Wordlist and Spell checking for Amharic and Tigrigna

Biniam Gebremichael

  • Spell checking

  • Corpus building
  • Corpus building

    To help Geez Natural Language Processing (NLP) developers, I have created a web crawler that collects Amharic and Tigrigna texts from the Internet. I wordlist is generated for both languages sorted by the number of occurances, as shown below.

    This Geez Crawler software is similar to Kevin Scannell's Crubadan Corpus builder, except that the former is specific to Geez languages. If you want to know more about web crawling, read Kevin's site.

    The word-lists is updated periodically, and it is free to download and use for research purpose. You will need a software to unzip the files and unicode font to properly display it.

    • Tigrigna
      • download: word list [433,164 words] from news papers and books
      • updated on Feb 2011

    • Amharic
      • download: word list [342,625 words] crawled from selected internet sites and books
      • updated on Apr 2011
    One of the many motivations for this is, to build word list for spell checker. See the spell checking section
  • The program has also been successfully applied to Greek, Icelandic and South African languages. If you are interested to use it for a different language, I can send you the source code by email.