


To curate a lexicon-morphological dataset can be time consuming (due in part to interpretive difficulties), but enormously profitable, since such data can be queried in sophisticated ways (e.g., in this corpora, how much more frequent are first-person aorists than third-person indicatives?). Some scholarly research questions require well-curated data sets, where alternative lexico-morphological forms are eliminated, weighted, or qualified. But such data will include many forms that are, for the context, incorrect. Such data is useful in many contexts, especially pedagogical. Numerous services and tools provide for any word in a given ancient text the possible lexico-morphological combinations (e.g., τῶν could be 6 possibilities, lexeme ὁ or ὅς, in masculine, feminine, or neuter forms). XTF has a good introduction to how they have approached the problem of spelling correction in their search engine (mainly from the perspective of users "mistyping" their query, but the problem is the same). For example, Latin has no standard orthography, which for diplomatic transcriptions (where the spelling has not been normalized by the editor, but remains as it is in the text) can mean that the same word may appear spelled differently throughout the corpus. The search engine Egothor also has a trainable stemmer component.Īnother difficulty in searching a corpus can be orthographic (spelling) variation in the text. An example for Latin is the Schinke Latin Stemmer. Such a script can be included in pages of other text collections, enabling lemmatizing searches via a "third-party" service.Īnother approach often used for expanding search results is stemming, which typically tries to use an algorithmic approach to normalize inflected words and "chop off" the inflections to produce a "stem" word. The Archimedes Project Morphology Service also provides an XML-RPC web interface - a script which forwards queries to the Morpheus lemmatiser/parser.
COLLATINUS WEB SOFTWARE
One approach is to use software such as TreeTagger trained to your language with a Treebank (such as the Perseus Treebanks). This is why lemmatisation software and online services typically also provide a morphological analysis of the inflected form, so they act both as lemmatisers and parsers.ĭisambiguating to the correct lemma form is a difficult problem, and parsing words in context to their correct part of speech can aid in this immensely. This can aid in lemmatisation because often multiple lemma forms can be inflected to the same inflected form, meaning that looking up the inflected form in a lemma dictionary will yield multiple results for the lemma form. For example, saying that 'hominis' is genitive singular of lemma 'homo, -inis'. The lemma dictionaries typically connect many occurrences of inflected word forms to their lemma form, and act as a mediator between a query (or the one who asks it) and a database, a corpus, or a text collection.įor Greek and Latin, the foremost freely available lemma dictionaries are included in the Morpheus source as XML files.Ī related problem is that of parsing an inflected form, that is of performing a morphological analysis of that word.

The same applies even more to highly flective languages such as Greek and Latin (this is, after all, how people are taught to use the dictionaries - you have to know, or predict, the lemma of a word to be able to look up its meaning and other information on it). For example if you search Google for "digital classicism", your results will include Digital Classicist and even though "classicist" is not the exact word "classicism", you may be interested in the result. Typically when implementing a search engine for a digital corpus, one wants to enable discovery not only of occurrences of exact word forms in the query but also of other inflections of the search terms.
