IMPACT at Conference on New Methods in Historical Corpora


By: Tomaž Erjavec

At this conference, Tomaž Erjavec of the Jožef Stefan Institute will present a follow-up of the work on the Slovene lexicon, entitled: “Annotating historical Slovene texts: first experiments”.

This paper outlines a developed to process historical Slovene text, initially from the 19th century, which annotates words in a TEI encoded corpus / digital library with their morphosyntactic tags, modern-day equivalents and lemmas. Such a tool is useful for developing historical corpora of the language, as it allows for searching corpora e.g. by modern day lemmas, but having also other uses such as modernising the historical texts for today's readers unfamiliar with older alphabets and orthography, making it simpler to hand-correct OCR transcriptions, etc.

The tool is – apart from the specific language resources it uses – fairly language independent and could be useful for processing other languages, esp. those of the Slavic family. These typically have similar characteristics to Slovene, i.e. rich inflection, available annotated corpora of contemporary language but are lacking resources for historical language.