Adaptive OCR engine: A comprehensive software system which will improve the recognition of historical texts significantly by applying adaptivity as one of the main features to the text recognition process. It will integrate several other tools, such as the image enhancement toolkit, the ABBYY FineReader engine, the post-correction module and the lexica tools.
A full web-based collaborative correction system: This web-based platform, suitable for massive volunteer participation, validates and corrects OCR results. In this way, it enables the general public to help with large scale digitisation efforts.
Improved FineReader OCR engine: The state-of-the-art OCR engine of ABBYY will be adapted in order to cope with the challenge of recognising historical fonts and layouts.
Image enhancement toolkit: A set of software tools for manipulating scanned images in order to improve the recognition results of OCR engines.
Segmentation toolkit : A set of software tools for recognising and segmenting important features of scanned documents, such as blocks, lines and characters.
Post-correction modules: A set of software tools for improving the lexicon based verification and correction of recognition results.
Experimental prototypes and tools:
- A special OCR classifier capable of dealing with (especially hard to recognise) typewriter characters.
- A word spotting engine capable of searching words in texts which cannot be OCR processed in the traditional sense of the word.
- A research prototype for extracting the complete inventory (characters) of a given book based on shape clustering
Named entities repository and collaborative environment: A collaborative web-based workspace for named entity management.
Lexical resources to improve OCR accuracy in historical documents and to enhance Information Retrieval by improving the matching process between queries submitted to search engines and variants of the search term found in historical documents
Toolboxes Providing the means to overcome the historical language barrier
Functional Extension Parser: A set of web services that can be exploited to automatically detect and tag structural metadata of scanned material.