In the i2010 vision of a European Digital Library, the EU launched an ambitious plan for large scale digitisation projects transforming Europe’s printed heritage into digitally available resources. The aim of fully integrating intellectual content into the modern information and communication technologies environment can only be achieved by full-text digitisation: transforming digital images of scanned books into electronic text.
Over the last 2-3 years mass-digitisation has become one of the most prominent issues in the library world. Today, a number of advanced libraries in Europe are scanning millions of pages each year and large scale-digitisation is a matter of fact, not a vision any more. However, these efforts can tackle only a fraction of the total heritage available in cultural memory organisations. The digitised material is becoming available too slowly and in too small quantities from too few sources, for three reasons.
- There is a lack of institutional knowledge and expertise which causes inefficiency and ‘re-inventing the wheel’. This is a problem for the vast majority of libraries, museums and archives in Europe.
- The costs for full-featured electronic text of historical documents are much too high. Cultural heritage institutions will not be able to satisfy the needs of their users for electronic texts instead of pure digital images. Manual keying costs around 1 EUR per page, so that a typical book sums up to 400, 500 or even 1000 EUR.
- Automated text recognition, carried out by Optical Character Recognition (OCR) engines does in many cases not produce satisfying results for historical documents. Recognition rates are poor or even useless. No commercial or other OCR engine is able to cope satisfactorily with the wide range of printed materials published between the start of the Gutenberg age in the 15th century and the start of the industrial production of books in the middle of the 19th century.
The IMPACT project will remove many of these barriers. The project will push innovation in OCR technology and language technology for historical document processing and retrieval, and share expertise to build capacity in digitisation across Europe. During the project a Centre of Competence will be set up in order to provide a central service entry point for all libraries, archives and museums involved in the digitisation of textual material.
The consortium brings together twenty-six national and regional libraries, research institutions and commercial suppliers who will share their know-how and best practices, develop innovative tools to enhance the capabilities of OCR engines and the accessibility of digitised text and lay down the foundations for the mass-digitisation programmes that will take place over the next decade.