The EC Digital Agenda and official launch of the IMPACT Centre of Competence
Khalil Rouhana will talk about the Digital Agenda of the European Commission and officially launch the IMPACT Centre of Competence.
Khalil Rouhana is the director for digital content and cognitive systems in DG INFSO. He is responsible for digital content policy in the Digital Agenda including public sector information, open data, cultural heritage, safe content and learning, and for EU supported research and innovation actions in creative content, information management, language technologies, interaction and robotics.
His previous experience includes the responsibility of the Unit in charge of ICT research and innovation strategy in DG INFSO between 2004 and 2010. An important part of his career was devoted to monitoring and management of R&D and innovation activities in the public and private sector. He was a project officer in the ESPRIT programme for 8 years in the areas of High Performance Computing and Networking and in Future and Emerging technologies.
Before joining the Commission in 1992, he was for 5 years the director of an institute and school of engineering (Grande Ecole) in France. He started his career as research and development engineer for the aeronautics industry, worked for the French University in Beirut and created also his own company. He has two master degrees in electrical engineering.
OCR and the transformation of the Humanities
Three fundamental shifts are transforming the Humanities within academia and, more broadly, the fundamental relationship between society and the cultural heritage of the world. First, the scale of analysis has grown at once broader and more intensive. Second, the interpretation of the past is becoming a shared, global enterprise that must flow beyond traditional barriers of space, language and culture. Third, there is decentralization of intellectual production underway where student research and citizen scholars work side by side with faculty and library professionals. OCR both enables and demands each of these three transformations and must be considered a key constituent in any emergent infrastructure.
Gregory Crane's interests are twofold. On the one hand, he has published on a wide range of ancient Greek authors (including articles on Greek drama and Hellenistic poetry and a book on the Odyssey). At the same time, he has a long-standing interest in the relationship between the humanities and rapidly developing digital technology. He began this side of his work as a graduate student at Harvard when the Classics Department purchased its first TLG authors on magnetic tape in the summer of 1982. He developed a Unix-based full text retrieval system for the TLG that was widely used in North America and Europe in the middle 1980s. Since 1985 he has been engaged in planning and development of the Perseus Project, which he directs as the Editor-in-Chief. Besides supervising the Perseus Project as a whole, he has been primarily responsible for the development of the morphological analysis system which provides many of the links within the Perseus database.
From 1998 through 2006 he directed a grant from the Digital Library Initiative to study general problems of digital libraries in the humanities. In 2006, he produced a named entity identification system, published a 55 million word collection, and authored several publications describing the system. With the rise of the Google Books project in 2004, he began to focus upon the problems and opportunities that arise when whole libraries rather than curated collections become available on-line. Crane is especially interested in helping the emerging Cyberinfrastructure serve the needs of the humanities in general and classical studies in particular.
Strategic Digital overview
Digitisation has the potential to revolutionise research - the efficiency gains made possible by access to digital rather than physical items are huge. Many national libraries and other institutions are digitising significant proportions of their collections. However this is only the first step to making these documents accessible. Optical Character Recongition, indexing, text mining and sophisticated browser-based applications are also required to make this material easy to find, view, compare and analyse. The improvement in these techniques will only be made possible by collaborative projects where the cost, risk and benefits can be shared between project partners - Impact is such a project. IMPACT has moved the boundaries of research forward by developing the tools that are required to unlock our physical collections and turn them into digital treasures.
Mr. Richard Boulderstone joined the British Library as Director of e-Strategy in July 2002. Formerly a CTO and Product Development Director at a number of international information providers, he has led the creation of many information based products both in the USA and UK. Mr. Boulderstone has led the creation of the British Library's Digital Library System that is the primary repository for the Library’s digital collections and is also responsible for the Science, Technology & Medicine collections & services at the BL.
Applied IMPACT: Does the new FineReader Engine and Dutch lexicon increase OCR accuracy and production efficiency? A case study by KB and CCS
OCR of historic newspapers printed in old font types like Fraktur and Gothic is a challenge and so far has not provided a sufficient level of text accuracy. KB and CCS investigate whether integrating the new ABBYY FineReader 10 and INL’s Dutch lexicon into a mass digitisation workflow system leads to improved text accuracy and fewer manual corrections.
Claus Gravenhorst joined CCS Content Conversion Specialists GmbH in 1983, holds a diploma in Electrical Engineering (TU Braunschweig, 1983). Today he is the Director of Strategic Initiatives at CCS leading business development. For 10 years Claus was in charge of the product management of CCS products. During the METAe Project, sponsored by the European Union Framework 5, from 2000 to 2003 Claus collaborated with 16 international partners (Universities, Libraries and Research Institutions) to develop a conversion engine for books and journals. Claus was responsible for the project management, exploration and dissemination. Since 2003 he is engaged in Business Development and acts as a speaker on various international conferences and exhibitions. In 2006 Claus contributed as a co-author to “Digitalization - International Projects in Libraries and Archives”, published in June 2007 by BibSpider, Berlin.
Experiences in mass digitisation: examining OCR quality
With millions of books digitized, there is now a massive amount of OCR extracted text in use: for building indexes, end user display, etc. What can we learn about the quality of this material and how can we work to apply that knowledge to improving the precision of OCR tools?
Paul Fogel is the Technical Lead for Mass Digitization at the University of California's California Digital Library, as well as serving as the co-Technical Lead for UC's partnership in the HathiTrust. He provides technical leadership for UC's collaborative work on mass digitization of books across all of the 10 UC campus libraries.
Crowdsourcing in the Digitalkoot project
Digitalkoot is a game-based crowdsourcing project where volunteers focus on fixing OCR errors in digitised newspapers. As volunteers become an important part of the digitising process the user experience changes as the quality of digitised material improves. Additionally the possibility to take part in the digitising process brings cultural heritage closer to its users.
Majlis Bremer-Laamanen is Director of the Centre for Preservation and Digitisation at the National Library of Finland. She has experience as Project Leader and Work Package Leader in several national and international projects. Bremer-Laamanen has actively contributed to a number of publications in the field. She is a member of the EU Member States' Expert group on Digitisation and Digital Preservation and the IFLA Newspaper Section. In Finland, she participates in the National Digital Library: Availability Section, The National Library Management Group, the Mikkeli University Consortium: Management group and Advisory Board and the Digital Mikkeli Management group.
CLARIN and IMPACT: Crossing Paths
I will give a very brief overview of what CLARIN is and where it is going, and I will show how CLARIN’s and IMPACT’s paths seem to be crossing. The main question I will ask myself and the audience is how we can celebrate, and more importantly exploit this in order to provide better services to our target audiences.
Steven Krauwer is the coordinator of CLARIN, aimed at the construction of a Language Resources and Technology Infrastructure to serve the Humanities and Social Sciences research community. He got his degree in mathematics from Utrecht University with a minor in linguistics. Since 1972 he has been working as a researcher, lecturer and project manager in the Utrecht institute of Linguistics of the Faculty of Humanities of Utrecht University. His main research interest has been Language Technology, with a special focus on Machine Translation. His recent research interests include language resources and tools for the Humanities, endangered languages and research infrastructures.
Experts from the IMPACT project
Hildelies Balk-Pennington de Jongh
Digitisation challenges & IMPACT achievements so far
Since 2008 the IMPACT consortium, a European team of scientists, industry partners and digitisation professionals, has worked on addressing the challenges involved in digitising historical material. This presentation will provide a summary of the IMPACT solutions, which include new approaches in image enhancement, segmentation, OCR correction through crowd sourcing, document profiling, document structuring, improvement of the ABBYY Finereader, a novel Adaptive OCR approach, experimental OCR engines, tools for language technology as well as historical lexica for nine European languages, a framework with valuable testing and evaluation tools and a large IMPACT dataset that will continue to inspire future research activities.
Hildelies Balk – Pennington de Jongh is Head of the section European Projects for Research and Development in the department of Innovation and Research of the KB National library of the Netherlands. Her section acquires and runs research projects on interoperability, digital preservation, digitization and access with partners in and outside Europe. Hildelies holds a PhD in the History of Art and is an experienced researcher and manager in the field of cultural heritage. She joined the KB in 2006 as head of the National programmes for digitization. The obvious need for improving access to the digital content created in these programmes gave rise to the forming of a European consortium to address the challenges in OCR in mass digitization of historical text. Hildelies coordinated the forming of this consortium and the writing of the proposal, resulting in the IMPACT project, led by the KB. She is project director of IMPACT and responsible for sustaining the results and the expertise of this project in a Centre of Competence to be launched at the end of 2011.
The IMPACT Knowledge Bank
The IMPACT Knowledge Bank is a suite of ‘paper tools’ that provide a framework for strategic digitisation decision making, a range of resources for capacity building, and support and advice for those interested in (mass) digitisation of historical text based material. Its content will form the core of the new IMPACT Centre of Competence website (www.digitisation.eu).
Neil Fitzgerald is the IMPACT Delivery Manager for The British Library. Prior to this he led the Microsoft Digitisation Project - a public/private partnership to digitise and OCR up to 25 million pages of mostly 19th century books. He has worked for the BL since 2002 in a variety of roles delivering both on-demand and project based imaging services. Before joining the BL he worked in the commercial imaging sector.
The IMPACT Interoperability Framework – Workflows for OCR and beyond
Clemens Neudecker will present the IMPACT Interoperability Framework, a novel web-based infrastructure for experimental workflow development in digitisation. The Framework provides facilities for wrapping a range of (both IMPACT and external) tools and services for transforming physical documents into digital resources as web services, which can can then easily be used to build workflows. Because the workflows are registered through a Web2.0 environment integrated with a workflow management system, users can discover, share, rate and tag workflows. The Framework also incorporates valuable testing and evaluation tools, which enable current and future developers to verify their progress against the existing state-of-the–art methods.
Clemens Neudecker holds a M.A. in Philosophy, Computer Science and Political Science. He has been a member of the Munich Digitisation Centre (MDZ) from 2003-2009, mostly involved with OCR processing workflows in various national digitization projects. Since December 2009 he works at the KB National Library of the Netherlands, currently as the Technical Project Manager for IMPACT.
Evaluation tools, ground truth and dataset
Stefan Pletschacher will present the IMPACT Evaluation tools, a series of software tools for the performance evaluation of image enhancement, segmentation and OCR output and the IMPACT dataset, a large dataset of over half a million representative pages of digitised historical texts (newspapers, books, pamphlets, typewritten material), compiled by a number of European National Libraries. A carefully selected subset of these pages has been reproduced in a sophisticated Ground truth, containing 100% correct text as well as extensive information on the lay out of the page. This Ground Truth can be used for development and testing of tools and language resources and for evaluating and demonstrating results. IMPACT ground truth for segmentation and text recognition is stored and exchanged via XML instances in the PAGE format (Page Analysis and Ground-truth Elements) developed by the University of Salford.
Stefan Pletschacher obtained his Masters degree (Diplom-Informatiker) in Computer Science from Chemnitz University of Technology in 2003. While initially working in the field of machine learning and data mining as a freelance software developer he turned more towards document engineering when joining the Institute for Print and Media Technology at the University of Chemnitz as a Research Assistant. After a research sabbatical at the PRImA Research Group he joined the University of Salford in 2008 as a Research Fellow with the main focus on document image analysis and recognition systems.
ABBYY & OCR improvements for IMPACT
Applying Optical Character Recognition (OCR) requires multiple steps and sophisticated technologies to achieve good results. Additionally, processing historic documents presents new and different challenges. Technically there is no single switch for optimization - each step has to be tuned. This presentation gives an overview where ABBYY OCR technology was optimized during IMPACT, e.g. image pre-processing, document analysis and character recognition.
Michael Fuchs is Senior Product Marketing Manager for OCR, Data Capture and Linguistic SDKs at ABBYY Europe. In this technical marketing role he serves as the “middleman” between market and customer requirements, the ABBYY Sales Teams and the ABBYY product development groups. Working with ABBYY since 2005, he has been involved in the IMPACT project right from the beginning, actively taking part in its conventions and discussions. Within his responsibilities working for ABBYY is the monitoring of the market introduction of ABBYY products equipped with the technology enhancements gained through the IMPACT project.
IBM Adaptive OCR engine and CONCERT Cooperative Correction
Asaf Tzadok will present IBM's Adaptive OCR engine, a comprehensive software system which improves the recognition of historical texts significantly by applying adaptivity as one of the main features to the text recognition process. He will also show CONCERT, the COoperative eNgine for the Correction of ExtRacted Text, a web-based platform suitable for massive volunteer participation which validates and corrects OCR results and enables the general public to help with large scale digitisation efforts.
Asaf Tzadok is the manager of the image and document analytics group at IBM Research - Haifa and has been part of the multimedia department since joining IBM in 1998. He is an expert in various aspects of image and document processing. Asaf's main research activities are in OCR, automatic parcel sorting, segmentation, scanner-printer quality, layout analysis, binarization and glyph vectorization. During the last four years Asaf led the development of IBM`s books and newspapers digitization platform, known as CONCERT, short for COoperative eNgine for the Correction of ExtRacted Text.
The Functional Extension Parser: A Document Understanding Platform
The FEP (Functional Extension Parser) is a document understanding platform that supports the extraction of structural metadata from digitized documents, such as table of contents entries, page numbers, headings or footnotes. FEP utilizes the results of Optical Character Recognition (OCR) and comes as a set of web-services ready to be integrated into digitization workflows. Display and correction of results can be done online.
Günter Mühlberger, Ph.D., Head of the Department for Digitisation and Digital Preservation of the University Innsbruck Library. Professional experience: Coordinator and project manager of several R&D projects from the 4th to the 7th EU Framework Programme, eTen, Culture and ICT-PSP. Initiated the eBooks-on-Demand (EOD) Network with 30 libraries from 14 European countries. Project manager of national projects, e.g. Austrian Literature Online (one of the largest digital repositories in Austria) or Innsbrucker Zeitungsarchiv. Several publications and lectures on digitisation issues.
Postcorrection in IMPACT
Compared to the classical scenarios of OCR post-correction, the large-scale digitisation of old printed material poses a variety of new challenges. Facing historical language variation as a second source of uncertainty besides OCR accuracy, special techniques have to be applied in order to distinguish spelling variants from OCR errors and treat both accordingly. Besides, the conventional idea of interactive post-correction systems, where documents have to be verified page by page, word by word, does not meet the requirements of mass digitisation efforts. The system presented here uses special algorithms for the analysis of historical, OCRed text. This analysis provides us with document specific knowledge on the language and also a model of the OCR error channel which can be used to find and correct systematic errors with more efficiently and accuracy.
Ulrich Reffle has graduated in computational linguistics in 2006 and then joined the research group of Prof. Klaus Schulz at the University of Munich. His research focuses on efficient finite-state technology in the area of text correction tasks.
Overview of language work in IMPACT
This talk will provide the necessary background on the IMPACT approach for the construction and use of lexica for OCR and retrieval in historical documents. The approach has lead to successful development of lexica for nine European languages. This presentation is a joint effort of Katrien Depuydt (INL) and Klaus Schulz (CIS group, University of Munich).
Katrien Depuydt is the Head of the Dutch Language Bank at the INL (Institute for Dutch Lexicology) in Leiden. She is a historical linguist and lexicographer. She has worked on two major historical dictionaries and has many years of experience in managing electronic publishing and content management projects. In IMPACT she leads the work packages on language resources and on tools for building and applying language resources.
Jesse de Does
Evaluation of lexicon supported OCR and Information Retrieval
In IMPACT considerable effort has been put into building lexica to improve OCR and retrieval for historical texts. In this talk, results for nine European languages will be presented.
Jesse de Does is a computational linguist. He holds a PhD in applied mathematics and a master's degree in Slavic Linguistics, and has many years of experience in language processing and retrieval applications. In IMPACT he is responsible for several tools for lexicon building and lexicon application, as well as for the implementation software for the IMPACT historical dictionaries and for the evaluation of lexica in OCR and retrieval.
Introduction to the IMPACT Centre of Competence
The end of the IMPACT project in December 2011 is not the end of the unique collaboration of scientists, industry partners and digitisation professionals: the IMPACT Centre of Competence will allow all cultural heritage and research institutions to continue to work together in an innovative way to continue to improve access to historical texts. Its mission is to make the digitisation of historical printed text in Europe faster, cheaper, better and to provide tools, services and facilities to further advance the state-of-the-art in the field of document imaging and processing of historic text. This session will outline the opportunities for the digitisation community to engage with the IMPACT Centre of Competence.
Aly Conteh, Digitisation Programme Manager at the British Library. He has been involved in many digitisation projects at the British Library including projects to digitise 25 million pages of 19th Century books, 4 million pages of pre-1900 newspapers and significant numbers of manuscript volumes. He serves on the Executive Board for the IMPACT project and is a member of the European Commission’s Member States’ Expert Group on Digitization and Digital Preservation.
Parallel Research session
The following speakers will take part in the Parallel Research session at the IMPACT conference:
Apostolos Antonacopoulos heads the Pattern Recognition and Image Analysis (PRImA) research laboratory in the School of Computing, Science and Engineering at the University of Salford, UK. He received his PhD from the University of Manchester Institute of Science and Technology (UMIST), UK in 1995. Dr. Antonacopoulos has worked and published extensively on various problems in Document Analysis and Understanding as well as on other applications of Pattern Recognition and Image Analysis. He is a member of the Editorial Boards of the International Journal on Document Analysis and Recognition and of the Electronic Letters on Computer Vision and Image Analysis. He has served as Head of Computer Science at the University of Salford and has chaired or served as a member of a number of International Association for Pattern Recognition (IAPR) and other committees. Most notably he served as the 1st Vice President of the IAPR. He has given a number of invited talks and tutorials (most recently on the analysis and recognition of historical documents) and is a member of programme committees of most conferences in his field and has co-edited a special issue on the Analysis of Historical Documents in the International Journal of Document Analysis and Recognition. He has significant experience in leading and participating in national, European (FP7 and earlier) and industry-sponsored projects.
Basilis G. Gatos received his Electrical Engineering Diploma in 1992 and his Ph.D. degree in 1998, both from the Electrical and Computer Engineering Department of Democritus University of Thrace, Xanthi, Greece. His Ph.D. thesis is on Optical Character Recognition Techniques. In 1993 he was awarded a scholarship from the Institute of Informatics and Telecommunications, NCSR "Demokritos", where he worked till 1996. From 1997 to 1998 he worked as a Software Engineer at Computer Logic S.A. From 1998 to 2001 he worked at Lambrakis Press Archives as a Director of the Research Division in the field of digital preservation of old newspapers. From 2001 to 2003 he worked at BSI S.A. as Managing Director of R&D Division in the field of document management and recognition. He is currently working as a Researcher at the Institute of Informatics and Telecommunications of the National Center for Scientific Research "Demokritos", Athens, Greece. His main research interests are in Image Processing and Document Image Analysis, OCR and Pattern Recognition. He has more than 110 publications in journals and international conference proceedings and has participated in several research programs funded by the European community. He is a member of the Technical Chamber of Greece, of the Editorial Board of the International Journal on Document Analysis and Recognition (IJDAR) and program committee member of several international Conferences and Workshops(e.g. ICDAR 2009, ICFHR 2010, ICDAR 2011, CBDAR 2011, AND 2011, International Workshop on Historical Document Imaging and Processing 2011).
Parallel Language session
The following speakers will take part in the Parallel Language session at the IMPACT conference:
Frank Landsbergen works for the Institute for Dutch Lexicology (INL) in Leiden, The Netherlands. He has a PhD in modern linguistics. His main task within IMPACT is named entity recognition.
Annette Gotscharek, computational linguist, researcher at LMU with special background on the construction of electronic dictionaries and their use for text correction and text interpretation tasks. Before the start of the IMPACT project, she has already gained experience on various related topics in a project funded by the German Research Foundation (DFG) on adaptive text correction.
Prof. Janusz S. Bien is a computer scientist and a linguist and has served as vice-president of Polish Linguistic Society in 2003-2005. He has been involved in the digitisation of several dictionaries, including the 17th century Knapski's dictionary, the 18th century dictionary of Troc, the 20th century so called "Warsaw dictionary'' and some volumes of the work-in-progress dictionary of Polish language of the 16th century. Leader of the Ministry of Education project "Digitalization tools for philological research" (2009-2012).
Dr. Tomaž Erjavec works at the Dept. of Knowledge Technologies at the Jožef Stefan Institute and works on language technologies, mostly development of Slovene language resources, methods for linguistic annotation and standardisation of language encoding. He is the founding president of the Slovenian Language Technologies Society (1998), is the Slovene representative in ISO TC 37, was a member of the EACL board and TEI council, and is on the editorial board of several journals. For more information c.f. nl.ijs.si/et/
Digitisation Tips session
The Digitisation Tips session is hosted by Aly Conteh (Digitisation Programme Manager, The British Library). Members of the panel include:
- Astrid Verheusen (Manager of the Digital Library Programme and Head of the Innovative Projects department, KB National library of the Netherlands)
- Geneviève Cron (OCR expert, Bibliothèque nationale de France)
- Christa Müller (Director Digital Services Department, Austrian National Library)
- Majlis Bremer-Laamanen (Director of the Centre for Preservation and Digitisation, National Library of Finland)
- Alenka Kavcic-Colic (Head of the Research and Development Unit, National and University Library of Slovenia)