Speakers - external experts

Pat Manson

Digitisation of Cultural Resources: European Actions and the Context of IMPACT

Making the resources of Europe's libraries, archives and museums accessible on line is one of the major challenges identified in the Commission's i2010 digital libraries initiative.  Digitisation of analogue collections and promoting their use online is a key element in optimising the economic and culturar potential of Europe's cultural heritage. The European Commission and Council have recognised the importance of this and have committed to various actions in support of mass digitisation and accessibility online. Amongst the challenges are the need to improve the cost-effectiveness of digitisation, through improved technologies and tools, and to expand the competences in digitisation across Europe's cultural institutions. At the core of this is the concept of (virtual) centres of competence which aim to exploit the results of research and to leverage national and other initiatives.

 

Pat Manson has worked with the European Commission's research programmes since the early 1990s, first as project officer and latterly as Head of Unit for Cultural Heritage and Technology Enhanced Learning. The unit focuses mainly on research in digital libraries, digital preservation and in the use of ICTs for improving learning, but is also involved in the development of the i2010 digital libraries policies and actions. Prior to joining the Commission, she worked in the UK providing a national advisory and market watch service to libraries on the use of new technologies.

 

To top

Simon Tanner

Measuring the OCR Accuracy across The British Library 2 Million Page Newspaper Archive

King's Digital Consultancy Services (KDCS) has worked with The British Library to establish the OCR accuracy of the recent digitisation of 2 million pages of newspaper from the 19th Century. This paper will discuss the methodology used as developed by KDCS and Digital Divide Data (DDD). It will further consider the implications from this approach and the advantages to other OCR based digitisation projects.


Simon Tanner
is the Director of King's Digital Consultancy Services (KDCS) in the Centre for Computing in the Humanities (CCH) at King's College London. KDCS provides research and consulting services specializing in the information and digital domain for the cultural, heritage and information sectors.
Simon is an independent member of the UK Legal Deposit Advisory Panel and Chair of its Web Archiving sub-committee. He is also a member of the JISC Digitisation Advisory Group. Simon authored the book, Digital Futures: Strategies for the Information Age, with Dr Marilyn Deegan and they co-edited the book, Digital Preservation. He has recently been leading the pilot digital imaging of the Dead Sea Scrolls.

 

To top

Rose Holley

Many Hands Make Light Work: Collaborative OCR Text Correction in Australian Historic Newspapers.

In July 2008 Australian Newspapers beta was released to the public by the National Library of Australia. Public users have the ability to improve the OCR text by correcting it. Rose Holley will report on the issues, activity and highly successful outcomes of this innovative idea.


Rose Holley
is manager of the Australian Newspaper
Digitisation Program (http://www.nla.gov.au/ndp/) at the National Library of Australia.  Prior to this she worked in New Zealand instigating and managing digitisation projects and was actively involved in raising awareness of digitisation techniques across the cultural heritage sector via her roles for the National Digital Forum (NDF http://ndf.natlib.govt.nz/) and the Auckland Heritage Archivists and Librarians Group (AHLAG www.ahlag.auckland.ac.nz). Rose is passionate about utilising digital technologies to enable preservation, discovery and access of our cultural heritage resources and in moving from small scale digitisation to mass digitisation. Her previously published papers on digitisation are available here: http://eprints.rclis.org/view/people/Holley=3ARose=3A=3A.html

 

Claus Gravenhorst

Future Challenges for OCR Technology

The basic requirement for accessibility, presentation and delivery of digitised material is an almost error-free text. Due to the enormous variety of document types and source qualities current OCR technology is facing big challenges. Experiences from various digitisation projects at Ivy League cultural heritage institutions show that next generation OCR and related technologies need big improvements to meet future requirements.


Claus Gravenhorst
, Director Strategic Initiatives, joined CCS Content Conversion Specialists GmbH in 1983, holds a diploma in Electrical Engineering (TU Braunschweig, 1983). Today he is the director of Strategic Initiatives at CCS leading business development, product management and product quality assurance. For 10 years Claus was in charge of the product management of CCS products. During the METAe Project, sponsored by the European Union Framework 5, from 2000 to 2003 Claus collaborated with 16 international partners (Universities, Libraries and Research Institutions) to develop a conversion engine for books and journals. Claus was responsible for the project management, exploration and dissemination. The METAe Project was successfully completed in August 2003. Since 2003 he is engaged in Business Development and promoted docWORKS as a speaker on various international conferences and exhibitions. In 2006 Claus contributed as a co-author to “Digitalization - International Projects in Libraries and Archives”, published in June 2007 by BibSpider, Berlin.

 

To top

Speakers - Experts from the IMPACT project

Hildelies Balk

Introduction to IMPACT

In the EC-funded project IMPACT (Improving Access to Text) seven libraries, six research institutes and two private sector companies across Europe work together to address the challenges to mass digitisation by the development of OCR software and technologies which exceed the accurateness of current state-of-the-art software significantly. The IMPACT solutions focus on the entire process of recognition after the document leaves the scanner: Image processing, OCR processing (including use of dictionaries), OCR correction and Document formatting. IMPACT will also build capacity in mass digitisation by sharing best practice and expertise with the cultural heritage communities in Europe.


Hildelies Balk
holds a PhD in the history of art and is an experienced researcher and programme manager in the field of cultural heritage. She joined the National Library of the Netherlands (KB) in 2006 as head of National Digitisation Programmes. After coordinating the forming of the IMPACT consortium and procuring the funding for this project she became coordinator of the project. She now heads the European Projects section within the department of Research and Development of the KB.

 

To top

Astrid Verheusen

Libary Challenges for Mass Digitisation  

Since the world wide web made it possible to display graphics, libraries have been scanning their older documents and pictures to provide access to them. Due to the progress in the field of digitisation a profound knowledge about best practices has been developed. However, the current trend toward mass digitisation and the need to speed up the process make new approaches necessary. In this presentation the experience of the Koninklijke Bibliotheek, the National Library of the Netherlands, will be used to show how to handle several aspects of the digitisation process in order to digitise efficiently on a large scale.

 

Astrid Verheusen holds a Master’s degree in history and has been working for the KB since 2001. She has many years of experience in research and development projects in the field of digitisation and digital preservation. She is currently head of the Digitisation Department.

 

To top

Asaf Tzadok

Adaptive OCR

Asaf Tzadok will talk about the IMPACT Adaptive OCR system. He will share the Adaptive OCR vision and challenges. Asaf will draw the main processes which are combined from both fully-automatic and semi-automatic algorithms.
One major aspect of this system is the online Collaborative Correction module, through which volunteers can correct the OCR results and in this way improve the adaptive OCR engine.

 

Asaf Tzadok works as an R&D team leader and senior research staff member at IBM Haifa Research Lab. He is an expert in various aspects of image and document processing. His main research activities were in OCR, automatic parcel sorting, segmentation, scanner-printer quality, layout analysis, binarization and glyph vectorization. Asaf was responsible for managing and developing an archive digitization engine for the Hearst Newsreel Archive in the USA.

 

To top

Basilis Gatos

A User Friendly Platform for Document Image Enhancement and Segmentation

The IMPACT enhancement and segmentation platform provides an important aid for the user that enables selection of a methodology for the enhancement and/or segmentation stages as well as interactive visualisation of not only the results of either enhancement or segmentation but also the results produced from the intermediate stages (e.g. image after de-warping or block segmentation). A live demo of platform’s functionality will be presented taking into consideration state-of-the-art techniques as well as a first version of representative new IMPACT toolkits.  


Dr. Basilis G. Gatos
is a Researcher at the Institute of Informatics and Telecommunications of the National Center for Scientific Research "Demokritos", Athens, Greece. His main research interests are in image processing and document image analysis, OCR, processing and recognition of historical documents. He has more than 80 publications in journals and international conference proceedings and has participated in several research programs funded by the European community. He is a member of the Technical Chamber of Greece and program committee member of several international Conferences (e.g. ICFHR 2008, ICDAR 2009).

 

To top

Klaus Schulz

Language Technology for Improving OCR on Historical Texts

In this talk Klaus Schulz will discuss how language technology can help to improve or correct OCR results on historical texts. After a brief summary of actual techniques for OCR correction he will comment on the specific challenges that have to be faced when recognizing historical texts. Results will be presented of how the use of special dictionaries for historical language in OCR engines affects OCR quality. The second part of the talk is centered on actual research on "profiling" OCR output in a fully automated way. The intended profiles try to detect the base language, special vocabulary, typical spelling variants of the underlying text as well as typical OCR errors found in the output. Profiles of this form can be used for postcorrection or for improving OCR output in a second run.


Prof. Klaus U. Schulz finished his Ph.D. in Mathemetics in 1987. After a visiting professorship at the University of Niteroi he was appointed professor of Computational Linguistics at the University of Munich (LMU) in 1991. He is a technical director of the Centrum für Informations- und Sprachverarbeitung (CIS) of the LMU. Recent research interests are concentrated on text correction, document analysis, information retrieval and semantic technologies.

 

To top

Günter Mühlberger

Günter Mühlberger, Ph.D., Head of the Department for Digitisation and Digital Preservation of the University Innsbruck Library. Professional experience: since 1998 coordinator and project manager of several EU RTD projects. Publications and lectures on digitisation issues. Coordinator of eBooks on Demand (EOD), a network of 18 libraries from 10 European countries providing a Digitisation and Print on Demand service.

 

To top

Aly Conteh

Aly Conteh is the Digitisation Programme Manager at the British Library, a post he took up in April 2003. He is responsible for the development and implementation of the policies and frameworks to govern digitisation of items from the Library’s collections in accordance with the British Library Strategy. His background is in IT in both the public and private sectors. He has been involved in many digitisation projects at the British Library including projects to digitise 20 million pages of 19th Century books, 4 million pages of pre-1900 newspapers.   

 

To top

Apostolos Antonacopoulos

Digital Restoration and Layout Analysis: Challenges and IMPACT

The presentation will cover the background issues, challenges and opportunities in image restoration and layout analysis of historical documents for large-scale full-text conversion. The talk starts by categorising and examining the different factors that give rise to artefacts which affect full-text conversion, along with possibilities for improvement. Work carried out within IMPACT on image enhancement and layout analysis is then outlined.


Dr. Apostolos Antonacopoulos heads the Pattern Recognition and Image Analysis (PRImA) research group at the University of Salford, UK. He is the 2008-2010 1st Vice President of the International Association for Pattern Recognition (IAPR). He is a member of the Editorial Boards of the International Journal on Document Analysis and Recognition (IJDAR) and of the Electronic Letters on Computer Vision and Image Analysis (ELCVIA) journal. He is Chair of ACM DocEng2010, Program Co-Chair of ICDAR2009 and he serves on the program committees of most conferences in Document Image Analysis. In 2007 he co-edited an IJDAR special issue on the Analysis of Historical Documents.

 

To top

Katrien Depuydt

Historical Lexicon Building and How it Improves Access to Text

In this talk Katrien Depuydt will discuss how lexica of historical language can be built and how they can be used to improve access to text. First some linguistic issues in building computational lexica of historical language will be discussed. Then some insight will be given into the lexicon building process. Finally, it will be demonstrated how computational lexica of historical language can overcome the historical language barrier in retrieval.

 
Katrien Depuydt is head of the language database department at the Institute for Dutch Lexicology (INL) in Leiden. She is a historical linguist and lexicographer. She has worked on two major historical dictionaries of Dutch, the Woordenboek der Nederlandsche Taal  (Dictionary of Dutch language) and the Vroegmiddelnederlands Woordenboek (Dictonary of Early Middle Dutch). She has many years of experience in linguistic data and application development.

 

To top

Neil Fitzgerald

Decision Support Tools

The Decision Support Tools presentation will provide an overview of current practise in digitisation approach within project partner institutions; explain the work IMPACT has done in this area and the overall objectives for these tools by project end. The overall digitisation process from starting a project to OCR issues will be considered.


Neil Fitzgerald is the IMPACT Delivery Manager for The British Library since June 2008. Prior to this he led the Microsoft Digitisation Project - a public/private partnership to digitise and OCR 25 million pages of mostly 19th century books. He has worked for the BL since 2002 in a variety of roles delivering both on-demand and project based imaging services. Before joining the BL he worked in the commercial imaging sector.

 

To top