- Each session will be held twice so participants have the option to attend 2 out of 3 sessions -
Presentation and discussion of state of the art research tools for document analysis and OCR
In IMPACT researchers from various universities and research centres have developed research prototypes that significantly advance the state of the art of research in text recognition. During this session, several new approaches in areas such as image enhancement, segmentation and experimental OCR engines will be presented and discussed. This session is hosted by Apostolos Antonacopoulos (University of Salford).
Apostolos Antonacopoulos heads the Pattern Recognition and Image Analysis (PRImA) research laboratory in the School of Computing, Science and Engineering at the University of Salford, UK. He received his PhD from the University of Manchester Institute of Science and Technology (UMIST), UK in 1995. Dr. Antonacopoulos has worked and published extensively on various problems in Document Analysis and Understanding as well as on other applications of Pattern Recognition and Image Analysis. He is a member of the Editorial Boards of the International Journal on Document Analysis and Recognition and of the Electronic Letters on Computer Vision and Image Analysis. He has served as Head of Computer Science at the University of Salford and has chaired or served as a member of a number of International Association for Pattern Recognition (IAPR) and other committees. Most notably he served as the 1st Vice President of the IAPR. He has given a number of invited talks and tutorials (most recently on the analysis and recognition of historical documents) and is a member of programme committees of most conferences in his field and has co-edited a special issue on the Analysis of Historical Documents in the International Journal of Document Analysis and Recognition. He has significant experience in leading and participating in national, European (FP7 and earlier) and industry-sponsored projects.
Basilis G. Gatos received his Electrical Engineering Diploma in 1992 and his Ph.D. degree in 1998, both from the Electrical and Computer Engineering Department of Democritus University of Thrace, Xanthi, Greece. His Ph.D. thesis is on Optical Character Recognition Techniques. In 1993 he was awarded a scholarship from the Institute of Informatics and Telecommunications, NCSR "Demokritos", where he worked till 1996. From 1997 to 1998 he worked as a Software Engineer at Computer Logic S.A. From 1998 to 2001 he worked at Lambrakis Press Archives as a Director of the Research Division in the field of digital preservation of old newspapers. From 2001 to 2003 he worked at BSI S.A. as Managing Director of R&D Division in the field of document management and recognition. He is currently working as a Researcher at the Institute of Informatics and Telecommunications of the National Center for Scientific Research "Demokritos", Athens, Greece. His main research interests are in Image Processing and Document Image Analysis, OCR and Pattern Recognition. He has more than 110 publications in journals and international conference proceedings and has participated in several research programs funded by the European community. He is a member of the Technical Chamber of Greece, of the Editorial Board of the International Journal on Document Analysis and Recognition (IJDAR) and program committee member of several international Conferences and Workshops(e.g. ICDAR 2009, ICFHR 2010, ICDAR 2011, CBDAR 2011, AND 2011, International Workshop on Historical Document Imaging and Processing 2011).
Presentation and demonstration of the IMPACT language tools & resources in further detail, hosted by Katrien Depuydt (INL)
Named Entity Work in Impact by Frank Landsbergen
Named entities (NE’s) refer to person names, locations and organisations, e.g. John Paul the Second, Edinburgh, SPECTRE. They are a popular search in historical texts. The challenge within IMPACT is to improve the often poor OCR-quality of NE’s in such texts and to enable users to find NE’s disregarding any historical spelling variation. In this talk, we discuss NE-extraction work and variant matching in historical English and Dutch.
Frank Landsbergen works for the Institute for Dutch Lexicology (INL) in Leiden, The Netherlands. He has a PhD in modern linguistics. His main task within IMPACT is named entity recognition.
Special resources to access 16th century German by Annette Gotscharek
Together with BSB, we set up a special dataset of 16th century data as documents from this period are of special interest for them. We report on our experience in selecting, preparing and analysing the data as well as the corpus based construction of special lexica for OCR and IR.
Annette Gotscharek, computational linguist, researcher at LMU with special background on the construction of electronic dictionaries and their use for text correction and text interpretation tasks. Before the start of the IMPACT project, she has already gained experience on various related topics in a project funded by the German Research Foundation (DFG) on adaptive text correction.
Polish language resources in IMPACT by Janusz Bien
Resources used and created for the IMPACT project will be presented and discussed. They include the dictionary of Polish language of 17th and the first half of the 18th century developed and published using Internet (http://sxvii.pl/), a morphological analyser used to process Polish texts and a corpus tool to be used to make the textual resources created for the project avalable for the general public.
Prof. Janusz S. Bien is a computer scientist and a linguist and has served as vice-president of Polish Linguistic Society in 2003-2005. He has been involved in the digitisation of several dictionaries, including the 17th century Knapski's dictionary, the 18th century dictionary of Troc, the 20th century so called "Warsaw dictionary'' and some volumes of the work-in-progress dictionary of Polish language of the 16th century. Leader of the Ministry of Education project "Digitalization tools for philological research" (2009-2012).
Slovene language resources in IMPACT by Tomaž Erjavec
The talk will present the work undertaken by the Slovene language partner of IMPACT to arrive at a collection of transcriptions of Slovene historical language, a hand annotated reference corpus, a lexicon and a tool to annotate historical Slovene language. The work presented was performed in cooperation with the National and University Library of Slovenia and with the Scientific Research Centre of the Slovenian Academy of Sciences and Arts, the latter in the scope of the Google Digital Humanities Research Award for Developing Language Models for Historical Slovene. The presented language resources provide a good basis for computational processing of historical Slovene, from the perspective of OCR correction and IR, as well as corpus research, and will be made available via the Creative Commons licence in order to stimulate research and development of processing historical Slovene.
Dr. Tomaž Erjavec works at the Dept. of Knowledge Technologies at the Jožef Stefan Institute and works on language technologies, mostly development of Slovene language resources, methods for linguistic annotation and standardisation of language encoding. He is the founding president of the Slovenian Language Technologies Society (1998), is the Slovene representative in ISO TC 37, was a member of the EACL board and TEI council, and is on the editorial board of several journals. For more information c.f. nl.ijs.si/et/
Digitisation tips session
Meet the expert: questions & answers on digitisation issues
Still not sure how (mass) digitisation is going to work at your institute? Do you now have more questions than answers? If so, this is your opportunity to ask the experts! Whether you have a simple practical question, or a global concern you would like to share with everyone, bring it to our panel of library experts, both from and outside of the IMPACT consortium, for their opinion.
The session is hosted by Aly Conteh (Digitisation Programme Manager, The British Library). Members of the panel include:
- Astrid Verheusen (Manager of the Digital Library Programme and Head of the Innovative Projects department, KB National library of the Netherlands)
- Geneviève Cron (OCR expert, Bibliothèque nationale de France)
- Christa Müller (Director Digital Services Department, Austrian National Library)
- Majlis Bremer-Laamanen (Director of the Centre for Preservation and Digitisation, National Library of Finland)
- Alenka Kavcic-Colic (Head of the Research and Development Unit, National and University Library of Slovenia)