IMPACT is a project funded by the European Commission. It aims to significantly improve access to historical text and to take away the barriers that stand in the way of the mass digitisation of the European cultural heritage. In this project participate 15 participants from different European countries, which form a consortium. In this consortium participate representatives of:
• universities (University in Munich, Universität Innsbruck, University of Salford, University of Bath, Charles University in Prague);
• scientific institutes and centres (Instituut voor Nederlandse Lexicologie, National Centre for Scientific Research (Greece), Jožef Stefan Institute (Slovenia)
• national and state libraries (Österreichische Nationalbibliothek, The British Library, Deutsche Nationalbibliothek, Bibliothèque Nationale de France, Koninklijke Bibliotheek, Narodna in univerzitetna knjižnica, Národní knihovna Ceské republiky);
• private companies (IBM Israel – Science and Technology Ltd and ABBYY Production – Moscow).
IMPACT is funded under the Seventh Framework Programme of the European Commission (FP7). The countries, which participate, are: Austria, Bulgaria, Great Britain, Germany, Greece, Israel, Russia, Slovenia, France, the Netherlands and the Czech Republic. The project’s coordinator is the National Library of the Netherlands. Most of the countries participate with two institutions: the first develops the project and the second provides the necessary resources. The Institute for Parallel Processing to the Bulgarian Academy of Sciences and the St. St. Cyril and Methodius National Library are the Bulgarian representatives in the project. The project’s aim is to remove all the obstacles in creating the European Digital Library and all the problems in the digitalisation in Europe. In the i2010 vision of a European Digital Library, the EU launched an ambitious plan for large scale digitisation projects transforming Europe’s printed heritage into digitally available resources. Their aim is to transform the European printed materials in accessible digital resources. However, the processes were delayed because of the following reasons:
• Automated text recognition, carried out by Optical Character Recognition (OCR) engines does in many cases not produce satisfying results for historical documents. The recognition of old prints, which have many writing variations or difficult arrangements in different newspapers, is not very successful. The situation with microfilms or unpublished texts on typewriter is the same;
• The present lexicon is not sufficient to recognise the obsolete words, endings and orthographic variations of different historical texts.
The research objectives of the project are:
• To recognise all the printed text, created before 1900;
• One of the most important objectives is the OCR software to start recognizing obsolete letters. The project will extend the survey and will provide an enormous lexical resource with different forms of conjugations and endings of obsolete words and their relations with the present word form;
• This approach will be tested in 9 European languages from 3 basic linguistic groups (German, Slavonic and Romance).
The project’s idea is to:
• Exercise a multiplane language approach to the access of the digital documents and raising the use of such materials;
• Provide and adapt language technologies and create language resources for languages, which have not been included in the project yet, including Bulgarian, Slovenian, and Czech. In the framework of the project 3 more libraries will have the opportunity to provide data bases, demonstrate the project’s results and build digital competence in their language fields.
The main objective is to develop the innovative language technologies in OCR to remove the historical language barrier. The two leading industrial partners are included in the system of text recognition. IMPACT explores the new methods for enlarging the image and its segmentation, as well as the use of the language technology and historical vocabulary in OCR. The project develops tools to create a lexicon (thesaurus) and to use the vocabulary in OCR and preserving the digital copies, as well as tools to structure the document.
The second objective of the project is to ameliorate the process of mass-digitalisation thanks to share their know-how and best practices in digitisation across Europe. Because of that there will be developed a website, help desk, tools to solution supporting, coherent programme of training, permanent Centre of Competence in order to provide a single access point for all players involved in mass-digitisation and full-text generation across Europe.
Stages of the project
The first stage of the project was from 2008 to 2009. Work on the development of most of the IMPACT tools and content started in this period. Simultaneously, the process of outlining the Interoperability Framework in detail started, as well as the procurement of the images to be used for development, testing and demonstrating the IMPACT results.
Three substantial historical lexica for English, Dutch and German have been created in the reporting period.
The second stage of the project was from 2010 to 2011. With the entry of 11 new IMPACT2.0 partners in 2010, the IMPACT consortium now brings together twenty-six national and regional libraries, research institutions and commercial suppliers, increase the level of expertise of the project. The new technology partners will implement and adapt language tools and build lexical resources for languages not yet addressed, using the synergy of this cooperation to gain a cross language view on the accessibility and enhancement of digitised text. Five new national and university libraries from Southern and Eastern Europe, including Bulgaria, will provide datasets, demonstrate project results and to build digitisation capacity in their language areas. The objectives of the project have been broaded:
• Demonstration of the IMPACT tools to create a successful lexicon (thesaurus) for Slovenian, Bulgarian and Czech. Because of that the Jožef Stefan Institute, the Institute for Parallel Processing to the Bulgarian Academy of Sciences and the Institute of the Czech National Corpus, Charles University Prague will continue working on the amelioration of the OCR software by using a special lexicon for the historical language. Besides, together with ABBYY the Bulgarian Academy of Sciences will include historical Cyrillic letters (obsolete letters) in the OCR software. The national libraries of Slovenia, Bulgaria and the Czech Republic will provide and maintain data bases about the development, assessment and demonstration of the results;
• Demonstration and dissemination of the results from the project in Slovenia, Bulgaria and the Czech Republic;
• Building a permanent Centre of Competence. The inclusion of more languages and library partners represents the operative model of IMPACT as Centre of competence across Europe.
The participation of the National Library of Bulgaria in IMPACT
The National Library of Bulgaria is partner in the second project stage and is obliged to provide digital resources for testing the following Bulgarian research developments:
• Providing a data base about the development, assessment and demonstration of the software – that is to say, to participate with its own data base of digitalised Bulgarian journals and newspapers from 1882 to 1944;
• Presenting and disseminating the results from the project and supporting the digital competence in Bulgaria.
Activities, done for the project so far
Many digital images should be selected for the project (up to 5 000) and after that they have to pass through the GT process. This process consists of creating metadata for each image. This metadata contains description of all the symbols, positions of the segments of the images etc.
For the first project stage were selected about 3 700 digital images (serials and 2 collections), taken by camera. After that they passed OCR testing from our project partners from the Bulgarian Academy of Science. The results were good, but there were several problems with some of the symbols (�, �, �). Because of that we had to make a new selection of scanned documents and new texts.
For the second project stage were selected about 3 000 images (only serials), quality over 300 dpi. They were scanned with the new scanners of the Digital Centre of the National Library. After that they passed OCR testing from our project partners from the Bulgarian Academy of Science. The results were very successful; there were only few remarks – stains on some of the pages, scrawls, paintings etc., which impeded the reading.
Now we are expecting a specialised company to start processing the images. This company will create the PAGE XML data, which will be later corrected.
The additional work of the National Library, related to the future development of the IMPACT project, was connected with many selections of old printed books, scanning parts of them and providing them to our partners from the Bulgarian Academy of Sciences in order to be recognised the obsolete symbols.
The Demonstration Day of IMPACT Project will be held on October 13 at the National library >>>
The final conference of the IMPACT project will be at the 24th and 25th October 2011 in the British Library >>>
IMPACT Newsletter, edition 9, May 2011 >>>
IMPACT Newsletter, edition 10, July 2011 >>>