Guillaume Lazzara (EPITA 2008) is a Research Engineer at EPITA’s Research and Development Laboratory (LRDE), and is working on the project Scribo that involves the LRDE.
What is Scribo?
Scribo is a project that aims to provide free tools for semi-automatic and collaborative annotation of digital documents. The approach is based on extracting knowledge from text and images.
The project participants are laboratories specializing in the analysis of textual and graphical documents and in the extraction of knowledge (the LRDE and the ALPAGE team of INRIA), the laboratory of knowledge engineering of CEA LIST; Nuxeo – specialized in content management for businesses; Proxem – publisher of solutions for semantic processing of natural language; Tagmatica – specialized in parsing and ISO standards; Xwiki – publisher of Web 2.0 collaborative solutions, Business driven users like Agence France-Presse (AFP) and Mandriva.
The project is funded by the state and local governments within Paris for the 5th bidding call launched by the Business Competitiveness Fund (FCE). This project was certified in November 2007 by the competitive branch “Systematic” in the context of its theme “Free Software”.
To what level does the LRDE step in?
An important part of the project, supported by partners in the LRDE, is to perform a semantic analysis on the text to extract relevant words or phrases that can be used for indexing.
As some documents may also be available as images, it is necessary to extract the text. It was around this problem that the LRDE, which specializes in image processing libraries, could participate.
We have developed over the last two years a string of electronic documents. That is to say, a set of tools that locate text, extract cleanly and passes it to a software optical character recognition (OCR). But also that detect other page elements such as dividers or photos for example.
Identification example of text boxes :
What is the use of Scribo?
At the end of the processing chain, we can reconstruct the document as HTML, PDF or even Open Document, while preserving its structure. The text coded in the image is selectable and / or editable by the output format. The text can then be used to annotate the document or perform any other task automatically.
Dematerialisation of paper is very active now. This domain has been talking about it recently with the scanning of books by Google. Under the Grand Emprunt (2010 National Loan Project), France wants to invest 58 million euros in the project calls for scanning, archiving and development works.
Following the request of a partner, the Agence France Presse (AFP), we have reused our tools to provide a string of text detection in images. AFP receives over 10 000 photos per day and less than 10% of them are annotated and therefore easily accessible in the future! This channel aims to report images that contain text to people who annotate manually.
Hence markets covered by Scribo are multiple: intelligent standby in general or specialized areas (press, defense, seismic data, specific technologies, etc.), analysis and routing incoming documents (letters, emails, etc) or semantic desktop.
Example of rebuilding a document: