Limited Corporation for Information Management and Documentation

Conversion from PDF into XML

Project examples

Conversion from PDF into XML

Many documents today are made available in a PDF format. To work systematically and comprehensively with the contents of these documents requires conversion into the XML format and possibly, intellectual post-processing.

Project duration: 9 months Today, most official documents are published and made available for download as PDF files, including all EU laws and directives (see http://eur-lex.europa.eu). In order to work systematically and comprehensively with these documents, conversion into the XML format is advisable. In a first step, so-called parsers are used to create an XML file on the basis of the formattings in the PDF file. For most applications, however, this raw file will be unusable, since neither the allocation nor the contents of the individual XML tags will lead to an immediately consistent result. Intellectual post-processing therefore is indispensable, if specific demands are made on searchability and cross-references. Intellectual post-processing may comprise the following points:
  • Standardization of separator characters (e.g. normal blank spaces vs. non-breaking spaces)
  • Correct insertion of links with the appropriate attributes (e.g. link to Directive …, effective as of …)
  • Standardization of referenced documents (e.g. with internal and external references, standardization of nomenclature (e.g. Dir for “directive”, etc.)
  • Integration of notes, footnotes, lists, appendices, etc., at the respective position in the text, in order to improve readability in electronic media
  • Consolidation of numbers (decimal point), special characters, and diagrams
At the end of this post-processing and the subsequent validation on the basis of DTD (Document Type Definition), a high-quality data stock will be available. The high formal and content-related consistency of the data is the precondition for its further use on electronic platforms and in various applications. All requisite steps may be carried out optionally, either remotely, i.e. in the client system, or in GIMD’s ARTIS database. The ARTIS software then supports our editors with needs-based checking routines, keyboard shortcuts, automated data import, respectively export, allocation of work packages, and much more.