Société de gestion d’informations et de documentation

Conversion de documents à formats PDF en format XML

Exemples de projets

Conversion de documents à formats PDF en format XML

Beaucoup de documents sont actuellement disponibles sous format PDF uniquement. Afin de pouvoir accéder de façon systématique au contenu, il est nécessaire de convertir le format de ces documents. Dans le cadre de ce projet, après la conversion automatisée de format, il a été nécessaire de procéder à un post-traitement manuel, afin de s’assurer de l’exactitude des liens et des références, ainsi que des séparateurs et autres éléments structurels du texte.

Project duration: 9 months

Today, most official documents are published and made available for download as PDF files, including all EU laws and directives (see http://eur-lex.europa.eu). In order to work systematically and comprehensively with these documents, conversion into the XML format is advisable. In a first step, so-called parsers are used to create an XML file on the basis of the formattings in the PDF file. For most applications, however, this raw file will be unusable, since neither the allocation nor the contents of the individual XML tags will lead to an immediately consistent result.

Intellectual post-processing therefore is indispensable, if specific demands are made on searchability and cross-references. Intellectual post-processing may comprise the following points:

  • Standardization of separator characters (e.g. normal blank spaces vs. non-breaking spaces)
  • Correct insertion of links with the appropriate attributes (e.g. link to Directive …, effective as of …)
  • Standardization of referenced documents (e.g. with internal and external references, standardization of nomenclature (e.g. Dir for “directive”, etc.)
  • Integration of notes, footnotes, lists, appendices, etc., at the respective position in the text, in order to improve readability in electronic media
  • Consolidation of numbers (decimal point), special characters, and diagrams

At the end of this post-processing and the subsequent validation on the basis of DTD (Document Type Definition), a high-quality data stock will be available. The high formal and content-related consistency of the data is the precondition for its further use on electronic platforms and in various applications.

All requisite steps may be carried out optionally, either remotely, i.e. in the client system, or in GIMD’s ARTIS database. The ARTIS software then supports our editors with needs-based checking routines, keyboard shortcuts, automated data import, respectively export, allocation of work packages, and much more.