Converting print data into XML
A scientific reference work only available in print was processed electronically, so that the information in it is now searchable and usable in a multitude of ways. A special challenge was posed by diagrams in different depths of detail (similar to a map at different scales), in which data was deposited visibly or invisibly.
Project duration: 6 months
Even today, many reference books and standard textbooks are primarily produced for print, therefore being available for secondary use via electronic media only to a limited extent. Search options and navigation via cross-references in particular are heavily restricted. In order to enable search options and navigation in electronic media as well, we have, together with a leading academic publisher, developed a concept to transfer print data into an augmented XML format. In a first step, we created raw files using a parser, exploiting information from the typesetting data as far as possible, e.g. type size, font, or color, as they have a clearly defined content relevance in the printed work. Even at this stage, it turned out that these features were often used redundantly, e.g. italics for highlighting and for the designations of biological species. Due to this, the unambiguous allocation of XML tags is only intellectually possible with the requisite expert knowledge. In addition, the publisher decided to augment the text on the content level as well.
Practical examples include:
- insertion of synonym relations (e.g. between term and abbreviation)
- insertion of additional information, definitions, etc.
- links between register and text
- meaningful additions (e.g. about a differentiation between base substance and product of a chemical reaction)
- additional search options by depositing invisible synonyms and spellings
- integration and indexing of purely graphical elements
At the end of this post-processing and the subsequent validation on the basis of DTD (Document Type Definition), a high-quality data stock will be available. The high formal and content-related consistency of the data is the precondition for its further use on electronic platforms and in various applications.
All requisite steps may be carried out optionally, either remotely in the client system or in GIMD’s ARTIS database. The ARTIS software then supports editors with needs-based checking routines, keyboard shortcuts, automated data import, respectively export, allocation of work packages, and much more.