Structured content is information or data that is organised in a predictable way. There are various formats for structured content, the most notable are XML (Extensible Markup Language), JSON (JavaScript Object Notation) and YAML (YAML Ain’t Markup Language).
XML is used widely in academic book and journal publishing. It makes scholarly content layout-independent, more flexible and reusable for a variety of formats (e.g. pdf, HTML, EPUB). It also offers improved searchability, accessibility and preservation, and allows text mining. Humanities scholars have traditionally used TEI (Text Encoding Initiative) XML, while publishers increasingly use the NISO standard JATS (Journal Article Tag Suite) and its extension BITS (Book Interchange Tag Suite). XML can be introduced at different stages of the production process, with XML-first, XML-last and XML-middle workflows.
Commercial publishers typically outsource the production of XML or use specialist software, which may not always be an option for small-scale open access journals with limited funds. Notably, Kotahi is an open source solution that includes JATS XML exporting features. However, we note that technical skills are required for its installation.
XML in journal production
The XML format uses machine readable tagging. The most important feature of XML tags is that the tagging is semantic: for example, an article’s title will effectively be tagged as the title – not only as text as typically done in HTML. XML can also cover other metadata, including authors, funding information, the publication date and more.
The advantage of an XML-first approach is that, from the time of submission, manuscripts can be tagged and edited in a structured format, reducing frustrations that may arise when annotating different versions of pdf or Word documents. Structured XML documents are also easier to analyse via artificial intelligence and automated checking tools, with potential time savings throughout the publication workflow.
Later down the line, when an article is published, the use of XML can give rise to benefits in terms of accessibility and discoverability. Structured documents can be easily indexed by search engines, and their automated analysis can be considered by researchers. Structured documents are also well suited to screen readers, thus making it easier for blind individuals, partially-sighted users or those with reading disorders.
Disadvantages of XML
Using XML is more complex than ‘basic’ pdf or HTML publishing, and one of its primary disadvantages is the learning curve required to effectively implement it. This challenge will particularly affect smaller publishers who do not have technical resources in-house nor have the budget to outsource XML production. The other disadvantage of XML is that it is time-consuming to apply, which might further discourage new journals. As a result, many smaller publishers prefer to work with pdf or HTML formats. Should a journal choose to pursue this approach, it is essential to make articles as accessible as possible in line with the time and resources available.
- XML. (n.d.). Focus Area News.
- JSON. (n.d.). Introducing JSON.
- YAML. (n.d.). YAML.
- The University of Edinburgh. (2019). Support for XML-based publishing in OJS.
- NISO. (2019). ANSI/NISO Z39.96-2019, JATS: Journal Article Tag Suite, version 1.2.
- JATS, (n.d.). Book Interchange Tag Set: JATS Extension. National Center for Biotechnology Information (NCBI), U.S. National Library of Medicine.
- Kotahi. (n.d.). Features.
- Aries systems. (2020, November 12). The Benefits of an XML-First Publishing Workflow.
- Access Computing. (n.d.). Is XML accessible? University of Washington.
- Github. (n.d.). Scholarly HTML.
Share this article
Download this article
This work is licensed under a Creative Commons Attribution 4.0 International License