Structured content

Topic lead: Andy Byers, Alex Mendonça
Last updated: 05/06/2023

Structured content has become increasingly important in scholarly communication, as it provides a standard and machine-readable format for organising and exchanging information. The XML format is widely used by publishers and is based on producing articles where each element is carefully tagged based on a standard vocabulary.

Structured content is information or data that is organised in a predictable way. There are various formats for structured content, the most notable are XML (Extensible Markup Language), JSON (JavaScript Object Notation) and YAML (YAML Ain’t Markup Language).

XML is used widely in academic book and journal publishing. It makes scholarly content layout-independent, more flexible and reusable for a variety of formats (e.g. pdf, HTML, EPUB). It also offers improved searchability, accessibility and preservation, and allows text mining. Humanities scholars have traditionally used TEI (Text Encoding Initiative) XML, while publishers increasingly use the NISO standard JATS (Journal Article Tag Suite) and its extension BITS (Book Interchange Tag Suite). XML can be introduced at different stages of the production process, with XML-first, XML-last and XML-middle workflows.

Commercial publishers typically outsource the production of XML or use specialist software, which may not always be an option for small-scale open access journals with limited funds. Notably, Kotahi is an open source solution that includes JATS XML exporting features. However, we note that technical skills are required for its installation.

XML in journal production

The XML format uses machine readable tagging. The most important feature of XML tags is that the tagging is semantic: for example, an article’s title will effectively be tagged as the title – not only as text as typically done in HTML. XML can also cover other metadata, including authors, funding information, the publication date and more.

The advantage of an XML-first approach is that, from the time of submission, manuscripts can be tagged and edited in a structured format, reducing frustrations that may arise when annotating different versions of pdf or Word documents. Structured XML documents are also easier to analyse via artificial intelligence and automated checking tools, with potential time savings throughout the publication workflow.

Later down the line, when an article is published, the use of XML can give rise to benefits in terms of accessibility and discoverability. Structured documents can be easily indexed by search engines, and their automated analysis can be considered by researchers. Structured documents are also well suited to screen readers, thus making it easier for blind individuals, partially-sighted users or those with reading disorders.

Disadvantages of XML

Using XML is more complex than ‘basic’ pdf or HTML publishing, and one of its primary disadvantages is the learning curve required to effectively implement it. This challenge will particularly affect smaller publishers who do not have technical resources in-house nor have the budget to outsource XML production. The other disadvantage of XML is that it is time-consuming to apply, which might further discourage new journals. As a result, many smaller publishers prefer to work with pdf or HTML formats. Should a journal choose to pursue this approach, it is essential to make articles as accessible as possible in line with the time and resources available.

Share this article

Download this article