Converting LaTeX Documents for Academic Publishing

LaTeX remains the de‑facto standard for scientific manuscripts, conference papers, and theses. Its strength lies in precise typesetting of mathematics, bibliographies, and complex structures. Yet, publishers, institutional repositories, and readers often demand the same material in alternative formats—PDF/A for archiving, HTML for web‑based reading, or EPUB for e‑readers. The conversion step is fraught with hidden pitfalls: missing fonts, broken cross‑references, or altered spacing that compromise the scholarly record.

This article walks through a systematic workflow that keeps the authorial intent intact while producing distribution‑ready files. The focus is on practical decisions, tool selection, and verification methods that work for a single manuscript or a batch of submissions.


1. Understand the Target Formats and Their Constraints

Before running any conversion, define the exact output requirements. Different delivery channels impose distinct technical constraints:

  • PDF/A‑1b – the ISO‑standard for long‑term preservation. It forbids encryption, requires embedded fonts, and disallows unreferenced color spaces.
  • PDF/UA – a PDF variant that meets accessibility norms (proper tags, reading order, alt‑text for images).
  • HTML5 – ideal for web portals; requires semantic markup, responsive images, and MathML or fallback images for equations.
  • EPUB 3 – the e‑book format that supports reflowable text, embedded fonts, and MathML; suitable for tablets and e‑readers.

Each format dictates specific compilation flags or post‑processing steps. Mapping those constraints early saves time and avoids costly re‑work.


2. Choose a Robust LaTeX Engine

The engine you invoke determines how faithfully the source is rendered and which auxiliary files are produced.

EngineStrengthsTypical Use Cases
pdfLaTeXDirect PDF output, mature ecosystem, broad package support.Simple articles, conference submissions where PDF/A compliance can be added later.
XeLaTeXNative Unicode handling, easy font selection via system fonts, good for multilingual texts.Documents with non‑Latin scripts or custom OpenType fonts.
LuaLaTeXExtensible via Lua scripting, fine‑grained control of fonts and PDFs.Complex layouts, programmable bibliography styles, or when you need tight PDF metadata control.

For archival PDFs (PDF/A), pdfLaTeX combined with the pdfx package is a reliable baseline. For HTML or EPUB, you’ll later pass the LaTeX source through a conversion tool that expects a clean intermediate PDF or DVI.


3. Prepare the Source for Conversion

3.1 Keep Packages Minimal and Well‑Documented

Redundant or obsolete packages increase the chance of compile errors when you switch engines. Audit \usepackage{} statements and remove any that are not essential to the final appearance.

3.2 Embed Fonts Explicitly

When the final PDF must embed every glyph, declare the font family using \setmainfont{} (XeLaTeX/LuaLaTeX) or the \pdfmapfile{} mechanism (pdfLaTeX). Verify that the chosen fonts are licensed for distribution; otherwise, the conversion will silently substitute defaults, breaking visual consistency.

3.3 Use Standard Bibliography Tools

Maintain bibliography data in a single .bib file and rely on biblatex with biber for modern citation styles. This approach preserves citation keys across formats, making it easier to generate reference lists in HTML or EPUB.


4. Generating a High‑Quality PDF Baseline

A clean PDF is the cornerstone for most downstream conversions. Follow these steps:

  1. Compile twice to resolve cross‑references and the table of contents.
  2. Run biber (or bibtex if you stay with legacy styles) between compilations.
  3. Apply the pdfx package:
    \usepackage[x-1a]{pdfx}
    
    This injects the required PDF/A metadata and forces font embedding.
  4. Check the log for any Missing font warnings. If they appear, add the missing fonts to the map file or switch to XeLaTeX.

Use a PDF validator (e.g., veraPDF) to confirm PDF/A compliance before proceeding.


5. Converting PDF to HTML and EPUB

Two primary strategies exist:

5.1 Direct LaTeX‑to‑HTML/EPUB Tools

  • pandoc – a universal converter that reads LaTeX and emits HTML5 or EPUB. It handles citations, figures, and simple equations via MathJax.
  • latex2html – older, lighter, but struggles with modern packages and complex math.

Pandoc workflow:

pandoc manuscript.tex \
  --pdf-engine=xelatex \
  --citeproc \
  -s -o manuscript.html

pandoc manuscript.tex \
  --pdf-engine=xelatex \
  --citeproc \
  -s -o manuscript.epub

Key options:

  • --pdf-engine ensures that any custom fonts are honoured.
  • --citeproc makes pandoc process the .bib file and render a bibliography.
  • -s produces a self‑contained document with embedded CSS.

5.2 PDF‑First Approach

If the PDF already meets PDF/A/UA standards, you can extract its structure with pdf2htmlEX (for HTML) or Calibre (for EPUB). This method preserves the exact pagination and font rendering but may embed large raster images for equations.

Pros: Near‑identical visual fidelity.
Cons: Larger output size, limited accessibility because the underlying text is often represented as images.


6. Preserving Mathematics Across Formats

Equations are the most fragile element during conversion.

  • MathML – native support in modern browsers and EPUB 3. Pandoc can emit MathML via the --mathml flag.
  • LaTeXML – a dedicated LaTeX‑to‑XML pipeline that produces high‑quality MathML and XHTML.
  • Image fallback – for environments that cannot render MathML, configure pandoc to generate SVG images (--webtex). SVG retains scalability without rasterizing the formula.

A typical pandoc command that balances both is:

pandoc manuscript.tex \
  --webtex=https://latex.codecogs.com/svg.latex? \
  --mathml \
  -s -o manuscript.html

The resulting HTML contains MathML for capable browsers and SVG for the rest.


7. Managing Figures and External Media

Figures often come from separate PDF, PNG, or EPS sources. To ensure consistency:

  1. Embed figures as PDF when using pdfLaTeX. This keeps vector quality in the final PDF.
  2. Convert figures to SVG for HTML/EPUB. Tools like Inkscape (inkscape -l fig.svg fig.pdf) preserve crispness and allow CSS styling.
  3. Provide alt‑text in the LaTeX source using \caption[Alt text]{Full caption}. Pandoc extracts the optional argument for accessibility.

Avoid large raster images unless the figure is inherently pixel‑based (e.g., microscopy photographs). For those, compress with optipng or jpegoptim before inclusion.


8. Validating the Output

8.1 PDF Validation

  • veraPDF – checks PDF/A compliance.
  • PDF/UA‑Validator – verifies accessibility tags.

Run both on the final PDF and fix any reported issues (missing alt‑text, untagged tables, etc.).

8.2 HTML Validation

  • W3C HTML validator – ensures syntactic correctness.
  • axe-core – scans for accessibility violations (missing ARIA labels, improper heading order).

8.3 EPUB Validation

  • epubcheck – the reference validator from the International Digital Publishing Forum (IDPF). It will flag missing metadata, invalid navigation files, or malformed MathML.

Automating these checks in a CI pipeline (e.g., GitHub Actions) guarantees that every new revision passes quality gates before release.


9. Automating the Workflow for Multiple Manuscripts

Researchers often need to process dozens of theses or conference papers each year. A lightweight automation script can orchestrate the steps described above.

#!/usr/bin/env bash
set -euo pipefail

DOCS=("paper1" "paper2" "paper3")
for d in "${DOCS[@]}"; do
  cd "$d"
  # 1. Build PDF/A
  latexmk -pdf -pdflatex='pdflatex -interaction=nonstopmode' -usepdfx
  # 2. Validate PDF/A
  verapdf "${d}.pdf"
  # 3. Convert to HTML & EPUB with pandoc
  pandoc "${d}.tex" --pdf-engine=xelatex --citeproc -s -o "${d}.html"
  pandoc "${d}.tex" --pdf-engine=xelatex --citeproc -s -o "${d}.epub"
  # 4. Validate HTML & EPUB
  html5validator "${d}.html"
  epubcheck "${d}.epub"
  cd ..
done

The script uses latexmk for incremental compilation and runs the three validators after each conversion. Adjust the DOCS array to match your directory layout.


10. When to Use an Online Conversion Service

A cloud‑based tool such as convertise.app can be handy for one‑off conversions, especially when you lack a full TeX installation on a workstation. The service processes LaTeX sources in a sandbox, returns PDF/A, HTML, or EPUB, and respects the same privacy principles outlined in its documentation. For sensitive research data, however, prefer a self‑hosted pipeline or run the conversion locally to keep the manuscript under your control.


11. Common Pitfalls and How to Avoid Them

PitfallSymptomRemedy
Missing fonts in PDF/AText appears as generic Times or shows warnings in validatorEmbed fonts explicitly; use \setmainfont{} with XeLaTeX or the pdfx package with pdfLaTeX
Broken citations after HTML export[?] placeholders in the final HTMLEnsure the bibliography file is reachable and use --citeproc (pandoc) or biber before conversion
Equations rendered as images onlyNo selectable text, large file sizeEnable MathML output (--mathml) and provide SVG fallback (--webtex)
Unnamed figure captionsAlt‑text missing for screen readersSupply optional short caption (\caption[Alt]{Long}) that pandoc extracts
Overly large EPUB filesSlow download, reader crashesOptimize raster images (jpegoptim/optipng) and prefer vector SVG where possible

By checking each of these items early, you prevent a cascade of re‑work later in the publication pipeline.


12. Integrating the Process into Institutional Repositories

Many universities run institutional repositories that ingest submissions in various formats. To streamline ingestion:

  1. Standardize on PDF/A‑1b as the archival master. Produce it directly from LaTeX as described in section 4.
  2. Generate HTML abstracts using the same LaTeX source; store them as separate metadata fields for search engine indexing.
  3. Offer EPUB as an auxiliary download for readers who prefer e‑readers; keep the file size under 5 MB by compressing images.
  4. Record the conversion provenance (engine version, package list, validator results) in the repository’s metadata schema. This satisfies audit requirements and aids future reproducibility.

13. Summary

Converting LaTeX manuscripts into multiple delivery formats is not a simple "click‑and‑go" task. It demands a clear understanding of the target standards, deliberate preparation of the source, and rigorous validation of every output. By selecting the appropriate engine, embedding fonts, using a robust PDF/A workflow, and leveraging tools like pandoc, LaTeXML, and dedicated validators, authors can publish a single source that safely reaches traditional journals, web portals, and e‑readers alike. Automation scripts keep the process repeatable, while occasional use of privacy‑focused online services such as convertise.app can fill occasional gaps without compromising data security. Implement these practices, and your scholarly work will retain its fidelity and accessibility across the whole digital lifecycle.