File Conversion for Knowledge Graphs: Turning Documents into Structured Data

Knowledge graphs have moved from academic curiosities to core components of search engines, recommendation systems, and enterprise data platforms. Their power lies in representing entities, relationships, and attributes in a machine‑readable, linked format—usually RDF (Resource Description Framework) or JSON‑LD. Yet most of the information that fuels a knowledge graph lives in unstructured or semi‑structured files: PDFs of research papers, Word contracts, Excel inventories, and legacy archives. Converting those files into structured triples without losing meaning, provenance, or legal compliance is a non‑trivial engineering problem.

This article walks through a complete, production‑ready workflow for turning everyday office documents into knowledge‑graph‑ready data. We cover the why, the preparation, the actual conversion techniques, validation, privacy safeguards, and finally how to ingest the output into a graph store. The guidance is deliberately platform‑agnostic, but we reference convertise.app as a convenient, privacy‑first tool for the initial format‑to‑format step when needed.


Why File Conversion Matters for Knowledge Graph Construction

A knowledge graph is only as good as the data it ingests. When the source material is a messy PDF, a scanned image, or a spreadsheet riddled with merged cells, the downstream extraction process either fails or produces noisy triples that degrade query precision. Proper file conversion serves two critical purposes:

  1. Normalization of Input – Converting PDFs to searchable, text‑rich formats (e.g., PDF‑A → plain‑text or HTML) eliminates OCR bottlenecks. Similarly, turning legacy Office binary files (.doc, .xls) into the open‑XML variants (.docx, .xlsx) ensures that parsers can reliably locate headings, tables, and metadata.
  2. Preservation of Contextual Metadata – Conversion tools that retain author, creation date, version, and even custom properties allow the resulting RDF to carry provenance information automatically. In a knowledge graph, provenance is a first‑class citizen; it enables trust scoring, audit trails, and compliance with regulations like GDPR.

When conversion is performed with precision, the downstream semantic extraction stage can concentrate on what the data says rather than how to read it.


Understanding the Semantic Targets: RDF, JSON‑LD, and CSV

Before starting a conversion campaign, define the target serialization format. Each has strengths:

  • RDF/Turtle – Ideal for complex vocabularies, custom ontologies, and when you need explicit subject‑predicate‑object triples. It is the lingua franca of SPARQL queries.
  • JSON‑LD – A JSON‑compatible representation that embeds linked‑data context directly. It is developer‑friendly, works well with web APIs, and is increasingly supported by search engines for rich snippets.
  • CSV – When the knowledge graph will be built from tabular data (e.g., product catalogs), a well‑structured CSV can be directly mapped to RDF using tools like OpenRefine or the W3C's CSV on the Web specification.

The choice dictates the conversion path. For instance, a PDF containing a table of chemical compounds may be best rendered as CSV first, then mapped to RDF. A contract in Word that mentions parties, dates, and obligations benefits from a direct RDF or JSON‑LD output, preserving nested clauses as separate entities.


Preparing Source Files for Semantic Extraction

Raw files often hide obstacles that manifest as extraction errors. A disciplined preparation phase pays dividends.

  1. Detect Encoding Early – Text files may be UTF‑8, UTF‑16, or legacy Windows-1252. Use a tool (e.g., chardet in Python) to identify the encoding and re‑encode to UTF‑8 before any conversion. This prevents garbled characters in RDF literals.
  2. Normalize Line Endings – Mixes of CR, LF, and CRLF break parsers that rely on line‑by‑line processing, especially when generating CSV. Convert all to LF (\n) using dos2unix or similar utilities.
  3. Separate Embedded Media – PDFs often embed images that contain critical data (charts, signatures). Extract those images first (using pdfimages or a cloud service) and treat them as separate assets that can be linked via foaf:Image or schema:ImageObject in the graph.
  4. Flatten Complex Layouts – Tables that span multiple pages, merged cells, or nested lists need to be flattened. Tools like Tabula for PDFs or pandoc for Word can export tables into CSV while preserving column headers.
  5. Validate Licenses and Permissions – Ensure you have the right to repurpose the content. When dealing with third‑party documents, store the original license URL in a dcterms:license triple attached to the source entity.

Once these pre‑flight steps are complete, the file is ready for deterministic conversion.


Converting Documents to Structured Formats

Below we outline concrete conversion pipelines for the three most common source families.

1. PDF → Text/HTML → RDF or JSON‑LD

  • Step 1 – Text Extraction: Use a PDF-to‑HTML converter that preserves the visual hierarchy (headings, lists, tables). Open‑source pdf2htmlEX does this while keeping CSS classes that map to logical structure.
  • Step 2 – Semantic Annotation: Apply a rule‑based engine (e.g., Apache Tika combined with custom Regex patterns) to tag headings as schema:Article sections, tables as schema:Table, and inline citations as schema:CreativeWork references.
  • Step 3 – RDF Generation: Feed the annotated HTML into a transformation engine such as XSLT or a Python script that walks the DOM, creates URIs for each section (_:section1), and emits triples. A typical triple for a table row might be:
:compound123 a chem:Compound ;
    chem:hasName "Acetaminophen" ;
    chem:hasMolecularWeight "151.16"^^xsd:float ;
    dcterms:source <file:///documents/report.pdf#page12> .
  • Step 4 – JSON‑LD Packaging: If the downstream consumer prefers JSON‑LD, serialize the same RDF graph using a compact context that maps chem: prefixes to a publicly shared ontology.

2. Word (.docx) → Structured XML → RDF/JSON‑LD

  • Step 1 – OOXML Extraction: A .docx file is a ZIP archive containing document.xml. Unzip and parse the XML with an XML library. Word's built‑in style hierarchy (Heading1, Heading2) maps cleanly to knowledge‑graph sections.
  • Step 2 – Table Normalization: Extract <w:tbl> elements, convert them to CSV rows, then feed the CSV into a mapping script that creates schema:Product or schema:Event entities based on column headers.
  • Step 3 – Preserve Custom Properties: Word documents often store custom metadata in docProps/custom.xml. Capture each <property> element and add a corresponding dcterms:description or a domain‑specific predicate.
  • Step 4 – RDF Emission: Use a templating system like Jinja2 to transform the XML tree into Turtle. Each paragraph becomes a schema:Paragraph with schema:text literals; headings gain schema:headline.

3. Spreadsheet (XLSX/CSV) → CSV → RDF via Mapping Files

  • Step 1 – Unified CSV Export: For XLSX, use xlsx2csv or the pandas library to flatten each sheet into a separate CSV, ensuring that cell types (date, number) are converted to ISO‑8601 strings or xsd data types.
  • Step 2 – Mapping Specification – Write a mapping file (in YAML or RML) that declares how each column maps to RDF predicates. For example:
mapping:
  - source: product_id
    predicate: schema:productID
  - source: price_usd
    predicate: schema:price
    datatype: xsd:decimal
  - source: release_date
    predicate: schema:datePublished
    datatype: xsd:date
  • Step 3 – Transformation Engine – Run the mapping with an RML processor (e.g., rmlmapper-java). The result is a stream of Turtle triples ready for ingestion.

Preserving Context, Ontology Alignment, and URIs

A conversion that yields syntactically correct RDF but semantically ambiguous triples is of limited use. Follow these practices to keep meaning intact:

  • Stable URIs – Derive identifiers from immutable source attributes (e.g., a DOI, ISBN, or a combination of document hash + section number). Avoid using volatile filenames that might change on a later sync.
  • Ontology Reuse – Before inventing new predicates, search existing vocabularies (Schema.org, FOAF, DC, or domain‑specific ontologies like bio:Gene). Reusing established terms improves interoperability and reduces downstream mapping effort.
  • Link Back to Source – Always attach a dcterms:source triple that points to the original file or the specific page/section. This link is invaluable for auditors and for users who need to verify the provenance of a statement.
  • Version Annotation – When the source document is under version control, include a schema:version triple referencing the Git commit hash or the document's revision number.

Handling Large Corpora: Batch Conversion Strategies

Enterprise environments may need to process thousands of PDFs and spreadsheets each night. Scaling the conversion pipeline requires careful orchestration:

  1. Chunking – Break the workload into batches of 500–1,000 files. Use a message queue (RabbitMQ, AWS SQS) to dispatch conversion jobs to worker nodes.
  2. Stateless Workers – Each worker should pull a file from storage (e.g., S3), perform the conversion using a containerized toolchain (pandoc, pdf2htmlEX, custom scripts), and push the resulting RDF to a triple store endpoint.
  3. Idempotency – Design the job so that re‑running it on the same file produces identical RDF. Store a hash of the source file and the generated graph; if the hash matches a previous run, skip re‑ingestion.
  4. Monitoring and Retries – Track conversion success rates with Prometheus metrics. Failed jobs should be retried with exponential back‑off, and persistent failures logged for manual review.
  5. Leveraging convertise.app – For occasional one‑off conversions, especially for formats not natively supported by your toolchain (e.g., converting old CorelDRAW files to SVG), convertise.app provides a quick, privacy‑focused bridge without custom code.

Quality Assurance: Validation, SHACL, and Automated Tests

After conversion, validate both syntactic and semantic correctness:

  • Syntax Check – Run the RDF through a parser (e.g., rapper from the Redland library) to catch malformed Turtle or JSON‑LD.
  • Shape Constraints (SHACL) – Define SHACL shapes that capture the expected structure of your graph. For a product catalog, a shape may require schema:price to be a decimal, schema:productID to be a non‑empty string, and schema:availability to be one of a controlled vocabulary.
  • SPARQL Conformance Tests – Write SPARQL ASK queries that verify critical triples exist (e.g., every schema:Person must have a schema:name). Automate these queries as part of your CI pipeline.
  • Round‑Trip Tests – Convert the RDF back to a human‑readable format (e.g., CSV) and compare with the original source using diff tools. Small differences often highlight lost whitespace or rounding errors in numeric fields.

Privacy, Licensing, and Ethical Considerations

When converting files that contain personal data, you must address GDPR, CCPA, or other jurisdictional rules.

  • Data Minimization – Extract only the fields required for the knowledge graph. If a PDF contains a full address but the graph only needs city and country, discard the street-level data before generating triples.
  • Pseudonymization – Replace direct identifiers (email, phone) with hashed versions using a salt stored separately. Keep a mapping file in a secure vault for audit purposes.
  • License Propagation – Include a dcterms:license triple that references the original document's license URL. If the source is under a Creative Commons license, propagate that information to every derived entity.
  • Retention Policies – Decide how long the converted RDF should be retained. Implement automated expiry based on the age of the source document, especially for sensitive contracts.

Ingesting the Converted Data into a Knowledge Graph Store

Once you have clean RDF, the final step is loading it into a graph database. The process differs slightly between native triple stores (Blazegraph, GraphDB) and property‑graph systems (Neo4j with RDF plugin).

  1. Bulk Load – Most stores accept a bulk INSERT DATA operation or a bulk loader that reads Turtle/NT files directly. Partition the data into logical named graphs (e.g., graph:finance, graph:research) to support fine‑grained access control.
  2. Streaming Ingestion – For continuous pipelines, use a SPARQL 1.1 UPDATE with INSERT statements as each batch finishes. Kafka connectors exist for many stores, allowing you to stream triples in real time.
  3. Indexing – Enable full‑text indexes on literals you expect to search (titles, abstracts). Some stores also provide geo‑indexes for schema:geo predicates, which is useful when your source files contain addresses.
  4. Query Validation – After load, run a suite of benchmark queries that reflect production use cases (e.g., “Find all contracts signed after 2020 where the counter‑party is a listed company”). Verify response times and result completeness.

Real‑World Walkthrough: Turning an Annual Report into a Knowledge Graph

Scenario: A financial analyst wants to query every instance of “net profit” across the past ten years of a corporation’s annual reports, which are published as PDFs.

  1. Collect PDFs – Store the PDFs in an S3 bucket, keyed by year.
  2. Pre‑flight – Run pdfinfo to confirm that each file is PDF/A‑1b (archival). Use pdf2htmlEX to convert each PDF to HTML, preserving headings.
  3. Extract Tables – Identify tables with the word “Profit” using the HTML class table. Export each table to CSV via tabula-java.
  4. Map to RDF – Write an RML mapping that creates a schema:FinancialStatement entity per year, and for each row, generates schema:Revenue, schema:NetProfit, and schema:OperatingExpense triples, casting numeric values to xsd:decimal.
  5. Add Provenance – Attach prov:wasGeneratedBy linking to a prov:Activity that records the conversion script version and the S3 URI of the source PDF.
  6. Validate – Execute a SHACL shape that enforces schema:NetProfit to be present for every schema:FinancialStatement. Any missing value triggers a log entry for manual review.
  7. Ingest – Load the Turtle into GraphDB under the named graph graph:annual_reports. Create a full‑text index on schema:financialMetric literals.
  8. Query – Run the SPARQL query:
SELECT ?year ?netProfit WHERE {
  GRAPH <graph:annual_reports> {
    ?stmt a schema:FinancialStatement ;
          schema:year ?year ;
          schema:NetProfit ?netProfit .
  }
}
ORDER BY ?year

The analyst now receives a clean, sortable list of net profit figures without manually opening each PDF.


Best‑Practice Checklist for File‑to‑Graph Conversion

  • Identify Target Serialization (RDF/Turtle, JSON‑LD, CSV) before any conversion.
  • Normalize Encoding and Line Endings to avoid hidden character corruption.
  • Extract Embedded Media Separately and link them with proper predicates.
  • Use Open Formats for Intermediate Steps (e.g., HTML, CSV) to keep the pipeline transparent.
  • Preserve Original Metadata (author, creation date, license) as provenance triples.
  • Generate Stable, Namespace‑aware URIs based on immutable identifiers.
  • Reuse Established Ontologies instead of inventing new predicates.
  • Validate with SHACL and SPARQL ASK as part of an automated test suite.
  • Apply Data Minimization & Pseudonymization for personal data.
  • Document Licensing on every generated entity.
  • Employ Batch Workers with Idempotent Jobs for large corpora.
  • Monitor Conversion Success Rates and retain logs for audit.
  • Leverage convertise.app for niche source‑format conversions that lack native tooling.

Conclusion

Converting everyday office files into knowledge‑graph‑ready data is a disciplined process that blends classic file‑format handling with semantic‑web best practices. By treating conversion as the first gate of a data‑quality pipeline—normalizing encodings, extracting structural cues, preserving provenance, and validating with SHACL—you turn noisy PDFs and spreadsheets into a clean, queryable graph.

The effort pays off: downstream analytics become faster, compliance auditors gain transparent provenance, and enterprises can reuse the same structured data across search, recommendation, and AI models. As the volume of unstructured documentation continues to grow, mastering file conversion for knowledge graphs will become an essential skill for data engineers, archivists, and anyone who wants to unlock the latent value hidden inside PDFs, Word docs, and Excel sheets.