How Journalists Use Document Metadata — And How to Protect Yourself | Purgit Blog

Metadata analysis is standard investigative practice

Investigative journalists are trained to examine every aspect of a document they receive — not just the text on the page, but the technical data embedded in the file. Metadata analysis is taught in journalism schools, covered in investigative reporting manuals, and used routinely by newsrooms worldwide.

This is a legitimate and important journalistic technique. It helps verify document authenticity, establish timelines, and corroborate sources. But it also means that any document shared externally — with the media, with regulators, with opposing parties, or with the public — will be examined at the metadata level by anyone with the motivation and the (freely available) tools to do so.

What journalists extract and how

Author and contributor identification

The first thing an investigative journalist does with a leaked document is check the Author and Last Modified By fields. These fields can identify who created or last edited the document, narrowing down the source of the leak or confirming the document's authenticity.

In the Scott Ritter case, metadata in documents was used to establish authorship and editing patterns that were relevant to the investigation. Document properties provided evidence about who had handled the files and when.

Timeline reconstruction

Document timestamps — creation date, last modification date, last printed date — allow journalists to reconstruct when a document was prepared. If a company claims a policy was created in January but the document metadata shows a creation date of June, the timeline discrepancy becomes part of the story.

Journalists also examine the gap between creation and modification dates. A document created on March 1 and last modified on March 1 suggests a single drafting session. A document created on March 1 and modified on November 15 suggests ongoing revision — which may contradict claims about when a decision was finalized.

Software and version analysis

The Creator and Producer fields in PDFs, and the Application field in Word documents, reveal which software was used. This can verify or undermine claims about document origin. If a government agency claims a document was produced by their standard systems but the Creator field shows a different software suite, that inconsistency is newsworthy.

Printer tracking dots

This is not metadata in the traditional sense, but it is a related form of hidden data. Most color laser printers embed nearly invisible yellow dots on every page they print. These dots encode the printer's serial number and the date and time of printing. The Electronic Frontier Foundation documented this practice extensively.

In 2017, NSA contractor Reality Winner was identified as the source of a leaked classified document in part through printer tracking dots. The dots on the printed document identified the specific printer used, which was traceable to Winner through access logs.

This case demonstrated that even printing a document and sharing the physical paper does not eliminate technical tracing. The digital and physical worlds both contain tracking mechanisms.

EXIF data in leaked photos

Photos shared with journalists carry EXIF data that can identify the device used, the location where the photo was taken, and the exact timestamp. For whistleblowers sharing photos of workplace conditions, safety violations, or other evidence, EXIF data can expose the source.

Journalists have an ethical obligation to protect sources, and responsible newsrooms strip EXIF data from published photos. But the journalist still sees the original EXIF during their investigation — and not all publications are equally careful about source protection.

Tools journalists use

These are all freely available, open-source tools. Any motivated individual can use them.

exiftool

The most comprehensive metadata extraction tool available. Running exiftool document.pdf or exiftool photo.jpg displays every metadata field in the file. It supports over 400 file formats and reads metadata standards including EXIF, IPTC, XMP, PDF info dictionary, and OOXML properties.

pdfinfo and pdftotext

Part of the Poppler utilities, pdfinfo extracts PDF metadata including title, author, creator, producer, creation date, and modification date. pdftotext can extract text that is present in the PDF data layer but not visible on the page — which is how "redacted" text is recovered from improperly redacted PDFs.

mat2 (Metadata Anonymisation Toolkit)

A tool specifically designed to remove metadata from files. Journalists use it to clean documents before publication, but its existence also demonstrates how well-understood the metadata extraction/removal workflow is.

Hex editors

For deep analysis, journalists and forensic analysts use hex editors to examine the raw binary content of files. This can reveal metadata that higher-level tools miss, including fragments of deleted metadata, embedded thumbnails, and non-standard fields.

What this means for organizations

Every external document is a potential source

If your organization is involved in any matter of public interest — regulatory compliance, environmental impact, labor practices, financial reporting — documents you share externally may end up in a journalist's hands. Metadata in those documents will be extracted and examined.

Metadata contradictions become stories

A common investigative pattern: the journalist receives a document, extracts the metadata, and finds that it contradicts the official narrative. The creation date does not match the claimed timeline. The author is someone who was not supposed to be involved. The template path reveals a connection to another entity. These contradictions are stories in themselves.

Source protection works both ways

Journalists protect their sources, but the metadata in leaked documents can identify sources before the journalist even publishes. If an employee leaks a document and the Author field or Last Modified By field identifies them, the organization can trace the leak internally without the journalist's involvement.

How to protect your documents

Clean before sharing externally

Every document that leaves your organization should be sanitized. This is not about hiding wrongdoing — it is about controlling what information you share and ensuring you are not inadvertently leaking organizational details, employee identities, or timeline information that you did not intend to disclose.

Be aware of printer tracking

If you print sensitive documents, be aware that color laser printers embed tracking dots. Black-and-white printers generally do not. Photocopying a printed page may preserve or obscure the dots depending on the copier.

Clean at the structural level

Surface-level tools like Document Inspector catch common metadata fields but miss deeper artifacts. Structural-level scanning — reading the document's internal format (XML, PDF objects, EXIF binary) — provides comprehensive coverage. Verification by re-scanning the output confirms that cleaning was successful.

Assume competence on the receiving end

The tools described in this article are free, widely available, and easy to use. Assume that any document you share will be examined at the metadata level by someone who knows what they are looking for.

Purgit scans documents and images for the same metadata that investigative journalists extract — author names, timestamps, GPS coordinates, software identifiers, device serial numbers. Clean your files before they leave your organization.

[Scan a File Free]