What Is Document Sanitization? (And Why Saving As PDF Isn't Enough)

What document sanitization means

Document sanitization is the process of identifying and removing hidden data from a file before sharing it with someone outside your organization. Hidden data includes metadata (author names, timestamps, GPS coordinates), revision history (tracked changes, deleted content), embedded objects (images with their own metadata, OLE objects), and structural artifacts (hidden layers, form field data, script actions).

Sanitization is not the same as redaction. Redaction removes specific visible content — a social security number, a patient name, a classified paragraph. Sanitization removes the invisible data that surrounds the visible content. A properly shared document should have both: redaction of sensitive visible content, and sanitization of hidden structural content.

Why "Save As PDF" does not sanitize

The most common assumption about document cleaning is that saving a Word document as a PDF removes hidden data. This is wrong in several specific ways.

PDFs inherit source metadata

When you export a Word document to PDF, the PDF file inherits metadata from the Word source. The Author field from Word becomes the Author field in the PDF. The Title property carries over. The "Creator" field in the PDF records which software generated it (e.g., "Microsoft Word 16.0"), and the "Producer" field records which PDF engine rendered it.

A PDF exported from a Word document is not a fresh file — it is a conversion that preserves the original's metadata fingerprint.

Tracked changes may survive

If you export a Word document to PDF while tracked changes are visible (i.e., displayed as markup), the tracked changes appear in the PDF as visible content. But even if you accept all changes before exporting, the PDF may contain artifacts. Some PDF export tools create text annotations or comment objects from change-tracking data.

Embedded images retain EXIF

A Word document containing a photo with GPS coordinates in its EXIF data will produce a PDF where that photo still contains GPS coordinates. The image is embedded in the PDF's content stream with its EXIF data intact. The format conversion does not touch the image's internal metadata.

Comments and annotations

Resolved comments in Word may not appear in the exported PDF, but open comments and annotations are typically preserved. Some users assume that comments disappear during export — they do not.

What a proper sanitization pipeline looks like

Effective document sanitization follows a four-step process: parse, scan, transform, and verify.

Step 1: Parse

Read the file at the structural level. For a PDF, this means parsing the object tree — every dictionary, stream, and cross-reference entry. For a Word document (DOCX), this means unzipping the file and reading each XML part: document.xml, core.xml, app.xml, and every relationship and embedded object. For images, this means reading the binary header and extracting all EXIF, IPTC, and XMP tag blocks.

Structural parsing reveals data that surface-level inspection misses. Document Inspector in Word performs a partial structural parse, but it does not cover all XML parts and does not handle embedded image EXIF.

Step 2: Scan

Analyze every parsed element against a set of rules. Each rule looks for a specific type of hidden data:

Metadata rules check for author names, company names, timestamps, template paths, software identifiers
Revision rules check for tracked changes, revision marks, comment threads (including resolved comments)
Embedded object rules check for images with EXIF data, OLE objects, attached files
Structural rules check for hidden layers (in PDFs), hidden sheets (in Excel), hidden slides (in PowerPoint), JavaScript actions, form field data

The scan produces a report of findings — every piece of hidden data identified, categorized by type and severity.

Step 3: Transform

Remove or replace each finding. Metadata fields are cleared or set to neutral values. Tracked changes are purged from the XML (not just accepted in the UI). Embedded image EXIF is stripped. Hidden layers are flattened or removed. Comments are deleted. Form field data is cleared.

The key property of a good transform is that it modifies only the metadata and hidden data — the visible content of the document remains unchanged. You should be able to read the sanitized document and see exactly the same text, images, and formatting as the original.

Step 4: Verify

Re-scan the output file using the same rules that were applied in Step 2. If the verification scan finds zero issues, the sanitization is confirmed. If it finds remaining issues, the process needs another pass or manual review.

This verification step is what separates sanitization from hope. Without it, you are trusting that the cleaning worked — but you have no evidence. With it, you have a machine-verifiable confirmation that the output is clean.

What manual approaches miss

Document Inspector limitations

Word's Document Inspector can find and remove comments, revisions, document properties, and some hidden content. However, it has known limitations:

It does not remove EXIF data from embedded images
It may not catch all custom XML parts
It does not handle headers and footers that reference removed content
It reports removal but does not verify it with a re-scan
It is specific to Office documents and does not handle PDFs or images

"Properties > Remove Personal Information"

Windows offers a "Remove Properties and Personal Information" option in the file properties dialog. This removes a subset of metadata fields from supported file types but does not handle tracked changes, comments, embedded objects, or structural data. It also does not verify removal.

Online converters

Converting a document using an online tool (Word to PDF, image format conversion) may or may not strip metadata. Each converter behaves differently. Some preserve all metadata. Some strip some fields but not others. None provide verification. And uploading a sensitive document to a third-party converter creates its own privacy risk.

When sanitization matters most

Before sending externally

Any document leaving your organization should be sanitized. This is the minimum standard for professional document hygiene.

Before publishing

Documents published on websites, in regulatory filings, or in court records become permanently public. Metadata in published documents cannot be retrieved. The consequences of publishing a document with sensitive metadata are permanent.

Before legal disclosure

Documents produced in discovery or shared with opposing counsel are examined for metadata — both as potential evidence and as potential intelligence. Legal documents require the most thorough sanitization.

Before sharing templates

Document templates propagate their metadata to every file created from them. A contaminated template creates a recurring metadata leak. Clean templates before adding them to shared libraries.

Purgit implements the full scan-clean-verify pipeline for PDFs, Word documents, and images. Upload a file to see what hidden data it contains — and clean it in one step.

[Scan a File Free]