What Is Document Sanitization? (And Why Saving As PDF Isn't Enough)
Document sanitization is the systematic removal of hidden data from files before sharing. Saving as PDF does not do it. Here's what actually works.
What document sanitization means
Document sanitization is the process of identifying and removing hidden data from a file before sharing it with someone outside your organization. Hidden data includes metadata (author names, timestamps, GPS coordinates), revision history (tracked changes, deleted content), embedded objects (images with their own metadata, OLE objects), and structural artifacts (hidden layers, form field data, script actions).
Sanitization is not the same as redaction. Redaction removes specific visible content — a social security number, a patient name, a classified paragraph. Sanitization removes the invisible data that surrounds the visible content. A properly shared document should have both: redaction of sensitive visible content, and sanitization of hidden structural content.
Why "Save As PDF" does not sanitize
The most common assumption about document cleaning is that saving a Word document as a PDF removes hidden data. This is wrong in several specific ways.
PDFs inherit source metadata
When you export a Word document to PDF, the PDF file inherits metadata from the Word source. The Author field from Word becomes the Author field in the PDF. The Title property carries over. The "Creator" field in the PDF records which software generated it (e.g., "Microsoft Word 16.0"), and the "Producer" field records which PDF engine rendered it.
A PDF exported from a Word document is not a fresh file — it is a conversion that preserves the original's metadata fingerprint.
Tracked changes may survive
If you export a Word document to PDF while tracked changes are visible (i.e., displayed as markup), the tracked changes appear in the PDF as visible content. But even if you accept all changes before exporting, the PDF may contain artifacts. Some PDF export tools create text annotations or comment objects from change-tracking data.
Embedded images retain EXIF
A Word document containing a photo with GPS coordinates in its EXIF data will produce a PDF where that photo still contains GPS coordinates. The image is embedded in the PDF's content stream with its EXIF data intact. The format conversion does not touch the image's internal metadata.
Comments and annotations
Resolved comments in Word may not appear in the exported PDF, but open comments and annotations are typically preserved. Some users assume that comments disappear during export — they do not.
What a proper sanitization pipeline looks like
Effective document sanitization follows a four-step process: parse, scan, transform, and verify.
Step 1: Parse
Read the file at the structural level. For a PDF, this means parsing the object tree — every dictionary, stream, and cross-reference entry. For a Word document (DOCX), this means unzipping the file and reading each XML part: document.xml, core.xml, app.xml, and every relationship and embedded object. For images, this means reading the binary header and extracting all EXIF, IPTC, and XMP tag blocks.
Structural parsing reveals data that surface-level inspection misses. Document Inspector in Word performs a partial structural parse, but it does not cover all XML parts and does not handle embedded image EXIF.
Step 2: Scan
Analyze every parsed element against a set of rules. Each rule looks for a specific type of hidden data:
- Metadata rules check for author names, company names, timestamps, template paths, software identifiers
- Revision rules check for tracked changes, revision marks, comment threads (including resolved comments)
- Embedded object rules check for images with EXIF data, OLE objects, attached files
- Structural rules check for hidden layers (in PDFs), hidden sheets (in Excel), hidden slides (in PowerPoint), JavaScript actions, form field data
The scan produces a report of findings — every piece of hidden data identified, categorized by type and severity.
Step 3: Transform
Remove or replace each finding. Metadata fields are cleared or set to neutral values. Tracked changes are purged from the XML (not just accepted in the UI). Embedded image EXIF is stripped. Hidden layers are flattened or removed. Comments are deleted. Form field data is cleared.
The key property of a good transform is that it modifies only the metadata and hidden data — the visible content of the document remains unchanged. You should be able to read the sanitized document and see exactly the same text, images, and formatting as the original.
Step 4: Verify
Re-scan the output file using the same rules that were applied in Step 2. If the verification scan finds zero issues, the sanitization is confirmed. If it finds remaining issues, the process needs another pass or manual review.
This verification step is what separates sanitization from hope. Without it, you are trusting that the cleaning worked — but you have no evidence. With it, you have a machine-verifiable confirmation that the output is clean.
What manual approaches miss
Document Inspector limitations
Word's Document Inspector can find and remove comments, revisions, document properties, and some hidden content. However, it has known limitations:
- It does not remove EXIF data from embedded images
- It may not catch all custom XML parts
- It does not handle headers and footers that reference removed content
- It reports removal but does not verify it with a re-scan
- It is specific to Office documents and does not handle PDFs or images
"Properties > Remove Personal Information"
Windows offers a "Remove Properties and Personal Information" option in the file properties dialog. This removes a subset of metadata fields from supported file types but does not handle tracked changes, comments, embedded objects, or structural data. It also does not verify removal.
Online converters
Converting a document using an online tool (Word to PDF, image format conversion) may or may not strip metadata. Each converter behaves differently. Some preserve all metadata. Some strip some fields but not others. None provide verification. And uploading a sensitive document to a third-party converter creates its own privacy risk.
When sanitization matters most
Before sending externally
Any document leaving your organization should be sanitized. This is the minimum standard for professional document hygiene.
Before publishing
Documents published on websites, in regulatory filings, or in court records become permanently public. Metadata in published documents cannot be retrieved. The consequences of publishing a document with sensitive metadata are permanent.
Before legal disclosure
Documents produced in discovery or shared with opposing counsel are examined for metadata — both as potential evidence and as potential intelligence. Legal documents require the most thorough sanitization.
Before sharing templates
Document templates propagate their metadata to every file created from them. A contaminated template creates a recurring metadata leak. Clean templates before adding them to shared libraries.
Purgit implements the full scan-clean-verify pipeline for PDFs, Word documents, and images. Upload a file to see what hidden data it contains — and clean it in one step.
[Scan a File Free]