The Hidden Data in Every Word Document You Send
You think you're sending a 12-page proposal. You're actually sending the proposal plus 20 invisible data fields about who wrote it, when, on what machine, and what the original draft said.
What you think you're sending
A 12-page proposal. Clean formatting. Your firm's logo in the header. Professional language. Ready for the client.
What you're actually sending
The 12-page proposal, plus approximately 20 invisible data fields recording:
- Who wrote it (full name, sometimes the Windows account username)
- Who last edited it (full name)
- What company created it (the organization name from the Office installation)
- What software was used (application name and exact version number)
- What template it was based on (often including the full file path on the author's computer)
- How many times it was revised (revision count)
- How long the author spent editing (total editing time, in minutes)
- When it was created and last modified (timestamps to the second)
- What was deleted during editing (if tracked changes are enabled)
- What comments were made during review (even if "resolved" in Word's interface)
- The previous version's content (in the revision history XML)
None of this is visible when reading the document. All of it is accessible to anyone who knows where to look — and increasingly, people know where to look.
The anatomy of a .docx file
A Word document saved as .docx is not a single file. It is a ZIP archive. If you rename a .docx file to .zip and open it, you will find a directory structure containing dozens of XML files:
document.docx (renamed to .zip)
├── [Content_Types].xml
├── _rels/
│ └── .rels
├── docProps/
│ ├── core.xml ← Author, dates, revision count
│ ├── app.xml ← Company, application, template, editing time
│ └── custom.xml ← Custom properties (classification, workflow)
└── word/
├── document.xml ← The actual content + tracked changes
├── comments.xml ← Comment threads
├── settings.xml ← Template path, compatibility settings
├── styles.xml ← Style definitions
├── fontTable.xml ← Embedded font references
└── media/ ← Embedded images (with their own EXIF data)
Each of these XML files contains structured data that Word uses to render the document. The data in docProps/core.xml and docProps/app.xml is metadata about the document. The data in word/document.xml includes the visible text and the invisible revision history. The data in word/comments.xml includes every comment ever made on the document, including resolved ones.
Let's walk through each of these layers.
Layer 1: Document properties (core.xml)
The docProps/core.xml file stores the Dublin Core metadata properties of the document:
<dc:creator>Jane Smith</dc:creator>
<cp:lastModifiedBy>Michael Johnson</cp:lastModifiedBy>
<dcterms:created>2026-01-15T09:23:00Z</dcterms:created>
<dcterms:modified>2026-02-12T16:47:00Z</dcterms:modified>
<cp:revision>47</cp:revision>
<dc:title>Q1 Strategic Review — Confidential</dc:title>
<dc:subject>Client engagement proposal</dc:subject>
<cp:keywords>M&A, due diligence, valuation</cp:keywords>
<dc:description>Draft proposal for ABC Corp engagement</dc:description>
What this reveals:
- dc:creator: "Jane Smith" — the person who created the document. This is set automatically from the Office profile or Windows account.
- cp:lastModifiedBy: "Michael Johnson" — the last person to save the document. If you're sending a proposal under your firm's name, this reveals the individual who prepared it.
- cp:revision: "47" — this document was saved 47 times. This tells the recipient how much work went into it.
- dc:title, dc:subject, cp:keywords: These fields are often populated from templates or previous versions of the document. A proposal for Client B may still carry the title or keywords from Client A's engagement if the document was adapted from an earlier proposal.
Layer 2: Application properties (app.xml)
The docProps/app.xml file stores information about the application and organization:
<Properties>
<Application>Microsoft Office Word</Application>
<AppVersion>16.0000</AppVersion>
<Company>Competitor Corp</Company>
<Template>EngagementProposal_v3.dotx</Template>
<TotalTime>380</TotalTime>
<Pages>12</Pages>
<Words>4,287</Words>
</Properties>
What this reveals:
- Company: "Competitor Corp" — this is the most common metadata leak in consulting. If you adapted a proposal from a template created at a previous employer, the Company field still contains the previous employer's name. The client receives your proposal and sees that it was created on a computer belonging to "Competitor Corp."
- Template: "EngagementProposal_v3.dotx" — this reveals the internal template name. Depending on naming conventions, this can expose your firm's document management system structure.
- TotalTime: "380" minutes — 6 hours and 20 minutes of editing time. In billable-hour environments, this is sensitive information. If you quoted 20 hours for the proposal and the metadata shows 6 hours of editing, the client may question your billing.
Layer 3: Settings and template path (settings.xml)
The word/settings.xml file contains document settings, including the template attachment:
<w:attachedTemplate r:id="rId1"/>
The related relationship file may contain:
<Relationship Target="file:///C:/Users/jsmith/AppData/Roaming/Microsoft/Templates/CompetitorCorp/EngagementProposal_v3.dotx"/>
What this reveals: The full file path on the author's computer. This path includes:
- The username ("jsmith")
- The directory structure ("AppData/Roaming/Microsoft/Templates/CompetitorCorp/")
- The template name with version number
This is one of the most commonly overlooked metadata fields, because it does not appear in Word's Properties panel. You have to inspect the XML directly to see it.
Layer 4: Tracked changes (document.xml)
When Track Changes is enabled in Word, every insertion and deletion is recorded as an XML node in the document content:
<w:del w:id="42" w:author="Jane Smith" w:date="2026-02-10T14:23:00Z">
<w:r><w:t>We propose a fee of $175,000 for this engagement</w:t></w:r>
</w:del>
<w:ins w:id="43" w:author="Michael Johnson" w:date="2026-02-11T09:15:00Z">
<w:r><w:t>We propose a fee of $225,000 for this engagement</w:t></w:r>
</w:ins>
What this reveals: The document originally proposed a fee of $175,000. It was later revised to $225,000. The revision was made by Michael Johnson on February 11. The recipient of this document can see the original price, the revised price, who made the change, and when.
"Accepting" this change in Word removes the visual markup. The <w:ins> and <w:del> nodes may or may not persist in the XML depending on the Word version and save format. In many cases, they do persist — and are extractable by parsing the file.
Layer 5: Comments (comments.xml)
<w:comment w:id="7" w:author="Jane Smith" w:date="2026-02-09T11:42:00Z">
<w:p><w:r><w:t>Michael — should we go higher on this? They accepted $200k last year without pushback.</w:t></w:r></w:p>
</w:comment>
What this reveals: An internal discussion about pricing strategy. The client learns that the firm considered charging more, and that the client accepted a higher price in a previous engagement.
Comments that are "resolved" in Word's interface may still exist in the word/comments.xml file. They are hidden from the visual display but remain in the XML. Any tool that parses the .docx archive can read them.
Layer 6: Embedded images with their own metadata
If the Word document contains embedded images — photographs, screenshots, diagrams — each of those images carries its own EXIF metadata. A photo embedded in a proposal may contain GPS coordinates, device information, and timestamps.
Word does not strip EXIF data from embedded images. The image is stored in the word/media/ directory within the .docx archive, complete with all its original metadata.
A real example, assembled
Here is a composite scenario — representative of actual incidents, assembled from common patterns:
A management consulting firm prepares a proposal for a new client engagement. The proposal document reveals the following through its metadata:
| Field | Value | What it exposes | |-------|-------|----------------| | Author | Jane Smith | Contractor's personal name, not the firm | | Company | Competitor Corp | File adapted from a previous employer's template | | Revision count | 47 | Document went through 47 drafts | | Total editing time | 380 minutes | Only ~6 hours of actual editing | | Template path | C:\Users\jsmith...\CompetitorCorp\ | Author's home directory and template source | | Tracked deletion | "fee of $175,000" | Original lower price was revised upward | | Comment | "they accepted $200k last year" | Internal pricing strategy discussion |
Every one of these fields is invisible in the rendered document. Every one is readable by parsing the .docx archive. The client who receives this proposal can, with minimal technical effort, reconstruct the firm's internal deliberation, discover the contractor relationship, and identify the pricing strategy.
What Microsoft Word's own tools catch
Word includes a "Check for Issues" > "Inspect Document" feature (the Document Inspector) that can find and remove some of these data points. It is a useful first step.
What Document Inspector catches:
- Comments, revisions, and annotations (most of the time)
- Document properties and personal information (author, company, title)
- Custom XML data
- Invisible content and hidden text
What Document Inspector may miss:
- Template path in settings.xml (not always cleared)
- Resolved comments that persist in comments.xml
- Custom document properties in custom.xml
- EXIF metadata in embedded images
- Residual revision data in complex documents with long editing histories
What Document Inspector does not do:
- Verify removal by re-parsing the output
- Produce a report of what was found and removed
- Process batch documents
- Work on files outside of Word (PDFs, spreadsheets, images)
What to do before sending any document
-
Know what's in the file. Before you send a document, inspect it. Not visually — structurally. Check the Properties panel. Run Document Inspector. Or use a dedicated metadata scanning tool that reads the XML.
-
Clean at the data layer. Removing metadata means removing it from the XML, not from the visual display. "Accept All Changes" is not enough. Author fields must be cleared from
core.xmlandapp.xml. Template paths must be cleared fromsettings.xml. -
Verify the cleanup. After cleaning, re-inspect the file. If the metadata is gone from the XML, the cleanup worked. If any fields persist, the cleanup was incomplete.
-
Make it routine. The documents that cause problems are not the ones you carefully reviewed. They are the ones you sent in a hurry on a Friday afternoon. The only reliable defense is a process that runs on every document, every time, without depending on memory or discipline.
Purgit scans Word, Excel, and PowerPoint documents at the XML level. It identifies author fields, tracked changes, comments, company names, template paths, and revision history — then removes them and verifies removal by re-parsing the file. Before you send, scan.
[Scan a File Free]