The Metadata Trail Left by Document Version Control

Documents accumulate metadata through review cycles

Every time a document passes through a review — from draft to manager review to legal review to final approval — it accumulates metadata. Each reviewer's name, each edit, each comment, and each save operation adds to the document's internal record of its own history.

This accumulation is invisible during normal use. You see the current content of the document. But the file's internal structure contains a detailed history of how that content was produced, who contributed, and what was changed along the way.

When you share the document externally, that history travels with it.

How Word's revision tracking works internally

Microsoft Word tracks revisions at the XML level using a system of revision session identifiers called rsid values. Understanding this system explains why "Accept All Changes" does not fully clean a document.

Revision Session IDs (rsid)

Every time Word saves a document, it generates a new revision session ID (rsid). This ID is attached to the text runs (paragraphs and inline text segments) that were modified during that session.

In the document's XML, a paragraph might look like this internally:

<w:r w:rsidR="00A21B34" w:rsidRPr="00C45D67">
  <w:t>Contract renewal terms</w:t>
</w:r>

The rsidR attribute identifies the session in which the text was inserted. The rsidRPr attribute identifies the session in which the formatting was last changed. The document stores a master list of all rsid values, and each corresponds to a point in the document's editing history.

What rsid values reveal

While rsid values do not directly store author names, they can be correlated with other metadata to reconstruct editing patterns:

How many editing sessions occurred — the number of unique rsid values indicates how many times the document was opened and saved
Which text was added in which session — by grouping text runs by their rsid, an analyst can reconstruct which content was added at each stage
The relative order of edits — rsid values are assigned in sequence within a session, revealing the order in which changes were made

Accept All Changes does not remove rsid values

When you accept tracked changes in Word, the visual markup (strikethroughs, colored text, margin indicators) disappears. But the rsid values on the text runs remain. The document still contains a record of which text was added or modified in which editing session.

Accepting tracked changes removes the explicit change tracking markup but does not flatten the document's rsid history. A motivated analyst can still extract information about the document's editing history from rsid patterns.

Tracked changes: what persists after acceptance

Word's tracked changes feature explicitly records insertions, deletions, and formatting changes with author attribution and timestamps. When you accept all changes, the change records are removed from the visible document. However:

Deleted text may persist in the XML in some scenarios, particularly if the acceptance was performed by a macro or an older version of Word that did not fully purge deletion records
Author information from the change records may be referenced by other metadata structures that survive acceptance
The total editing time stored in document properties continues to reflect the cumulative editing duration across all sessions, including review sessions

PDF version stacking

PDF files handle versioning differently from Word. PDFs support incremental saves, where modifications are appended to the end of the file rather than overwriting the original content. This means that previous versions of the content can coexist within a single PDF file.

How incremental saves work

When a PDF is modified and saved incrementally, the original objects remain in the file and new objects are appended. A cross-reference table at the end of the file points to the latest versions of each object. But the previous versions are still present in the file's byte stream.

What this means for metadata

Previous text content — if a PDF was edited to change text (for example, correcting a name or updating a figure), the original text may still exist in the file as an older object version
Previous metadata — if metadata fields were changed (author name updated, title corrected), the original metadata values may persist as unreferenced objects
Annotation history — comments and annotations that were added and then deleted may remain in the file as deleted objects

Viewing a PDF normally shows only the current version. But tools that parse the raw PDF structure can extract previous versions, revealing what changed and what the document looked like before modification.

SharePoint and OneDrive version history

Cloud-based document management platforms add another layer of version tracking that operates independently of the document's internal metadata.

Platform-level version history

SharePoint and OneDrive maintain a version history for every document. Each time the document is saved (manually or through auto-save), a new version is created. This version history includes:

Who made each version — the user account that saved the file
When each version was created — timestamp for each save
The full content of each version — the complete document at that point in time

This version history is accessible to anyone with appropriate permissions on the SharePoint site or OneDrive folder. It exists independently of the document's internal metadata — even if you clean the document's internal metadata, the platform's version history retains all previous versions with their original metadata.

The distinction between file metadata and platform metadata

Cleaning a document's internal metadata (author field, comments, revision history) addresses only what is embedded in the file itself. Platform-level metadata (version history, access logs, sharing records) is stored by the platform and is not affected by changes to the file's content.

When preparing a document for external sharing:

Clean the file's internal metadata (this is what Purgit does)
Download or export a clean copy rather than sharing a link to the version-controlled file
If sharing via link is required, understand that the recipient may have access to the platform's version history depending on permission settings

A proper clean workflow before external sharing

Step 1: Finalize content before cleaning

Complete all internal reviews and accept all changes before beginning the metadata removal process. Cleaning metadata during an active review cycle means the next review will re-introduce metadata.

Step 2: Create a clean copy

Rather than cleaning the working file, create a copy specifically for external sharing. This preserves the internal version (with its revision history, for your records) while producing a clean export.

Step 3: Remove internal metadata

On the clean copy, remove:

All tracked changes (accept and purge, not just accept)
All comments
Document properties (author, company, title, etc.)
Revision session identifiers (rsid values)
Custom XML parts from document management systems

Step 4: Flatten PDFs

If the output is a PDF, flatten it to remove incremental save layers. A flattened PDF contains only the current version of each object, with no residual previous versions.

Step 5: Verify

Re-scan the clean copy to confirm that metadata has been successfully removed. This catches cases where the removal process missed fields or introduced new metadata.

Step 6: Share the clean copy, not the working file

Share only the verified clean copy. Never share the working file or a link to the file in your version-controlled system.

Purgit scans documents for revision metadata — rsid values, tracked changes, comments, document properties, and incremental PDF layers. It removes the version trail at the structural level and verifies removal before you share.

[Scan a File Free]