The Metadata Trail Left by Document Version Control
Every review cycle adds metadata to your documents. Revision IDs, tracked changes, and version history create a trail that persists after you click Accept All Changes.
Documents accumulate metadata through review cycles
Every time a document passes through a review — from draft to manager review to legal review to final approval — it accumulates metadata. Each reviewer's name, each edit, each comment, and each save operation adds to the document's internal record of its own history.
This accumulation is invisible during normal use. You see the current content of the document. But the file's internal structure contains a detailed history of how that content was produced, who contributed, and what was changed along the way.
When you share the document externally, that history travels with it.
How Word's revision tracking works internally
Microsoft Word tracks revisions at the XML level using a system of revision session identifiers called rsid values. Understanding this system explains why "Accept All Changes" does not fully clean a document.
Revision Session IDs (rsid)
Every time Word saves a document, it generates a new revision session ID (rsid). This ID is attached to the text runs (paragraphs and inline text segments) that were modified during that session.
In the document's XML, a paragraph might look like this internally:
<w:r w:rsidR="00A21B34" w:rsidRPr="00C45D67">
<w:t>Contract renewal terms</w:t>
</w:r>
The rsidR attribute identifies the session in which the text was inserted. The rsidRPr attribute identifies the session in which the formatting was last changed. The document stores a master list of all rsid values, and each corresponds to a point in the document's editing history.
What rsid values reveal
While rsid values do not directly store author names, they can be correlated with other metadata to reconstruct editing patterns:
- How many editing sessions occurred — the number of unique rsid values indicates how many times the document was opened and saved
- Which text was added in which session — by grouping text runs by their rsid, an analyst can reconstruct which content was added at each stage
- The relative order of edits — rsid values are assigned in sequence within a session, revealing the order in which changes were made
Accept All Changes does not remove rsid values
When you accept tracked changes in Word, the visual markup (strikethroughs, colored text, margin indicators) disappears. But the rsid values on the text runs remain. The document still contains a record of which text was added or modified in which editing session.
Accepting tracked changes removes the explicit change tracking markup but does not flatten the document's rsid history. A motivated analyst can still extract information about the document's editing history from rsid patterns.
Tracked changes: what persists after acceptance
Word's tracked changes feature explicitly records insertions, deletions, and formatting changes with author attribution and timestamps. When you accept all changes, the change records are removed from the visible document. However:
- Deleted text may persist in the XML in some scenarios, particularly if the acceptance was performed by a macro or an older version of Word that did not fully purge deletion records
- Author information from the change records may be referenced by other metadata structures that survive acceptance
- The total editing time stored in document properties continues to reflect the cumulative editing duration across all sessions, including review sessions
PDF version stacking
PDF files handle versioning differently from Word. PDFs support incremental saves, where modifications are appended to the end of the file rather than overwriting the original content. This means that previous versions of the content can coexist within a single PDF file.
How incremental saves work
When a PDF is modified and saved incrementally, the original objects remain in the file and new objects are appended. A cross-reference table at the end of the file points to the latest versions of each object. But the previous versions are still present in the file's byte stream.
What this means for metadata
- Previous text content — if a PDF was edited to change text (for example, correcting a name or updating a figure), the original text may still exist in the file as an older object version
- Previous metadata — if metadata fields were changed (author name updated, title corrected), the original metadata values may persist as unreferenced objects
- Annotation history — comments and annotations that were added and then deleted may remain in the file as deleted objects
Viewing a PDF normally shows only the current version. But tools that parse the raw PDF structure can extract previous versions, revealing what changed and what the document looked like before modification.
SharePoint and OneDrive version history
Cloud-based document management platforms add another layer of version tracking that operates independently of the document's internal metadata.
Platform-level version history
SharePoint and OneDrive maintain a version history for every document. Each time the document is saved (manually or through auto-save), a new version is created. This version history includes:
- Who made each version — the user account that saved the file
- When each version was created — timestamp for each save
- The full content of each version — the complete document at that point in time
This version history is accessible to anyone with appropriate permissions on the SharePoint site or OneDrive folder. It exists independently of the document's internal metadata — even if you clean the document's internal metadata, the platform's version history retains all previous versions with their original metadata.
The distinction between file metadata and platform metadata
Cleaning a document's internal metadata (author field, comments, revision history) addresses only what is embedded in the file itself. Platform-level metadata (version history, access logs, sharing records) is stored by the platform and is not affected by changes to the file's content.
When preparing a document for external sharing:
- Clean the file's internal metadata (this is what Purgit does)
- Download or export a clean copy rather than sharing a link to the version-controlled file
- If sharing via link is required, understand that the recipient may have access to the platform's version history depending on permission settings
A proper clean workflow before external sharing
Step 1: Finalize content before cleaning
Complete all internal reviews and accept all changes before beginning the metadata removal process. Cleaning metadata during an active review cycle means the next review will re-introduce metadata.
Step 2: Create a clean copy
Rather than cleaning the working file, create a copy specifically for external sharing. This preserves the internal version (with its revision history, for your records) while producing a clean export.
Step 3: Remove internal metadata
On the clean copy, remove:
- All tracked changes (accept and purge, not just accept)
- All comments
- Document properties (author, company, title, etc.)
- Revision session identifiers (rsid values)
- Custom XML parts from document management systems
Step 4: Flatten PDFs
If the output is a PDF, flatten it to remove incremental save layers. A flattened PDF contains only the current version of each object, with no residual previous versions.
Step 5: Verify
Re-scan the clean copy to confirm that metadata has been successfully removed. This catches cases where the removal process missed fields or introduced new metadata.
Step 6: Share the clean copy, not the working file
Share only the verified clean copy. Never share the working file or a link to the file in your version-controlled system.
Purgit scans documents for revision metadata — rsid values, tracked changes, comments, document properties, and incremental PDF layers. It removes the version trail at the structural level and verifies removal before you share.
[Scan a File Free]