What Is Document Metadata? A Guide for Non-Technical Professionals
Document metadata is the invisible data embedded in every file you create — author names, timestamps, GPS coordinates, revision history. Here's what it is, why it matters, and what to do about it.
The short version
Every document you create — every PDF, every Word file, every photo — contains invisible data about itself. This data is called metadata. It records who created the file, when, with what software, and sometimes where. It can include revision history, comments, GPS coordinates, and device identifiers.
You cannot see metadata by reading a document normally. But anyone who receives the file can extract it in seconds.
This guide explains what metadata is, what kinds of files contain it, why it matters for professionals, and what you can do about it — without requiring any technical background.
What metadata actually is
The word "metadata" means "data about data." In the context of documents, metadata is information about the file that is not part of the visible content.
Think of it this way: a printed letter contains the words on the page. But the envelope contains additional information — the sender's address, the postmark date, the postal service used. Metadata is the digital equivalent of the envelope.
Unlike a physical envelope, digital metadata is invisible unless you specifically look for it. It is embedded inside the file itself, traveling with the document wherever it goes — via email, file sharing, cloud storage, or USB drive.
What kinds of metadata exist
Different file types store different kinds of metadata. Here are the most common categories.
Author and identity metadata
Almost every document format records who created the file and who last modified it. This information is pulled automatically from your computer's user account or your software profile.
Where it appears:
- PDFs: Author, creator, producer fields
- Word/Excel/PowerPoint: Author, last modified by, company name
- Images: Camera make and model, software used
Why it matters: If you are a consultant adapting a proposal from a template, the author field may show a previous employee's name or a competitor's company name. If you are a lawyer, the author field may identify the individual attorney who drafted the document — which may not be appropriate to share.
Timestamps
Documents record when they were created, last modified, and sometimes when they were last printed or accessed.
Where it appears:
- All document formats store creation and modification timestamps
- Images store the exact date and time the photo was taken
- Word documents store the total editing time in minutes
Why it matters: Timestamps can reveal when work was performed, how long it took, and whether a document was created before or after a claimed date. In legal contexts, timestamps are frequently examined as evidence.
Location data
Photos taken with smartphones embed GPS coordinates — the precise latitude and longitude where the photo was taken, often accurate to within a few meters.
Where it appears:
- JPEG, PNG, and HEIC image files (in EXIF metadata)
- Not typically in PDFs or Word documents (unless they contain embedded photos with GPS data)
Why it matters: A photo shared without removing GPS data reveals the location where it was taken. This could be your home, your office, a client's location, a hospital, or any other place you would prefer not to disclose.
Revision history and editing data
Word documents and other office formats can retain a complete record of every change made to the document — insertions, deletions, formatting changes, and comments.
Where it appears:
- Word (.docx): Tracked changes, revision history, comment threads
- Excel (.xlsx): Cell change history, hidden sheets
- PowerPoint (.pptx): Comments, speaker notes, hidden slides
Why it matters: Even after you "accept" tracked changes in Word, the revision data may persist in the file's internal structure. The recipient can potentially see what text was deleted, what the original phrasing was, and what internal comments were made during the review process.
Software and system information
Documents record which software was used to create them, including the application name, version number, and sometimes the operating system.
Where it appears:
- PDFs: Creator and producer fields (e.g., "Microsoft Word 16.0," "Adobe Acrobat Pro 2024")
- Word/Excel/PowerPoint: Application name and version
- Images: Software field (e.g., "Adobe Photoshop 26.1," "Lightroom Classic 14.0")
Why it matters: Software identifiers reveal your tools and workflow. In some contexts, this is insignificant. In others — such as when a law firm wants to present work as its own without revealing that it was prepared using specific tools or AI assistants — the software metadata is sensitive.
Tool attribution strings
A newer category of metadata: strings that identify the tool or AI system that contributed to the document's creation.
Where it appears:
- Documents created with AI assistance may contain strings like "Generated with ChatGPT" or "Co-authored by Claude" in metadata fields or embedded comments
- Images generated by AI tools may contain "Created with Midjourney" or "DALL-E" in the metadata
- C2PA Content Credentials may embed a full provenance chain
Why it matters: If you are using AI tools to assist with document preparation — which is increasingly common and perfectly legitimate — you may or may not want the tool's name to appear in the file you share with clients, opposing counsel, or regulatory bodies.
Who should care about metadata
Metadata is relevant to anyone who shares files professionally. Here are the most common contexts.
Legal professionals
Lawyers handle privileged, confidential, and strategically sensitive documents daily. Metadata in legal documents has been used in litigation to challenge arguments, expose negotiation strategies, and identify document authors. Several bar associations have issued ethics opinions on the duty to strip metadata before sharing documents with opposing parties.
Healthcare professionals
The Health Insurance Portability and Accountability Act (HIPAA) defines geographic data as a category of Protected Health Information (PHI). GPS coordinates in clinical photos, combined with other identifiers, can constitute a HIPAA violation. Healthcare professionals who share photos or documents containing patient-related information need to ensure metadata does not reveal location, timing, or device details.
Consultants and professional services
Management consultants, accountants, and financial advisors frequently adapt documents from templates or previous engagements. The metadata — author name, company name, template path — can reveal the source of the document, the individual who prepared it, and sometimes the previous client the template was created for.
Academics and researchers
Researchers preparing manuscripts for double-blind peer review need to remove author identification from the file's metadata, not just from the text. A PDF submitted for anonymous review that still contains the author's name in the metadata defeats the purpose of the blind review process.
IT and security professionals
Development teams and security engineers working with document processing pipelines need systematic, automated metadata handling. Manual cleaning does not scale to hundreds or thousands of documents.
Anyone sharing photos
Anyone who shares photos via email, messaging apps, or platforms that do not automatically strip EXIF data is potentially sharing their GPS location, device information, and timestamps.
How to check for metadata
Quick check: Document properties
- Word/Excel/PowerPoint: File > Info > Properties panel on the right side. This shows author, company, dates, and word count.
- PDF (Adobe Reader): File > Properties > Description tab. Shows author, creator, producer, dates.
- macOS (any file): Right-click > Get Info. Expand "More Info" for metadata fields.
- Windows (images): Right-click > Properties > Details tab. Shows EXIF data including GPS if present.
Deeper check: Document Inspector (Word)
File > Info > Check for Issues > Inspect Document. This scans for comments, revisions, document properties, hidden content, and custom XML. It can remove what it finds, but does not verify the removal.
Comprehensive check: Dedicated tools
For thorough metadata inspection — including fields that quick checks miss, such as template paths, resolved comments, and embedded image EXIF data — a dedicated metadata scanning tool that reads the file at the structural level (XML for OOXML, objects for PDF, EXIF tags for images) provides more complete coverage.
What to do about it
You have three options, in order of increasing reliability:
Option 1: Manual inspection and removal
Use the built-in tools in your software (Document Inspector in Word, Properties in PDF readers) to find and remove metadata. This works for most common fields but may miss some data, does not verify removal, and requires manual execution for every file.
Option 2: Save as a new format
Copying the visible content into a new document or saving as a different format (e.g., Word to PDF) can strip some metadata. However, this is inconsistent — PDFs inherit metadata from the source Word document, and the conversion process may add new metadata (the PDF creator and producer fields).
Option 3: Scan, clean, and verify
Use a tool that reads the file at the structural level, identifies all metadata fields, removes or replaces them, and then re-scans the output to verify that the metadata is gone. This is the most reliable approach because it includes a verification step — confirmation that the cleanup actually worked.
Purgit scans PDFs, Word documents, Excel spreadsheets, PowerPoint presentations, and images for hidden metadata. It removes findings at the structural level and verifies removal by re-scanning the output. No expertise required — just upload your file.
[Scan a File Free]