Skip to main content

What Purgit does to your documents.

Every file goes through the same five-stage pipeline: parse, scan, plan, transform, verify. Here is exactly what happens at each stage.

Five stages. Every document. Every time.

Stage 01

Parse

Your document is read and its internal structure is mapped. Purgit doesn't just look at what you see — it reads the layers underneath.

A PDF has metadata fields, text objects, annotations, attachments, and revision history. A Word document is actually a ZIP archive containing dozens of XML files. A JPEG photo carries EXIF data recording GPS coordinates, camera settings, and timestamps. Purgit opens these layers and catalogs what's there.

Stage 02

Scan

Each element of your document is checked against a policy — a set of rules that define what constitutes a finding.

A finding is anything that could expose information you didn't intend to share: an author name in a PDF's metadata, GPS coordinates in a photo's EXIF data, tracked changes showing deleted text in a Word document. Every finding is categorized by type, assigned a severity level, and explained in plain language.

Stage 03

Plan

Before anything is changed, Purgit builds a transform plan — a list of exactly what will be modified in your document.

You see the plan before any changes are applied. The plan respects your policy. If your policy says "remove GPS but keep camera model," the plan reflects that. If your policy says "randomize author name," the plan shows the synthetic value that will replace it. Nothing changes until you approve the plan.

Stage 04

Transform

The planned changes are applied to a copy of your document. Your original file is never modified.

Transforms include removing metadata fields, stripping GPS coordinates from photos, accepting and purging tracked changes from Word documents, removing comment threads, speaker notes, and hidden slides, clearing revision history and editing time, and randomizing metadata with plausible synthetic values (Pro). Every transform is deterministic.

Stage 05

Verify

After transformation, Purgit re-scans the output file using the same rules that identified the original findings.

This is the step that separates Purgit from every other metadata tool. If a finding persists after transformation, it's flagged in the report. If all findings are resolved, the verification passes. Verification is not optional. Every document that goes through Purgit is verified.

Deep format support. Not surface cleaning.

PDF

Purgit reads PDF files at the object level, not just the surface metadata panel.

  • Document properties (author, title, subject, keywords, creator, producer)
  • Creation and modification timestamps
  • Annotations (comments, highlights, sticky notes)
  • Embedded file attachments
  • JavaScript actions
  • Form field data
  • Incremental save history (revision snapshots)
  • Text under visual overlays (redaction safety check)
  • XMP metadata streams
  • Tool attribution strings (e.g., "Created with ChatGPT")

Redaction verification: Purgit checks whether text content exists beneath visual redaction rectangles. A black box drawn over text in a PDF does not remove the text — it only hides it visually. Purgit identifies this pattern and flags it.

Word (.docx)

Word documents are ZIP archives containing XML files. Purgit unpacks the archive and scans each XML part individually.

  • Author and last-modified-by (core.xml)
  • Company name (app.xml)
  • Application name and version (app.xml)
  • Template path — often contains internal network paths or personal folder names
  • Tracked changes — insertions, deletions, and formatting changes with author attribution
  • Comment threads — including deleted comments that persist in the XML
  • Revision count and total editing time
  • Custom document properties
  • Embedded objects and macros

Beyond "Accept All Changes": Accepting tracked changes in Word removes the visual markup, but the revision history persists in the document's XML structure. Purgit removes the structural data, not just the visual indicators.

Excel (.xlsx)

  • Author and last-modified-by
  • Company name
  • Named ranges (which may reveal internal data model names)
  • Embedded chart metadata
  • Formula references to external files
  • Custom document properties
  • Hidden sheets

PowerPoint (.pptx)

  • Author and last-modified-by
  • Company name
  • Comment threads on slides
  • Speaker notes (often contain talking points not intended for the audience)
  • Hidden slides
  • Embedded media metadata
  • Custom document properties

Images (JPEG, PNG, HEIC)

  • GPS coordinates (latitude, longitude, altitude)
  • Camera make and model
  • Lens model
  • ISO, aperture, shutter speed
  • Firmware and software version
  • Capture and modification timestamps
  • IPTC author and copyright fields
  • XMP metadata
  • Device serial number
  • Thumbnail images (which may contain uncropped versions)

GPS precision: Modern smartphone GPS records coordinates accurate to within a few meters. A photo shared without stripping EXIF data reveals where it was taken — your home, your office, your client's location.

Every scan produces a report. Every report shows exactly what changed.

Scan Report

  • Every finding identified (type, severity, field location, current value)
  • The policy applied (name, version, rule count)
  • The transform plan (what was changed and how)
  • Verification result (pass/fail for each original finding)
  • Input file hash (SHA-256) and output file hash (SHA-256)
  • Engine version and timestamp
HTMLJSON

Free tier reports include Purgit branding in the footer. Pro and Team reports are unbranded.

Verification Certificate

Pro+
  • Input and output file SHA-256 hashes
  • Policy ID and version applied
  • Engine version and timestamp (ISO 8601)
  • List of transform steps applied
  • List of resolved and remaining findings
  • Pass/fail status

The certificate documents that processing occurred and the file has not been modified since processing. It does not claim to prove authenticity or legal compliance — it is a processing record.

Policies define what matters to you.

A policy is a named set of rules that define which findings matter for your use case. Different professionals care about different things.

Standard

Removes all common metadata fields across all formats

All tiers

Legal

Focused on author attribution, revision history, tracked changes, and redaction safety

All tiers

Healthcare

Focused on GPS coordinates, device identifiers, timestamps, and IPTC fields

All tiers

Academic

Focused on author identity for double-blind review preparation

All tiers

Custom Policies

Pro+

Select which rules to include or exclude. Save your policy for reuse across all your documents.

Shared Policies

Team+

Team admins define org-wide policies. Every team member applies the same standards. Changes propagate automatically.

Your files stay on your device.

On the free tier and by default on paid plans, Purgit processes your documents entirely in your browser. The scan engine is compiled to WebAssembly and runs locally. Your files are never uploaded to our servers unless you explicitly opt into cloud batch processing.

This is not a marketing claim — it is the architecture. The free tier has no server-side file processing capability. There is no upload endpoint for free tier users.

Legal professionals processing privileged documents cannot risk cloud exposure

Healthcare professionals handling PHI indicators need local-only guarantees

Compliance officers evaluating tools need verifiable architecture, not promises

Works where you work.

CLI

Available via npm for developers and power users. Runs locally — no files are uploaded. Supports all formats. Available on macOS, Linux, and Windows.

npx @purgit/cli scan document.pdf
npx @purgit/cli sanitize document.pdf --policy standard
npx @purgit/cli batch ./documents/ --policy legal --output ./clean/
npx @purgit/cli verify document.pdf --hash abc123...

REST API

Team+

Programmatic access for integration into existing workflows. OpenAPI 3.1 documentation auto-generated. SDK clients available for TypeScript, Python, and Go.

POST /api/v1/scan
POST /api/v1/sanitize
GET  /api/v1/verify/:hash
GET  /api/v1/policies

Team tier: 1,000 API calls/month. Enterprise: unlimited.

See it in action.

Upload a document. See what's hidden. Download it clean.