Building a Document Sanitization Pipeline for Your Team
How to set up a repeatable document sanitization workflow for your team — from ad-hoc manual checks to automated pipelines with shared policies, audit logs, and CI/CD integration.
Why teams need a pipeline, not a tool
Individual professionals can scan documents one at a time. Open the file, run the check, download the clean version. For a solo practitioner who sends ten documents a week, this works.
Teams face a different problem. When five, ten, or fifty people in an organization share documents externally, the question is not "can we clean documents?" but "are we consistently cleaning every document, every time, to the same standard?"
The difference between a tool and a pipeline is the difference between a fire extinguisher and a sprinkler system. One requires someone to remember to use it. The other runs automatically.
This article walks through four maturity levels of document sanitization — from ad-hoc manual checks to fully automated pipelines — and explains what each level requires, what it protects against, and when to move to the next level.
Level 1: Ad-hoc manual scanning
Who this is for: Solo practitioners, small teams getting started, anyone evaluating the problem.
How it works:
- Before sending a document, the sender opens a scanning tool
- They upload the document and review the findings
- They download the cleaned version
- They send the cleaned version instead of the original
What this catches:
- Obvious metadata (author name, company, dates)
- Tracked changes and comments (if the sender remembers to check)
- GPS coordinates in images (if the sender uses an image-aware tool)
What this misses:
- Documents sent in a hurry (the most dangerous ones)
- Files shared via drag-and-drop to cloud storage or messaging apps (bypasses the scanning step)
- Inconsistent standards across team members (one person strips metadata, another doesn't)
- No record of what was cleaned or when
When to move to Level 2: When more than one person in your organization sends documents externally, or when you need to demonstrate compliance with a policy (HIPAA, legal ethics, client requirements).
Level 2: Shared policies and team standards
Who this is for: Teams of 2-20 who need consistent document handling.
How it works:
- A team admin defines a shared policy — a set of rules specifying what metadata must be removed
- All team members apply the same policy when scanning documents
- Policies are version-controlled and centrally managed
- Scan reports are retained (metadata only) as an audit trail
What this adds over Level 1:
- Consistency: Every team member applies the same rules. If the policy says "remove author, company, tracked changes, and GPS coordinates," every document is checked against that standard.
- Accountability: An audit log records who scanned what, when, and whether the scan passed. This is metadata only — no file content is stored — but it provides evidence that the process was followed.
- Policy governance: When the team's requirements change (a new client requires additional metadata removal, a regulation changes), the admin updates the policy once. All team members get the updated rules automatically.
Implementation:
- Choose a tool that supports shared policies and team accounts
- Define your policy based on your organization's requirements:
- Legal teams: Author, tracked changes, comments, revision history, redaction safety
- Healthcare teams: GPS coordinates, device identifiers, timestamps, author identity
- Consulting teams: Author, company name, template path, revision history, total editing time
- Train your team on the scanning workflow (this takes 5 minutes per person)
- Review the audit log monthly to verify adoption
What this misses:
- Still relies on human behavior — someone has to remember to scan before sending
- Does not integrate with existing workflows (email, file sharing, document management)
- No automated enforcement
When to move to Level 3: When compliance requirements mandate that every external document is scanned (not just the ones people remember), or when your team processes more than 50 documents per week and manual scanning becomes a bottleneck.
Level 3: CLI integration and batch processing
Who this is for: IT teams, development teams, and organizations that want to embed scanning into existing workflows.
How it works:
- The scanning tool's CLI is installed on team machines or servers
- Scanning is integrated into existing workflows:
- Pre-commit hooks in document management systems
- Batch processing scripts that scan entire directories
- CI/CD pipeline steps that scan documents before deployment
- Email gateway hooks that scan attachments before sending
- Results are logged to a central audit system
Integration patterns:
Batch scanning a shared directory
# Scan all documents in the outgoing directory
purgit batch ./outgoing/ --policy legal --output ./clean/ --report ./reports/
# Only send files from the clean directory
This pattern is useful for teams that stage outgoing documents in a shared folder. A scheduled script or manual command scans everything in the staging area, produces cleaned versions in a separate directory, and generates reports.
Pre-send scanning in a script
#!/bin/bash
# scan-before-send.sh — wrapper for scanning before email attachment
FILE=$1
RESULT=$(purgit scan "$FILE" --policy standard --json)
STATUS=$(echo "$RESULT" | jq -r '.verification.status')
if [ "$STATUS" = "pass" ]; then
echo "File is clean. Proceed to send."
else
echo "Findings detected. Run: purgit sanitize $FILE"
exit 1
fi
CI/CD pipeline integration
# GitHub Actions step
- name: Scan documents for metadata
run: |
npx @purgit/cli batch ./docs/ --policy standard --fail-on-findings
This blocks a deployment or release if any document in the repository contains metadata that violates the policy.
What this adds over Level 2:
- Automation: Scanning happens as part of existing workflows, not as a separate manual step
- Batch processing: Hundreds of documents processed in minutes, not one at a time
- Enforcement: CI/CD integration can block releases that contain documents with metadata violations
- Integration: Fits into existing IT infrastructure (scripts, cron jobs, CI/CD, document management hooks)
What this misses:
- Requires technical setup (CLI installation, script writing, CI/CD configuration)
- Does not cover ad-hoc sharing (someone emails a document directly without going through the pipeline)
- API integration requires a paid tier
When to move to Level 4: When you need real-time scanning of every document that leaves the organization, regardless of how it's shared — email, cloud storage, messaging, API uploads.
Level 4: API-driven automated pipeline
Who this is for: Enterprise organizations, compliance-heavy industries, teams with dedicated IT/security staff.
How it works:
- The scanning tool's API is integrated into your organization's document handling infrastructure
- Every document that crosses an organizational boundary is automatically scanned
- Findings are logged to your SIEM or compliance platform
- Policies are managed centrally and applied consistently across all channels
- Reports and certificates are generated automatically and stored in your compliance archive
Architecture pattern:
Document created (Word, PDF, etc.)
→ User initiates share (email, cloud upload, API transfer)
→ Organization's middleware intercepts the file
→ Middleware calls Purgit API: POST /api/v1/sanitize
→ API returns: clean file + report + certificate
→ Middleware replaces original with clean version
→ Share proceeds with clean file
→ Report and certificate logged to compliance system
Integration points:
- Email gateway: Scan attachments before outbound delivery (Microsoft Exchange transport rules, Google Workspace DLP)
- Cloud storage: Scan files on upload to SharePoint, Google Drive, Box, or Dropbox (webhook trigger → API call → replace file)
- Document management system: Scan files on check-out or on export (iManage, NetDocuments, SharePoint integration)
- Custom applications: Any workflow that handles documents can call the API before sharing
What this adds over Level 3:
- Real-time, automatic scanning — no human action required
- Coverage of all sharing channels — email, cloud, DMS, API, messaging
- Centralized compliance evidence — every scan is logged, every report is archived
- Policy enforcement without user friction — users share documents normally; scanning happens transparently
Requirements:
- API access (Team or Enterprise tier)
- IT resources to build and maintain integrations
- Compliance team to define and manage policies
- Monitoring and alerting for failed scans or policy violations
Choosing your starting point
Not every organization needs Level 4. The right starting point depends on your risk profile and team size.
| Factor | Level 1 | Level 2 | Level 3 | Level 4 | |--------|---------|---------|---------|---------| | Team size | 1-3 | 2-20 | 5-100 | 20+ | | Documents shared externally per week | < 10 | 10-50 | 50-500 | 500+ | | Regulatory requirements | Low | Moderate | High | Strict | | IT resources available | None | Minimal | Some | Dedicated | | Compliance audit exposure | Low | Moderate | High | Mandatory |
Start at the lowest level that addresses your risk. A solo practitioner at Level 1 who scans every document is better protected than an enterprise at Level 4 with a misconfigured pipeline.
Building incrementally
The advantage of a pipeline-oriented approach is that each level builds on the previous one:
- Level 1 → Level 2: Add shared policies and an audit log. No infrastructure changes required — just upgrade to a team plan and define your policy.
- Level 2 → Level 3: Install the CLI and write a batch processing script. Start with the highest-risk document category (client deliverables, regulatory submissions, clinical photos) and expand.
- Level 3 → Level 4: Add API integration to one sharing channel at a time. Start with outbound email (highest volume, highest risk), then expand to cloud storage and document management.
Each level reduces risk. Each level produces better compliance evidence. Each level requires less human discipline and more systematic enforcement.
The goal is not perfection from day one. The goal is a process that improves over time and catches the documents that human attention misses.
Purgit supports every level of this pipeline. Web app for individual scanning. Shared policies and audit logs for team coordination. CLI for batch processing and CI/CD integration. REST API for automated pipelines. Start where you are, grow as you need.
[Scan a File Free]