Building a Document Sanitization Pipeline for Your Team

Why teams need a pipeline, not a tool

Individual professionals can scan documents one at a time. Open the file, run the check, download the clean version. For a solo practitioner who sends ten documents a week, this works.

Teams face a different problem. When five, ten, or fifty people in an organization share documents externally, the question is not "can we clean documents?" but "are we consistently cleaning every document, every time, to the same standard?"

The difference between a tool and a pipeline is the difference between a fire extinguisher and a sprinkler system. One requires someone to remember to use it. The other runs automatically.

This article walks through four maturity levels of document sanitization — from ad-hoc manual checks to fully automated pipelines — and explains what each level requires, what it protects against, and when to move to the next level.

Level 1: Ad-hoc manual scanning

Who this is for: Solo practitioners, small teams getting started, anyone evaluating the problem.

How it works:

Before sending a document, the sender opens a scanning tool
They upload the document and review the findings
They download the cleaned version
They send the cleaned version instead of the original

What this catches:

Obvious metadata (author name, company, dates)
Tracked changes and comments (if the sender remembers to check)
GPS coordinates in images (if the sender uses an image-aware tool)

What this misses:

Documents sent in a hurry (the most dangerous ones)
Files shared via drag-and-drop to cloud storage or messaging apps (bypasses the scanning step)
Inconsistent standards across team members (one person strips metadata, another doesn't)
No record of what was cleaned or when

When to move to Level 2: When more than one person in your organization sends documents externally, or when you need to demonstrate compliance with a policy (HIPAA, legal ethics, client requirements).

Level 2: Shared policies and team standards

Who this is for: Teams of 2-20 who need consistent document handling.

How it works:

A team admin defines a shared policy — a set of rules specifying what metadata must be removed
All team members apply the same policy when scanning documents
Policies are version-controlled and centrally managed
Scan reports are retained (metadata only) as an audit trail

What this adds over Level 1:

Consistency: Every team member applies the same rules. If the policy says "remove author, company, tracked changes, and GPS coordinates," every document is checked against that standard.
Accountability: An audit log records who scanned what, when, and whether the scan passed. This is metadata only — no file content is stored — but it provides evidence that the process was followed.
Policy governance: When the team's requirements change (a new client requires additional metadata removal, a regulation changes), the admin updates the policy once. All team members get the updated rules automatically.

Implementation:

Choose a tool that supports shared policies and team accounts
Define your policy based on your organization's requirements:
- Legal teams: Author, tracked changes, comments, revision history, redaction safety
- Healthcare teams: GPS coordinates, device identifiers, timestamps, author identity
- Consulting teams: Author, company name, template path, revision history, total editing time
Train your team on the scanning workflow (this takes 5 minutes per person)
Review the audit log monthly to verify adoption

What this misses:

Still relies on human behavior — someone has to remember to scan before sending
Does not integrate with existing workflows (email, file sharing, document management)
No automated enforcement

When to move to Level 3: When compliance requirements mandate that every external document is scanned (not just the ones people remember), or when your team processes more than 50 documents per week and manual scanning becomes a bottleneck.

Level 3: CLI integration and batch processing

Who this is for: IT teams, development teams, and organizations that want to embed scanning into existing workflows.

How it works:

The scanning tool's CLI is installed on team machines or servers
Scanning is integrated into existing workflows:
- Pre-commit hooks in document management systems
- Batch processing scripts that scan entire directories
- CI/CD pipeline steps that scan documents before deployment
- Email gateway hooks that scan attachments before sending
Results are logged to a central audit system

Integration patterns:

Batch scanning a shared directory

# Scan all documents in the outgoing directory
purgit batch ./outgoing/ --policy legal --output ./clean/ --report ./reports/

# Only send files from the clean directory

This pattern is useful for teams that stage outgoing documents in a shared folder. A scheduled script or manual command scans everything in the staging area, produces cleaned versions in a separate directory, and generates reports.

Pre-send scanning in a script

#!/bin/bash
# scan-before-send.sh — wrapper for scanning before email attachment

FILE=$1
RESULT=$(purgit scan "$FILE" --policy standard --json)
STATUS=$(echo "$RESULT" | jq -r '.verification.status')

if [ "$STATUS" = "pass" ]; then
  echo "File is clean. Proceed to send."
else
  echo "Findings detected. Run: purgit sanitize $FILE"
  exit 1
fi

CI/CD pipeline integration

# GitHub Actions step
- name: Scan documents for metadata
  run: |
    npx @purgit/cli batch ./docs/ --policy standard --fail-on-findings

This blocks a deployment or release if any document in the repository contains metadata that violates the policy.

What this adds over Level 2:

Automation: Scanning happens as part of existing workflows, not as a separate manual step
Batch processing: Hundreds of documents processed in minutes, not one at a time
Enforcement: CI/CD integration can block releases that contain documents with metadata violations
Integration: Fits into existing IT infrastructure (scripts, cron jobs, CI/CD, document management hooks)

What this misses:

Requires technical setup (CLI installation, script writing, CI/CD configuration)
Does not cover ad-hoc sharing (someone emails a document directly without going through the pipeline)
API integration requires a paid tier

When to move to Level 4: When you need real-time scanning of every document that leaves the organization, regardless of how it's shared — email, cloud storage, messaging, API uploads.

Level 4: API-driven automated pipeline

Who this is for: Enterprise organizations, compliance-heavy industries, teams with dedicated IT/security staff.

How it works:

The scanning tool's API is integrated into your organization's document handling infrastructure
Every document that crosses an organizational boundary is automatically scanned
Findings are logged to your SIEM or compliance platform
Policies are managed centrally and applied consistently across all channels
Reports and certificates are generated automatically and stored in your compliance archive

Architecture pattern:

Document created (Word, PDF, etc.)
  → User initiates share (email, cloud upload, API transfer)
  → Organization's middleware intercepts the file
  → Middleware calls Purgit API: POST /api/v1/sanitize
  → API returns: clean file + report + certificate
  → Middleware replaces original with clean version
  → Share proceeds with clean file
  → Report and certificate logged to compliance system

Integration points:

Email gateway: Scan attachments before outbound delivery (Microsoft Exchange transport rules, Google Workspace DLP)
Cloud storage: Scan files on upload to SharePoint, Google Drive, Box, or Dropbox (webhook trigger → API call → replace file)
Document management system: Scan files on check-out or on export (iManage, NetDocuments, SharePoint integration)
Custom applications: Any workflow that handles documents can call the API before sharing

What this adds over Level 3:

Real-time, automatic scanning — no human action required
Coverage of all sharing channels — email, cloud, DMS, API, messaging
Centralized compliance evidence — every scan is logged, every report is archived
Policy enforcement without user friction — users share documents normally; scanning happens transparently

Requirements:

API access (Team or Enterprise tier)
IT resources to build and maintain integrations
Compliance team to define and manage policies
Monitoring and alerting for failed scans or policy violations

Choosing your starting point

Not every organization needs Level 4. The right starting point depends on your risk profile and team size.

| Factor | Level 1 | Level 2 | Level 3 | Level 4 | |--------|---------|---------|---------|---------| | Team size | 1-3 | 2-20 | 5-100 | 20+ | | Documents shared externally per week | < 10 | 10-50 | 50-500 | 500+ | | Regulatory requirements | Low | Moderate | High | Strict | | IT resources available | None | Minimal | Some | Dedicated | | Compliance audit exposure | Low | Moderate | High | Mandatory |

Start at the lowest level that addresses your risk. A solo practitioner at Level 1 who scans every document is better protected than an enterprise at Level 4 with a misconfigured pipeline.

Building incrementally

The advantage of a pipeline-oriented approach is that each level builds on the previous one:

Level 1 → Level 2: Add shared policies and an audit log. No infrastructure changes required — just upgrade to a team plan and define your policy.
Level 2 → Level 3: Install the CLI and write a batch processing script. Start with the highest-risk document category (client deliverables, regulatory submissions, clinical photos) and expand.
Level 3 → Level 4: Add API integration to one sharing channel at a time. Start with outbound email (highest volume, highest risk), then expand to cloud storage and document management.

Each level reduces risk. Each level produces better compliance evidence. Each level requires less human discipline and more systematic enforcement.

The goal is not perfection from day one. The goal is a process that improves over time and catches the documents that human attention misses.

Purgit supports every level of this pipeline. Web app for individual scanning. Shared policies and audit logs for team coordination. CLI for batch processing and CI/CD integration. REST API for automated pipelines. Start where you are, grow as you need.

[Scan a File Free]

Why teams need a pipeline, not a tool

Level 1: Ad-hoc manual scanning

Level 2: Shared policies and team standards

Level 3: CLI integration and batch processing

Batch scanning a shared directory

Pre-send scanning in a script

CI/CD pipeline integration

Level 4: API-driven automated pipeline

Choosing your starting point

Building incrementally

Scan before you share.