Integrating Purgit Into Your Document Pipeline via API
How to integrate Purgit's API into your document workflows. Covers authentication, the scan-sanitize-verify flow, code examples, and webhook setup.
Why integrate via API
Purgit's web interface handles individual document scanning and sanitization. But if your organization processes documents at volume — contract management systems, report generators, document publishing workflows, automated compliance pipelines — you need metadata handling integrated directly into your software.
The Purgit API provides programmatic access to the same scan, sanitize, and verify capabilities available in the web interface. You send a document, receive a metadata report, request sanitization, and get back a clean file with verification results.
Authentication
API keys
All API requests require authentication via Bearer token. Generate API keys from your Purgit dashboard under Settings > API Keys.
Authorization: Bearer purgit_live_sk_abc123...
Key management best practices
- Generate separate keys for each integration (one for your CMS connector, one for your CI pipeline, one for your email gateway)
- Store keys in environment variables or a secrets manager, never in source code
- Rotate keys on a regular schedule and immediately if a key may have been exposed
- Use test-mode keys (
purgit_test_sk_...) during development — test keys process documents but do not count against your plan quota
The scan, sanitize, verify flow
The Purgit API follows a three-step flow that mirrors the verification-first philosophy of the product.
Step 1: Upload and scan
Upload a document to receive a metadata scan report.
POST /api/v1/scan
Content-Type: multipart/form-data
file: [binary file data]
policy: "default" // optional: policy ID for custom rules
The response includes a scan ID and the full metadata inventory:
{
"scanId": "scan_8f3a2b1c",
"status": "complete",
"findings": {
"total": 12,
"categories": {
"identity": 3,
"timestamps": 2,
"revision": 4,
"comments": 2,
"location": 1
},
"details": [
{
"ruleId": "PDF-META-001",
"field": "Author",
"value": "Jane Smith",
"severity": "high",
"category": "identity"
}
]
},
"fileHash": "sha256:a1b2c3..."
}
Step 2: Sanitize
Request sanitization of a scanned document. You can sanitize all findings or specify which categories or individual findings to address.
POST /api/v1/sanitize
Content-Type: application/json
{
"scanId": "scan_8f3a2b1c",
"mode": "all",
"policy": "default"
}
Or selectively:
{
"scanId": "scan_8f3a2b1c",
"mode": "selective",
"include": ["identity", "comments", "revision"],
"exclude": ["timestamps"]
}
The response includes a sanitization ID and a download URL for the clean file:
{
"sanitizeId": "san_4d5e6f7g",
"status": "complete",
"removedCount": 10,
"preservedCount": 2,
"downloadUrl": "/api/v1/download/san_4d5e6f7g",
"expiresAt": "2026-04-01T00:00:00Z"
}
Step 3: Verify
The sanitized file is automatically re-scanned to verify that metadata was successfully removed. Verification results are included in the sanitization response, but you can also request a standalone verification scan.
POST /api/v1/verify
Content-Type: application/json
{
"sanitizeId": "san_4d5e6f7g"
}
Response:
{
"verifyId": "ver_9h0i1j2k",
"status": "pass",
"remainingFindings": 0,
"report": {
"scannedFields": 47,
"cleanFields": 47,
"flaggedFields": 0
}
}
If verification fails (remaining findings > 0), the response includes details on which metadata persisted and why.
Code examples
Node.js
const fs = require('fs');
const API_BASE = 'https://api.purgit.io';
const API_KEY = process.env.PURGIT_API_KEY;
async function scanAndSanitize(filePath) {
// Step 1: Upload and scan
const formData = new FormData();
formData.append('file', new Blob([fs.readFileSync(filePath)]));
formData.append('policy', 'default');
const scanRes = await fetch(`${API_BASE}/api/v1/scan`, {
method: 'POST',
headers: { 'Authorization': `Bearer ${API_KEY}` },
body: formData,
});
const scan = await scanRes.json();
if (scan.findings.total === 0) {
console.log('No metadata found.');
return null;
}
console.log(`Found ${scan.findings.total} metadata findings.`);
// Step 2: Sanitize
const sanitizeRes = await fetch(`${API_BASE}/api/v1/sanitize`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
scanId: scan.scanId,
mode: 'all',
}),
});
const sanitize = await sanitizeRes.json();
// Step 3: Download clean file
const fileRes = await fetch(`${API_BASE}${sanitize.downloadUrl}`, {
headers: { 'Authorization': `Bearer ${API_KEY}` },
});
const cleanFile = Buffer.from(await fileRes.arrayBuffer());
const outputPath = filePath.replace(/(\.\w+)$/, '.clean$1');
fs.writeFileSync(outputPath, cleanFile);
console.log(`Clean file saved to ${outputPath}`);
return sanitize;
}
Python
import os
import requests
API_BASE = "https://api.purgit.io"
API_KEY = os.environ["PURGIT_API_KEY"]
HEADERS = {"Authorization": f"Bearer {API_KEY}"}
def scan_and_sanitize(file_path: str) -> dict | None:
# Step 1: Upload and scan
with open(file_path, "rb") as f:
scan_res = requests.post(
f"{API_BASE}/api/v1/scan",
headers=HEADERS,
files={"file": f},
data={"policy": "default"},
)
scan = scan_res.json()
if scan["findings"]["total"] == 0:
print("No metadata found.")
return None
print(f"Found {scan['findings']['total']} metadata findings.")
# Step 2: Sanitize
sanitize_res = requests.post(
f"{API_BASE}/api/v1/sanitize",
headers={**HEADERS, "Content-Type": "application/json"},
json={"scanId": scan["scanId"], "mode": "all"},
)
sanitize = sanitize_res.json()
# Step 3: Download clean file
file_res = requests.get(
f"{API_BASE}{sanitize['downloadUrl']}",
headers=HEADERS,
)
base, ext = os.path.splitext(file_path)
output_path = f"{base}.clean{ext}"
with open(output_path, "wb") as f:
f.write(file_res.content)
print(f"Clean file saved to {output_path}")
return sanitize
GitHub Actions workflow
Automate metadata scanning as part of your CI pipeline. This workflow scans documents in your repository before they are published or distributed.
name: Document Metadata Scan
on:
push:
paths:
- 'docs/**'
- 'assets/**'
pull_request:
paths:
- 'docs/**'
- 'assets/**'
jobs:
scan-metadata:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Find documents
id: find-docs
run: |
find docs assets -type f \
\( -name "*.pdf" -o -name "*.docx" -o -name "*.xlsx" \
-o -name "*.pptx" -o -name "*.jpg" -o -name "*.png" \) \
> /tmp/doc-list.txt
echo "count=$(wc -l < /tmp/doc-list.txt)" >> "$GITHUB_OUTPUT"
- name: Scan documents for metadata
if: steps.find-docs.outputs.count > 0
env:
PURGIT_API_KEY: ${{ secrets.PURGIT_API_KEY }}
run: |
FAILED=0
while IFS= read -r file; do
RESULT=$(curl -s -X POST "https://api.purgit.io/api/v1/scan" \
-H "Authorization: Bearer $PURGIT_API_KEY" \
-F "file=@$file" \
-F "policy=default")
FINDINGS=$(echo "$RESULT" | jq '.findings.total')
if [ "$FINDINGS" -gt 0 ]; then
echo "::warning file=$file::Found $FINDINGS metadata findings"
FAILED=1
fi
done < /tmp/doc-list.txt
if [ "$FAILED" -eq 1 ]; then
echo "::error::Documents contain metadata. Run Purgit to clean before committing."
exit 1
fi
Webhook setup
For asynchronous processing of large files, configure webhooks to receive scan and sanitization results.
Register a webhook endpoint
POST /api/v1/webhooks
Content-Type: application/json
{
"url": "https://your-app.com/webhooks/purgit",
"events": ["scan.complete", "sanitize.complete", "verify.complete"],
"secret": "whsec_your_signing_secret"
}
Webhook payload
{
"event": "sanitize.complete",
"timestamp": "2026-03-31T14:30:00Z",
"data": {
"sanitizeId": "san_4d5e6f7g",
"scanId": "scan_8f3a2b1c",
"status": "complete",
"removedCount": 10,
"downloadUrl": "/api/v1/download/san_4d5e6f7g",
"verification": {
"status": "pass",
"remainingFindings": 0
}
}
}
Signature verification
Webhook payloads are signed with your webhook secret using HMAC-SHA256. Verify the signature from the X-Purgit-Signature header before processing:
const crypto = require('crypto');
function verifyWebhookSignature(payload, signature, secret) {
const expected = crypto
.createHmac('sha256', secret)
.update(payload)
.digest('hex');
return crypto.timingSafeEqual(
Buffer.from(signature),
Buffer.from(expected)
);
}
Error handling and retry logic
HTTP status codes
200— success400— invalid request (unsupported file format, missing required fields)401— invalid or missing API key413— file exceeds size limit429— rate limit exceeded500— server error (retry with backoff)
Retry strategy
For 429 and 500 responses, implement exponential backoff:
async function fetchWithRetry(url, options, maxRetries = 3) {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
const res = await fetch(url, options);
if (res.status === 429 || res.status >= 500) {
if (attempt === maxRetries) throw new Error(`Failed after ${maxRetries} retries`);
const delay = Math.pow(2, attempt) * 1000;
await new Promise(r => setTimeout(r, delay));
continue;
}
return res;
}
}
Rate limits
API rate limits depend on your plan:
| Plan | Requests/minute | Concurrent uploads | |------|----------------|--------------------| | Pro | 60 | 5 | | Team | 300 | 20 | | Enterprise | Custom | Custom |
Rate limit headers are included in every response:
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1711900800
Monitor X-RateLimit-Remaining and throttle requests as needed rather than hitting the limit and handling 429 responses.
Ready to integrate? Generate your API key at purgit.io/dashboard/api-keys and start scanning documents programmatically. Free tier includes 50 API scans per month.
[Scan a File Free]