Why PDF Redaction Fails (And How to Do It Right)
PDF redaction fails because most tools cover text visually without removing it from the file's data layer. Here's how PDF redaction actually works, why it breaks, and how to verify it.
The redaction problem
You need to share a PDF, but certain information must be hidden — names, social security numbers, financial figures, addresses, privileged communications. You open the PDF, draw a black box over the sensitive text, save the file, and send it.
The text is still in the file.
This is the most common PDF redaction failure, and it has occurred in court filings, government publications, corporate disclosures, and regulatory submissions. It happens because drawing a visual overlay on a PDF does not modify the text layer underneath.
Understanding why requires understanding how PDFs store content.
How PDFs store text
A PDF is a structured document format with multiple layers of data. The visual rendering — what you see on screen or in print — is composed from underlying data objects. For text, these objects include:
The content stream
The content stream is the core of a PDF page. It contains positioning commands and text-drawing operators that tell the PDF renderer where to place each character:
BT % Begin text block
/F1 12 Tf % Set font: F1, 12pt
100 700 Td % Move to position (100, 700)
(John Smith) Tj % Draw the string "John Smith"
ET % End text block
This is the actual text data. When you select text in a PDF reader and copy it, you are copying from the content stream. When a search engine indexes a PDF, it reads the content stream. When a screen reader narrates a PDF, it reads the content stream.
The annotation layer
Annotations are objects that sit on top of the page content. They include comments, highlights, form fields, stamps, and drawing shapes — including rectangles.
When you use a PDF editor's drawing tools to place a black rectangle over text, you are adding an annotation to the annotation layer. The annotation is rendered above the content stream in the visual display. The content stream is unchanged.
% Annotation: black rectangle at position (95, 695) to (195, 715)
/Type /Annot
/Subtype /Square
/Rect [95 695 195 715]
/IC [0 0 0] % Fill color: black
The text "John Smith" remains in the content stream at position (100, 700). The black rectangle is drawn above it at position (95, 695). The visual result appears redacted. The data is untouched.
The metadata dictionary
A PDF's metadata is stored in the document information dictionary and in XMP metadata streams. These contain author, title, creation date, creator application, and other properties — separate from the page content.
Four ways redaction fails
Failure 1: Visual overlay without text removal
This is the classic failure. A rectangle, highlight, or image is placed over text. The text remains in the content stream. Anyone can extract it by:
- Selecting and copying the text under the rectangle
- Opening the PDF in a text editor and searching for the text strings
- Using a PDF parsing library (pdf.js, pdfplumber, PyPDF2) to extract all text objects by page
- Using accessibility tools that read the content stream, not the visual layer
This is what happened in the Epstein court filings, the SISMI intelligence leak, and numerous other documented incidents.
Failure 2: White text on white background
A variation of the overlay approach: instead of covering text with a black rectangle, the text color is changed to white. The text is invisible on a white background but remains in the content stream. Selecting all text on the page (Ctrl+A) reveals the white text, and copy-paste extracts it.
Some users also set the font size to 1pt, making the text effectively invisible. The text is still there.
Failure 3: Image-based redaction that does not flatten
Some users take a screenshot of the PDF page with the redaction applied, then insert the screenshot back into the PDF. If the original text layer is not removed — if the screenshot is placed over the existing page content rather than replacing it — the text remains accessible beneath the image.
This happens when the PDF is not "flattened" after the image insertion. Flattening merges all layers into a single rendered output. Without flattening, the original content stream coexists with the overlaid image.
Failure 4: Incomplete structural redaction
Even when using a proper redaction tool (like Adobe Acrobat Pro's Redact feature), redaction can be incomplete:
- Metadata not cleared: The document properties may contain the author's name or other information that should have been redacted.
- Bookmarks and links: The table of contents, bookmarks, or hyperlinks may reference redacted content by name.
- Incremental save history: PDFs support incremental saves, where new versions are appended to the file without overwriting previous versions. A "redacted" version may be appended, but the previous unredacted version still exists in the file's byte stream. Specialized tools can recover incremental revisions.
- Form fields: Interactive form fields may contain values that overlap with redacted content.
- Embedded attachments: Files attached to the PDF (other documents, images, spreadsheets) are not affected by page-level redaction.
How to redact a PDF correctly
Correct redaction requires three steps: mark, apply, and verify.
Step 1: Mark areas for redaction
Use a tool that supports structural redaction — not drawing tools, not annotation tools. In Adobe Acrobat Pro, this is Tools > Redact > Mark for Redaction. In other professional tools, look for a dedicated "Redact" function that is separate from the drawing or comment tools.
Mark all areas that contain sensitive content. The tool should highlight these areas but not yet remove the content.
Step 2: Apply redaction
Once all areas are marked, apply the redaction. This step removes the text objects from the content stream and replaces them with the redaction marks (black rectangles or other visual indicators). The text data is deleted from the PDF's data structures.
In Acrobat Pro: after marking, click "Apply Redactions." The tool will warn you that this action is permanent and will remove the underlying text. This warning is the key indicator that you are using structural redaction, not visual overlay.
After applying redaction, also:
- Remove metadata (File > Properties > remove author, title, etc.)
- Remove hidden information (Tools > Redact > Remove Hidden Information — this strips comments, metadata, attachments, bookmarks, and incremental save data)
Step 3: Verify
After redaction and metadata removal, verify that the content is actually gone:
- Select and copy test: Try to select text in the redacted areas. If you can copy readable text, the redaction failed.
- Search test: Use Ctrl+F to search for keywords that should have been redacted. If the search finds them, the redaction failed.
- Parser test: Open the PDF with a text extraction tool (pdftotext, pdf.js, or a dedicated metadata scanner) and check whether the redacted text appears in the extracted output.
- Incremental save test: Check the file size. If the redacted PDF is the same size as or larger than the original, incremental save data may still contain the unredacted version. Save the file with "Save As" (not "Save") to create a new file without incremental history.
The verification gap
The core problem with PDF redaction is not that proper tools don't exist — Adobe Acrobat Pro's Redact tool works correctly when used correctly. The problem is:
-
Users reach for the wrong tool. Drawing tools are more familiar and more accessible than the Redact tool. The visual result is identical. Nothing in the interface warns that one removes content and the other does not.
-
Verification is not built into the workflow. After applying redaction, users visually inspect the result ("the text looks hidden") and proceed. Visual inspection is not verification. The text layer is invisible by design — you cannot verify its absence by looking at the rendered page.
-
Partial redaction is hard to detect. If 99 out of 100 redactions are structural but one is a visual overlay, visual inspection will not catch it. The one failed redaction looks identical to the other 99.
-
Metadata and incremental saves are forgotten. Even users who apply structural redaction correctly may forget to remove metadata properties, clear incremental save history, or check embedded attachments.
A systematic approach
Reliable redaction requires treating the process as a pipeline, not a single step:
- Structural redaction: Remove text from the content stream, not just cover it visually.
- Metadata removal: Clear all document properties, XMP metadata, and application identifiers.
- Incremental save cleanup: Save the file as a new document to eliminate revision history.
- Attachment and bookmark review: Check that embedded files, bookmarks, and links do not reference redacted content.
- Automated verification: Re-parse the output file and check for the presence of text that should have been redacted. This must be done programmatically — not by visual inspection.
- Report generation: Produce a record of what was found, what was redacted, and whether verification passed. This record matters if the adequacy of the redaction is later questioned.
Purgit scans PDFs for redaction safety — checking whether text content exists beneath visual overlays, whether metadata contains sensitive information, and whether revision history preserves unredacted content. After sanitization, Purgit re-scans the output to verify that findings are resolved.
[Scan a File Free]