Detect Bad Redactions

x-ray scans PDFs to find visual redaction boxes that leave underlying text accessible, reporting exact page, bounding box, and recovered text via CLI or Python.

Detect Bad Redactions
Detect Bad Redactions
Millions of public PDFs may hide “redactions” you can still read. This open‑source tool finds worthless redactions so sensitive text stops slipping through.

x-ray scans PDFs to find visual redaction boxes that leave underlying text accessible, reporting exact page, bounding box, and recovered text via CLI or Python.

Source: GitHub - freelawproject/x-ray — Source link

Highlights

Metric Value Notes
Purpose Python library for finding bad redactions in PDF documents.
Core engine Uses PyMuPDF to parse and inspect PDFs.
Output format JSON mapping page numbers to lists of detections (bbox + text).
Interfaces Command-line tool (xray), Python module, and runnable via uvx.
License BSD-2-Clause permissive license.
CI / Releases Releases happen automatically via GitHub Actions.

Key points

  • Detection method: find rectangles, find letters in same location, render rectangle, check if rendered region is a single color.
  • When a rectangle is uniform color over underlying text, x-ray flags it as a bad redaction and returns the text and bbox.
  • CLI usage: xray <path-or-https-url> outputs JSON; supports local paths, https URLs, or bytes in memory via the inspect API.
  • Python API: xray.inspect accepts a path, URL, or bytes and returns a Python dict mirroring the JSON output.
  • Input type rules: str/Path -> local file, https-prefixed str -> download URL, bytes -> PDF in memory.
  • Examples in the README show recovered text and bounding boxes to illustrate false redactions.
  • Contributions require a signed contributor license agreement; issues list tracks feature requests and unsupported cases.

Timeline

  • 2021-10-26 — Congressional testimony PDF referenced as an example (no bad redactions).
  • Dec 8, 2025 — Release v0.3.5 listed as latest on the repository page.

Why this matters

Faulty redactions can expose sensitive information in public records and archives. x-ray automates detection, improving privacy, transparency, and the integrity of legal and archival datasets while enabling teams to remediate risky documents at scale.