Detect Bad Redactions
x-ray scans PDFs to find visual redaction boxes that leave underlying text accessible, reporting exact page, bounding box, and recovered text via CLI or Python.
Millions of public PDFs may hide “redactions” you can still read. This open‑source tool finds worthless redactions so sensitive text stops slipping through.
x-ray scans PDFs to find visual redaction boxes that leave underlying text accessible, reporting exact page, bounding box, and recovered text via CLI or Python.
Source: GitHub - freelawproject/x-ray — Source link
Highlights
| Metric | Value | Notes |
|---|---|---|
| Purpose | Python library for finding bad redactions in PDF documents. | |
| Core engine | Uses PyMuPDF to parse and inspect PDFs. | |
| Output format | JSON mapping page numbers to lists of detections (bbox + text). | |
| Interfaces | Command-line tool (xray), Python module, and runnable via uvx. | |
| License | BSD-2-Clause permissive license. | |
| CI / Releases | Releases happen automatically via GitHub Actions. |
Key points
- Detection method: find rectangles, find letters in same location, render rectangle, check if rendered region is a single color.
- When a rectangle is uniform color over underlying text, x-ray flags it as a bad redaction and returns the text and bbox.
- CLI usage: xray <path-or-https-url> outputs JSON; supports local paths, https URLs, or bytes in memory via the inspect API.
- Python API: xray.inspect accepts a path, URL, or bytes and returns a Python dict mirroring the JSON output.
- Input type rules: str/Path -> local file, https-prefixed str -> download URL, bytes -> PDF in memory.
- Examples in the README show recovered text and bounding boxes to illustrate false redactions.
- Contributions require a signed contributor license agreement; issues list tracks feature requests and unsupported cases.
Timeline
- 2021-10-26 — Congressional testimony PDF referenced as an example (no bad redactions).
- Dec 8, 2025 — Release v0.3.5 listed as latest on the repository page.
Why this matters
Faulty redactions can expose sensitive information in public records and archives. x-ray automates detection, improving privacy, transparency, and the integrity of legal and archival datasets while enabling teams to remediate risky documents at scale.