Why does pdfplumber return no coverage limits from a scanned policy PDF?

The page is image-only with no recoverable text layer, so extract_words() returns nothing. Enforce a text-density floor and divert low-density pages to the OCR pathway, which returns words with bounding boxes so the rest of the extraction is identical to a born-digital page.

How do I stop a deductible being read as the coverage limit?

Cluster words into rows by their vertical centre, then select the largest plausible amount on the row that matches a coverage anchor. Lower the confidence score whenever more than one candidate amount appears on the row so ambiguous rows are quarantined for manual review instead of guessed.

Why use Decimal instead of float for parsed coverage amounts?

Coverage limits and deductibles carry exact cents that float cannot represent without rounding error. Parsing into Decimal preserves the exact value through normalization, audit logging, and any downstream comparison against the bound policy.

How is an extracted coverage limit made auditable for regulators?

Persist the bounding box, the raw token, the confidence score, the OCR engine version, and a SHA-256 hash of the source page alongside the normalized amount. This append-only lineage lets an examiner reconstruct exactly how a value was derived from a specific pixel region.

Extracting Coverage Limits from Scanned Policy PDFs

This walkthrough extends the PDF Text Extraction with pdfplumber stage with the specific coordinate-anchoring, optical fallback, and normalization steps needed to recover coverage limits from rasterized declarations pages, and it sits inside the broader Policy PDF Parsing & Extraction Workflows architecture.

Problem Statement

Scanned declarations pages defeat the lexical string-matching that works on born-digital policies. A rasterized page carries no recoverable text layer, so page.extract_text() returns an empty string rather than raising — and a naive pipeline records a hollow result instead of recovering the limit. Even after optical character recognition restores text, coverage limits sit inside multi-column grids where identical font weights erase the visual boundary between a deductible, a per-occurrence limit, and an aggregate. The failure this page resolves is precise: a Coverage A dwelling limit of $450,000 is silently transcribed as the $2,500 wind/hail deductible printed two columns to its right, and that wrong number flows into reserving and adjudication unchecked. The fix is to stop reading the page as a character stream and start reading it as a coordinate plane, anchoring every amount to the coverage label that owns it.

Prerequisites

This pattern assumes a Python 3.11+ worker with the following pinned dependencies, plus the Tesseract binary (4.x or 5.x) on the system path for the optical fallback:

pdfplumber==0.11.4
pdf2image==1.17.0
pytesseract==0.3.13

The optical fallback rasterizes only the pages flagged image-only by the density probe — it is not run on every document. For the routing logic that decides when a whole document belongs in the optical queue rather than a single page, see OCR Integration & Sync; pages whose orientation is non-zero must be normalized first using the handling rotated pages transform, otherwise the configured coordinates address the pre-rotation space.

Step 1 — Probe text density and load coordinate-anchored words

Every page must first be classified as born-digital or image-only. The probe measures recoverable characters per unit area; image-only scans score near zero and are rasterized through Tesseract, which returns words with bounding boxes so the rest of the pipeline is identical regardless of source. Anchoring on coordinates — not character order — is the single decision that makes scanned extraction deterministic.

from __future__ import annotations

import logging
from dataclasses import dataclass
from typing import Sequence

import pdfplumber
import pytesseract
from pdf2image import convert_from_path

logger = logging.getLogger("coverage_extraction")


class ScannedExtractionError(RuntimeError):
    """Raised when a page cannot yield a trustworthy coordinate-anchored text layer."""


@dataclass(frozen=True)
class Word:
    text: str
    x0: float
    x1: float
    top: float
    bottom: float

    @property
    def y_center(self) -> float:
        return (self.top + self.bottom) / 2.0


# Characters per 1000 pt^2 below which the page is treated as image-only.
TEXT_DENSITY_FLOOR = 0.05
OCR_DPI = 300


def load_words(page: "pdfplumber.page.Page", source_pdf: str) -> list[Word]:
    raw = page.extract_words(use_text_flow=False, keep_blank_chars=False)
    area = max(page.width * page.height, 1.0)
    density = (sum(len(w["text"]) for w in raw) * 1000.0) / area

    if density >= TEXT_DENSITY_FLOOR:
        logger.info("page %d: born-digital, %d words", page.page_number, len(raw))
        return [
            Word(w["text"], float(w["x0"]), float(w["x1"]), float(w["top"]), float(w["bottom"]))
            for w in raw
        ]

    logger.info("page %d: density %.4f below floor, diverting to OCR", page.page_number, density)
    return _ocr_words(source_pdf, page.page_number, page.height)


def _ocr_words(source_pdf: str, page_number: int, page_height: float) -> list[Word]:
    images = convert_from_path(source_pdf, dpi=OCR_DPI, first_page=page_number, last_page=page_number)
    if not images:
        raise ScannedExtractionError(f"page {page_number}: rasterization produced no image")

    image = images[0]
    scale = page_height / image.height  # map pixel space back to PDF points
    data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

    words: list[Word] = []
    for text, conf, left, top, width, height in zip(
        data["text"], data["conf"], data["left"], data["top"], data["width"], data["height"]
    ):
        token = text.strip()
        if not token or int(conf) < 40:  # drop empty and low-confidence glyphs
            continue
        words.append(
            Word(token, left * scale, (left + width) * scale, top * scale, (top + height) * scale)
        )
    if not words:
        raise ScannedExtractionError(f"page {page_number}: OCR recovered no usable tokens")
    return words

Step 2 — Cluster words into rows and anchor coverage labels

Coverage limits live on the same visual row as their label. Grouping words by their vertical centre — with a small tolerance for OCR baseline jitter — reconstructs those rows, after which each row is tested against an alias table of coverage anchors. Matching on aliases (Coverage A, Dwelling, Cov A) absorbs carrier-specific shorthand without a per-carrier rewrite.

ROW_TOLERANCE = 4.0  # points; words within this vertical band share a row

COVERAGE_ANCHORS: dict[str, tuple[str, ...]] = {
    "dwelling": ("dwelling", "coverage a", "cov a"),
    "other_structures": ("other structures", "coverage b", "cov b"),
    "personal_property": ("personal property", "contents", "coverage c"),
    "loss_of_use": ("loss of use", "coverage d"),
    "personal_liability": ("personal liability", "liability", "coverage e"),
    "medical_payments": ("medical payments", "med pay", "coverage f"),
}


def group_rows(words: Sequence[Word], tolerance: float = ROW_TOLERANCE) -> list[list[Word]]:
    rows: list[list[Word]] = []
    for word in sorted(words, key=lambda w: (w.y_center, w.x0)):
        for row in rows:
            if abs(row[0].y_center - word.y_center) <= tolerance:
                row.append(word)
                break
        else:
            rows.append([word])
    for row in rows:
        row.sort(key=lambda w: w.x0)
    return rows


def match_anchor(row_text: str) -> str | None:
    haystack = row_text.lower()
    for canonical, aliases in COVERAGE_ANCHORS.items():
        if any(alias in haystack for alias in aliases):
            return canonical
    return None

Step 3 — Normalize currency tokens and score each limit

The final step converts the candidate amounts into exact Decimal values — never float, which would corrupt cents — and selects the amount that belongs to the anchored coverage. When a row contains several numbers (a limit and a deductible), the rule that wins in practice is “largest amount on the anchored row”, with a confidence score that downgrades ambiguous rows so the gate can quarantine them. These normalized records are what the Field Mapping Strategies layer maps onto canonical ACORD identifiers.

import re
from decimal import Decimal, InvalidOperation

CURRENCY_RE = re.compile(
    r"\$?\s*([0-9][0-9,]*)(?:\.(\d{2}))?\s*(K|M|MM|B)?$", re.IGNORECASE
)
MULTIPLIERS = {"k": Decimal(1_000), "m": Decimal(1_000_000), "mm": Decimal(1_000_000), "b": Decimal(1_000_000_000)}
MIN_PLAUSIBLE_LIMIT = Decimal(100)


@dataclass(frozen=True)
class CoverageLimit:
    coverage: str
    amount: Decimal
    raw: str
    confidence: float
    bbox: tuple[float, float, float, float]


def parse_amount(token: str) -> Decimal | None:
    match = CURRENCY_RE.fullmatch(token.strip())
    if not match:
        return None
    whole, cents, suffix = match.groups()
    try:
        value = Decimal(whole.replace(",", ""))
    except InvalidOperation:
        return None
    if cents:
        value += Decimal(f"0.{cents}")
    if suffix:
        value *= MULTIPLIERS[suffix.lower()]
    return value


def extract_limits_from_page(page: "pdfplumber.page.Page", source_pdf: str) -> list[CoverageLimit]:
    records: list[CoverageLimit] = []
    for row in group_rows(load_words(page, source_pdf)):
        coverage = match_anchor(" ".join(w.text for w in row))
        if coverage is None:
            continue
        candidates = [(w, amt) for w in row if (amt := parse_amount(w.text)) and amt >= MIN_PLAUSIBLE_LIMIT]
        if not candidates:
            logger.warning("anchor %s matched with no parseable amount", coverage)
            continue
        word, amount = max(candidates, key=lambda pair: pair[1])
        # One clean amount on the row is high confidence; multiple competing numbers lowers it.
        confidence = 0.95 if len(candidates) == 1 else 0.70
        records.append(
            CoverageLimit(coverage, amount, word.text, confidence, (word.x0, word.top, word.x1, word.bottom))
        )
    return records

Verification & Testing

Correctness here is not “did it return a number” but “did it return the right number with a defensible confidence”. Pin a fixture declarations page with known limits and assert on both the value and the score. The confidence gate is what keeps a misread out of the downstream system — assert that low-confidence rows are flagged rather than silently trusted.

from decimal import Decimal

import pdfplumber


def test_dwelling_limit_recovered_exactly() -> None:
    with pdfplumber.open("fixtures/scanned_ho3_declarations.pdf") as pdf:
        limits = extract_limits_from_page(pdf.pages[0], "fixtures/scanned_ho3_declarations.pdf")

    by_coverage = {limit.coverage: limit for limit in limits}

    assert by_coverage["dwelling"].amount == Decimal("450000")
    assert by_coverage["dwelling"].confidence >= 0.90
    # The wind/hail deductible must NOT be misattributed as the Coverage A limit.
    assert by_coverage["dwelling"].amount != Decimal("2500")


def test_ambiguous_row_is_quarantined() -> None:
    rows = group_rows([
        Word("Coverage", 50, 120, 700, 712),
        Word("E", 124, 132, 700, 712),
        Word("$300,000", 300, 380, 700, 712),
        Word("$500,000", 420, 500, 700, 712),  # umbrella attachment on the same row
    ])
    coverage = match_anchor(" ".join(w.text for row in rows for w in row))
    assert coverage == "personal_liability"

Run the gate with pytest -q, and in production assert a per-document invariant: the set of recovered coverage keys must match the carrier’s expected schedule for that policy form, otherwise raise ScannedExtractionError and route the document to manual review rather than persisting a partial result.

Compliance & Audit Note

Every recovered CoverageLimit must be traceable from the adjudicated value back to the exact pixel region it came from. Persist the bbox, the raw token, the confidence, the OCR engine version, and a SHA-256 hash of the source page alongside the normalized amount, so a regulator examining a disputed settlement can reconstruct precisely how $450,000 was derived. This lineage is also what lets the Coverage Validation Rules engine cross-check an extracted limit against the bound policy on file — and what feeds the deductible side of that check, covered in validating deductible thresholds automatically. Append-only audit records with hash chaining satisfy NAIC data-governance expectations and state department-of-insurance examination requests without exposing the production database.

Troubleshooting Checklist

The scenarios below cover the large majority of scanned coverage-limit incidents. Each entry names the failure, the diagnostic signal, and the fix.

Deductible read as the coverage limit — the limit is plausible but too small

A deductible printed in a column to the right of the limit lands on the same clustered row, and a naive “first number wins” rule selects it. Diagnose by logging every (word, amount) candidate per anchored row. Fix by selecting the largest plausible amount on the row and lowering confidence whenever more than one candidate exists, so genuinely ambiguous rows are quarantined rather than guessed.

Empty result on an image-only page — extraction "succeeds" with zero limits

The page has no text layer, so extract_words() returns nothing and the density probe should have fired. Diagnose by logging the computed density against TEXT_DENSITY_FLOOR. Fix by confirming the OCR branch is reached; if Tesseract is missing from the path, pytesseract raises and the document must dead-letter with an operator alert, not record an empty success.

Row fragmentation from OCR baseline jitter — one logical row splits into two

At 300 DPI, OCR can place a label and its amount a few points apart vertically, breaking the row cluster. Diagnose by dumping y_center values for the affected coverage. Fix by widening ROW_TOLERANCE for the carrier, or by re-rasterizing at a higher DPI; do not lower the floor so far that adjacent schedule rows merge.

Suffix amounts collapse to the wrong magnitude — `$1M` parsed as `1`

A carrier prints shorthand ($1M, 750K) that the regex multiplier path missed. Diagnose by asserting that any token ending in K, M, MM, or B resolves above MIN_PLAUSIBLE_LIMIT. Fix by confirming the suffix group is captured and mapped through MULTIPLIERS; extend the table for carrier-specific notations rather than special-casing values inline.

Ruled-grid declarations defeat row clustering — columns interleave across coverages

Some carriers render limits inside a fully ruled table where coordinate clustering alone is unreliable. Diagnose by checking whether the page has dense rect lines. Fix by routing those pages to lattice-mode table reconstruction via Table Parsing with Camelot and anchoring on resolved cells instead of raw word rows.

PDF Text Extraction with pdfplumber — the parent stage this walkthrough extends
Handling Rotated Pages in Policy Documents — normalize orientation before applying coordinate bounds
Optimizing Camelot for Complex Insurance Tables — ruled-grid reconstruction for limit schedules
Building Async Batch Processors for Daily Policy Ingestion — run this extractor at portfolio scale
Validating Deductible Thresholds Automatically — check the recovered values against the bound policy