When should I use Camelot lattice mode versus stream mode for policy tables?

Use lattice when the table has explicit cell borders or vector strokes, which is common for modern carrier declarations and ruled loss runs. Use stream when the table is rendered as whitespace-aligned text with no structural lines, typical of legacy mainframe schedules. Select the flavor from measured vector density rather than hardcoding one, because forcing lattice on a borderless table returns zero tables and forcing stream on a ruled grid merges columns.

Why does Camelot return an empty table list for some insurance PDFs?

The document is either rasterized with no vector layer (a scan or fax) or its grid uses hairline rules Ghostscript does not surface as strokes. Check parsing_report accuracy and vector density: if density is near zero, divert the document to the OCR fallback queue; if rules exist but are thin, raise the line_scale parameter for that carrier.

How do I stop Camelot from shifting policy values into the wrong column?

Merged or multi-line cells cause Camelot to emit a phantom column that offsets every later value. Validate the recovered DataFrame column count against the form's expected schema in the manifest and reject any mismatch, so a shifted deductible quarantines the document instead of silently persisting wrong data. Enabling split_text also reduces spurious column splits.

What infrastructure does Camelot need in a production container?

Camelot's lattice backend shells out to the system Ghostscript binary, so install gs 10.x in the build stage in addition to the Python ghostscript and opencv-python-headless packages. Pin all of these explicitly, because unpinned Ghostscript or OpenCV releases change line detection and make extraction non-reproducible across deploys.

Table Parsing with Camelot for Insurance Policy and Claims Data

This stage of the broader Policy PDF Parsing & Extraction Workflows architecture owns the ruled-grid pathway: turning declarations pages, loss-run schedules, premium allocation matrices, and endorsement riders that encode their data inside table geometry into structured, row-aligned, auditable records. Where the native-text stage flattens a page into a character stream, tabular policy data needs deterministic table reconstruction to preserve the row-column relationships that carry coverage limits, deductibles, and premium splits. Camelot is the workhorse for this stage because it extracts tables from the PDF vector layer — cell strokes, text coordinates, and whitespace columns — without any probabilistic guesswork, so every recovered cell traces back to a definite position on a specific page. Deployed inside an InsurTech pipeline, it demands strict routing logic, explicit error categorization, and compliance-aware data boundaries rather than a blind read_pdf call.

What Breaks Without a Disciplined Table Stage

At low volume you can open a declarations page, call camelot.read_pdf(...), and hand the first DataFrame downstream. That approach collapses against a real multi-carrier portfolio, where three production failures dominate.

The first is silent column misalignment. A carrier renders a premium schedule with thin or absent cell borders, lattice mode finds no grid, and the parser either returns nothing or — worse — splits a two-line address into phantom columns that shift every subsequent value one cell to the right. Nothing throws. A deductible lands in the limit column, and thousands of rating records absorb the corruption before anyone notices.

The second is flavor mis-routing. lattice and stream are not interchangeable: forcing lattice on a borderless whitespace table yields zero tables, and forcing stream on a dense ruled grid merges adjacent columns. A pipeline that hardcodes one flavor silently degrades on every document that does not match its assumption.

The third is rasterized-table blindness. A scanned loss run contains no vector strokes and no text layer, so Camelot returns an empty table list rather than an error. A naive pipeline logs “no tables found” and moves on, when the document should have been diverted into the optical pathway for reconstruction.

This stage exists to make those failures loud and early: classify the table geometry before parsing, select the flavor deterministically, validate the recovered schema, score confidence, and route anything ambiguous away from the fast path instead of letting it pollute the canonical policy store.

Prerequisites and Environment Setup

Camelot’s behaviour is dominated by its native dependencies, so pin the full stack — an unpinned Ghostscript or OpenCV release changes lattice line detection and produces non-reproducible extraction across deploys.

# requirements.txt — pin Camelot and its table-detection dependencies
camelot-py[base]==0.11.0
opencv-python-headless==4.10.0.84   # lattice line detection
ghostscript==0.7                    # Python binding; system Ghostscript also required
pandas==2.2.2
pypdf==4.3.1                        # cheap pre-flight vector/geometry inspection

System Ghostscript (gs 10.x) must be installed in the runtime image; the lattice backend shells out to it for image rendering. Bake it into the container, do not rely on the host.
Python 3.11+ — the orchestration tier relies on dataclasses(slots=True) and asyncio task-group semantics.
Object storage for immutable raw bytes (S3, GCS, or MinIO). The worker reads document bytes from storage, never from the inbound request, so it stays stateless and replayable.
A message broker (RabbitMQ, SQS, or Redis Streams) carrying lightweight work items — a content hash and a routing key — not the document bytes.
A version-controlled extraction manifest mapping carrier and form identifiers to expected column schemas and flavor hints. Treat it as code: reviewed, tagged, and deployed so any historical extraction is reproducible against the schema in force at the time.

Confirm a document is a candidate for this stage before doing anything expensive: a parseable cross-reference table, a non-trivial vector-stroke density, or detectable whitespace-aligned columns. Documents that fail every check belong in the optical or native-text pathways, not here.

Deterministic Routing Architecture

Carriers and third-party administrators distribute documents across dozens of templates, each with varying table geometries, border styles, and header conventions. A production pipeline must never invoke Camelot blindly. The ingestion layer evaluates PDF metadata, page dimensions, and vector object density to classify the document before extraction begins, then selects one of Camelot’s two extraction paradigms:

lattice executes with high precision when explicit grid lines, cell borders, or vector strokes define table boundaries — the common case for modern carrier declarations and ruled loss runs.
stream is necessary when tables are rendered as whitespace-aligned text blocks with no structural strokes — typical of legacy mainframe-generated schedules.

If neither a vector table structure nor alignment heuristics are present, the router hands the document back to PDF Text Extraction with pdfplumber for line-level parsing rather than forcing Camelot into a degraded state. When the document is a scanned image or carries rasterized tables, vector analysis returns zero confidence and the work item diverts into the OCR Integration & Sync fallback queue, which reconstructs structural boundaries before the page is re-evaluated for table extraction. Every routing decision is logged, versioned, and tied to a deterministic confidence threshold so the pipeline categorizes outcomes at ingestion rather than discovering corruption downstream during field mapping.

Core Implementation: A Production Extraction Module

The module below is the deterministic core of the stage. It enforces type hints, structured logging, geometry-driven flavor selection, schema validation, and explicit error propagation, categorizing each failure so the queue layer can decide retry-versus-quarantine from the error type rather than from a half-populated result.

import logging
from typing import Dict, Any, List, Optional, Tuple
import camelot
import pandas as pd
from dataclasses import dataclass, field
from enum import Enum
import pypdf
from datetime import datetime, timezone

logger = logging.getLogger("insurtech.table_extractor")


class ExtractionErrorType(Enum):
    STRUCTURAL = "structural_pdf_error"
    BOUNDARY = "table_boundary_error"
    VALIDATION = "schema_validation_error"
    COMPLIANCE = "compliance_boundary_violation"


class TableFlavor(Enum):
    LATTICE = "lattice"
    STREAM = "stream"


@dataclass
class ExtractionResult:
    success: bool
    dataframes: List[pd.DataFrame]
    routing_decision: str
    confidence_score: float
    audit_trail_id: str
    extraction_timestamp: datetime = field(
        default_factory=lambda: datetime.now(timezone.utc)
    )
    error_type: Optional[ExtractionErrorType] = None
    error_message: Optional[str] = None
    compliance_flags: Dict[str, Any] = field(default_factory=dict)


def _assess_vector_density(pdf_path: str) -> Tuple[bool, float]:
    """Evaluate the content stream for explicit grid lines and vector strokes."""
    try:
        reader = pypdf.PdfReader(pdf_path)
        total_vectors = 0
        for page in reader.pages:
            if "/Contents" in page:
                content = page["/Contents"]
                if hasattr(content, "get_data"):
                    total_vectors += len(content.get_data().splitlines())
        # Heuristic: >150 vector ops/page suggests a ruled, lattice-able table.
        avg_vectors = total_vectors / max(len(reader.pages), 1)
        return avg_vectors > 150, min(avg_vectors / 300.0, 1.0)
    except Exception as exc:
        logger.error("Vector density assessment failed: %s", exc)
        return False, 0.0


def validate_schema(df: pd.DataFrame, expected_columns: List[str]) -> bool:
    """Reject empty or column-count-mismatched frames before they propagate."""
    if df.empty:
        return False
    if df.shape[1] != len(expected_columns):
        logger.warning(
            "Column-count mismatch: got %d, expected %d",
            df.shape[1], len(expected_columns),
        )
        return False
    return True


def extract_insurance_tables(
    pdf_path: str,
    expected_columns: List[str],
    audit_trail_id: str,
    flavor_override: Optional[TableFlavor] = None,
) -> ExtractionResult:
    """Production table extraction with deterministic routing and compliance tracking."""
    has_vectors, vector_confidence = _assess_vector_density(pdf_path)

    if flavor_override:
        selected_flavor = flavor_override.value
        routing_confidence = 0.85
    elif has_vectors:
        selected_flavor = TableFlavor.LATTICE.value
        routing_confidence = vector_confidence
    else:
        selected_flavor = TableFlavor.STREAM.value
        routing_confidence = max(vector_confidence, 0.60)

    logger.info(
        "Routing to %s extraction (confidence: %.2f) | AuditID: %s",
        selected_flavor, routing_confidence, audit_trail_id,
    )

    try:
        tables = camelot.read_pdf(
            pdf_path,
            flavor=selected_flavor,
            pages="all",
            split_text=True,
            strip_text="\n\r",
        )

        if not tables or len(tables) == 0:
            return ExtractionResult(
                success=False,
                dataframes=[],
                routing_decision=selected_flavor,
                confidence_score=routing_confidence,
                audit_trail_id=audit_trail_id,
                error_type=ExtractionErrorType.BOUNDARY,
                error_message="No tables detected in target pages.",
            )

        validated_frames: List[pd.DataFrame] = []
        for table in tables:
            df = table.df
            if not validate_schema(df, expected_columns):
                continue
            # Standardize headers and strip cell whitespace deterministically.
            df.columns = expected_columns
            df = df.apply(lambda col: col.str.strip() if col.dtype == "object" else col)
            validated_frames.append(df)

        if not validated_frames:
            return ExtractionResult(
                success=False,
                dataframes=[],
                routing_decision=selected_flavor,
                confidence_score=routing_confidence,
                audit_trail_id=audit_trail_id,
                error_type=ExtractionErrorType.VALIDATION,
                error_message="Extracted tables failed schema validation.",
            )

        return ExtractionResult(
            success=True,
            dataframes=validated_frames,
            routing_decision=selected_flavor,
            confidence_score=routing_confidence,
            audit_trail_id=audit_trail_id,
            compliance_flags={
                "data_classification": "policy_financial",
                "retention_schedule": "7_years",
                "pii_masked": True,
            },
        )

    except camelot.errors.PDFReadError as exc:
        logger.error("PDF structural error: %s", exc)
        return ExtractionResult(
            success=False,
            dataframes=[],
            routing_decision=selected_flavor,
            confidence_score=0.0,
            audit_trail_id=audit_trail_id,
            error_type=ExtractionErrorType.STRUCTURAL,
            error_message=str(exc),
        )
    except Exception:
        logger.exception("Unexpected extraction failure")
        return ExtractionResult(
            success=False,
            dataframes=[],
            routing_decision=selected_flavor,
            confidence_score=0.0,
            audit_trail_id=audit_trail_id,
            error_type=ExtractionErrorType.COMPLIANCE,
            error_message="Processing halted due to compliance boundary violation.",
        )

Two design choices carry most of the production value. First, the flavor is chosen from measured vector density rather than guessed, so the same deployed pipeline absorbs ruled commercial declarations and borderless legacy schedules without a redeploy. Second, the function never returns a half-populated result on a structural problem; it returns a typed ExtractionResult whose error_type lets the broker layer route the work item to retry, fallback parsing, reconciliation, or a security alert.

Configuration and Tuning

Hardcoded thresholds are the enemy of a multi-carrier pipeline. Drive every tunable from the environment and the carrier manifest so operators can recalibrate without shipping code.

import os
from dataclasses import dataclass


@dataclass(frozen=True, slots=True)
class TableExtractorConfig:
    lattice_vector_threshold: float   # avg vector ops/page above which lattice wins
    stream_floor_confidence: float    # confidence floor assigned to stream routing
    ocr_divert_threshold: float       # below this density -> OCR fallback queue
    line_scale: int                   # lattice line-detection sensitivity

    @classmethod
    def from_env(cls) -> "TableExtractorConfig":
        return cls(
            lattice_vector_threshold=float(os.getenv("CAMELOT_LATTICE_VECTOR_THRESHOLD", "150")),
            stream_floor_confidence=float(os.getenv("CAMELOT_STREAM_FLOOR", "0.60")),
            ocr_divert_threshold=float(os.getenv("CAMELOT_OCR_DIVERT_THRESHOLD", "0.05")),
            line_scale=int(os.getenv("CAMELOT_LINE_SCALE", "40")),
        )

Calibration guidance:

lattice_vector_threshold (default 150) is the boundary between trusting cell strokes and falling back to whitespace columns. Dense ruled loss runs sit well above it; sparse benefit schedules sit below. Backtest against a labelled sample and watch the column-misalignment rate, not just the table-detection rate.
line_scale controls Camelot’s lattice line-detection sensitivity. Raise it when thin or broken rules go undetected, but raising it too far makes the detector treat underlined text as table borders and fabricate phantom rows.
ocr_divert_threshold separates a genuine vector table from a rasterized scan; set it just above the vector density of your worst image-only PDFs so they divert cleanly to optical recovery.
Carrier-specific overrides — flavor hints, expected column schemas, and per-form line_scale — belong in the manifest, keyed by routing key. Once raw tables are recovered, hand their cells to Field Mapping Strategies to normalize carrier column shorthand into canonical ACORD identifiers before the values reach the policy store. Multi-page declarations, nested sub-tables, and carrier-specific coordinate filtering are tuned in depth in Optimizing Camelot for Complex Insurance Tables.

Compliance Integration and Auditability

Insurance table extraction operates under strict regulatory frameworks, so every extraction event must map to auditable controls. Because the ExtractionResult carries compliance_flags inline, downstream systems enforce retention and privacy obligations without an additional transformation layer.

Data lineage — the audit_trail_id ties raw PDF ingestion to the parsed DataFrames, capturing the selected flavor, the confidence score, and a UTC timestamp so a compliance officer can reconstruct why a given cell was produced during a NAIC or SOX examination.
PII boundaries — financial matrices frequently intersect with policyholder identifiers, so column-level masking is enforced before downstream persistence in line with state-level privacy statutes. The segregation of raw bytes, recovered tables, and adjudicated values follows Data Boundary Enforcement so financial and personal fields never cross a tier without an explicit, logged transition.
Retention scheduling — extracted outputs inherit the retention_schedule flag, triggering automated archival workflows that satisfy statutory document-retention requirements.

Recovered values are reconciled against canonical definitions enforced by Policy Schema Design, and the coverage limits and deductibles this stage pulls from premium and schedule tables then flow into Coverage Validation Rules, where they are checked against the bound policy before any claim adjudicates.

Failure Modes and Troubleshooting

The scenarios below account for the large majority of table-extraction incidents. Each names the failure, the diagnostic signal, and the code-level fix.

Merged-cell column shift — values are present but offset one column

A multi-line address or a vertically merged label causes Camelot to emit an extra phantom column, shifting every subsequent value right. Diagnose by logging df.shape[1] against the manifest’s expected column count — the validate_schema guard above rejects the frame on mismatch. Fix by enabling split_text=True (already set) and adding the form’s true column count to the manifest so the mismatch quarantines the document instead of silently persisting a shifted deductible.

Lattice finds no table — empty table list on a clearly tabular page

The page renders its grid with hairline or anti-aliased rules that Ghostscript does not surface as strokes, so lattice detects no cells. Diagnose by re-running with camelot.read_pdf(..., flavor="lattice", line_scale=60) and inspecting tables[0].parsing_report for an accuracy near zero. Fix by raising line_scale for that carrier in the manifest, or, when no rules truly exist, letting the vector-density router fall through to stream.

Stream over-segmentation — clean grid explodes into too many columns

stream mode infers column boundaries from whitespace, so a ruled table forced through it splits on internal spacing and fabricates columns. Diagnose by comparing tables[0].df.shape against the expected schema. Fix by ensuring _assess_vector_density routes ruled pages to lattice, and pass explicit columns=["72,180,340"] coordinate separators in the manifest only for genuinely borderless forms.

Rasterized or rotated table — zero vectors, zero tables

A scanned or sideways-scanned loss run has no vector layer, so density scores 0.0 and Camelot returns nothing. Diagnose by reading the BOUNDARY error type alongside a zero confidence score. Fix by diverting the work item to OCR Integration & Sync; rotation normalization specifically is covered in handling rotated pages before the page is re-evaluated for table extraction.

Ghostscript missing in the runtime image — lattice raises at import or read time

The container ships Python ghostscript bindings but no system gs binary, so lattice fails while stream appears to work, masking the gap. Diagnose by checking the STRUCTURAL error path and running gs --version in the image. Fix by installing system Ghostscript in the build stage and treating its absence as a fatal deploy-time check, not a per-document runtime surprise.

Going Deeper: Focused Walkthroughs

This stage anchors a set of narrower implementation guides that drill into specific table-extraction problems:

Optimizing Camelot for Complex Insurance Tables extends the routing and validation contract here into production scaling concerns — memory isolation for large multi-page binders, deterministic fallback architectures, and per-form coordinate filtering for merged cells, nested sub-schedules, and inconsistent legacy grids.

For high-volume ingestion, wrap extract_insurance_tables inside an async batch orchestrator that respects backpressure and applies exponential backoff to transient I/O failures; the daily policy ingestion batch processors walkthrough wires that orchestration end to end. As the catalogue grows, each new walkthrough plugs into the same routing, confidence-scoring, and audit contract defined on this page, so patterns compose rather than diverge.

Policy PDF Parsing & Extraction Workflows — the parent architecture this stage belongs to
PDF Text Extraction with pdfplumber — the native-text pathway tables route back to when no grid exists
OCR Integration & Sync — optical fallback for rasterized or scanned tables
Field Mapping Strategies — normalizing recovered table cells into canonical ACORD identifiers
Coverage Validation Rules — validating extracted limits and deductibles against the bound policy
Data Boundary Enforcement — tiered isolation of raw bytes, recovered tables, and adjudicated values