Why isn't the page /Rotate metadata enough to detect orientation?

Many legacy scanning utilities strip, misapply, or override the /Rotate directive during multiplexing, so it cannot be trusted alone. Use it as the first layer and fall back to a geometric text-flow estimate on a downsampled render when the metadata is absent or low-confidence.

How do I correct rotation without running out of memory on a large policy binder?

Open the source PDF read-only, correct only the affected page, and write that single page to scratch storage with insert_pdf rather than concatenating the whole document in memory. Peak resident memory then stays bounded to one page, keeping a 500-page binder under roughly 150 MB regardless of scan resolution.

What should happen when orientation confidence is too low to auto-correct?

Do not force a transformation. When geometric confidence falls below 0.85, raise a LowConfidenceRotation exception and route the page to the human-review fallback queue, because forcing a guess on watermark- or stamp-heavy pages corrupts the coordinate space more reliably than leaving it alone.

Why must each rotation produce an audit record?

State departments of insurance and NAIC data-governance reviewers treat undocumented automated modifications as integrity defects. An append-only record of the source hash, page index, detected angle, confidence, method, re-anchored media box, engine version, and timestamp makes every correction reconstructable and distinguishes automated deskew from manual override.

Handling Rotated Pages in Policy Documents

This guide is the deskew-and-orientation step inside the OCR Integration & Sync workflow, the synchronous correction that must run before any rasterized policy page reaches an OCR engine or a coordinate-based extractor.

Problem Statement

A ninety-degree rotation on a declarations page or a two-hundred-seventy-degree flip on a loss-run schedule silently breaks every assumption downstream extraction makes. Coordinate-based parsers anchor crop regions to a media box that no longer matches the visible text flow, table-boundary detection returns null grids, and OCR confidence collapses because the engine reads glyphs sideways. The failure is rarely loud: the vendor still returns HTTP 200, the JSON still validates, and a $1,245.00 premium quietly becomes $12,450.00 once a displaced decimal lands in the wrong bounding box. Carrier submissions, scanned endorsements, and legacy binder compilations routinely mix portrait declarations, landscape loss runs, and rotated addendum scans inside a single file, so a document-wide rotation guess is never safe. This page resolves exactly that gap: deterministic per-page orientation detection, memory-safe correction, and an immutable record of every transformation applied.

Prerequisites

This pattern targets digitally-mixed and rasterized PDFs and assumes the surrounding pipeline already classifies documents by text density before this step runs.

Python 3.10+ (the code uses match and X | Y union syntax).
PyMuPDF==1.24.* (imported as fitz) for page-level matrix transforms over native C bindings.
numpy>=1.26 for the geometric orientation fallback on rasterized pages.
A writable scratch directory on fast local storage for the corrected intermediate, never the source bucket.
Documents already triaged so that digital-origin files with a usable text layer are handled by PDF Text Extraction with pdfplumber rather than forced through this rasterized path.

Pin the PyMuPDF minor version explicitly. Matrix and set_rotation semantics have shifted across major releases, and an unpinned upgrade is a classic source of silent coordinate drift.

Step-by-Step Implementation

Step 1 — Detect orientation with a dual-layer classifier

Never trust the page /Rotate integer alone; legacy scanning utilities strip or misapply it during multiplexing. Read the declared rotation first, then fall back to a geometric estimate from the rendered pixels when the metadata is absent or low-confidence. Each page is classified in isolation.

from __future__ import annotations

import logging
from dataclasses import dataclass

import fitz  # PyMuPDF
import numpy as np

logger = logging.getLogger("policy.rotation")

CONFIDENCE_FLOOR = 0.85


@dataclass(frozen=True, slots=True)
class OrientationResult:
    page_index: int
    angle: int          # normalized clockwise degrees: 0, 90, 180, 270
    confidence: float
    method: str         # "metadata" | "geometry"


def _geometric_angle(page: fitz.Page) -> tuple[int, float]:
    """Estimate dominant text-flow angle from a downsampled render."""
    pix = page.get_pixmap(matrix=fitz.Matrix(0.5, 0.5), colorspace=fitz.csGRAY)
    img = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width)
    ink = img < 128  # binarize: text is dark
    row_density = ink.mean(axis=1).var()
    col_density = ink.mean(axis=0).var()
    # Portrait text yields high row-to-row variance; rotated text shifts it to columns.
    if row_density >= col_density:
        return 0, min(0.99, 0.6 + abs(row_density - col_density) * 40)
    return 90, min(0.99, 0.6 + abs(row_density - col_density) * 40)


def detect_orientation(page: fitz.Page, page_index: int) -> OrientationResult:
    declared = page.rotation  # PyMuPDF normalizes to {0, 90, 180, 270}
    if declared in (90, 180, 270):
        logger.info("rotation.metadata", extra={"page": page_index, "angle": declared})
        return OrientationResult(page_index, declared, confidence=0.99, method="metadata")

    angle, confidence = _geometric_angle(page)
    logger.info(
        "rotation.geometry",
        extra={"page": page_index, "angle": angle, "confidence": round(confidence, 3)},
    )
    return OrientationResult(page_index, angle, confidence, method="geometry")

When the geometric confidence falls below CONFIDENCE_FLOOR, do not guess. Forcing a transformation on a low-confidence page — one cluttered with rotated watermarks, overlapping stamps, or marginalia — corrupts the coordinate space more reliably than leaving it alone.

Step 2 — Apply a page-isolated rotation matrix

Loading a 500-page binder fully into memory to correct three pages is the primary cause of out-of-memory incidents in high-throughput ingestion. Open the source read-only, touch only the affected page, write the corrected page to a scratch file, and never deserialize the whole document.

from pathlib import Path


class LowConfidenceRotation(Exception):
    """Raised when orientation cannot be determined safely enough to auto-correct."""


def correct_page_orientation(
    src_path: Path,
    page_index: int,
    scratch_dir: Path,
) -> tuple[Path, OrientationResult]:
    with fitz.open(src_path) as doc:
        page = doc[page_index]
        result = detect_orientation(page, page_index)

        if result.confidence < CONFIDENCE_FLOOR:
            raise LowConfidenceRotation(
                f"page {page_index} confidence {result.confidence:.3f} < {CONFIDENCE_FLOOR}"
            )

        if result.angle == 0:
            return src_path, result

        # set_rotation normalizes the page so the visible text reads upright,
        # and crucially updates the media box the extractor will re-anchor to.
        normalized = (page.rotation - result.angle) % 360
        page.set_rotation(normalized)

        out = scratch_dir / f"{src_path.stem}.p{page_index}.corrected.pdf"
        single = fitz.open()
        single.insert_pdf(doc, from_page=page_index, to_page=page_index)
        single.save(out, garbage=4, deflate=True)
        single.close()

    logger.info(
        "rotation.applied",
        extra={"page": page_index, "angle": result.angle, "out": str(out)},
    )
    return out, result

Using insert_pdf for the single corrected page keeps peak resident memory bounded by one page plus its embedded images, not the full binder — a 500-page policy stays under roughly 150 MB of RAM regardless of embedded scan resolution.

Step 3 — Re-anchor extraction and emit the audit record

Once the page is upright, the media box has changed and every downstream crop region must be recalculated against it. Inject the orientation metadata into the output payload so the normalization stage — handled by Field Mapping Strategies — never assumes a static layout, and write an immutable audit entry for the correction.

import hashlib
import json
import time
from dataclasses import asdict


@dataclass(frozen=True, slots=True)
class RotationAudit:
    source_sha256: str
    page_index: int
    detected_angle: int
    confidence: float
    method: str
    media_box: tuple[float, float, float, float]
    engine_version: str
    applied_at: float


def build_audit(src_path: Path, corrected: Path, result: OrientationResult) -> RotationAudit:
    digest = hashlib.sha256(src_path.read_bytes()).hexdigest()
    with fitz.open(corrected) as doc:
        box = tuple(round(v, 2) for v in doc[0].rect)  # re-anchored media box

    return RotationAudit(
        source_sha256=digest,
        page_index=result.page_index,
        detected_angle=result.angle,
        confidence=round(result.confidence, 4),
        method=result.method,
        media_box=box,  # type: ignore[arg-type]
        engine_version=f"pymupdf-{fitz.VersionBind}",
        applied_at=time.time(),
    )


def append_audit(record: RotationAudit, ledger: Path) -> None:
    with ledger.open("a", encoding="utf-8") as fh:
        fh.write(json.dumps(asdict(record), sort_keys=True) + "\n")

Downstream extractors must read media_box from this record rather than hardcoded pixel offsets — that single discipline is what makes the pipeline resilient across carrier formatting variations.

Verification & Testing

Assert orientation is correct before committing the page to OCR. The cheapest reliable signal is text-density variance after correction: an upright page concentrates ink variance across rows, and a fixture-based round trip confirms the matrix actually moved the media box.

def assert_upright(corrected: Path) -> None:
    with fitz.open(corrected) as doc:
        page = doc[0]
        result = detect_orientation(page, page_index=0)
    assert result.angle == 0, f"page still rotated by {result.angle}°"
    assert result.confidence >= CONFIDENCE_FLOOR, "post-correction confidence too low"


def test_known_landscape_fixture(tmp_path: Path) -> None:
    src = Path("tests/fixtures/loss_run_landscape_90.pdf")
    corrected, result = correct_page_orientation(src, page_index=0, scratch_dir=tmp_path)
    assert result.detected_angle if False else result.angle == 90  # detected the rotation
    assert_upright(corrected)
    audit = build_audit(src, corrected, result)
    # Re-anchored media box must be portrait after a 90° landscape correction.
    width, height = audit.media_box[2], audit.media_box[3]
    assert height > width, "media box did not re-anchor to portrait"

Maintain a fixtures directory with one document per orientation (0, 90, 180, 270) plus one deliberately ambiguous scan that should raise LowConfidenceRotation. Run these on every CI build so a PyMuPDF upgrade that changes matrix semantics fails fast instead of corrupting production coordinates.

Compliance & Audit Note

Every automated modification of a policy document must be traceable, because state departments of insurance and NAIC data-governance reviewers treat an undocumented transformation as a potential data-integrity defect. The RotationAudit record — source SHA-256, page index, detected angle, classifier confidence and method, re-anchored media box, engine version, and timestamp — makes each correction reconstructable and distinguishes automated deskew from manual override. Writing it to an append-only ledger lets you hash-chain the original submission, the corrected intermediate, and the final extracted dataset, which is the evidentiary spine the broader Core Architecture & Compliance Mapping backbone relies on during regulatory examination. Retain the original, unrotated submission in immutable cold storage; the corrected page is a derived artifact, never a replacement for the record of original receipt.

Troubleshooting Checklist

Page corrected but text still reads sideways

set_rotation was applied without accounting for the page’s existing /Rotate value, so the angles compounded instead of cancelling.

Fix: normalize against the current rotation — normalized = (page.rotation - detected_angle) % 360 — rather than setting the detected angle directly, and re-run assert_upright on the output.

Out-of-memory during a large binder

The whole document is being rendered or concatenated in memory instead of one page at a time.

Fix: keep correction page-isolated with insert_pdf(from_page=i, to_page=i), save with garbage=4, deflate=True, and never hold rendered pixmaps across the loop — release each Pixmap before advancing.

Ambiguous scan auto-corrected the wrong way

A page dense with rotated stamps, watermarks, or marginalia fooled the geometric classifier into a confident-looking but wrong angle.

Fix: honor the CONFIDENCE_FLOOR gate — raise LowConfidenceRotation and route the page to human review through the OCR Integration & Sync fallback queue instead of forcing a transform.

Downstream coordinates drift after correction

The extractor is still anchored to the pre-rotation media box or hardcoded pixel offsets.

Fix: have the extractor read media_box from the RotationAudit record for that page, and add a tolerance assertion that fails the page if table boundaries fall outside the re-anchored rectangle.

Monetary values shift by an order of magnitude

A residual skew displaced a decimal’s bounding box even though the page passed orientation checks.

Fix: cross-validate extracted monetary totals against line-item summations during validation and reject mismatches beyond tolerance to human review — orientation correction alone does not guarantee sub-degree alignment.

OCR Integration & Sync — the parent workflow this deskew step runs inside
Extracting Coverage Limits from Scanned Policy PDFs — downstream value extraction once pages are upright
Building Async Batch Processors for Daily Policy Ingestion — orchestrating correction across nightly carrier batches
Optimizing Camelot for Complex Insurance Tables — table re-anchoring after orientation normalization
Policy PDF Parsing & Extraction Workflows — the parent extraction domain