Implementing Error Handling & Retry Logic in Geospatial Schema Mapping Pipelines Jump to heading

Q: How do retries avoid creating duplicate features in the target geodatabase?

The write path is idempotent. Each feature carries a deterministic ID derived from source_id + sha1(canonical_attributes), and the consumer writes through an UPSERT rather than an INSERT, so a retry that succeeds on a later attempt overwrites the partial row instead of appending a duplicate.

Q: Where should a ProjError actually be fixed?

Not in the retry loop. A transient ProjError from an unreachable datum-grid server is retried here, but a persistent one means a missing transformation grid, which is recovered upstream by the datum transformation fallback chains under CRS Normalization & Sync. This stage only routes and records the fault; the grid repair is a CRS concern.

Geospatial ETL workflows operating at municipal or federal scale require deterministic failure management. When automating attribute standardization across heterogeneous sources, transient network faults, malformed geometries, or schema drift will inevitably interrupt batch processing. Error handling and retry logic must be engineered as a first-class pipeline component rather than an afterthought. This stage operates at the core of Automated Attribute Transformation & ETL Workflows, wrapping every transformation step with fault-tolerant routing, tolerance thresholds, and compliance-grade audit trails.

This page covers the resilience layer that surrounds attribute transformation: classifying errors, retrying transient faults, and quarantining records that cannot be recovered. It does not implement the transformations themselves — type casting belongs to Field Renaming & Type Coercion Rules, the recursive traversal of structured payloads belongs to Nested JSON/GeoJSON Flattening, and the streaming execution loop these policies wrap is described in Batch Schema Processing Pipelines. Reprojection faults raised here are recovered upstream by CRS Normalization & Sync.

Declarative Retry & Tolerance Manifest Jump to heading

Resilient schema mapping engines drive failure behaviour from externalized manifests rather than hard-coded constants. By isolating retry parameters, timeout windows, tolerance thresholds, and fallback destinations into YAML, engineering teams achieve reproducible execution across staging and production and can review a policy change as a diff. The manifest must enforce strict typing and explicitly distinguish mandatory from optional fields so that a typo fails loudly at load time instead of silently widening the failure budget in production.

# pipeline_resilience.yaml — validated at startup against the pydantic model below
pipeline:
  id: "municipal-parcel-sync"
  retry_policy:
    max_attempts: 3              # MANDATORY: int >= 1. Hard stop threshold.
    base_delay_seconds: 2.0      # MANDATORY: float > 0. Initial backoff window.
    jitter: true                 # OPTIONAL: bool. Defaults to true. Prevents thundering herd.
    retryable_status_codes:      # MANDATORY: list[int]. Transient fault identifiers.
      - 429
      - 502
      - 503
    terminal_status_codes:       # OPTIONAL: list[int]. Defaults to [400, 401, 403, 404, 500].
      - 400
      - 401
  tolerance:
    max_null_rate: 0.005         # MANDATORY: float 0.0-1.0. Acceptable missing-data threshold.
    halt_on_geometry_invalid: true   # MANDATORY: bool. Enforces spatial integrity.
    precision_decimal_threshold: 6   # OPTIONAL: int. Defaults to 6. Coordinate precision floor.
  dead_letter:
    sink_uri: "s3://gis-quarantine/parcel-sync/"  # MANDATORY: str. Durable failure store.
    retain_days: 90              # OPTIONAL: int. Defaults to 90. Governance retention window.

Field	Required	Type / range	Default	Purpose
`retry_policy.max_attempts`	Mandatory	`int >= 1`	—	Hard ceiling on reprocessing of a single record.
`retry_policy.base_delay_seconds`	Mandatory	`float > 0`	—	First backoff window; doubled each attempt.
`retry_policy.jitter`	Optional	`bool`	`true`	Randomizes delay to avoid synchronized retry storms.
`retry_policy.retryable_status_codes`	Mandatory	`list[int]`	—	Faults eligible for retry; everything else is terminal.
`retry_policy.terminal_status_codes`	Optional	`list[int]`	`[400,401,403,404,500]`	Faults that must fail fast without retry.
`tolerance.max_null_rate`	Mandatory	`float 0.0–1.0`	—	Batch-level null-injection ceiling before rejection.
`tolerance.halt_on_geometry_invalid`	Mandatory	`bool`	—	Stops the batch on any invalid geometry.
`tolerance.precision_decimal_threshold`	Optional	`int`	`6`	Minimum coordinate decimal places accepted.
`dead_letter.sink_uri`	Mandatory	`str`	—	Durable destination for exhausted records.
`dead_letter.retain_days`	Optional	`int`	`90`	Quarantine retention for audit/reprocessing.

Mandatory fields must be present and validated at pipeline initialization; a missing value triggers an immediate ValidationError before any spatial I/O occurs. Optional fields inherit the documented defaults above, and any override must still pass schema validation so that policy drift cannot creep in unnoticed.

Preprocessing: Normalising the Failure Surface Jump to heading

The hardest part of geospatial retry logic is not the loop — it is that the underlying libraries raise wildly different exception types for conceptually identical faults. A transient PROJ grid-server timeout surfaces as pyproj.exceptions.ProjError, a malformed coordinate as pyproj.exceptions.CRSError, a type-cast overflow as pyarrow.lib.ArrowInvalid, and a self-intersecting ring as a Shapely TopologyException. Before a record enters the retry loop, every one of these must be collapsed into a single classified error carrying a transient flag, so the loop can make a deterministic route/retry decision without library-specific branching.

# preprocessing/error_classifier.py
# Python 3.10+, pyproj >=3.6, pyarrow >=14, shapely >=2.0
from dataclasses import dataclass
from pyproj.exceptions import CRSError, ProjError
from pyarrow.lib import ArrowInvalid
from shapely.errors import GEOSException

TRANSIENT_EXC = (ProjError,)            # grid server / network-style faults
TERMINAL_EXC = (CRSError, ArrowInvalid, GEOSException)  # data-shape faults

@dataclass(slots=True)
class ClassifiedError:
    record_id: str
    error_class: str
    transient: bool
    detail: str

def classify(record_id: str, exc: Exception) -> ClassifiedError:
    status = getattr(exc, "status_code", None)
    if status in (429, 502, 503):       # upstream API throttling / gateway
        transient = True
    elif isinstance(exc, TRANSIENT_EXC):
        transient = True
    elif isinstance(exc, TERMINAL_EXC):
        transient = False
    else:
        transient = False               # unknown == terminal; never guess "retry"
    return ClassifiedError(
        record_id=record_id,
        error_class=type(exc).__name__,
        transient=transient,
        detail=str(exc)[:500],
    )

The default-to-terminal rule is deliberate: an unrecognised exception is never retried, because blindly replaying a deterministic data fault wastes the attempt budget and delays quarantine. Records must also arrive flattened and reprojected — if nested attributes or mismatched CRS reach this stage, the resulting exceptions are data-shape faults that no amount of retrying will fix.

Execution Engine & Exponential Backoff Jump to heading

The retry mechanism operates at both the record and batch levels. When a spatial transformation or attribute coercion fails, the engine classifies the exception, increments the attempt counter, and either reschedules reprocessing or escalates to terminal handling. For transient failures — database connection resets, temporary API throttling, grid-server timeouts — implementing exponential backoff in schema mapping jobs ensures retry intervals scale predictably with randomized jitter rather than hammering an already-degraded dependency.

# execution/retry_engine.py — Python 3.10+, stdlib only for the core loop
import time
import random
import logging
from typing import Callable, Any
from preprocessing.error_classifier import classify, ClassifiedError

logger = logging.getLogger(__name__)

class RetriesExhausted(Exception):
    def __init__(self, err: ClassifiedError) -> None:
        self.err = err
        super().__init__(f"exhausted retries for {err.record_id}")

def execute_with_backoff(
    func: Callable[[], Any],
    record_id: str,
    max_attempts: int,
    base_delay: float,
    jitter: bool = True,
) -> Any:
    """Run func() with exponential backoff. Raises on terminal faults
    immediately and RetriesExhausted once the attempt budget is spent."""
    for attempt in range(1, max_attempts + 1):
        try:
            return func()
        except Exception as exc:                      # noqa: BLE001 — classified next
            err = classify(record_id, exc)
            if not err.transient:
                logger.error("terminal %s on %s: %s",
                             err.error_class, record_id, err.detail)
                raise                                  # fail fast, no retry
            if attempt == max_attempts:
                logger.error("budget spent (%d) for %s [%s]",
                             max_attempts, record_id, err.error_class)
                raise RetriesExhausted(err) from exc
            delay = base_delay * (2 ** (attempt - 1))
            if jitter:
                delay += random.uniform(0, base_delay)  # decorrelate retries
            logger.warning("transient %s on %s; retry %d/%d in %.2fs",
                           err.error_class, record_id,
                           attempt + 1, max_attempts, delay)
            time.sleep(delay)

There are no silent failures in this loop: a terminal fault propagates immediately, and an exhausted record raises a typed RetriesExhausted carrying the classified error so the caller can route it deterministically. The caller — typically the streaming loop described in Batch Schema Processing Pipelines — catches RetriesExhausted and forwards the payload to the dead-letter sink rather than aborting the whole batch.

Queue-Based Routing & Idempotency Jump to heading

Persistent failures exceeding the configured threshold must be isolated from the primary stream. Durable storage for failed payloads preserves the original geometry, source metadata, and transformation context for manual review or automated reprocessing during off-peak windows. Queue consumers must enforce idempotent write operations so that a successful retry never inserts a duplicate feature: derive a deterministic feature ID such as source_id + sha1(canonical_attributes) and write through an UPSERT against the target geodatabase. Every routing decision — retry, fallback, or quarantine — is emitted as structured JSON for downstream auditability.

Failure Modes & Fallback Routing Jump to heading

Every fault must resolve to exactly one deterministic action. The table below maps the failure types this stage encounters to their cause and recovery route; nothing is swallowed.

Failure type	Typical cause	Classification	Deterministic recovery action
HTTP 429 / 502 / 503	Upstream API throttling or gateway flap	Transient	Exponential backoff with jitter; retry up to `max_attempts`, then dead-letter.
`ProjError` (grid timeout)	PROJ datum-grid server unreachable	Transient	Retry with backoff; on exhaustion route to dead-letter, flag for Datum Transformation Fallback Chains.
`CRSError`	Unparseable or mismatched CRS on the record	Terminal	Fail fast; quarantine for CRS repair upstream — no retry.
`ArrowInvalid`	Type-coercion overflow or schema mismatch	Terminal	Quarantine with the offending field name; fix the coercion rule, not the loop.
`TopologyException` / invalid geometry	Self-intersection, null geometry	Terminal (batch-halting if `halt_on_geometry_invalid`)	Reject record; halt batch when integrity flag is set.
Null rate over `max_null_rate`	Source-wide attribute gaps	Batch-level	Reject the whole batch before commit; emit a schema-drift warning.
`RetriesExhausted`	Transient fault that never cleared	Terminal after retries	Persist full payload to `dead_letter.sink_uri` for off-peak reprocessing.

Compliance Reporting Output Jump to heading

Compliance-grade pipelines maintain immutable audit trails. For every record that is retried, fall back, or quarantined, the stage appends one structured JSON line to the rejection log using Python’s built-in logging module with a JSON formatter. The minimum lineage fields are the record identifier, the source dataset URI, the classified error_class, the transient flag, the attempt count reached, the final routing destination, and a UTC timestamp.

{
  "ts": "2026-06-25T14:02:11Z",
  "pipeline_id": "municipal-parcel-sync",
  "record_id": "parcel:04019:001023",
  "source_uri": "s3://county-ingest/2026Q2/parcels.gpkg",
  "error_class": "ProjError",
  "transient": true,
  "attempts": 3,
  "routing": "dead_letter",
  "sink_uri": "s3://gis-quarantine/parcel-sync/",
  "detail": "PROJ grid server timeout: us_noaa_nadcon"
}

These records satisfy municipal and federal data-governance requirements, give analysts a single queryable surface for root-cause analysis during schema-drift events, and let a reprocessing job replay exactly the quarantined records once the upstream fault clears. The batch-level summary that aggregates these lines — total processed, retried, quarantined, and null-rate observed — is written by the surrounding pipeline and folds into the broader audit output documented in Batch Schema Processing Pipelines.

CI Integration Jump to heading

Resilience logic must be validated before deployment, not discovered in production. Gate the retry manifest and the backoff behaviour with two checks in CI: a schema lint that rejects an out-of-range policy, and a fault-injection simulation that proves transient faults retry and terminal faults fail fast.

# .github/workflows/pipeline_resilience.yml
name: Validate Pipeline Resilience
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with: { python-version: "3.12" }   # 3.10+ required
      - name: Install dependencies
        run: pip install "pydantic>=2" pytest pyyaml
      - name: Validate retry manifest
        run: |
          python -c "
          import yaml, sys
          from pydantic import BaseModel, Field, ValidationError

          class RetryPolicy(BaseModel):
              max_attempts: int = Field(..., ge=1)
              base_delay_seconds: float = Field(..., gt=0)
              jitter: bool = True
              retryable_status_codes: list[int]

          try:
              cfg = yaml.safe_load(open('pipeline_resilience.yaml'))
              RetryPolicy(**cfg['pipeline']['retry_policy'])
              print('manifest valid')
          except ValidationError as e:
              print(f'config invalid: {e}', file=sys.stderr); sys.exit(1)
          "
      - name: Run fault-injection simulation
        run: pytest tests/test_retry_logic.py -v

A matching pytest fixture asserts the two invariants that matter most: a function raising a retryable status code is invoked exactly max_attempts times before RetriesExhausted, and a function raising a terminal CRSError is invoked exactly once. Wiring these as a required status check means no policy change merges without proving its routing behaviour.

# tests/test_retry_logic.py — pytest, Python 3.10+
import pytest
from execution.retry_engine import execute_with_backoff, RetriesExhausted

class Throttled(Exception):
    status_code = 503

def test_transient_exhausts_budget():
    calls = {"n": 0}
    def fn():
        calls["n"] += 1
        raise Throttled()
    with pytest.raises(RetriesExhausted):
        execute_with_backoff(fn, "rec-1", max_attempts=3,
                             base_delay=0.0, jitter=False)
    assert calls["n"] == 3            # retried, not silently dropped

def test_terminal_fails_fast():
    from pyproj.exceptions import CRSError
    calls = {"n": 0}
    def fn():
        calls["n"] += 1
        raise CRSError("unparseable")
    with pytest.raises(CRSError):
        execute_with_backoff(fn, "rec-2", max_attempts=3,
                             base_delay=0.0, jitter=False)
    assert calls["n"] == 1            # no wasted retries on a data fault

Frequently Asked Questions Jump to heading

Why default an unrecognised exception to terminal rather than retrying it? Most faults that crash a geospatial transform are deterministic data-shape problems — an unparseable CRS string, a self-intersecting ring, a type-cast overflow — and replaying them simply burns the attempt budget and delays quarantine. The classifier only marks a fault transient when it matches a known network-style signature (an upstream 429/502/503, or a ProjError from a grid server). Everything else fails fast, so a genuinely broken record reaches the dead-letter sink in one attempt instead of three.

How do retries avoid creating duplicate features in the target geodatabase? The write path is idempotent by construction. Each feature carries a deterministic ID derived from source_id + sha1(canonical_attributes), and the consumer writes through an UPSERT rather than an INSERT. A retry that succeeds on its second or third attempt resolves to the same ID and overwrites the partial row instead of appending a second copy — so a flapping connection can never inflate the feature count.

Should backoff jitter ever be disabled? Only in tests, where deterministic timing makes assertions stable (the fixtures above pass jitter=False). In production keep it on: without jitter, a batch of records that all hit the same throttled PROJ grid server retry in lockstep and re-saturate the dependency the instant it recovers. The randomized component decorrelates those retries. Backoff curve and jitter tuning are covered in depth in implementing exponential backoff in schema mapping jobs.

What is the difference between a record-level reject and a batch-level halt? A record-level reject quarantines a single feature and lets the batch continue — the normal path for a CRSError or an exhausted retry. A batch-level halt stops the whole commit before any write lands, and is reserved for integrity violations: an invalid geometry when halt_on_geometry_invalid is set, or a source-wide null rate over max_null_rate. The distinction keeps one bad parcel from aborting a clean batch while still refusing to publish a structurally corrupt dataset.

Where should a ProjError actually be fixed? Not in the retry loop. A transient ProjError from an unreachable datum-grid server is retried here, but a persistent one means a missing transformation grid, which is recovered upstream by the Datum Transformation Fallback Chains strategy under CRS Normalization & Sync. This stage only routes the fault and records it; the grid repair is a CRS concern.

Deeper Implementation Guides Jump to heading

Implementing exponential backoff in schema mapping jobs — backoff curve tuning, jitter strategies, and attempt-budget math for throttled upstream APIs.

Automated Attribute Transformation & ETL Workflows — the parent discipline this resilience layer wraps.
Batch Schema Processing Pipelines — the streaming execution loop that calls this retry engine.
Field Renaming & Type Coercion Rules — where ArrowInvalid and coercion faults originate.
Nested JSON/GeoJSON Flattening — preprocessing that removes structural faults before this stage runs.
CRS Normalization & Sync — upstream recovery for ProjError and CRSError quarantine routes.

Implementing Error Handling & Retry Logic in Geospatial Schema Mapping Pipelines Jump to heading#

Declarative Retry & Tolerance Manifest Jump to heading#

Preprocessing: Normalising the Failure Surface Jump to heading#

Execution Engine & Exponential Backoff Jump to heading#

Queue-Based Routing & Idempotency Jump to heading#

Failure Modes & Fallback Routing Jump to heading#

Compliance Reporting Output Jump to heading#

CI Integration Jump to heading#

Frequently Asked Questions Jump to heading#

Deeper Implementation Guides Jump to heading#

Related Jump to heading#