Implementing Batch Schema Processing Pipelines for Geospatial Standardization Jump to heading

Government GIS teams and open-source maintainers routinely ingest heterogeneous spatial datasets — vendor CAD exports, legacy municipal shapefiles, REST API GeoJSON responses — that must reach a deterministic, standards-compliant attribute schema before publication or archival. Batch schema processing pipelines are the execution layer that performs this alignment at scale, translating raw inputs into validated, publication-ready feature collections without manual reconciliation. This pipeline stage operates at the core of Automated Attribute Transformation & ETL Workflows, driving the per-field mapping logic across thousands of files in a single, auditable run.

This page covers the scope from schema manifest authoring through streaming execution and compliance reporting. It does not cover CRS reprojection (handled by CRS Normalization & Sync) or the recursive JSON traversal required to flatten nested payloads before attribute coercion begins (covered by Nested JSON/GeoJSON Flattening).

Configuration-as-Code: The Schema Manifest Jump to heading

Pipeline architects must declare all mapping rules in version-controlled YAML manifests, not embedded transformation logic. Each manifest entry specifies the source field path, target column name, expected type, and coercion tolerances. Storing these manifests in Git alongside the pipeline source gives every change a traceable commit, makes rollbacks deterministic, and ensures that Field Renaming & Type Coercion Rules execute identically across every municipal dataset that flows through the system.

# schema_manifest.yaml — geopandas >=0.14, pyarrow >=14
schema_version: "2.1"
target_crs: "EPSG:4326"
compliance_profile: "FGDC-STD-001"
fields:
  # Mandatory: pipeline halts batch if null or missing
  - source: "PROP_ID"
    target: "parcel_id"
    type: "string"
    required: true
    description: "Unique parcel identifier. Null value routes entire record to quarantine."

  # Optional with fallback: pipeline continues; fallback applied and flagged
  - source: "ACRES"
    target: "area_hectares"
    type: "float64"
    required: false
    fallback: 0.0
    tolerance: 1e-6
    description: "Area in hectares after unit conversion. Missing → 0.0 with audit flag."

  # Optional with regex validation
  - source: "ZONING"
    target: "zoning_code"
    type: "string"
    required: false
    fallback: "UNKNOWN"
    validation_regex: "^[A-Z]{2,4}-\\d{2}$"
    description: "Zoning designation. Non-matching values logged as schema_drift warnings."

Mandatory vs. Optional Field Semantics Jump to heading

Rule	`required: true`	`required: false`
Source field missing	Hard failure → quarantine record	Fallback value applied
Value is null	Hard failure → quarantine record	Fallback value applied
Type coercion fails	Hard failure → quarantine record	Fallback value applied; warning emitted
Regex validation fails	Hard failure → quarantine record	Warning emitted; value kept as-is
Audit log entry	`QUARANTINE` with reason code	`FALLBACK_APPLIED` or `SCHEMA_DRIFT`

This two-tier design guarantees that records missing a parcel identifier never reach the target geodatabase, while preserving pipeline throughput for datasets where area measurements are incomplete. The per-field casting that each manifest entry triggers is detailed in Writing Robust Python Scripts for Automated Field Type Casting, which the execution engine below calls once per field.

Preprocessing Requirements Jump to heading

Before schema coercion executes, inputs must satisfy two shape requirements:

1. Flat attribute structure. Type coercion operates on scalar field values, not nested objects. GeoJSON payloads with deeply nested attribute dictionaries must be projected to top-level columns before the manifest is applied. The Nested JSON/GeoJSON Flattening stage handles this using dot-notation path resolution without mutating coordinate arrays; for the recursion limits and cycle guards that keep a malformed feature collection from exhausting the stack, see Flattening Deeply Nested GeoJSON Feature Collections Safely.

2. CRS declared and readable. The pipeline must be able to read the source CRS from the file header before opening the streaming reader. Files with missing or malformed .prj sidecar files must be rejected at the preprocessing gate with an explicit error code rather than silently assuming a coordinate system. If CRS reconciliation across multiple source datasets is required, CRS Normalization & Sync provides the full multi-dataset harmonization workflow.

# Python 3.10+ — fiona >=1.9, pyproj >=3.6
from pathlib import Path
import fiona
from pyproj import CRS
from pyproj.exceptions import CRSError

def assert_readable_crs(source_path: Path) -> CRS:
    """Raise CRSError with a descriptive message if CRS cannot be determined."""
    with fiona.open(str(source_path), "r") as src:
        raw_crs = src.crs
        if raw_crs is None:
            raise CRSError(
                f"No CRS declared in {source_path.name}. "
                "Provide a .prj sidecar or set the CRS explicitly before ingestion."
            )
        return CRS.from_user_input(raw_crs)

Execution Engine and Precision Guards Jump to heading

Execution must prioritize memory stability over raw throughput. Streaming iteration applies schema transformations record-by-record rather than materialising entire feature collections in RAM. For the specific challenge of processing archives of thousands of discrete files while maintaining stable allocation, see Batch Transforming 10k+ Shapefiles Without Memory Leaks.

The execution engine below implements the full coercion-fallback-write loop with explicit error handling for the failure types most common in municipal GIS data:

# Python 3.10+ — fiona >=1.9, geopandas >=0.14, pyarrow >=14
import gc
import logging
import resource
import uuid
from pathlib import Path
from typing import Any

import fiona
import pyarrow as pa  # pyarrow >=14

logger = logging.getLogger(__name__)

# Cap virtual address space at 2 GiB to catch runaway allocations early
resource.setrlimit(resource.RLIMIT_AS, (2 * 1024**3, 2 * 1024**3))

PYARROW_TYPE_MAP: dict[str, pa.DataType] = {
    "string": pa.string(),
    "float64": pa.float64(),
    "int64": pa.int64(),
    "bool": pa.bool_(),
}


def coerce_field(
    value: Any,
    field_def: dict[str, Any],
    record_id: str,
    audit_rows: list[dict],
) -> Any:
    """
    Apply type coercion and fallback logic for a single field.
    Returns the coerced value; appends to audit_rows on any deviation.
    """
    target = field_def["target"]
    required = field_def.get("required", False)
    fallback = field_def.get("fallback")
    tolerance = field_def.get("tolerance")

    if value is None or (isinstance(value, str) and value.strip() == ""):
        if required:
            raise ValueError(f"Mandatory field '{target}' is null for record {record_id}")
        audit_rows.append({
            "record_id": record_id,
            "field": target,
            "event": "FALLBACK_APPLIED",
            "reason": "null_source_value",
        })
        return fallback

    try:
        field_type = field_def.get("type", "string")
        if field_type == "float64":
            coerced = float(value)
            # Enforce numeric tolerance: values below tolerance snap to 0.0
            if tolerance is not None and abs(coerced) < tolerance:
                coerced = 0.0
            return coerced
        if field_type == "int64":
            return int(value)
        if field_type == "bool":
            return bool(value)
        # Default: string coercion
        return str(value).strip()

    except (TypeError, ValueError) as exc:
        if required:
            raise ValueError(
                f"Type coercion failed for mandatory field '{target}' on record {record_id}: {exc}"
            ) from exc
        audit_rows.append({
            "record_id": record_id,
            "field": target,
            "event": "FALLBACK_APPLIED",
            "reason": f"coercion_error: {exc}",
        })
        return fallback


def stream_transform(
    source_path: Path,
    target_path: Path,
    quarantine_path: Path,
    schema_map: list[dict[str, Any]],
) -> dict[str, Any]:
    """
    Stream-transform all features in source_path to target_path.
    Quarantined records go to quarantine_path.
    Returns a compliance summary dict.
    """
    batch_id = str(uuid.uuid4())
    audit_rows: list[dict] = []
    counts = {"processed": 0, "quarantined": 0, "fallback_applied": 0}

    out_properties = {
        fd["target"]: "str" if fd.get("type", "string") == "string" else fd["type"]
        for fd in schema_map
    }

    with fiona.open(str(source_path), "r") as src:
        out_schema = {"geometry": src.schema["geometry"], "properties": out_properties}
        q_schema = {"geometry": src.schema["geometry"], "properties": {"source_id": "str", "failure_reason": "str"}}

        with (
            fiona.open(str(target_path), "w", driver="GPKG", schema=out_schema, crs=src.crs) as dst,
            fiona.open(str(quarantine_path), "w", driver="GPKG", schema=q_schema, crs=src.crs) as qst,
        ):
            for feat in src:
                record_id = str(feat.get("id", counts["processed"]))
                row_audit: list[dict] = []
                props: dict[str, Any] = {}
                quarantine_reason: str | None = None

                for field_def in schema_map:
                    src_key = field_def["source"]
                    raw_val = feat["properties"].get(src_key)
                    try:
                        props[field_def["target"]] = coerce_field(
                            raw_val, field_def, record_id, row_audit
                        )
                    except ValueError as exc:
                        quarantine_reason = str(exc)
                        break

                if quarantine_reason:
                    qst.write({
                        "geometry": feat["geometry"],
                        "properties": {"source_id": record_id, "failure_reason": quarantine_reason},
                    })
                    counts["quarantined"] += 1
                    audit_rows.append({"record_id": record_id, "event": "QUARANTINE", "reason": quarantine_reason})
                else:
                    dst.write({"geometry": feat["geometry"], "properties": props})
                    counts["processed"] += 1
                    fallback_events = [r for r in row_audit if r["event"] == "FALLBACK_APPLIED"]
                    counts["fallback_applied"] += len(fallback_events)
                    audit_rows.extend(row_audit)

    gc.collect()
    return {"batch_id": batch_id, "counts": counts, "audit": audit_rows}

Idempotency Guarantee Jump to heading

For recurring municipal updates, batch processing must be idempotent: re-running the same pipeline against identical inputs must produce identical outputs without duplicating records. Use a deterministic composite key — typically source_id + hash(geometry_wkt + sorted_attributes) — and UPSERT semantics in the target geodatabase so that duplicate runs update existing rows rather than appending new ones.

Failure Modes and Fallback Routing Jump to heading

Failure Type	Typical Cause	Deterministic Recovery
`MandatoryFieldNull`	Source dataset missing required column	Record routed to quarantine; batch continues
`TypeCoercionError`	String in numeric field (e.g. `"N/A"` in `ACRES`)	Optional: fallback applied + audit row. Mandatory: quarantine
`RegexValidationFail`	Zoning code in unexpected format (`"RES-1"` vs. `"RS-01"`)	`SCHEMA_DRIFT` warning emitted; value preserved or fallback applied
`CRSError`	Missing `.prj` sidecar; EPSG code not in PROJ registry	Batch rejected at preprocessing gate; no partial writes
`MemoryLimitExceeded`	Single feature with excessively large geometry coordinate array	Feature quarantined; `RLIMIT_AS` exception caught at record boundary
`FileHandleLeak`	Driver not closed after partial write	Mitigated by `with` context managers on all `fiona.open` calls
`IdempotencyViolation`	Duplicate source ID arriving in subsequent batch run	UPSERT semantics prevent duplication; audit log records update event

The Error Handling & Retry Logic stage wraps the outer batch loop to handle transient I/O failures (network-mounted shares, S3 throttling) with exponential backoff before the per-record failure routing above executes. The exact retry envelope — jittered delays, attempt caps, and idempotency keys that keep a retried batch from double-writing — is worked through in Implementing Exponential Backoff in Schema Mapping Jobs.

Compliance Reporting Output Jump to heading

Every batch run must emit a structured JSON compliance record to the audit trail. This record is the primary artefact for lineage tracking, FGDC-STD-001 compliance audits, and incident investigation:

{
  "batch_id": "a3f2e1d0-...",
  "source_file": "parcels_2024_q3.shp",
  "source_hash": "sha256:4c3d...",
  "compliance_profile": "FGDC-STD-001",
  "schema_manifest_version": "2.1",
  "schema_manifest_git_sha": "abc123",
  "run_timestamp_utc": "2026-06-25T09:14:00Z",
  "counts": {
    "total_features": 18432,
    "processed": 18197,
    "quarantined": 12,
    "fallback_applied": 223
  },
  "schema_drift_warnings": [
    {"field": "ZONE_CLASS", "reason": "unexpected_column_not_in_manifest"}
  ],
  "quarantine_reasons": [
    {"code": "MandatoryFieldNull", "field": "parcel_id", "count": 12}
  ]
}

The source_hash field ties the report to a specific input file so that re-runs are distinguishable from original runs. The schema_manifest_git_sha pins the exact manifest version that produced this output, satisfying version-control traceability requirements under FGDC-STD-001 Section 2.

CI Integration Jump to heading

Treat schema manifests as immutable release artifacts: a manifest change must pass the same gates as a code change before it reaches any production dataset. The GitHub Actions workflow below lints the YAML against a JSON Schema registry, runs a synthetic integration test against a known-good fixture dataset, and publishes the compliance report as a build artifact.

# .github/workflows/schema-etl.yml
name: Schema Validation and ETL Test
on:
  push:
    paths:
      - "schemas/**"
      - "src/etl/**"
      - "tests/fixtures/**"

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: |
          pip install \
            "fiona>=1.9" \
            "geopandas>=0.14" \
            "pyproj>=3.6" \
            "pyarrow>=14" \
            pyyaml jsonschema pytest

      - name: Lint manifest against JSON Schema registry
        run: |
          python - <<'EOF'
          import yaml, jsonschema, sys, json
          with open("schemas/manifest.yaml") as f:
              manifest = yaml.safe_load(f)
          with open("schemas/manifest.schema.json") as f:
              schema = json.load(f)
          try:
              jsonschema.validate(manifest, schema)
              print("Manifest valid.")
          except jsonschema.ValidationError as exc:
              print(f"Manifest invalid: {exc.message}", file=sys.stderr)
              sys.exit(1)
          EOF

      - name: Run synthetic ETL integration test
        run: pytest tests/etl_integration.py -v --tb=short

      - name: Publish compliance report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: compliance-report
          path: reports/compliance_*.json

Add a pre-commit hook that runs the manifest linter locally so schema errors are caught before CI:

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: lint-schema-manifest
        name: Lint schema manifest
        language: python
        entry: python scripts/lint_manifest.py
        files: "schemas/.*\\.yaml$"

The synthetic fixture dataset deserves the same version-control discipline as the manifest: pin a small, hand-curated tests/fixtures/parcels_golden.gpkg that exercises every failure branch (one null parcel_id, one non-numeric ACRES, one drifting ZONING code) and assert the resulting counts block byte-for-byte. When the golden output changes unexpectedly, the diff is the schema-drift signal.

Deep-Dive Implementation Guides Jump to heading

The detailed engineering problems that sit beneath this stage are covered in dedicated walkthroughs:

Batch Transforming 10k+ Shapefiles Without Memory Leaks works through bounded memory iteration, driver-handle hygiene, and RLIMIT_AS tuning when a single batch run spans tens of thousands of discrete files.

Frequently Asked Questions Jump to heading

Should the schema manifest live in the same repository as the pipeline code? Yes. Co-locating schemas/manifest.yaml with the ETL source means a single commit captures both the rule change and any code that depends on it, and the schema_manifest_git_sha recorded in each compliance report resolves to a reachable commit. Splitting them across repositories breaks lineage traceability under FGDC-STD-001 Section 2.

Why quarantine records instead of dropping them or aborting the batch? Dropping silently destroys evidence a government audit needs; aborting the whole batch lets one bad record block thousands of valid ones. Quarantining writes the failing feature plus a machine-readable reason to a separate store, so the batch completes deterministically and the rejected records remain inspectable and re-runnable.

How do I keep re-runs from duplicating rows in the target geodatabase? Derive a deterministic composite key from source_id plus a hash of the geometry WKT and sorted attributes, then UPSERT on that key. Identical inputs then update existing rows rather than appending, which is what makes recurring municipal refreshes idempotent.

Does this stage reproject coordinates? No. Attribute coercion assumes a readable, already-declared CRS; multi-dataset reprojection is the responsibility of CRS Normalization & Sync, which must run before the manifest is applied.

Field Renaming & Type Coercion Rules — the coercion rule library this pipeline applies per field
Nested JSON/GeoJSON Flattening — the preprocessing stage that flattens nested API payloads before coercion
Error Handling & Retry Logic — exponential backoff and outer-loop retry wrapping the batch stage
CRS Normalization & Sync — coordinate reference system reconciliation that must run before attribute coercion
Automated Attribute Transformation & ETL Workflows — the parent discipline this batch stage belongs to

Implementing Batch Schema Processing Pipelines for Geospatial Standardization Jump to heading#

Configuration-as-Code: The Schema Manifest Jump to heading#

Mandatory vs. Optional Field Semantics Jump to heading#

Preprocessing Requirements Jump to heading#

Execution Engine and Precision Guards Jump to heading#

Idempotency Guarantee Jump to heading#

Failure Modes and Fallback Routing Jump to heading#

Compliance Reporting Output Jump to heading#

CI Integration Jump to heading#

Deep-Dive Implementation Guides Jump to heading#

Frequently Asked Questions Jump to heading#

Related Jump to heading#