Implementing Nested JSON/GeoJSON Flattening in Standardized ETL Pipelines Jump to heading

Geospatial data ingestion routinely encounters deeply nested payloads from municipal APIs, federal open-data portals, and third-party survey platforms. Normalizing these hierarchical structures into flat, relational, or columnar formats is a mandatory prerequisite for spatial indexing, downstream analysis, and compliance auditing. Nested JSON/GeoJSON flattening operates as a discrete, high-precision transformation stage at the core of Automated Attribute Transformation & ETL Workflows, requiring deterministic mapping logic, strict schema enforcement, and predictable fallback routing.

This page covers the scope from mapping-manifest design through the traversal engine, failure routing, and CI gating for nested attribute structures. It does not cover the downstream type-casting library that consumes the flattened columns — that belongs to Field Renaming & Type Coercion Rules — nor the at-scale orchestration of thousands of files, which is the domain of Batch Schema Processing Pipelines. Flattening runs first: it produces the predictable, single-level records those neighboring stages assume as input.

Declarative Mapping Manifest Jump to heading

Hardcoded parsing routines introduce maintenance debt and schema-drift vulnerabilities. The flattening stage must be driven by version-controlled YAML manifests that externalize all transformation logic. Each mapping rule declares a source dot-notation path (relative to properties), a target column identifier, a delimiter convention, an expected data type, and explicit mandatory/optional status.

# mapping_manifest.yaml
# Loaded and validated at pipeline init; never edited by hand in production.
schema_version: "1.0"      # bumped whenever the rule set changes; recorded in lineage
delimiter: "__"            # joins nested keys; must match the SQL-safe column convention
rules:
  - source_path: "admin.zoning.code"   # dot-path relative to properties
    target_column: "zoning_code"       # output column name
    type: "string"
    mandatory: true                    # missing → quarantine the record
    fallback: null
  - source_path: "survey.responses"
    target_column: "survey_responses"
    type: "json_string"
    mandatory: false                   # missing → apply fallback, keep record
    fallback: "[]"
    array_policy: "serialize"          # collapse the array to a compact JSON string
  - source_path: "metadata.last_updated"
    target_column: "record_timestamp"
    type: "iso8601"
    mandatory: true
    fallback: "1970-01-01T00:00:00Z"

Every directive has a defined contract. The table below fixes which fields a valid rule must carry versus which are optional:

Field	Required	Purpose
`source_path`	mandatory	Dot-notation path within `properties`; never prefix with `properties.`
`target_column`	mandatory	Output column; must be unique across the manifest
`type`	mandatory	Declared output type handed to the coercion stage (`string`, `iso8601`, `json_string`, …)
`mandatory`	mandatory	Whether a missing value quarantines the record or applies a fallback
`fallback`	optional	Value substituted when an optional field is absent; ignored when `mandatory: true`
`array_policy`	optional	`serialize` collapses arrays to compact JSON; default leaves the list in place
`delimiter`	optional (manifest-level)	Overrides the global key-join delimiter for this manifest

Pipeline initialization must validate the manifest against a central JSON Schema before any feature is processed. Invalid path definitions, circular references, or conflicting target names trigger immediate pipeline halts with structured diagnostic output. This declarative architecture aligns directly with the rule library defined in Field Renaming & Type Coercion Rules, ensuring attribute transformations remain auditable, reproducible, and environment-agnostic.

Preprocessing Requirements Jump to heading

Flattening assumes a specific input shape. Before the traversal engine runs, the stage establishes the following invariants:

One feature per record. A FeatureCollection is decomposed into individual Feature objects upstream; the engine operates on a single feature at a time so memory stays bounded.
Reserved keys isolated. Per RFC 7946, the top-level keys type, bbox, geometry, and id are pulled aside and never recursed into. Only the properties dictionary is traversed.
CRS treated as informational. RFC 7946 deprecated the top-level crs member; any crs key is logged but not flattened, and coordinate reconciliation is deferred to CRS Normalization & Sync downstream.
Geometry passes through untouched. The geometry object and its coordinate arrays are preserved byte-for-byte so spatial topology survives the attribute transformation.

These guarantees keep flattening orthogonal to geometry handling, which is exactly what the at-scale orchestration in Batch Schema Processing Pipelines relies on when it parallelizes across feature collections.

Execution Engine & Precision Guards Jump to heading

GeoJSON introduces rigid structural constraints that generic JSON flattening routines frequently violate. The implementation isolates the reserved keys, restricts recursive traversal exclusively to the properties dictionary, and uses an explicit stack rather than language recursion so a hostile payload cannot exhaust the call stack.

# geojson_flattener.py  (Python 3.10+)
import json
from typing import Any, Dict, List

RESERVED_KEYS = {"type", "bbox", "geometry", "id", "crs"}


def flatten_properties(properties: Dict[str, Any], delimiter: str = "__") -> Dict[str, Any]:
    """Flatten a nested properties dict using an iterative stack (avoids recursion limits)."""
    flat: Dict[str, Any] = {}
    stack = [(properties, "")]

    while stack:
        current, prefix = stack.pop()
        if isinstance(current, dict):
            for key, val in current.items():
                new_key = f"{prefix}{delimiter}{key}" if prefix else key
                if isinstance(val, (dict, list)):
                    stack.append((val, new_key))
                else:
                    flat[new_key] = val
        # Lists are stored as-is; callers handle array_policy

    return flat


def flatten_geojson(
    feature: Dict[str, Any],
    rules: List[Dict[str, Any]],
    delimiter: str = "__",
) -> Dict[str, Any]:
    """Flatten a single GeoJSON feature according to manifest rules.

    source_path in each rule is dot-notation relative to 'properties';
    after flattening, dots become the configured delimiter.
    """
    output = {k: v for k, v in feature.items() if k in RESERVED_KEYS}
    properties = feature.get("properties") or {}
    flat_props = flatten_properties(properties, delimiter)

    # Apply manifest rules
    for rule in rules:
        target = rule["target_column"]
        # Convert dot-notation path to flattened key
        flat_key = rule["source_path"].replace(".", delimiter)
        value = flat_props.get(flat_key)
        fallback = rule.get("fallback")
        mandatory = rule.get("mandatory", False)

        if value is None:
            if mandatory:
                raise ValueError(f"Mandatory field missing: {rule['source_path']}")
            output[target] = fallback
            continue

        if rule.get("array_policy") == "serialize" and isinstance(value, list):
            output[target] = json.dumps(value, separators=(",", ":"))
        else:
            output[target] = value

    return output

For datasets exceeding available RAM, do not materialize the full feature collection. Stream features incrementally with a generator-based parser such as ijson (>=3.2) and hand each feature to flatten_geojson before writing the flattened row directly to Parquet (pyarrow >=14) or PostGIS via batch inserts. Mandatory-field violations must not abort the entire stream — catch the ValueError per feature, quarantine that record, and keep the generator running so a single bad payload never costs the whole batch:

# streaming_flatten.py  (Python 3.10+)
import logging
import ijson  # ijson >=3.2

log = logging.getLogger("nested_flatten")


def flatten_stream(path: str, rules, delimiter: str = "__"):
    """Yield one flattened record per feature without loading the whole file.

    Quarantines (does not raise) on a mandatory-field miss so one hostile
    feature cannot terminate the batch; the quarantine list is handed to the
    audit writer for the compliance record.
    """
    quarantined: list[dict] = []
    with open(path, "rb") as fh:
        for feature in ijson.items(fh, "features.item"):
            try:
                yield flatten_geojson(feature, rules, delimiter)
            except ValueError as exc:
                fid = feature.get("id", "<no-id>")
                log.warning("quarantined feature %s: %s", fid, exc)
                quarantined.append({"feature_id": fid, "reason": str(exc)})
    if quarantined:
        log.info("stream complete: %d feature(s) quarantined", len(quarantined))

When the flattened rows are written onward, wrap the columnar write so a schema mismatch surfaces as a logged, routable failure rather than a process crash — pyarrow raises pa.lib.ArrowInvalid (and ArrowTypeError) when a value does not fit the declared column type:

# write_parquet.py  (Python 3.10+)
import logging
import pyarrow as pa          # pyarrow >=14
import pyarrow.parquet as pq

log = logging.getLogger("nested_flatten")


def write_batch(records: list[dict], schema: pa.Schema, dest: str) -> None:
    """Write a flattened batch to Parquet, routing schema errors to the audit log."""
    try:
        table = pa.Table.from_pylist(records, schema=schema)
        pq.write_table(table, dest)
    except (pa.lib.ArrowInvalid, pa.lib.ArrowTypeError) as exc:
        # A value survived flattening but violates the columnar contract:
        # log it as a coercion-boundary failure, never write a partial file.
        log.error("parquet write rejected: %s", exc)
        raise

For municipal datasets with nested administrative boundaries, multi-level zoning attributes, or survey arrays, deterministic traversal prevents structural corruption. Detailed strategies for irregular geometries, recursion-depth caps, and coordinate-precision clamping are covered in Flattening Deeply Nested GeoJSON Feature Collections Safely.

Failure Modes & Fallback Routing Jump to heading

No record may fail silently. Every traversal outcome maps to a deterministic action so a downstream auditor can reconstruct exactly why a record was dropped, defaulted, or serialized.

Failure mode	Cause	Deterministic recovery action
Mandatory field missing	`source_path` resolves to `None` and `mandatory: true`	Raise, quarantine the record, log the rule and feature `id`
Optional field missing	`source_path` resolves to `None` and `mandatory: false`	Apply declared `fallback`, continue, log a `field_defaulted` event
Target column collision	Two rules declare the same `target_column`	Halt at manifest-validation time, before any feature is read
Circular reference in payload	Self-referencing nested object	Iterative stack bounds traversal; depth cap quarantines the record
Non-`iso8601` timestamp	Source value not parseable as a datetime	Reject the value, route to coercion-error log, do not write
Array where scalar expected	Nested list under a scalar rule	Apply `array_policy: serialize` or quarantine if no policy declared
Columnar write rejection	Flattened value violates the declared Parquet/Arrow column type	Catch `ArrowInvalid`/`ArrowTypeError`, log to the audit trail, never write a partial file

Mandatory-field violations halt and quarantine; optional violations default and continue. This routing model integrates with the parallel validation in Batch Schema Processing Pipelines, and transient I/O failures during streaming are wrapped by the exponential-backoff strategy described in Error Handling & Retry Logic:

# retry_handler.py  (Python 3.10+)
import time
from functools import wraps
from typing import Callable, Any, Tuple, Type

def retry_with_backoff(
    max_retries: int = 3,
    base_delay: float = 1.0,
    exceptions: Tuple[Type[Exception], ...] = (ConnectionError, TimeoutError),
) -> Callable:
    def decorator(func: Callable) -> Callable:
        @wraps(func)
        def wrapper(*args: Any, **kwargs: Any) -> Any:
            delay = base_delay
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except exceptions:
                    if attempt == max_retries - 1:
                        raise
                    time.sleep(delay)
                    delay *= 2
        return wrapper
    return decorator

Compliance Reporting Output Jump to heading

Each batch run writes a structured audit record so flattening decisions are reproducible and reviewable. The record captures lineage (which manifest version produced the output), per-field defaulting events, and the quarantine log.

# audit.py  (Python 3.10+)
import json
from datetime import datetime, timezone

def write_audit(manifest_version: str, processed: int, defaulted: list, quarantined: list) -> str:
    record = {
        "stage": "nested_flatten",
        "manifest_version": manifest_version,   # lineage: ties output to the exact rule set
        "run_at": datetime.now(timezone.utc).isoformat(),
        "records_processed": processed,
        "records_quarantined": len(quarantined),
        "fields_defaulted": defaulted,          # [{feature_id, target_column, fallback}]
        "quarantine_log": quarantined,          # [{feature_id, rule, reason}]
    }
    return json.dumps(record, separators=(",", ":"))

The audit fields map onto the lineage and rejection-log conventions used across every ETL stage, so the flattening stage’s output slots into the same compliance dashboard that consumes coercion and batch-processing reports.

CI Integration Jump to heading

Configuration manifests and flattening logic must be validated before deployment. A minimal GitHub Actions workflow gates manifest schema compliance, unit-test coverage, and synthetic-payload regression for the traversal engine.

# .github/workflows/validate-flattener.yml
name: Validate Flattening Pipeline
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.10"
      - name: Install dependencies
        run: pip install pyyaml jsonschema ijson pytest
      - name: Validate YAML Manifest
        run: |
          python -c "
          import yaml, jsonschema
          with open('mapping_manifest.yaml') as f:
              manifest = yaml.safe_load(f)
          schema = {
              'type': 'object',
              'required': ['schema_version', 'rules'],
              'properties': {
                  'rules': {'type': 'array', 'minItems': 1}
              }
          }
          jsonschema.validate(manifest, schema)
          print('Manifest valid.')
          "
      - name: Run Unit Tests
        run: pytest tests/ --tb=short

Pair the CI job with a local pre-commit hook so duplicate target_column declarations and unresolvable paths are caught before they ever reach the pipeline:

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: lint-flatten-manifest
        name: Lint flatten manifest
        language: python
        entry: python scripts/lint_manifest.py
        files: "mapping_manifest\\.yaml$"

This enforces version-control discipline, prevents schema drift, and guarantees that traversal logic stays deterministic across staging and production.

Frequently Asked Questions Jump to heading

Why use an iterative stack instead of recursion? Vendor payloads can nest dozens of levels deep, and a maliciously or accidentally deep object would raise RecursionError and crash the worker. The explicit stack in flatten_properties bounds traversal to heap memory and lets you enforce a depth cap as a quarantine condition rather than an uncaught exception.

Should the geometry ever be flattened? No. Coordinate arrays are preserved exactly so spatial topology survives. Flattening is restricted to the properties object; reconciling coordinate systems is handled separately by CRS Normalization & Sync.

How are arrays in properties handled? By the array_policy directive. serialize collapses an array to a compact JSON string in a single column; without a policy, an array under a scalar rule is treated as a failure mode and quarantined rather than silently coerced.

Where does type casting happen — here or downstream? This stage emits the declared type alongside each value but performs only structural flattening. The actual casting (string → int, datetime parsing, null normalization) is the job of Field Renaming & Type Coercion Rules, which consumes the flattened columns.

Flattening Deeply Nested GeoJSON Feature Collections Safely — recursion-depth caps, coordinate precision clamping, and edge-case handling for vendor GeoJSON
Field Renaming & Type Coercion Rules — the type-casting stage that consumes flattened columns
Batch Schema Processing Pipelines — parallel, at-scale orchestration of the flattening and coercion stages
Error Handling & Retry Logic — exponential backoff for transient failures during streaming ingestion
CRS Normalization & Sync — coordinate reference reconciliation that runs alongside attribute transformation

Implementing Nested JSON/GeoJSON Flattening in Standardized ETL Pipelines Jump to heading#

Declarative Mapping Manifest Jump to heading#

Preprocessing Requirements Jump to heading#

Execution Engine & Precision Guards Jump to heading#

Failure Modes & Fallback Routing Jump to heading#

Compliance Reporting Output Jump to heading#

CI Integration Jump to heading#

Frequently Asked Questions Jump to heading#

Related Pages Jump to heading#