Automated FGDC CSDGM to ISO 19115 Conversion Pipeline Jump to heading

Converting FGDC CSDGM to ISO 19115 automatically requires deterministic schema translation rather than heuristic parsing. Legacy FGDC XML structures lack the strict cardinality, temporal precision, and controlled vocabulary constraints mandated by ISO 19115-2003 and ISO 19115-1:2014. Direct mapping pipelines routinely fail when encountering unstructured <idinfo> blocks, ambiguous coordinate reference system declarations, or missing lineage sequencing. This guide provides a production-ready Python ETL configuration that enforces strict type coercion, resolves schema drift, and guarantees compliance through automated validation and fallback routing. For foundational context on spatial metadata interoperability, review the Geospatial Schema Architecture & Standards Mapping framework before deploying this pipeline.

Declarative Mapping Configuration Jump to heading

The conversion engine must operate on a declarative translation matrix to prevent ad-hoc XPath mutations during execution. Define a YAML-driven mapping layer that explicitly binds FGDC elements to ISO 19115 equivalents while enforcing mandatory field population. The following minimal reproducible configuration establishes the baseline translation rules:

yaml
mapping_matrix:
  idinfo/citation/citeinfo/title: identificationInfo/MD_DataIdentification/citation/CI_Citation/title
  idinfo/citation/citeinfo/pubdate: identificationInfo/MD_DataIdentification/citation/CI_Citation/date/CI_Date/date
  idinfo/citation/citeinfo/geoform: identificationInfo/MD_DataIdentification/spatialRepresentationType/MD_SpatialRepresentationTypeCode
  dataqual/lineage/procstep/procdate: dataQualityInfo/DQ_DataQuality/lineage/LI_Lineage/processStep/LI_ProcessStep/dateTime/CI_Date/date
  spref/horizsys/geodetic/horizdn: referenceSystemInfo/MD_ReferenceSystem/referenceSystemIdentifier/RS_Identifier/code

Implement this matrix using lxml.etree with a deterministic traversal order. Avoid recursive wildcard searches (//) as they introduce non-deterministic node selection and break batch processing consistency. The Python implementation below enforces strict path resolution and handles missing source nodes gracefully:

python
import yaml
from lxml import etree
from typing import Dict, Optional, Tuple

def load_mapping_config(config_path: str) -> Dict[str, str]:
    with open(config_path, "r") as f:
        return yaml.safe_load(f)["mapping_matrix"]

def translate_node(
    source_tree: etree._ElementTree,
    fgdc_path: str,
    iso_path: str
) -> Optional[Tuple[str, str]]:
    """Extracts FGDC value and returns ISO target path + value."""
    node = source_tree.find(fgdc_path)
    if node is None or not node.text or not node.text.strip():
        return None
    return iso_path, node.text.strip()

Precision Management and Temporal Coercion Jump to heading

Precision loss during coordinate and temporal conversion is the primary failure vector in automated pipelines. FGDC stores dates as YYYYMMDD or YYYY strings, while ISO 19115 requires strict YYYY-MM-DD ISO 8601 formatting with explicit time zone offsets. Implement a regex-based temporal parser with a hard fallback threshold. For spatial bounding boxes, FGDC uses <westbc>, <eastbc>, <northbc>, and <southbc>. ISO 19115 requires EX_GeographicBoundingBox with explicit longitude/latitude tags.

Apply the following thresholds and rules:

  • Temporal Precision: If temporal precision drops below day-level, inject 00:00:00Z and append a <gco:CharacterString> provenance note indicating automated normalization.
  • Coordinate Precision: Enforce a 6-decimal precision threshold during float coercion to prevent floating-point drift that corrupts downstream spatial indexing.
  • Rounding Strategy: Values exceeding the threshold must be rounded using decimal.ROUND_HALF_EVEN to maintain IEEE 754 compliance.
  • Null Handling: Missing bounding box coordinates trigger a fallback to the dataset’s declared CRS extent or raise a ValueError if no authoritative extent exists.
python
import re
from decimal import Decimal, ROUND_HALF_EVEN
from datetime import datetime

DATE_PATTERN = re.compile(r"^(?P<year>\d{4})(?P<month>\d{2})?(?P<day>\d{2})?$")

def coerce_date(raw_date: str) -> str:
    match = DATE_PATTERN.match(raw_date.strip())
    if not match:
        raise ValueError(f"Invalid FGDC date format: {raw_date}")
    
    year = match.group("year")
    month = match.group("month") or "01"
    day = match.group("day") or "01"
    
    iso_date = f"{year}-{month}-{day}"
    try:
        datetime.strptime(iso_date, "%Y-%m-%d")
    except ValueError:
        raise ValueError(f"Invalid calendar date: {iso_date}")
        
    return f"{iso_date}T00:00:00Z"

def coerce_coordinate(val: str) -> str:
    d = Decimal(val.strip())
    return str(d.quantize(Decimal("0.000001"), rounding=ROUND_HALF_EVEN))

Schema Drift Debugging and Fallback Routing Jump to heading

Schema drift occurs when FGDC profiles omit mandatory elements or use deprecated vocabulary codes. Production pipelines must implement a quarantine-and-retry architecture rather than hard-failing. When FGDC Metadata Mapping profiles diverge from the baseline, route non-compliant records to a staging directory with structured error manifests.

Handle the following edge cases explicitly:

  • Missing Fields: If a mandatory ISO 19115 element lacks a direct FGDC equivalent, inject a <gco:nilReason> attribute with the value missing or unknown per ISO 19115-1:2014 Section 6.2.
  • CRS Mismatches: FGDC often declares horizontal datums via free-text strings. Resolve these using an authoritative EPSG registry lookup. If resolution fails, default to EPSG:4326 and log a CRS_AMBIGUOUS warning.
  • CI Pipeline Failures: Integrate schema validation into your CI/CD workflow. Fail fast on XML well-formedness errors, but allow soft-failures on optional metadata blocks to prevent deployment bottlenecks.

Automated Validation and CI Integration Jump to heading

Post-conversion validation must occur before data publication. Use lxml with the official ISO 19115 XML Schema Definition (XSD) and Schematron rules to enforce business logic constraints. Embed validation directly into your CI pipeline using pytest and a custom fixture that asserts zero critical violations. Refer to the official ISO 19115-1:2014 Specification for authoritative schema definitions.

python
import subprocess
from pathlib import Path

def validate_iso19115(xml_path: Path, xsd_path: Path) -> bool:
    """Validates ISO 19115 XML against official XSD using xmllint."""
    result = subprocess.run(
        ["xmllint", "--noout", "--schema", str(xsd_path), str(xml_path)],
        capture_output=True,
        text=True
    )
    if result.returncode != 0:
        print(f"Schema Validation Failed:\n{result.stderr}")
        return False
    return True

Configure your CI runner to:

  • Run validate_iso19115() on every pull request touching the metadata directory.
  • Block merges if validation returns False or if the error log contains CRITICAL severity tags.
  • Archive failed XML payloads for manual review by GIS data stewards.

This pipeline architecture guarantees deterministic output, preserves spatial and temporal precision, and aligns with federal geospatial interoperability mandates.