Back to notebook Case 004

Data Processing

Mass data pipelines, cleansing and ETL automation for critical datasets in production workflows.

Mission

Replace fragile manual imports with repeatable, safer data-processing workflows at production scale.

Current Stack
PythonMySQLPostgreSQLSQL

Casefile Summary

System Role

ETL and cleanup layer for high-volume imports that need repeatable processing and safer operational reruns.

Primary Users

Internal operations and systems that depend on large datasets arriving clean enough for downstream use.

Operational Constraint

Encoding, malformed rows and partial failures had to be treated as first-class workflow behavior, not one-off exceptions.

Why It Exists

To replace fragile import scripts and manual cleanup with one operational workflow that can process large batches safely.

Problem

Large imports were error-prone, repetitive and exposed to encoding issues that did not show up in small test runs.

Architecture

The workflow separates staging, cleanup and final persistence so bad data can be corrected without corrupting the target state.

Source file -> staging tables -> cleanup rules -> validated rows
     |               |                |                 |
     v               v                v                 v
 batch input    raw persistence   SQL / Python     target tables

Key Decisions

  • Use staging layers before touching production-facing tables.
  • Handle encoding and malformed data as part of the pipeline.
  • Prefer repeatable reruns over manual cleanup heroics.

System Diagram

flowchart LR A["Source JSON / CSV Files"] --> B["Staging Load"] B --> C["Normalization / Flattening"] C --> D["Validation + Cleanup Rules"] D --> E["Chunked Delivery"] E --> F["Sheets / Target Tables"]

Operational Flow

01 Intake

Batches are received

Large source datasets arrive with inconsistent content, formatting and hidden cleanup work.

02 Staging

Raw rows are isolated first

Data lands in staging so cleanup and validation can happen before production tables are touched.

03 Cleanup

Rules normalize the batch

Encoding fixes, trimming, transformations and rejects are handled as pipeline logic instead of manual edits.

04 Persist

Validated data moves forward

Only cleaned and trusted rows continue into the operational dataset.

Repository Snippet

Based on the private import automation script, anonymized down to the normalization pattern.

def load_and_normalize_json(file_path: Path) -> pd.DataFrame:
    with file_path.open("r", encoding="utf-8") as f:
        data = json.load(f)
    if isinstance(data, list):
        df = pd.json_normalize(data, sep=".")
    elif isinstance(data, dict):
        df = pd.json_normalize([data], sep=".")
    df.insert(0, "_source_file", file_path.name)
    return df

Decision Record

Context

Imports looked simple in small samples, but at scale they created repeated manual fixes and hard-to-reproduce failures.

Decision

Build staged ETL logic with explicit cleanup and validation instead of pushing raw imports directly downstream.

Tradeoff

More pipeline structure and more SQL, but far better repeatability once the batch size and failure rate matter.

Result

Imports became operational workflows instead of brittle scripts that only work on good days.

Result

More than 900,000 records were processed through controlled ETL logic, with encoding and cleanup problems handled as system concerns rather than ad hoc fixes.

Production Signals

  • Encoding and data quality became architectural concerns at volume.
  • ETLs were built for repeatability and visibility, not just throughput.
  • Imports stopped being scripts once failure handling became operational work.
  • Large reruns became safer because staging and cleanup were explicit.

Verified Workflow Evidence

Checked against the current private automation repositories behind these import workflows. The implementation is not just SQL cleanup; it also includes file flattening and destination-specific delivery logic.

  • Verified Python-based import automation for flattening nested JSON into tabular structures.
  • Verified chunked Google Sheets uploads designed to avoid payload-size failures.
  • Verified overwrite and rerun logic so the destination state stays controlled instead of manually repaired.

Operational Readout

The important point is that ingestion, normalization and delivery are treated as one workflow. The import is not done when parsing succeeds; it is done when the destination system is consistent.