Chapter 04

Data Wrangling and Cleaning

Government data is not dirty by accident — it reflects decades of incompatible systems, agency-specific conventions, and evolving federal standards. Cleaning it correctly requires knowing the domain as much as knowing the tools. This chapter covers pandas, PySpark, CAGE codes, NAICS standardization, PII handling, and the Bronze/Silver/Gold tier pattern.

~50 min read Databricks, Advana, Navy Jupiter, Palantir Foundry Code on GitHub

The Problem with Government Data

General-purpose data cleaning tutorials use toy datasets. The government context introduces problems you won't find there: DUNS-to-UEI transitions mid-dataset, NAICS code version changes every five years, agency names in 47 variants across source systems, and PII embedded in datasets labelled as procurement records.

This chapter covers the specific cleaning patterns for each major problem type, with the platform implementations on Databricks (pandas and PySpark) and Palantir Foundry (Pipeline Builder and Code Repositories).

Cleaning with pandas

Null Handling

Federal datasets have structured missingness — specific columns are null for specific record types. Obligation amounts are null for task order modifications. Agency codes are null for records from legacy systems that used a different coding scheme. Treat nulls as domain knowledge, not errors to fill blindly.

python
import pandas as pd
import numpy as np

# Profile nulls before treating them
null_profile = pd.DataFrame({
    "null_count": df.isnull().sum(),
    "null_pct": (df.isnull().mean() * 100).round(2),
    "dtype": df.dtypes,
}).sort_values("null_pct", ascending=False)

print("Null profile (columns with > 5% null):")
print(null_profile[null_profile["null_pct"] > 5])

# Domain-informed fills
# Federal action obligation: null on modifications is intentional — fill 0
df["federal_action_obligation"] = df["federal_action_obligation"].fillna(0)

# Period of performance end: null on IDIQs is legitimate — keep null, don't fill
# (Filling with a date would corrupt analyses of open-ended contracts)

# Agency code: null often means "same as parent" in hierarchical contracts
# Fill with the awarding agency code rather than a placeholder
df["funding_agency_code"] = df["funding_agency_code"].fillna(
    df["awarding_agency_code"]
)

Deduplication

Government contract data has a critical deduplication nuance: modifications are not duplicates. A contract action with modification_number = "0" is the base award. modification_number = "P00001" is the first modification. Both are legitimate records — deduplicating on contract ID alone destroys the history of changes.

python
# When you want the current state of each contract (latest modification)
df_current = (
    df
    .sort_values(["contract_award_unique_key", "action_date"], ascending=[True, False])
    .groupby("contract_award_unique_key", as_index=False)
    .first()
)
print(f"Original: {len(df):,} rows | Deduplicated to current state: {len(df_current):,} rows")

# When you want the full modification history (all actions)
# No deduplication needed — the full table IS the history
# Just ensure your primary key is (contract_award_unique_key, modification_number)
dupes = df.duplicated(subset=["contract_award_unique_key", "modification_number"])
if dupes.any():
    print(f"WARNING: {dupes.sum():,} duplicate (key, modification) pairs found")
    df = df[~dupes]

PySpark at Scale on Databricks

When the dataset is too large for pandas (typically over 5M rows on Advana and Jupiter), PySpark handles the same operations distributed across the cluster. The Bronze/Silver/Gold lakehouse pattern structures this work: Bronze is raw ingested data, Silver is cleaned and standardized, Gold is validated and business-rule-enforced.

python
from pyspark.sql import functions as F, Window

# Read Bronze tier (raw)
df_bronze = spark.table("procurement_catalog.bronze.usaspending_fy2024")

# Clean and standardize for Silver tier
df_silver = (
    df_bronze
    # Standardize NAICS to 6 digits
    .withColumn("naics_code",
        F.lpad(F.regexp_replace(F.col("naics_code").cast("string"), r"[^0-9]", ""), 6, "0")
    )
    # Standardize obligation as double
    .withColumn("federal_action_obligation",
        F.col("federal_action_obligation").cast("double")
    )
    # Derive fiscal year (Oct–Sep)
    .withColumn("fiscal_year",
        F.when(F.month("action_date") >= 10, F.year("action_date") + 1)
         .otherwise(F.year("action_date"))
    )
    # Flag data quality issues rather than silently dropping
    .withColumn("dq_flag",
        F.when(F.col("naics_code").rlike(r"^0{6}$"), "INVALID_NAICS")
         .when(F.col("recipient_uei").isNull(), "MISSING_UEI")
         .otherwise(None)
    )
    # Remove true duplicates using window function
    .withColumn("row_rank",
        F.row_number().over(
            Window.partitionBy("contract_award_unique_key", "modification_number")
                  .orderBy(F.col("action_date").desc())
        )
    )
    .filter(F.col("row_rank") == 1)
    .drop("row_rank")
)

# Write to Silver tier
(
    df_silver
    .write.format("delta").mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable("procurement_catalog.silver.usaspending_fy2024_clean")
)

Government-Specific Identifier Standardization

CAGE Codes and UEI/DUNS

In April 2022, the federal government replaced DUNS (Dun & Bradstreet's proprietary identifier) with UEI (Unique Entity Identifier) via SAM.gov. Datasets spanning pre- and post-2022 will have both, inconsistently. CAGE codes (5-character alphanumeric DoD identifiers) are separate from UEIs — a contractor will have multiple CAGE codes for different divisions but one UEI per legal entity.

python
def standardize_vendor_identifiers(df: pd.DataFrame) -> pd.DataFrame:
    """
    Standardize CAGE codes, UEI, and legacy DUNS numbers.
    Handles the April 2022 SAM.gov UEI transition period.
    """
    # CAGE codes: always uppercase, 5 chars, alphanumeric
    # First char cannot be I, O, or Q (reserved)
    df["cage_code"] = (
        df["cage_code"].astype(str).str.upper().str.strip()
        .str.replace(r"[^A-Z0-9]", "", regex=True)
    )
    valid_cage = df["cage_code"].str.match(r"^[A-HJ-NP-Z0-9][A-Z0-9]{4}$")
    df.loc[~valid_cage & (df["cage_code"] != ""), "data_quality_flag"] = "INVALID_CAGE_FORMAT"

    # UEI: 12-character alphanumeric (SAM.gov)
    df["recipient_uei"] = (
        df["recipient_uei"].astype(str).str.upper().str.strip()
        .str.replace(r"[^A-Z0-9]", "", regex=True)
    )
    valid_uei = df["recipient_uei"].str.match(r"^[A-Z0-9]{12}$")
    post_2022 = df["action_date"] >= "2022-04-04"
    df.loc[post_2022 & ~valid_uei & (df["recipient_uei"].str.len() > 0),
           "data_quality_flag"] = "INVALID_UEI_FORMAT"

    # DUNS: 9-digit numeric, zero-padded (legacy, pre-2022)
    df["recipient_duns_number"] = (
        df["recipient_duns_number"].astype(str).str.strip()
        .str.replace(r"[^0-9]", "", regex=True)
        .str.zfill(9)
    )
    return df

NAICS Code Standardization

python
NAICS_SECTORS = {
    "11": "Agriculture/Forestry/Fishing/Hunting",
    "21": "Mining/Quarrying/Oil/Gas",
    "23": "Construction",
    "31": "Manufacturing", "32": "Manufacturing", "33": "Manufacturing",
    "42": "Wholesale Trade",
    "44": "Retail Trade", "45": "Retail Trade",
    "48": "Transportation/Warehousing", "49": "Transportation/Warehousing",
    "51": "Information",
    "52": "Finance/Insurance",
    "54": "Professional/Scientific/Technical Services",
    "56": "Administrative/Support/Waste Services",
    "61": "Educational Services",
    "62": "Health Care/Social Assistance",
    "92": "Public Administration",
}

def standardize_naics(df: pd.DataFrame) -> pd.DataFrame:
    df["naics_code"] = (
        df["naics_code"].astype(str).str.strip()
        .str.replace(r"[^0-9]", "", regex=True)
        .str.zfill(6)
    )
    df.loc[df["naics_code"].str.len() != 6, "data_quality_flag"] = "INVALID_NAICS_LENGTH"
    df["naics_sector_code"] = df["naics_code"].str[:2]
    df["naics_sector_desc"] = df["naics_sector_code"].map(NAICS_SECTORS).fillna("Unknown")
    return df

Palantir Foundry: Pipeline Builder vs. Code Repositories

On Foundry, you have two tools for building data transforms. Choosing between them is a real decision that shapes how your team maintains the pipeline.

Pipeline Builder is a visual, node-based interface — drag, connect, configure. No Python required. A business analyst can maintain it. Use Pipeline Builder for standard joins, filters, casts, and renames; when the business owner needs to modify without a developer; and when visual lineage matters for non-engineers.

Code Repositories are Git-backed Python environments using the @transform_df decorator. Use Code Repositories when you need complex conditional logic, ML model inference inside the pipeline, unit tests against transform logic, or edge case handling that a GUI can't express.

The most common mistake on Foundry: reaching for Code Repositories by default because you're an engineer who prefers code. Pipeline Builder is the right choice for standard cleaning tasks in production — lower maintenance cost, visible to non-engineers, and less likely to break silently when a junior analyst inherits it.

Data Quality Metrics

python
import mlflow

def compute_quality_metrics(df: pd.DataFrame, table_name: str) -> dict:
    """Log data quality metrics to MLflow for timestamped audit trail."""
    n_rows = len(df)
    metrics = {"table_name": table_name, "row_count": n_rows}

    critical_cols = ["contract_award_unique_key", "recipient_uei",
                     "federal_action_obligation", "action_date"]
    for col in critical_cols:
        if col in df.columns:
            completeness = (1 - df[col].isnull().mean()) * 100
            metrics[f"completeness_{col}"] = round(completeness, 2)

    if "naics_code" in df.columns:
        valid_naics = df["naics_code"].str.match(r"^\d{6}$").mean() * 100
        metrics["validity_naics_6digit_pct"] = round(valid_naics, 2)

    pk = ["contract_award_unique_key", "modification_number"]
    if all(c in df.columns for c in pk):
        dup_pct = df.duplicated(subset=pk).mean() * 100
        metrics["uniqueness_pk_dup_pct"] = round(dup_pct, 2)

    return metrics


metrics = compute_quality_metrics(df_cleaned, "usaspending_fy2024_clean")

with mlflow.start_run(run_name="procurement_data_quality_check"):
    mlflow.log_param("source_table", "procurement_catalog.raw.usaspending_fy2024")
    for key, val in metrics.items():
        if isinstance(val, (int, float)):
            mlflow.log_metric(key, val)

Handling PII and PHI

Government datasets routinely contain PII. Sole-source contracts with individual consultants often include SSNs in place of EINs. Veteran-owned small business certifications include personal background information. Run a pattern-match scan before assuming you're clean.

python
import re, hashlib, hmac

def scan_for_pii_patterns(df: pd.DataFrame) -> pd.DataFrame:
    """Scan string columns for likely PII patterns."""
    patterns = {
        "SSN":         r"\b\d{3}-\d{2}-\d{4}\b",
        "phone":       r"\b\d{3}[-.\s]\d{3}[-.\s]\d{4}\b",
        "email":       r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b",
        "credit_card": r"\b(?:\d{4}[-\s]?){3}\d{4}\b",
    }
    results = []
    for col in df.select_dtypes(include="object").columns:
        sample = df[col].dropna().astype(str).head(10000)
        for label, pattern in patterns.items():
            matches = sample.str.contains(pattern, regex=True).sum()
            if matches > 0:
                results.append({"column": col, "pattern": label,
                                 "matches_in_sample": int(matches)})
    return pd.DataFrame(results)


TOKENIZATION_KEY = b"use_a_real_secret_from_your_vault_here"

def tokenize_field(value: str, key: bytes = TOKENIZATION_KEY) -> str:
    """HMAC-SHA256 tokenization — deterministic but not reversible without the key."""
    if pd.isnull(value) or str(value).strip() == "":
        return ""
    token = hmac.new(key, str(value).encode("utf-8"), hashlib.sha256).hexdigest()[:16]
    return f"TKN_{token}"


pii_scan = scan_for_pii_patterns(df_raw)
if len(pii_scan) > 0:
    print("PII patterns detected — review before proceeding:")
    print(pii_scan)
    # Tokenize all flagged columns before loading to shared workspace
    for col in pii_scan["column"].unique():
        df_raw[f"{col}_token"] = df_raw[col].apply(tokenize_field)
        df_raw = df_raw.drop(columns=[col])

Where This Goes Wrong

Failure Mode 1: Treating Cleaning as a One-Time Event

Cleaning the dataset once, saving a Parquet file, and moving on. Six months later the pipeline updates with new records that have new formatting quirks. Fix: treat the cleaning pipeline as production code with unit tests. Run quality metric checks on every new data load. Set thresholds — if completeness on recipient_uei drops below 92%, fail the pipeline and alert someone.

Failure Mode 2: Normalizing Away Real Signal

"Boeing" and "Boeing Defense Space and Security" are separate legal entities with different contract vehicles and performance histories. Collapsing them to "BOEING" for normalization destroys this distinction. Fix: normalize for display, preserve the original field. Join against SAM.gov's entity hierarchy data to get official parent/child relationships.

Failure Mode 3: Skipping PII Review on "Non-Sensitive" Datasets

The dataset is labelled "procurement data" so you assume it's clean of PII and load it into a shared workspace accessible to 200 analysts. Fix: run scan_for_pii_patterns() before loading any new dataset into a shared workspace. This is not optional, and getting it wrong has career consequences.

Platform Comparison

Capability Advana (Databricks) Navy Jupiter Palantir Foundry Qlik
Primary cleaning interface PySpark notebooks PySpark notebooks Pipeline Builder + Code Repositories Qlik Data Manager
No-code ETL Limited Limited Strong (Pipeline Builder) Strong (Data Manager)
Handles 100M+ rows Yes (Spark clusters) Yes (Spark clusters) Yes (Spark-backed) No (memory-bound)
Data lineage Unity Catalog Collibra + Unity Catalog Native Ontology lineage Partial
Quality metric tracking MLflow MLflow Pipeline health dashboards Built-in data profiling
Semantic entity layer No No Yes (Ontology) No

Exercises

This chapter includes 4 hands-on exercises with full solutions — coding challenges, analysis tasks, and scenario-based problems.

View Exercises on GitHub →