Data Wrangling and Cleaning
Government data is not dirty by accident — it reflects decades of incompatible systems, agency-specific conventions, and evolving federal standards. Cleaning it correctly requires knowing the domain as much as knowing the tools. This chapter covers pandas, PySpark, CAGE codes, NAICS standardization, PII handling, and the Bronze/Silver/Gold tier pattern.
The Problem with Government Data
General-purpose data cleaning tutorials use toy datasets. The government context introduces problems you won't find there: DUNS-to-UEI transitions mid-dataset, NAICS code version changes every five years, agency names in 47 variants across source systems, and PII embedded in datasets labelled as procurement records.
This chapter covers the specific cleaning patterns for each major problem type, with the platform implementations on Databricks (pandas and PySpark) and Palantir Foundry (Pipeline Builder and Code Repositories).
Cleaning with pandas
Null Handling
Federal datasets have structured missingness — specific columns are null for specific record types. Obligation amounts are null for task order modifications. Agency codes are null for records from legacy systems that used a different coding scheme. Treat nulls as domain knowledge, not errors to fill blindly.
import pandas as pd
import numpy as np
# Profile nulls before treating them
null_profile = pd.DataFrame({
"null_count": df.isnull().sum(),
"null_pct": (df.isnull().mean() * 100).round(2),
"dtype": df.dtypes,
}).sort_values("null_pct", ascending=False)
print("Null profile (columns with > 5% null):")
print(null_profile[null_profile["null_pct"] > 5])
# Domain-informed fills
# Federal action obligation: null on modifications is intentional — fill 0
df["federal_action_obligation"] = df["federal_action_obligation"].fillna(0)
# Period of performance end: null on IDIQs is legitimate — keep null, don't fill
# (Filling with a date would corrupt analyses of open-ended contracts)
# Agency code: null often means "same as parent" in hierarchical contracts
# Fill with the awarding agency code rather than a placeholder
df["funding_agency_code"] = df["funding_agency_code"].fillna(
df["awarding_agency_code"]
)
Deduplication
Government contract data has a critical deduplication nuance: modifications are not duplicates. A contract action with modification_number = "0" is the base award. modification_number = "P00001" is the first modification. Both are legitimate records — deduplicating on contract ID alone destroys the history of changes.
# When you want the current state of each contract (latest modification)
df_current = (
df
.sort_values(["contract_award_unique_key", "action_date"], ascending=[True, False])
.groupby("contract_award_unique_key", as_index=False)
.first()
)
print(f"Original: {len(df):,} rows | Deduplicated to current state: {len(df_current):,} rows")
# When you want the full modification history (all actions)
# No deduplication needed — the full table IS the history
# Just ensure your primary key is (contract_award_unique_key, modification_number)
dupes = df.duplicated(subset=["contract_award_unique_key", "modification_number"])
if dupes.any():
print(f"WARNING: {dupes.sum():,} duplicate (key, modification) pairs found")
df = df[~dupes]
PySpark at Scale on Databricks
When the dataset is too large for pandas (typically over 5M rows on Advana and Jupiter), PySpark handles the same operations distributed across the cluster. The Bronze/Silver/Gold lakehouse pattern structures this work: Bronze is raw ingested data, Silver is cleaned and standardized, Gold is validated and business-rule-enforced.
from pyspark.sql import functions as F, Window
# Read Bronze tier (raw)
df_bronze = spark.table("procurement_catalog.bronze.usaspending_fy2024")
# Clean and standardize for Silver tier
df_silver = (
df_bronze
# Standardize NAICS to 6 digits
.withColumn("naics_code",
F.lpad(F.regexp_replace(F.col("naics_code").cast("string"), r"[^0-9]", ""), 6, "0")
)
# Standardize obligation as double
.withColumn("federal_action_obligation",
F.col("federal_action_obligation").cast("double")
)
# Derive fiscal year (Oct–Sep)
.withColumn("fiscal_year",
F.when(F.month("action_date") >= 10, F.year("action_date") + 1)
.otherwise(F.year("action_date"))
)
# Flag data quality issues rather than silently dropping
.withColumn("dq_flag",
F.when(F.col("naics_code").rlike(r"^0{6}$"), "INVALID_NAICS")
.when(F.col("recipient_uei").isNull(), "MISSING_UEI")
.otherwise(None)
)
# Remove true duplicates using window function
.withColumn("row_rank",
F.row_number().over(
Window.partitionBy("contract_award_unique_key", "modification_number")
.orderBy(F.col("action_date").desc())
)
)
.filter(F.col("row_rank") == 1)
.drop("row_rank")
)
# Write to Silver tier
(
df_silver
.write.format("delta").mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable("procurement_catalog.silver.usaspending_fy2024_clean")
)
Government-Specific Identifier Standardization
CAGE Codes and UEI/DUNS
In April 2022, the federal government replaced DUNS (Dun & Bradstreet's proprietary identifier) with UEI (Unique Entity Identifier) via SAM.gov. Datasets spanning pre- and post-2022 will have both, inconsistently. CAGE codes (5-character alphanumeric DoD identifiers) are separate from UEIs — a contractor will have multiple CAGE codes for different divisions but one UEI per legal entity.
def standardize_vendor_identifiers(df: pd.DataFrame) -> pd.DataFrame:
"""
Standardize CAGE codes, UEI, and legacy DUNS numbers.
Handles the April 2022 SAM.gov UEI transition period.
"""
# CAGE codes: always uppercase, 5 chars, alphanumeric
# First char cannot be I, O, or Q (reserved)
df["cage_code"] = (
df["cage_code"].astype(str).str.upper().str.strip()
.str.replace(r"[^A-Z0-9]", "", regex=True)
)
valid_cage = df["cage_code"].str.match(r"^[A-HJ-NP-Z0-9][A-Z0-9]{4}$")
df.loc[~valid_cage & (df["cage_code"] != ""), "data_quality_flag"] = "INVALID_CAGE_FORMAT"
# UEI: 12-character alphanumeric (SAM.gov)
df["recipient_uei"] = (
df["recipient_uei"].astype(str).str.upper().str.strip()
.str.replace(r"[^A-Z0-9]", "", regex=True)
)
valid_uei = df["recipient_uei"].str.match(r"^[A-Z0-9]{12}$")
post_2022 = df["action_date"] >= "2022-04-04"
df.loc[post_2022 & ~valid_uei & (df["recipient_uei"].str.len() > 0),
"data_quality_flag"] = "INVALID_UEI_FORMAT"
# DUNS: 9-digit numeric, zero-padded (legacy, pre-2022)
df["recipient_duns_number"] = (
df["recipient_duns_number"].astype(str).str.strip()
.str.replace(r"[^0-9]", "", regex=True)
.str.zfill(9)
)
return df
NAICS Code Standardization
NAICS_SECTORS = {
"11": "Agriculture/Forestry/Fishing/Hunting",
"21": "Mining/Quarrying/Oil/Gas",
"23": "Construction",
"31": "Manufacturing", "32": "Manufacturing", "33": "Manufacturing",
"42": "Wholesale Trade",
"44": "Retail Trade", "45": "Retail Trade",
"48": "Transportation/Warehousing", "49": "Transportation/Warehousing",
"51": "Information",
"52": "Finance/Insurance",
"54": "Professional/Scientific/Technical Services",
"56": "Administrative/Support/Waste Services",
"61": "Educational Services",
"62": "Health Care/Social Assistance",
"92": "Public Administration",
}
def standardize_naics(df: pd.DataFrame) -> pd.DataFrame:
df["naics_code"] = (
df["naics_code"].astype(str).str.strip()
.str.replace(r"[^0-9]", "", regex=True)
.str.zfill(6)
)
df.loc[df["naics_code"].str.len() != 6, "data_quality_flag"] = "INVALID_NAICS_LENGTH"
df["naics_sector_code"] = df["naics_code"].str[:2]
df["naics_sector_desc"] = df["naics_sector_code"].map(NAICS_SECTORS).fillna("Unknown")
return df
Palantir Foundry: Pipeline Builder vs. Code Repositories
On Foundry, you have two tools for building data transforms. Choosing between them is a real decision that shapes how your team maintains the pipeline.
Pipeline Builder is a visual, node-based interface — drag, connect, configure. No Python required. A business analyst can maintain it. Use Pipeline Builder for standard joins, filters, casts, and renames; when the business owner needs to modify without a developer; and when visual lineage matters for non-engineers.
Code Repositories are Git-backed Python environments using the @transform_df decorator. Use Code Repositories when you need complex conditional logic, ML model inference inside the pipeline, unit tests against transform logic, or edge case handling that a GUI can't express.
The most common mistake on Foundry: reaching for Code Repositories by default because you're an engineer who prefers code. Pipeline Builder is the right choice for standard cleaning tasks in production — lower maintenance cost, visible to non-engineers, and less likely to break silently when a junior analyst inherits it.
Data Quality Metrics
import mlflow
def compute_quality_metrics(df: pd.DataFrame, table_name: str) -> dict:
"""Log data quality metrics to MLflow for timestamped audit trail."""
n_rows = len(df)
metrics = {"table_name": table_name, "row_count": n_rows}
critical_cols = ["contract_award_unique_key", "recipient_uei",
"federal_action_obligation", "action_date"]
for col in critical_cols:
if col in df.columns:
completeness = (1 - df[col].isnull().mean()) * 100
metrics[f"completeness_{col}"] = round(completeness, 2)
if "naics_code" in df.columns:
valid_naics = df["naics_code"].str.match(r"^\d{6}$").mean() * 100
metrics["validity_naics_6digit_pct"] = round(valid_naics, 2)
pk = ["contract_award_unique_key", "modification_number"]
if all(c in df.columns for c in pk):
dup_pct = df.duplicated(subset=pk).mean() * 100
metrics["uniqueness_pk_dup_pct"] = round(dup_pct, 2)
return metrics
metrics = compute_quality_metrics(df_cleaned, "usaspending_fy2024_clean")
with mlflow.start_run(run_name="procurement_data_quality_check"):
mlflow.log_param("source_table", "procurement_catalog.raw.usaspending_fy2024")
for key, val in metrics.items():
if isinstance(val, (int, float)):
mlflow.log_metric(key, val)
Handling PII and PHI
Government datasets routinely contain PII. Sole-source contracts with individual consultants often include SSNs in place of EINs. Veteran-owned small business certifications include personal background information. Run a pattern-match scan before assuming you're clean.
import re, hashlib, hmac
def scan_for_pii_patterns(df: pd.DataFrame) -> pd.DataFrame:
"""Scan string columns for likely PII patterns."""
patterns = {
"SSN": r"\b\d{3}-\d{2}-\d{4}\b",
"phone": r"\b\d{3}[-.\s]\d{3}[-.\s]\d{4}\b",
"email": r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b",
"credit_card": r"\b(?:\d{4}[-\s]?){3}\d{4}\b",
}
results = []
for col in df.select_dtypes(include="object").columns:
sample = df[col].dropna().astype(str).head(10000)
for label, pattern in patterns.items():
matches = sample.str.contains(pattern, regex=True).sum()
if matches > 0:
results.append({"column": col, "pattern": label,
"matches_in_sample": int(matches)})
return pd.DataFrame(results)
TOKENIZATION_KEY = b"use_a_real_secret_from_your_vault_here"
def tokenize_field(value: str, key: bytes = TOKENIZATION_KEY) -> str:
"""HMAC-SHA256 tokenization — deterministic but not reversible without the key."""
if pd.isnull(value) or str(value).strip() == "":
return ""
token = hmac.new(key, str(value).encode("utf-8"), hashlib.sha256).hexdigest()[:16]
return f"TKN_{token}"
pii_scan = scan_for_pii_patterns(df_raw)
if len(pii_scan) > 0:
print("PII patterns detected — review before proceeding:")
print(pii_scan)
# Tokenize all flagged columns before loading to shared workspace
for col in pii_scan["column"].unique():
df_raw[f"{col}_token"] = df_raw[col].apply(tokenize_field)
df_raw = df_raw.drop(columns=[col])
Where This Goes Wrong
Failure Mode 1: Treating Cleaning as a One-Time Event
Cleaning the dataset once, saving a Parquet file, and moving on. Six months later the pipeline updates with new records that have new formatting quirks. Fix: treat the cleaning pipeline as production code with unit tests. Run quality metric checks on every new data load. Set thresholds — if completeness on recipient_uei drops below 92%, fail the pipeline and alert someone.
Failure Mode 2: Normalizing Away Real Signal
"Boeing" and "Boeing Defense Space and Security" are separate legal entities with different contract vehicles and performance histories. Collapsing them to "BOEING" for normalization destroys this distinction. Fix: normalize for display, preserve the original field. Join against SAM.gov's entity hierarchy data to get official parent/child relationships.
Failure Mode 3: Skipping PII Review on "Non-Sensitive" Datasets
The dataset is labelled "procurement data" so you assume it's clean of PII and load it into a shared workspace accessible to 200 analysts. Fix: run scan_for_pii_patterns() before loading any new dataset into a shared workspace. This is not optional, and getting it wrong has career consequences.
Platform Comparison
| Capability | Advana (Databricks) | Navy Jupiter | Palantir Foundry | Qlik |
|---|---|---|---|---|
| Primary cleaning interface | PySpark notebooks | PySpark notebooks | Pipeline Builder + Code Repositories | Qlik Data Manager |
| No-code ETL | Limited | Limited | Strong (Pipeline Builder) | Strong (Data Manager) |
| Handles 100M+ rows | Yes (Spark clusters) | Yes (Spark clusters) | Yes (Spark-backed) | No (memory-bound) |
| Data lineage | Unity Catalog | Collibra + Unity Catalog | Native Ontology lineage | Partial |
| Quality metric tracking | MLflow | MLflow | Pipeline health dashboards | Built-in data profiling |
| Semantic entity layer | No | No | Yes (Ontology) | No |
Exercises
This chapter includes 4 hands-on exercises with full solutions — coding challenges, analysis tasks, and scenario-based problems.
View Exercises on GitHub →