Chapter 05: Exploratory Analysis — Federal DS Handbook

Statistical Profiling

The first step after cleaning is profiling: understanding distributions, identifying outliers, and checking for patterns that signal data quality problems the cleaning step didn't catch.

python

import pandas as pd
import numpy as np

def profile_government_dataset(df: pd.DataFrame) -> pd.DataFrame:
    """
    Comprehensive statistical profile for a government dataset.
    Returns a summary DataFrame covering shape, nulls, distributions,
    and cardinality for each column.
    """
    rows, cols = df.shape
    print(f"Dataset shape: {rows:,} rows x {cols} columns")

    profile = []
    for col in df.columns:
        series = df[col]
        dtype = str(series.dtype)
        n_null = series.isnull().sum()
        n_unique = series.nunique(dropna=True)

        row = {
            "column":     col,
            "dtype":      dtype,
            "null_count": n_null,
            "null_pct":   round(n_null / rows * 100, 2),
            "n_unique":   n_unique,
        }

        if pd.api.types.is_numeric_dtype(series):
            row.update({
                "mean":   round(series.mean(), 4),
                "median": round(series.median(), 4),
                "std":    round(series.std(), 4),
                "min":    series.min(),
                "p25":    series.quantile(0.25),
                "p75":    series.quantile(0.75),
                "max":    series.max(),
                "skew":   round(series.skew(), 3),
            })
        elif dtype == "object":
            top = series.value_counts().head(1)
            row["top_value"]      = top.index[0] if len(top) > 0 else None
            row["top_value_freq"] = int(top.values[0]) if len(top) > 0 else 0

        profile.append(row)

    return pd.DataFrame(profile)


profile = profile_government_dataset(df)
print("\nHighly skewed numeric columns (skew > 2):")
print(profile[profile["skew"].abs() > 2][["column", "mean", "median", "skew"]])

Government Data Patterns to Know

Fiscal Year Seasonality

The federal fiscal year ends September 30. The "September surge" — also called "use it or lose it" spending — is one of the most reliable patterns in all of federal data. In many agencies, 20–30% of annual contract obligations occur in September. This is not an anomaly. If you model procurement data without controlling for fiscal month, your model will learn September as a prediction target rather than a calendar artifact.

python

import matplotlib.pyplot as plt
import matplotlib.ticker as mticker

# Fiscal month: 1 = October, 12 = September
df["fiscal_month"] = ((df["action_date"].dt.month - 10) % 12) + 1
fiscal_month_names = {
    1: "Oct", 2: "Nov", 3: "Dec", 4: "Jan", 5: "Feb", 6: "Mar",
    7: "Apr", 8: "May", 9: "Jun", 10: "Jul", 11: "Aug", 12: "Sep",
}
df["fiscal_month_name"] = df["fiscal_month"].map(fiscal_month_names)

monthly = (
    df.groupby("fiscal_month")["federal_action_obligation"]
    .agg(["sum", "count"])
    .reset_index()
)
monthly["obligation_billions"] = monthly["sum"] / 1e9

fig, ax1 = plt.subplots(figsize=(12, 5))
ax1.bar(monthly["fiscal_month"], monthly["obligation_billions"],
        color="#1B3A6B", alpha=0.8)
ax1.set_xlabel("Fiscal Month (1=Oct, 12=Sep)")
ax1.set_ylabel("Total Obligations ($B)")
ax1.set_title("Federal Contract Obligations by Fiscal Month — FY2024\n"
              "Note: September surge is structural, not anomalous")
plt.tight_layout()
plt.savefig("fiscal_month_obligations.png", dpi=150, bbox_inches="tight")

Procurement Cycles

Beyond fiscal year seasonality, government procurement has multi-year cycles driven by program budgets, continuing resolutions, and the biennial Congressional budget cycle. A model trained on FY2018–2020 data may perform poorly on FY2023–2024 because the procurement environment changed — budget authority shifted to continuing resolutions, new contract vehicles were established, and agency priorities realigned.

Unit Identifier Chaos

Government datasets join across unit identifiers that evolved independently: UIC (Unit Identification Code) for military units, FIPS codes for geographic entities, TAS (Treasury Account Symbol) for budget accounts, and BETC (Business Event Type Code) for financial transactions. None of these map cleanly to each other. Before building any cross-system join, document the identifier chain explicitly.

EDA by Platform

Databricks: SQL + Python Notebooks

On Databricks (Advana and Jupiter), the standard EDA workflow alternates between SQL queries in %sql cells and Python visualization in %python cells. This lets you run aggregate queries against the full distributed dataset and then visualize the results without ever materializing the full data to the driver.

sql

-- Run in a %sql cell: aggregate directly on Spark without .toPandas()
SELECT
    naics_sector_code,
    naics_sector_desc,
    fiscal_year,
    COUNT(*)                                          AS award_count,
    SUM(federal_action_obligation)                    AS total_obligations,
    AVG(federal_action_obligation)                    AS avg_award_size,
    PERCENTILE_APPROX(federal_action_obligation, 0.5) AS median_award_size,
    PERCENTILE_APPROX(federal_action_obligation, 0.9) AS p90_award_size
FROM procurement_catalog.silver.usaspending_fy2024_clean
WHERE fiscal_year >= 2020
GROUP BY naics_sector_code, naics_sector_desc, fiscal_year
ORDER BY fiscal_year DESC, total_obligations DESC

Qlik: Associative Exploration

Qlik's QIX engine makes EDA interactive in a way that SQL notebooks cannot. Every selection in a Qlik app filters all other charts simultaneously — clicking on "NAICS 54" in a sector breakdown immediately shows all related vendors, timeframes, and agencies filtered to that sector. This associative exploration finds unexpected relationships that a pre-built SQL query structure would miss.

Palantir Foundry: Object Explorer

On Foundry, EDA happens through the Object Explorer — a graphical browser of the Ontology. You navigate from a Contract object to its linked Vendor object to all associated ContractModification objects without writing a single query. For understanding relationships between entities in a complex dataset, Foundry's Ontology-driven EDA is faster than any SQL notebook approach.

Outlier Detection

Government contract data is genuinely right-skewed — a small number of large contracts dominate the total obligation volume. "Outliers" in this data may be real (a $5B IDIQ contract is legitimate, not an error). Treat extreme values as domain knowledge problems before deciding to clip or remove them.

python

def identify_outliers(df: pd.DataFrame, col: str,
                      method: str = "iqr") -> pd.Series:
    """
    Identify outliers using IQR or percentile method.

    For government financial data, use percentile with high thresholds
    (P99 or P99.5) rather than IQR — the data is genuinely right-skewed
    and IQR flags too many legitimate large contracts.
    """
    series = df[col].dropna()

    if method == "iqr":
        q1, q3 = series.quantile([0.25, 0.75])
        iqr = q3 - q1
        lower, upper = q1 - 1.5 * iqr, q3 + 1.5 * iqr
        outlier_mask = (df[col] < lower) | (df[col] > upper)

    elif method == "percentile":
        lower = series.quantile(0.005)  # 0.5th percentile
        upper = series.quantile(0.995)  # 99.5th percentile
        outlier_mask = (df[col] < lower) | (df[col] > upper)

    n_outliers = outlier_mask.sum()
    pct = n_outliers / len(df) * 100
    print(f"{col}: {n_outliers:,} outliers ({pct:.2f}%) | range: [{lower:,.0f}, {upper:,.0f}]")
    print(f"  Values outside range: {df.loc[outlier_mask, col].describe()}")
    return outlier_mask


# For obligation amounts — use percentile, not IQR
# A $5B IDIQ is real data, not an error
outlier_mask = identify_outliers(df, "federal_action_obligation", method="percentile")

# Review the actual records flagged before any removal decision
flagged_records = df[outlier_mask][["contract_award_unique_key", "recipient_name",
                                    "federal_action_obligation", "award_type"]].head(20)
print(f"\nSample outlier records (review before removing):")
print(flagged_records)

Distribution Checks for ML Readiness

Before building a model, EDA should answer three questions about distributions: Is the target variable balanced enough for classification? Are continuous features distributed in a way that requires transformation? Are there distribution differences between your training and test periods?

python

from scipy import stats

def check_target_class_balance(y: pd.Series, target_name: str = "target") -> None:
    """Check class balance for binary classification targets."""
    counts = y.value_counts()
    pct = y.value_counts(normalize=True) * 100

    print(f"Class balance for '{target_name}':")
    for cls, count in counts.items():
        print(f"  Class {cls}: {count:,} ({pct[cls]:.1f}%)")

    # Imbalance ratio
    if len(counts) == 2:
        ratio = counts.max() / counts.min()
        if ratio > 10:
            print(f"\n  WARNING: Severe imbalance (ratio {ratio:.1f}:1)")
            print("  Consider: SMOTE, class weights, or stratified sampling")
        elif ratio > 3:
            print(f"\n  NOTE: Moderate imbalance (ratio {ratio:.1f}:1)")
            print("  Use class_weight='balanced' in scikit-learn estimators")
        else:
            print(f"\n  Balance is acceptable (ratio {ratio:.1f}:1)")


def test_feature_normality(df: pd.DataFrame, numeric_cols: list) -> pd.DataFrame:
    """Test whether numeric features are approximately normal (useful for linear models)."""
    results = []
    for col in numeric_cols:
        series = df[col].dropna()
        if len(series) < 20:
            continue
        _, p_value = stats.shapiro(series.sample(min(len(series), 5000), random_state=42))
        skewness = series.skew()
        results.append({
            "feature":          col,
            "skewness":         round(skewness, 3),
            "shapiro_p":        round(p_value, 4),
            "likely_normal":    p_value > 0.05,
            "needs_transform":  abs(skewness) > 1.0,
        })
    return pd.DataFrame(results).sort_values("skewness", key=abs, ascending=False)

Where This Goes Wrong

Failure Mode 1: Treating Fiscal Seasonality as Signal

Including action_month as a raw feature in a model that predicts future procurement behavior. The model learns that September is high-volume because of the fiscal year end, which is true — but this produces a model that "predicts" based on calendar position rather than underlying procurement drivers. Fix: encode fiscal month as a cyclical feature or include it explicitly as a control variable rather than a predictive feature.

Failure Mode 2: Skipping Temporal Distribution Checks

Doing all EDA on the aggregated dataset without checking whether the distribution of key variables changed between your training period and your test period. Fiscal policy changes, continuing resolutions, and procurement reform create structural breaks. Fix: always stratify your EDA by year and look explicitly for distributional shifts.

Failure Mode 3: Removing Legitimate Large Values

Using standard IQR-based outlier removal on obligation amounts and accidentally removing the largest IDIQ contracts that represent legitimate program funding. Fix: apply the percentile method (P0.5 / P99.5) rather than IQR, and always review the flagged records before removing them. A $5B contract is not a data error.

Platform Comparison

EDA Capability	Databricks (Advana/Jupiter)	Qlik	Palantir Foundry
Large-scale profiling	PySpark + MLflow	QIX engine (memory-bound)	Code Workspaces
Interactive exploration	Databricks SQL dashboards	Best-in-class (associative)	Object Explorer + Workshop
Statistical tests	scipy, statsmodels	Limited	Code Workspaces (scipy)
Distribution visualization	matplotlib, seaborn, plotly	Qlik charts (limited custom)	matplotlib in Code Workspaces
Cross-entity relationships	SQL JOINs	Associative model	Ontology navigation (best)

Exercises

This chapter includes 6 hands-on exercises with full solutions — coding challenges, analysis tasks, and scenario-based problems.

View Exercises on GitHub →