Exploratory Data Analysis
EDA in a federal context means knowing which patterns are real signal and which are artifacts of fiscal calendars, procurement cycles, and reporting lags. Before you model anything, understand the data's seasonal structure, distribution shape, and platform-specific idiosyncrasies.
Statistical Profiling
The first step after cleaning is profiling: understanding distributions, identifying outliers, and checking for patterns that signal data quality problems the cleaning step didn't catch.
import pandas as pd
import numpy as np
def profile_government_dataset(df: pd.DataFrame) -> pd.DataFrame:
"""
Comprehensive statistical profile for a government dataset.
Returns a summary DataFrame covering shape, nulls, distributions,
and cardinality for each column.
"""
rows, cols = df.shape
print(f"Dataset shape: {rows:,} rows x {cols} columns")
profile = []
for col in df.columns:
series = df[col]
dtype = str(series.dtype)
n_null = series.isnull().sum()
n_unique = series.nunique(dropna=True)
row = {
"column": col,
"dtype": dtype,
"null_count": n_null,
"null_pct": round(n_null / rows * 100, 2),
"n_unique": n_unique,
}
if pd.api.types.is_numeric_dtype(series):
row.update({
"mean": round(series.mean(), 4),
"median": round(series.median(), 4),
"std": round(series.std(), 4),
"min": series.min(),
"p25": series.quantile(0.25),
"p75": series.quantile(0.75),
"max": series.max(),
"skew": round(series.skew(), 3),
})
elif dtype == "object":
top = series.value_counts().head(1)
row["top_value"] = top.index[0] if len(top) > 0 else None
row["top_value_freq"] = int(top.values[0]) if len(top) > 0 else 0
profile.append(row)
return pd.DataFrame(profile)
profile = profile_government_dataset(df)
print("\nHighly skewed numeric columns (skew > 2):")
print(profile[profile["skew"].abs() > 2][["column", "mean", "median", "skew"]])
Government Data Patterns to Know
Fiscal Year Seasonality
The federal fiscal year ends September 30. The "September surge" — also called "use it or lose it" spending — is one of the most reliable patterns in all of federal data. In many agencies, 20–30% of annual contract obligations occur in September. This is not an anomaly. If you model procurement data without controlling for fiscal month, your model will learn September as a prediction target rather than a calendar artifact.
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
# Fiscal month: 1 = October, 12 = September
df["fiscal_month"] = ((df["action_date"].dt.month - 10) % 12) + 1
fiscal_month_names = {
1: "Oct", 2: "Nov", 3: "Dec", 4: "Jan", 5: "Feb", 6: "Mar",
7: "Apr", 8: "May", 9: "Jun", 10: "Jul", 11: "Aug", 12: "Sep",
}
df["fiscal_month_name"] = df["fiscal_month"].map(fiscal_month_names)
monthly = (
df.groupby("fiscal_month")["federal_action_obligation"]
.agg(["sum", "count"])
.reset_index()
)
monthly["obligation_billions"] = monthly["sum"] / 1e9
fig, ax1 = plt.subplots(figsize=(12, 5))
ax1.bar(monthly["fiscal_month"], monthly["obligation_billions"],
color="#1B3A6B", alpha=0.8)
ax1.set_xlabel("Fiscal Month (1=Oct, 12=Sep)")
ax1.set_ylabel("Total Obligations ($B)")
ax1.set_title("Federal Contract Obligations by Fiscal Month — FY2024\n"
"Note: September surge is structural, not anomalous")
plt.tight_layout()
plt.savefig("fiscal_month_obligations.png", dpi=150, bbox_inches="tight")
Procurement Cycles
Beyond fiscal year seasonality, government procurement has multi-year cycles driven by program budgets, continuing resolutions, and the biennial Congressional budget cycle. A model trained on FY2018–2020 data may perform poorly on FY2023–2024 because the procurement environment changed — budget authority shifted to continuing resolutions, new contract vehicles were established, and agency priorities realigned.
Unit Identifier Chaos
Government datasets join across unit identifiers that evolved independently: UIC (Unit Identification Code) for military units, FIPS codes for geographic entities, TAS (Treasury Account Symbol) for budget accounts, and BETC (Business Event Type Code) for financial transactions. None of these map cleanly to each other. Before building any cross-system join, document the identifier chain explicitly.
EDA by Platform
Databricks: SQL + Python Notebooks
On Databricks (Advana and Jupiter), the standard EDA workflow alternates between SQL queries in %sql cells and Python visualization in %python cells. This lets you run aggregate queries against the full distributed dataset and then visualize the results without ever materializing the full data to the driver.
-- Run in a %sql cell: aggregate directly on Spark without .toPandas()
SELECT
naics_sector_code,
naics_sector_desc,
fiscal_year,
COUNT(*) AS award_count,
SUM(federal_action_obligation) AS total_obligations,
AVG(federal_action_obligation) AS avg_award_size,
PERCENTILE_APPROX(federal_action_obligation, 0.5) AS median_award_size,
PERCENTILE_APPROX(federal_action_obligation, 0.9) AS p90_award_size
FROM procurement_catalog.silver.usaspending_fy2024_clean
WHERE fiscal_year >= 2020
GROUP BY naics_sector_code, naics_sector_desc, fiscal_year
ORDER BY fiscal_year DESC, total_obligations DESC
Qlik: Associative Exploration
Qlik's QIX engine makes EDA interactive in a way that SQL notebooks cannot. Every selection in a Qlik app filters all other charts simultaneously — clicking on "NAICS 54" in a sector breakdown immediately shows all related vendors, timeframes, and agencies filtered to that sector. This associative exploration finds unexpected relationships that a pre-built SQL query structure would miss.
Palantir Foundry: Object Explorer
On Foundry, EDA happens through the Object Explorer — a graphical browser of the Ontology. You navigate from a Contract object to its linked Vendor object to all associated ContractModification objects without writing a single query. For understanding relationships between entities in a complex dataset, Foundry's Ontology-driven EDA is faster than any SQL notebook approach.
Outlier Detection
Government contract data is genuinely right-skewed — a small number of large contracts dominate the total obligation volume. "Outliers" in this data may be real (a $5B IDIQ contract is legitimate, not an error). Treat extreme values as domain knowledge problems before deciding to clip or remove them.
def identify_outliers(df: pd.DataFrame, col: str,
method: str = "iqr") -> pd.Series:
"""
Identify outliers using IQR or percentile method.
For government financial data, use percentile with high thresholds
(P99 or P99.5) rather than IQR — the data is genuinely right-skewed
and IQR flags too many legitimate large contracts.
"""
series = df[col].dropna()
if method == "iqr":
q1, q3 = series.quantile([0.25, 0.75])
iqr = q3 - q1
lower, upper = q1 - 1.5 * iqr, q3 + 1.5 * iqr
outlier_mask = (df[col] < lower) | (df[col] > upper)
elif method == "percentile":
lower = series.quantile(0.005) # 0.5th percentile
upper = series.quantile(0.995) # 99.5th percentile
outlier_mask = (df[col] < lower) | (df[col] > upper)
n_outliers = outlier_mask.sum()
pct = n_outliers / len(df) * 100
print(f"{col}: {n_outliers:,} outliers ({pct:.2f}%) | range: [{lower:,.0f}, {upper:,.0f}]")
print(f" Values outside range: {df.loc[outlier_mask, col].describe()}")
return outlier_mask
# For obligation amounts — use percentile, not IQR
# A $5B IDIQ is real data, not an error
outlier_mask = identify_outliers(df, "federal_action_obligation", method="percentile")
# Review the actual records flagged before any removal decision
flagged_records = df[outlier_mask][["contract_award_unique_key", "recipient_name",
"federal_action_obligation", "award_type"]].head(20)
print(f"\nSample outlier records (review before removing):")
print(flagged_records)
Distribution Checks for ML Readiness
Before building a model, EDA should answer three questions about distributions: Is the target variable balanced enough for classification? Are continuous features distributed in a way that requires transformation? Are there distribution differences between your training and test periods?
from scipy import stats
def check_target_class_balance(y: pd.Series, target_name: str = "target") -> None:
"""Check class balance for binary classification targets."""
counts = y.value_counts()
pct = y.value_counts(normalize=True) * 100
print(f"Class balance for '{target_name}':")
for cls, count in counts.items():
print(f" Class {cls}: {count:,} ({pct[cls]:.1f}%)")
# Imbalance ratio
if len(counts) == 2:
ratio = counts.max() / counts.min()
if ratio > 10:
print(f"\n WARNING: Severe imbalance (ratio {ratio:.1f}:1)")
print(" Consider: SMOTE, class weights, or stratified sampling")
elif ratio > 3:
print(f"\n NOTE: Moderate imbalance (ratio {ratio:.1f}:1)")
print(" Use class_weight='balanced' in scikit-learn estimators")
else:
print(f"\n Balance is acceptable (ratio {ratio:.1f}:1)")
def test_feature_normality(df: pd.DataFrame, numeric_cols: list) -> pd.DataFrame:
"""Test whether numeric features are approximately normal (useful for linear models)."""
results = []
for col in numeric_cols:
series = df[col].dropna()
if len(series) < 20:
continue
_, p_value = stats.shapiro(series.sample(min(len(series), 5000), random_state=42))
skewness = series.skew()
results.append({
"feature": col,
"skewness": round(skewness, 3),
"shapiro_p": round(p_value, 4),
"likely_normal": p_value > 0.05,
"needs_transform": abs(skewness) > 1.0,
})
return pd.DataFrame(results).sort_values("skewness", key=abs, ascending=False)
Where This Goes Wrong
Failure Mode 1: Treating Fiscal Seasonality as Signal
Including action_month as a raw feature in a model that predicts future procurement behavior. The model learns that September is high-volume because of the fiscal year end, which is true — but this produces a model that "predicts" based on calendar position rather than underlying procurement drivers. Fix: encode fiscal month as a cyclical feature or include it explicitly as a control variable rather than a predictive feature.
Failure Mode 2: Skipping Temporal Distribution Checks
Doing all EDA on the aggregated dataset without checking whether the distribution of key variables changed between your training period and your test period. Fiscal policy changes, continuing resolutions, and procurement reform create structural breaks. Fix: always stratify your EDA by year and look explicitly for distributional shifts.
Failure Mode 3: Removing Legitimate Large Values
Using standard IQR-based outlier removal on obligation amounts and accidentally removing the largest IDIQ contracts that represent legitimate program funding. Fix: apply the percentile method (P0.5 / P99.5) rather than IQR, and always review the flagged records before removing them. A $5B contract is not a data error.
Platform Comparison
| EDA Capability | Databricks (Advana/Jupiter) | Qlik | Palantir Foundry |
|---|---|---|---|
| Large-scale profiling | PySpark + MLflow | QIX engine (memory-bound) | Code Workspaces |
| Interactive exploration | Databricks SQL dashboards | Best-in-class (associative) | Object Explorer + Workshop |
| Statistical tests | scipy, statsmodels | Limited | Code Workspaces (scipy) |
| Distribution visualization | matplotlib, seaborn, plotly | Qlik charts (limited custom) | matplotlib in Code Workspaces |
| Cross-entity relationships | SQL JOINs | Associative model | Ontology navigation (best) |
Exercises
This chapter includes 6 hands-on exercises with full solutions — coding challenges, analysis tasks, and scenario-based problems.
View Exercises on GitHub →