Chapter 07: Unsupervised ML — Federal DS Handbook

Clustering

K-Means for Procurement Segmentation

K-means clusters contracts into groups that share similar characteristics — obligation size, competition type, NAICS sector, vendor relationship patterns. The output gives a contracting officer a vocabulary for describing the portfolio without enumerating individual awards.

The right number of clusters is not 2 and not 20. Use the elbow method (inertia curve) and silhouette score together. The elbow is a heuristic; the silhouette tells you whether clusters are meaningfully separated.

python

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# Feature selection for procurement clustering
feature_cols = [
    "log_obligation",          # np.log1p(federal_action_obligation)
    "period_of_performance_days",
    "competition_type_encoded",  # 0=sole source, 1=limited, 2=full
    "naics_sector_encoded",
    "modification_count",
    "vendor_award_count_prior",  # vendor experience proxy
]

X = df[feature_cols].fillna(0)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Evaluate k from 2 to 15
inertias    = []
silhouettes = []
k_values    = range(2, 16)

for k in k_values:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    silhouettes.append(silhouette_score(X_scaled, labels, sample_size=5000))
    print(f"k={k:2d}  inertia={km.inertia_:,.0f}  silhouette={silhouettes[-1]:.3f}")

# Choose k at the silhouette peak
best_k = k_values[np.argmax(silhouettes)]
print(f"\nBest k by silhouette: {best_k}")

# Final model
km_final = KMeans(n_clusters=best_k, random_state=42, n_init=10)
df["cluster"] = km_final.fit_predict(X_scaled)

# Profile each cluster
cluster_profile = (
    df.groupby("cluster")
    .agg(
        count=("contract_award_unique_key", "count"),
        total_obligation=("federal_action_obligation", "sum"),
        median_obligation=("federal_action_obligation", "median"),
        sole_source_rate=("competition_type_encoded", lambda x: (x == 0).mean()),
    )
    .sort_values("total_obligation", ascending=False)
)
print(f"\nCluster profiles:\n{cluster_profile}")

Distributed K-Means with Databricks MLlib

For datasets too large for scikit-learn (over ~5M rows), use Databricks MLlib's distributed K-means, which runs across the Spark cluster.

python

from pyspark.ml.clustering import KMeans as SparkKMeans
from pyspark.ml.feature import VectorAssembler, StandardScaler as SparkScaler
from pyspark.ml import Pipeline as SparkPipeline

# Assemble feature vector for Spark ML
assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="raw_features",
    handleInvalid="keep",
)

scaler = SparkScaler(
    inputCol="raw_features",
    outputCol="features",
    withMean=True, withStd=True,
)

km_spark = SparkKMeans(
    featuresCol="features",
    predictionCol="cluster",
    k=best_k,
    seed=42,
    maxIter=20,
)

pipeline_spark = SparkPipeline(stages=[assembler, scaler, km_spark])
model = pipeline_spark.fit(df_spark)
df_clustered = model.transform(df_spark)

# Write results to Gold tier
df_clustered.write.format("delta").mode("overwrite").saveAsTable(
    "procurement_catalog.gold.contract_clusters"
)

DBSCAN for Irregular-Shaped Clusters

K-means assumes spherical clusters. Government data rarely has spherical structure. DBSCAN (Density-Based Spatial Clustering) finds clusters of arbitrary shape and explicitly marks low-density points as noise (label -1) — which is often exactly what you want: identifying the "normal" contracts and flagging the outliers.

python

from sklearn.cluster import DBSCAN

# DBSCAN: eps = neighborhood radius, min_samples = core point threshold
# On scaled procurement data: eps ~0.5-1.0 is typical starting point
dbscan = DBSCAN(eps=0.7, min_samples=10, n_jobs=-1)
df["dbscan_cluster"] = dbscan.fit_predict(X_scaled)

n_clusters = len(set(df["dbscan_cluster"])) - (1 if -1 in df["dbscan_cluster"].values else 0)
n_noise    = (df["dbscan_cluster"] == -1).sum()

print(f"DBSCAN results: {n_clusters} clusters, {n_noise:,} noise points "
      f"({n_noise/len(df)*100:.1f}%)")

# The noise points (-1) are your anomaly candidates
anomalies = df[df["dbscan_cluster"] == -1].sort_values(
    "federal_action_obligation", ascending=False
)
print(f"\nTop anomalous contracts by obligation:")
print(anomalies[["contract_award_unique_key", "recipient_name",
                  "federal_action_obligation"]].head(10))

Anomaly Detection with Isolation Forest

Isolation Forest is the most practical anomaly detector for government financial data because it handles mixed feature types, scales to millions of records, and produces an anomaly score (not just a binary flag) that you can threshold operationally.

The government-specific wrinkle: September is anomalous by design (fiscal year-end spending surge). If you train without controlling for fiscal calendar effects, you'll generate a spike of false positives every September. Include fiscal month as a feature so the model learns that September is "normal-for-September."

python

from sklearn.ensemble import IsolationForest

# Include fiscal calendar features to avoid September false positives
feature_cols_anomaly = feature_cols + [
    "fiscal_month",     # 1=Oct through 12=Sep — critical for government data
    "is_fiscal_eoq",    # end-of-quarter (December, March, June, September)
    "is_fiscal_eoy",    # September only
]

X_anomaly = df[feature_cols_anomaly].fillna(0)
X_scaled_anomaly = StandardScaler().fit_transform(X_anomaly)

iso_forest = IsolationForest(
    n_estimators=100,
    contamination=0.05,   # expect ~5% anomalies
    random_state=42,
    n_jobs=-1,
)

df["anomaly_score"]  = iso_forest.fit_predict(X_scaled_anomaly)
df["anomaly_raw"]    = iso_forest.decision_function(X_scaled_anomaly)
# anomaly_score: -1 = anomaly, 1 = normal
# anomaly_raw: lower = more anomalous

anomaly_rate = (df["anomaly_score"] == -1).mean()
print(f"Anomaly rate: {anomaly_rate:.3f} ({anomaly_rate*100:.1f}% of contracts flagged)")

# Review flagged contracts — present to contracting officer
flagged = df[df["anomaly_score"] == -1].sort_values("anomaly_raw")
print(f"\nTop anomalies (lowest scores = most anomalous):")
print(flagged[["contract_award_unique_key", "recipient_name",
               "federal_action_obligation", "fiscal_month",
               "anomaly_raw"]].head(20))

Dimensionality Reduction

PCA for Feature Compression

PCA reduces many correlated features to a smaller set of orthogonal components. For procurement data with 40+ features, PCA can compress to 5-10 components that capture 85% of variance — making downstream clustering faster and visualization possible.

python

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(random_state=42)
pca.fit(X_scaled)

# Explained variance plot — find the "elbow"
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
n_components_85 = np.argmax(cumulative_variance >= 0.85) + 1
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1

print(f"Components for 85% variance: {n_components_85}")
print(f"Components for 95% variance: {n_components_95}")

# Apply chosen dimensionality
pca_final = PCA(n_components=n_components_85, random_state=42)
X_pca = pca_final.fit_transform(X_scaled)
print(f"Reduced from {X_scaled.shape[1]} to {X_pca.shape[1]} dimensions")

Topic Modeling for Contract Descriptions

Federal contract descriptions are free text that encodes procurement intent in ways that NAICS codes miss. LDA (Latent Dirichlet Allocation) finds recurring themes across tens of thousands of contract descriptions — useful for portfolio analysis and anomaly detection in contract language.

python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Preprocess contract description text
vectorizer = TfidfVectorizer(
    max_features=5000,
    min_df=10,          # ignore terms that appear in fewer than 10 contracts
    max_df=0.95,        # ignore terms in more than 95% of contracts
    stop_words="english",
    ngram_range=(1, 2), # unigrams and bigrams
)

X_text = vectorizer.fit_transform(df["contract_description"].fillna(""))
feature_names = vectorizer.get_feature_names_out()

lda = LatentDirichletAllocation(
    n_components=10,    # 10 topics
    max_iter=15,
    learning_method="online",
    random_state=42,
    n_jobs=-1,
)

lda.fit(X_text)

# Display top words per topic
def display_topics(model, feature_names, n_top_words=10):
    for topic_idx, topic in enumerate(model.components_):
        top_words = [feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]
        print(f"Topic {topic_idx:2d}: {', '.join(top_words)}")

print("LDA Topics:")
display_topics(lda, feature_names)

# Assign dominant topic to each contract
df["dominant_topic"] = lda.transform(X_text).argmax(axis=1)

Foundry: Unsupervised Learning as a Transform

python

from transforms.api import transform_df, Input, Output
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

@transform_df(
    Output("/analytics/gold/contract_segments"),
    features=Input("/analytics/silver/contract_features"),
)
def segment_contracts(features: pd.DataFrame) -> pd.DataFrame:
    """
    Foundry transform: cluster contracts into operational segments.
    Output is written to Gold tier and exposed as a Foundry Object Type.
    """
    df = features.copy()
    cluster_cols = ["log_obligation", "period_days", "competition_type",
                    "naics_sector", "fiscal_month"]

    X = df[cluster_cols].fillna(0)
    X_scaled = StandardScaler().fit_transform(X)

    km = KMeans(n_clusters=6, random_state=42, n_init=10)
    df["segment_id"] = km.fit_predict(X_scaled)

    return df[["contract_id", "segment_id",
               "federal_action_obligation", "naics_sector",
               "recipient_name", "fiscal_year"]]

Where This Goes Wrong

Failure Mode 1: Choosing k Without Validation

Picking k=5 because "five segments seems reasonable" without computing silhouette scores or profiling the resulting clusters for interpretability. Fix: always compute silhouette scores across a range of k values. Then profile the clusters — if you can't explain what each cluster represents to a non-statistician in one sentence, the clustering is not useful.

Failure Mode 2: Ignoring Fiscal Calendar in Anomaly Detection

Training an anomaly detector on procurement data and generating a wave of false positives every September and December because the model doesn't know that end-of-quarter spending surges are structural. Fix: include fiscal month and quarter-end indicators as features so the model learns the calendar pattern as "normal."

Failure Mode 3: Treating Noise Points as Errors

In DBSCAN, noise points (label -1) are contracts that don't fit any cluster — which is exactly what you wanted to find. Automatically removing them or re-assigning them to the nearest cluster destroys the analysis. Fix: surface noise points to the contracting officer as anomaly candidates. That's the output of the analysis, not a data quality problem.

Platform Comparison

Capability	Databricks (Advana/Jupiter)	Palantir Foundry	Qlik
Distributed clustering	MLlib K-means (millions of rows)	Code Workspaces (scikit-learn)	Not supported
Anomaly detection	Isolation Forest (scikit-learn)	Code Workspaces (scikit-learn)	Not supported
Topic modeling	LDA, BERTopic (DBR ML)	Code Workspaces	Not supported
Results as operational objects	Delta table → Gold tier	Foundry Objects (Ontology)	QVD → charts
Experiment tracking	MLflow	Foundry versioned artifacts	Not applicable

Exercises

This chapter includes 5 hands-on exercises with full solutions — coding challenges, analysis tasks, and scenario-based problems.

View Exercises on GitHub →