Unsupervised Machine Learning
When there are no labels, you still have structure. Clustering finds natural groupings in contract portfolios. Anomaly detection flags spending patterns that fall outside the fiscal calendar's rhythm. Dimensionality reduction makes 40-feature datasets interpretable for a program manager who will never open a Jupyter notebook.
Clustering
K-Means for Procurement Segmentation
K-means clusters contracts into groups that share similar characteristics — obligation size, competition type, NAICS sector, vendor relationship patterns. The output gives a contracting officer a vocabulary for describing the portfolio without enumerating individual awards.
The right number of clusters is not 2 and not 20. Use the elbow method (inertia curve) and silhouette score together. The elbow is a heuristic; the silhouette tells you whether clusters are meaningfully separated.
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
# Feature selection for procurement clustering
feature_cols = [
"log_obligation", # np.log1p(federal_action_obligation)
"period_of_performance_days",
"competition_type_encoded", # 0=sole source, 1=limited, 2=full
"naics_sector_encoded",
"modification_count",
"vendor_award_count_prior", # vendor experience proxy
]
X = df[feature_cols].fillna(0)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Evaluate k from 2 to 15
inertias = []
silhouettes = []
k_values = range(2, 16)
for k in k_values:
km = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = km.fit_predict(X_scaled)
inertias.append(km.inertia_)
silhouettes.append(silhouette_score(X_scaled, labels, sample_size=5000))
print(f"k={k:2d} inertia={km.inertia_:,.0f} silhouette={silhouettes[-1]:.3f}")
# Choose k at the silhouette peak
best_k = k_values[np.argmax(silhouettes)]
print(f"\nBest k by silhouette: {best_k}")
# Final model
km_final = KMeans(n_clusters=best_k, random_state=42, n_init=10)
df["cluster"] = km_final.fit_predict(X_scaled)
# Profile each cluster
cluster_profile = (
df.groupby("cluster")
.agg(
count=("contract_award_unique_key", "count"),
total_obligation=("federal_action_obligation", "sum"),
median_obligation=("federal_action_obligation", "median"),
sole_source_rate=("competition_type_encoded", lambda x: (x == 0).mean()),
)
.sort_values("total_obligation", ascending=False)
)
print(f"\nCluster profiles:\n{cluster_profile}")
Distributed K-Means with Databricks MLlib
For datasets too large for scikit-learn (over ~5M rows), use Databricks MLlib's distributed K-means, which runs across the Spark cluster.
from pyspark.ml.clustering import KMeans as SparkKMeans
from pyspark.ml.feature import VectorAssembler, StandardScaler as SparkScaler
from pyspark.ml import Pipeline as SparkPipeline
# Assemble feature vector for Spark ML
assembler = VectorAssembler(
inputCols=feature_cols,
outputCol="raw_features",
handleInvalid="keep",
)
scaler = SparkScaler(
inputCol="raw_features",
outputCol="features",
withMean=True, withStd=True,
)
km_spark = SparkKMeans(
featuresCol="features",
predictionCol="cluster",
k=best_k,
seed=42,
maxIter=20,
)
pipeline_spark = SparkPipeline(stages=[assembler, scaler, km_spark])
model = pipeline_spark.fit(df_spark)
df_clustered = model.transform(df_spark)
# Write results to Gold tier
df_clustered.write.format("delta").mode("overwrite").saveAsTable(
"procurement_catalog.gold.contract_clusters"
)
DBSCAN for Irregular-Shaped Clusters
K-means assumes spherical clusters. Government data rarely has spherical structure. DBSCAN (Density-Based Spatial Clustering) finds clusters of arbitrary shape and explicitly marks low-density points as noise (label -1) — which is often exactly what you want: identifying the "normal" contracts and flagging the outliers.
from sklearn.cluster import DBSCAN
# DBSCAN: eps = neighborhood radius, min_samples = core point threshold
# On scaled procurement data: eps ~0.5-1.0 is typical starting point
dbscan = DBSCAN(eps=0.7, min_samples=10, n_jobs=-1)
df["dbscan_cluster"] = dbscan.fit_predict(X_scaled)
n_clusters = len(set(df["dbscan_cluster"])) - (1 if -1 in df["dbscan_cluster"].values else 0)
n_noise = (df["dbscan_cluster"] == -1).sum()
print(f"DBSCAN results: {n_clusters} clusters, {n_noise:,} noise points "
f"({n_noise/len(df)*100:.1f}%)")
# The noise points (-1) are your anomaly candidates
anomalies = df[df["dbscan_cluster"] == -1].sort_values(
"federal_action_obligation", ascending=False
)
print(f"\nTop anomalous contracts by obligation:")
print(anomalies[["contract_award_unique_key", "recipient_name",
"federal_action_obligation"]].head(10))
Anomaly Detection with Isolation Forest
Isolation Forest is the most practical anomaly detector for government financial data because it handles mixed feature types, scales to millions of records, and produces an anomaly score (not just a binary flag) that you can threshold operationally.
The government-specific wrinkle: September is anomalous by design (fiscal year-end spending surge). If you train without controlling for fiscal calendar effects, you'll generate a spike of false positives every September. Include fiscal month as a feature so the model learns that September is "normal-for-September."
from sklearn.ensemble import IsolationForest
# Include fiscal calendar features to avoid September false positives
feature_cols_anomaly = feature_cols + [
"fiscal_month", # 1=Oct through 12=Sep — critical for government data
"is_fiscal_eoq", # end-of-quarter (December, March, June, September)
"is_fiscal_eoy", # September only
]
X_anomaly = df[feature_cols_anomaly].fillna(0)
X_scaled_anomaly = StandardScaler().fit_transform(X_anomaly)
iso_forest = IsolationForest(
n_estimators=100,
contamination=0.05, # expect ~5% anomalies
random_state=42,
n_jobs=-1,
)
df["anomaly_score"] = iso_forest.fit_predict(X_scaled_anomaly)
df["anomaly_raw"] = iso_forest.decision_function(X_scaled_anomaly)
# anomaly_score: -1 = anomaly, 1 = normal
# anomaly_raw: lower = more anomalous
anomaly_rate = (df["anomaly_score"] == -1).mean()
print(f"Anomaly rate: {anomaly_rate:.3f} ({anomaly_rate*100:.1f}% of contracts flagged)")
# Review flagged contracts — present to contracting officer
flagged = df[df["anomaly_score"] == -1].sort_values("anomaly_raw")
print(f"\nTop anomalies (lowest scores = most anomalous):")
print(flagged[["contract_award_unique_key", "recipient_name",
"federal_action_obligation", "fiscal_month",
"anomaly_raw"]].head(20))
Dimensionality Reduction
PCA for Feature Compression
PCA reduces many correlated features to a smaller set of orthogonal components. For procurement data with 40+ features, PCA can compress to 5-10 components that capture 85% of variance — making downstream clustering faster and visualization possible.
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA(random_state=42)
pca.fit(X_scaled)
# Explained variance plot — find the "elbow"
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
n_components_85 = np.argmax(cumulative_variance >= 0.85) + 1
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1
print(f"Components for 85% variance: {n_components_85}")
print(f"Components for 95% variance: {n_components_95}")
# Apply chosen dimensionality
pca_final = PCA(n_components=n_components_85, random_state=42)
X_pca = pca_final.fit_transform(X_scaled)
print(f"Reduced from {X_scaled.shape[1]} to {X_pca.shape[1]} dimensions")
Topic Modeling for Contract Descriptions
Federal contract descriptions are free text that encodes procurement intent in ways that NAICS codes miss. LDA (Latent Dirichlet Allocation) finds recurring themes across tens of thousands of contract descriptions — useful for portfolio analysis and anomaly detection in contract language.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
# Preprocess contract description text
vectorizer = TfidfVectorizer(
max_features=5000,
min_df=10, # ignore terms that appear in fewer than 10 contracts
max_df=0.95, # ignore terms in more than 95% of contracts
stop_words="english",
ngram_range=(1, 2), # unigrams and bigrams
)
X_text = vectorizer.fit_transform(df["contract_description"].fillna(""))
feature_names = vectorizer.get_feature_names_out()
lda = LatentDirichletAllocation(
n_components=10, # 10 topics
max_iter=15,
learning_method="online",
random_state=42,
n_jobs=-1,
)
lda.fit(X_text)
# Display top words per topic
def display_topics(model, feature_names, n_top_words=10):
for topic_idx, topic in enumerate(model.components_):
top_words = [feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]
print(f"Topic {topic_idx:2d}: {', '.join(top_words)}")
print("LDA Topics:")
display_topics(lda, feature_names)
# Assign dominant topic to each contract
df["dominant_topic"] = lda.transform(X_text).argmax(axis=1)
Foundry: Unsupervised Learning as a Transform
from transforms.api import transform_df, Input, Output
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
@transform_df(
Output("/analytics/gold/contract_segments"),
features=Input("/analytics/silver/contract_features"),
)
def segment_contracts(features: pd.DataFrame) -> pd.DataFrame:
"""
Foundry transform: cluster contracts into operational segments.
Output is written to Gold tier and exposed as a Foundry Object Type.
"""
df = features.copy()
cluster_cols = ["log_obligation", "period_days", "competition_type",
"naics_sector", "fiscal_month"]
X = df[cluster_cols].fillna(0)
X_scaled = StandardScaler().fit_transform(X)
km = KMeans(n_clusters=6, random_state=42, n_init=10)
df["segment_id"] = km.fit_predict(X_scaled)
return df[["contract_id", "segment_id",
"federal_action_obligation", "naics_sector",
"recipient_name", "fiscal_year"]]
Where This Goes Wrong
Failure Mode 1: Choosing k Without Validation
Picking k=5 because "five segments seems reasonable" without computing silhouette scores or profiling the resulting clusters for interpretability. Fix: always compute silhouette scores across a range of k values. Then profile the clusters — if you can't explain what each cluster represents to a non-statistician in one sentence, the clustering is not useful.
Failure Mode 2: Ignoring Fiscal Calendar in Anomaly Detection
Training an anomaly detector on procurement data and generating a wave of false positives every September and December because the model doesn't know that end-of-quarter spending surges are structural. Fix: include fiscal month and quarter-end indicators as features so the model learns the calendar pattern as "normal."
Failure Mode 3: Treating Noise Points as Errors
In DBSCAN, noise points (label -1) are contracts that don't fit any cluster — which is exactly what you wanted to find. Automatically removing them or re-assigning them to the nearest cluster destroys the analysis. Fix: surface noise points to the contracting officer as anomaly candidates. That's the output of the analysis, not a data quality problem.
Platform Comparison
| Capability | Databricks (Advana/Jupiter) | Palantir Foundry | Qlik |
|---|---|---|---|
| Distributed clustering | MLlib K-means (millions of rows) | Code Workspaces (scikit-learn) | Not supported |
| Anomaly detection | Isolation Forest (scikit-learn) | Code Workspaces (scikit-learn) | Not supported |
| Topic modeling | LDA, BERTopic (DBR ML) | Code Workspaces | Not supported |
| Results as operational objects | Delta table → Gold tier | Foundry Objects (Ontology) | QVD → charts |
| Experiment tracking | MLflow | Foundry versioned artifacts | Not applicable |
Exercises
This chapter includes 5 hands-on exercises with full solutions — coding challenges, analysis tasks, and scenario-based problems.
View Exercises on GitHub →