Overview
Databricks is the production ML development environment in the DoD data ecosystem. On both Advana and Navy Jupiter, Databricks is the tool you use when your work requires scale — distributed compute, version-controlled ML pipelines, experiment tracking, and model deployment.
The platform achieved FedRAMP High authorization on AWS GovCloud on February 27, 2025. It holds DoD IL5 authorization on both AWS and Azure Government. As of December 2025, all new Databricks accounts exclusively use Unity Catalog as the governance layer.
The core technical components — PySpark for distributed data processing, MLflow for experiment tracking, Delta Lake for storage, Unity Catalog for governance — are all open source or open standard. That matters in a government context where vendor lock-in and data portability are real concerns.
Architecture on Advana and Jupiter
Databricks architecture in the DoD ecosystem. Your notebooks run on managed clusters; all data passes through Unity Catalog governance.
Python Environment and Package Management
Understanding the Databricks Runtime (DBR) is the first thing you need to do when starting on a government Databricks deployment. The DBR version determines your Python version and pre-installed packages.
DBR 14.x ships with Python 3.10 and includes pandas, numpy, scikit-learn, matplotlib, seaborn, mlflow, and PySpark pre-installed. DBR 14.x ML adds torch, tensorflow, xgboost, lightgbm, and huggingface_hub.
On government Databricks deployments (IL4/IL5), outbound PyPI requests may be blocked entirely. You cannot run %pip install package-name without a network path to PyPI. Check with your platform administrator before starting a project that requires packages not in the default DBR image.
Your options when you need a package not in the DBR image:
- Pre-approved cluster-level libraries: Submit a request to your platform team. They install the package at the cluster level. Typical turnaround is one to two weeks.
- Wheel files in Unity Catalog volumes: Obtain the
.whlfile through your organization's software approval process, upload it to a Unity Catalog volume, and install via%pip install /Volumes/catalog/schema/volume/package.whl. - DBFS upload: On older configurations, upload to DBFS and reference the path in
%pip install.
PySpark Patterns for Federal Data
At the scale of DoD logistics and financial data, distributed processing is not optional. Here are the core patterns you will use on most contracts.
# Reading from Unity Catalog Delta table
df = spark.table("don_catalog.readiness.ship_maintenance_silver")
# Basic data quality check before analysis
from pyspark.sql import functions as F
quality_report = df.agg(
F.count("*").alias("total_rows"),
F.count_distinct("hull_number").alias("unique_ships"),
F.sum(F.col("mission_capable_flag").cast("int")).alias("mc_count"),
F.count(F.when(F.col("maintenance_date").isNull(), 1)).alias("null_dates"),
)
quality_report.show()
# Cleaning: standardize unit codes and dates
cleaned = (
df
.withColumn("unit_code", F.trim(F.upper(F.col("unit_code"))))
.withColumn(
"maintenance_date",
F.to_date(F.col("maintenance_date"), "yyyy-MM-dd")
)
.dropDuplicates(["hull_number", "maintenance_date", "work_order_id"])
)
# Write cleaned data to silver tier
(
cleaned.write
.format("delta")
.mode("overwrite")
.option("mergeSchema", "true")
.saveAsTable("don_catalog.readiness.ship_maintenance_silver_v2")
)
MLflow: Experiment Tracking and Model Registry
MLflow is the experiment tracking and model registry standard on Databricks. Every model you train should be logged to an MLflow experiment — this is not optional on government contracts where model reproducibility and audit trails are required.
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, classification_report
import pandas as pd
# Set experiment path (your email or team path in Databricks)
mlflow.set_experiment("/Users/your.email@agency.gov/maintenance-anomaly-v2")
with mlflow.start_run(run_name="rf_v3_balanced_classes"):
# Log parameters
params = {
"n_estimators": 300,
"max_depth": 10,
"class_weight": "balanced",
"random_state": 42,
}
mlflow.log_params(params)
# Log data lineage (critical for audit)
mlflow.log_param("training_table", "don_catalog.readiness.ship_maintenance_silver")
mlflow.log_param("training_cutoff", "2025-01-01")
mlflow.log_param("featurization_version", "v3.1")
# Train
clf = RandomForestClassifier(**params)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# Log metrics
mlflow.log_metric("f1_macro", f1_score(y_test, y_pred, average="macro"))
mlflow.log_metric("f1_weighted", f1_score(y_test, y_pred, average="weighted"))
mlflow.log_metric("test_accuracy", clf.score(X_test, y_test))
# Log the model with signature
from mlflow.models import infer_signature
signature = infer_signature(X_train, y_pred)
mlflow.sklearn.log_model(
clf,
artifact_path="model",
signature=signature,
registered_model_name="maintenance-anomaly-detector",
)
# Log classification report as artifact
report = classification_report(y_test, y_pred)
mlflow.log_text(report, "classification_report.txt")
print(f"Run ID: {mlflow.active_run().info.run_id}")
Unity Catalog: Governance in Practice
As of December 2025, all new Databricks accounts exclusively use Unity Catalog. Understanding the three-level namespace is essential.
-- Unity Catalog three-level namespace: catalog.schema.table
-- catalog = data domain or organization
-- schema = data tier or subject area
-- table = specific dataset
-- List catalogs you have access to
SHOW CATALOGS;
-- List schemas in a catalog
SHOW SCHEMAS IN don_catalog;
-- Examine table lineage and properties
DESCRIBE EXTENDED don_catalog.readiness.ship_maintenance_gold;
-- Check your permissions on a table
SHOW GRANTS ON TABLE don_catalog.readiness.ship_maintenance_gold;
-- Query with full namespace (always use three-part names)
SELECT hull_number, COUNT(*) AS maintenance_events
FROM don_catalog.readiness.ship_maintenance_gold
WHERE fiscal_year = 2025
AND mission_capable_flag = TRUE
GROUP BY hull_number
ORDER BY maintenance_events DESC
LIMIT 20;
Always use the three-part name (catalog.schema.table) when referencing tables in production code. Relative references work in notebooks but break in scheduled jobs and pipelines where the default catalog may differ.
Certifications Worth Getting
| Certification | Focus | Who Should Get It |
|---|---|---|
| Data Engineer Associate | Delta Lake, Spark, ETL pipelines | Anyone building data pipelines |
| Machine Learning Associate | MLflow, feature engineering, model deployment | Anyone training and deploying models |
| Data Analyst Associate | SQL, visualization in Databricks | Analysts transitioning from SQL-based tools |