Platform Guide

Databricks

The FedRAMP High lakehouse for data engineering, machine learning, and AI workloads in the DoD ecosystem. When you build a production ML model on Advana or Navy Jupiter, you build it here.

FedRAMP High DoD IL5 AWS GovCloud Azure Government Unity Catalog
Authorization
FedRAMP High (Feb 2025)
Impact Level
IL2, IL4, IL5
Networks
NIPRNET, IL5
Infrastructure
AWS GovCloud / Azure Gov
Governance
Unity Catalog (Dec 2025+)
Users
80%+ federal exec depts

Overview

Databricks is the production ML development environment in the DoD data ecosystem. On both Advana and Navy Jupiter, Databricks is the tool you use when your work requires scale — distributed compute, version-controlled ML pipelines, experiment tracking, and model deployment.

The platform achieved FedRAMP High authorization on AWS GovCloud on February 27, 2025. It holds DoD IL5 authorization on both AWS and Azure Government. As of December 2025, all new Databricks accounts exclusively use Unity Catalog as the governance layer.

The core technical components — PySpark for distributed data processing, MLflow for experiment tracking, Delta Lake for storage, Unity Catalog for governance — are all open source or open standard. That matters in a government context where vendor lock-in and data portability are real concerns.

Architecture on Advana and Jupiter

graph TD A[Your Databricks Notebook] --> B[Databricks Runtime Cluster] B --> C[Delta Lake Storage] B --> D[MLflow Tracking Server] B --> E[Unity Catalog] C --> F[Bronze / Silver / Gold Tables] D --> G[Model Registry] E --> H[Data Governance + Access Control] G --> I[Production Serving Endpoint] I --> J[Qlik SSE / API Consumers]

Databricks architecture in the DoD ecosystem. Your notebooks run on managed clusters; all data passes through Unity Catalog governance.

Python Environment and Package Management

Understanding the Databricks Runtime (DBR) is the first thing you need to do when starting on a government Databricks deployment. The DBR version determines your Python version and pre-installed packages.

DBR 14.x ships with Python 3.10 and includes pandas, numpy, scikit-learn, matplotlib, seaborn, mlflow, and PySpark pre-installed. DBR 14.x ML adds torch, tensorflow, xgboost, lightgbm, and huggingface_hub.

On government Databricks deployments (IL4/IL5), outbound PyPI requests may be blocked entirely. You cannot run %pip install package-name without a network path to PyPI. Check with your platform administrator before starting a project that requires packages not in the default DBR image.

Your options when you need a package not in the DBR image:

  1. Pre-approved cluster-level libraries: Submit a request to your platform team. They install the package at the cluster level. Typical turnaround is one to two weeks.
  2. Wheel files in Unity Catalog volumes: Obtain the .whl file through your organization's software approval process, upload it to a Unity Catalog volume, and install via %pip install /Volumes/catalog/schema/volume/package.whl.
  3. DBFS upload: On older configurations, upload to DBFS and reference the path in %pip install.

PySpark Patterns for Federal Data

At the scale of DoD logistics and financial data, distributed processing is not optional. Here are the core patterns you will use on most contracts.

python
# Reading from Unity Catalog Delta table
df = spark.table("don_catalog.readiness.ship_maintenance_silver")

# Basic data quality check before analysis
from pyspark.sql import functions as F

quality_report = df.agg(
    F.count("*").alias("total_rows"),
    F.count_distinct("hull_number").alias("unique_ships"),
    F.sum(F.col("mission_capable_flag").cast("int")).alias("mc_count"),
    F.count(F.when(F.col("maintenance_date").isNull(), 1)).alias("null_dates"),
)
quality_report.show()

# Cleaning: standardize unit codes and dates
cleaned = (
    df
    .withColumn("unit_code", F.trim(F.upper(F.col("unit_code"))))
    .withColumn(
        "maintenance_date",
        F.to_date(F.col("maintenance_date"), "yyyy-MM-dd")
    )
    .dropDuplicates(["hull_number", "maintenance_date", "work_order_id"])
)

# Write cleaned data to silver tier
(
    cleaned.write
    .format("delta")
    .mode("overwrite")
    .option("mergeSchema", "true")
    .saveAsTable("don_catalog.readiness.ship_maintenance_silver_v2")
)

MLflow: Experiment Tracking and Model Registry

MLflow is the experiment tracking and model registry standard on Databricks. Every model you train should be logged to an MLflow experiment — this is not optional on government contracts where model reproducibility and audit trails are required.

python
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, classification_report
import pandas as pd

# Set experiment path (your email or team path in Databricks)
mlflow.set_experiment("/Users/your.email@agency.gov/maintenance-anomaly-v2")

with mlflow.start_run(run_name="rf_v3_balanced_classes"):
    # Log parameters
    params = {
        "n_estimators": 300,
        "max_depth": 10,
        "class_weight": "balanced",
        "random_state": 42,
    }
    mlflow.log_params(params)

    # Log data lineage (critical for audit)
    mlflow.log_param("training_table", "don_catalog.readiness.ship_maintenance_silver")
    mlflow.log_param("training_cutoff", "2025-01-01")
    mlflow.log_param("featurization_version", "v3.1")

    # Train
    clf = RandomForestClassifier(**params)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    # Log metrics
    mlflow.log_metric("f1_macro", f1_score(y_test, y_pred, average="macro"))
    mlflow.log_metric("f1_weighted", f1_score(y_test, y_pred, average="weighted"))
    mlflow.log_metric("test_accuracy", clf.score(X_test, y_test))

    # Log the model with signature
    from mlflow.models import infer_signature
    signature = infer_signature(X_train, y_pred)
    mlflow.sklearn.log_model(
        clf,
        artifact_path="model",
        signature=signature,
        registered_model_name="maintenance-anomaly-detector",
    )

    # Log classification report as artifact
    report = classification_report(y_test, y_pred)
    mlflow.log_text(report, "classification_report.txt")

print(f"Run ID: {mlflow.active_run().info.run_id}")

Unity Catalog: Governance in Practice

As of December 2025, all new Databricks accounts exclusively use Unity Catalog. Understanding the three-level namespace is essential.

sql
-- Unity Catalog three-level namespace: catalog.schema.table
-- catalog = data domain or organization
-- schema  = data tier or subject area
-- table   = specific dataset

-- List catalogs you have access to
SHOW CATALOGS;

-- List schemas in a catalog
SHOW SCHEMAS IN don_catalog;

-- Examine table lineage and properties
DESCRIBE EXTENDED don_catalog.readiness.ship_maintenance_gold;

-- Check your permissions on a table
SHOW GRANTS ON TABLE don_catalog.readiness.ship_maintenance_gold;

-- Query with full namespace (always use three-part names)
SELECT hull_number, COUNT(*) AS maintenance_events
FROM don_catalog.readiness.ship_maintenance_gold
WHERE fiscal_year = 2025
  AND mission_capable_flag = TRUE
GROUP BY hull_number
ORDER BY maintenance_events DESC
LIMIT 20;

Always use the three-part name (catalog.schema.table) when referencing tables in production code. Relative references work in notebooks but break in scheduled jobs and pipelines where the default catalog may differ.

Certifications Worth Getting

CertificationFocusWho Should Get It
Data Engineer AssociateDelta Lake, Spark, ETL pipelinesAnyone building data pipelines
Machine Learning AssociateMLflow, feature engineering, model deploymentAnyone training and deploying models
Data Analyst AssociateSQL, visualization in DatabricksAnalysts transitioning from SQL-based tools