Databricks Guide — Federal DS Handbook

Overview

Databricks is the production ML development environment in the DoD data ecosystem. On both Advana and Navy Jupiter, Databricks is the tool you use when your work requires scale — distributed compute, version-controlled ML pipelines, experiment tracking, and model deployment.

The platform achieved FedRAMP High authorization on AWS GovCloud on February 27, 2025. It holds DoD IL5 authorization on both AWS and Azure Government. As of December 2025, all new Databricks accounts exclusively use Unity Catalog as the governance layer.

Accreditation specifics: Databricks is installed at all classification levels. The SaaS offering (managed Databricks on GovCloud) is accredited at IL5. For IL6/classified workloads, Databricks is not currently available as a managed service; Palantir or on-premises enclaves fill that gap. JWICS SaaS availability is expected in early 2027, with additional network accreditations expected to follow.

The core technical components — PySpark for distributed data processing, MLflow for experiment tracking, Delta Lake for storage, Unity Catalog for governance — are all open source or open standard. That matters in a government context where vendor lock-in and data portability are real concerns.

Architecture on Advana and Jupiter

graph TD A[Your Databricks Notebook] --> B[Databricks Runtime Cluster] B --> C[Delta Lake Storage] B --> D[MLflow Tracking Server] B --> E[Unity Catalog] C --> F[Bronze / Silver / Gold Tables] D --> G[Model Registry] E --> H[Data Governance + Access Control] G --> I[Production Serving Endpoint] I --> J[Qlik SSE / API Consumers]

Databricks architecture in the DoD ecosystem. Your notebooks run on managed clusters; all data passes through Unity Catalog governance.

Python Environment and Package Management

Understanding the Databricks Runtime (DBR) is the first thing you need to do when starting on a government Databricks deployment. The DBR version determines your Python version and pre-installed packages.

DBR 14.x ships with Python 3.10 and includes pandas, numpy, scikit-learn, matplotlib, seaborn, mlflow, and PySpark pre-installed. DBR 14.x ML adds torch, tensorflow, xgboost, lightgbm, and huggingface_hub.

On government Databricks deployments (IL4/IL5), outbound PyPI requests may be blocked entirely. You cannot run %pip install package-name without a network path to PyPI. Check with your platform administrator before starting a project that requires packages not in the default DBR image.

Your options when you need a package not in the DBR image:

Pre-approved cluster-level libraries: Submit a request to your platform team. They install the package at the cluster level. Typical turnaround is one to two weeks.
Wheel files in Unity Catalog volumes: Obtain the .whl file through your organization's software approval process, upload it to a Unity Catalog volume, and install via %pip install /Volumes/catalog/schema/volume/package.whl.
DBFS upload: On older configurations, upload to DBFS and reference the path in %pip install.

PySpark Patterns for Federal Data

At the scale of DoD logistics and financial data, distributed processing is not optional. Here are the core patterns you will use on most contracts.

python

# Reading from Unity Catalog Delta table
df = spark.table("don_catalog.readiness.ship_maintenance_silver")

# Basic data quality check before analysis
from pyspark.sql import functions as F

quality_report = df.agg(
    F.count("*").alias("total_rows"),
    F.count_distinct("hull_number").alias("unique_ships"),
    F.sum(F.col("mission_capable_flag").cast("int")).alias("mc_count"),
    F.count(F.when(F.col("maintenance_date").isNull(), 1)).alias("null_dates"),
)
quality_report.show()

# Cleaning: standardize unit codes and dates
cleaned = (
    df
    .withColumn("unit_code", F.trim(F.upper(F.col("unit_code"))))
    .withColumn(
        "maintenance_date",
        F.to_date(F.col("maintenance_date"), "yyyy-MM-dd")
    )
    .dropDuplicates(["hull_number", "maintenance_date", "work_order_id"])
)

# Write cleaned data to silver tier
(
    cleaned.write
    .format("delta")
    .mode("overwrite")
    .option("mergeSchema", "true")
    .saveAsTable("don_catalog.readiness.ship_maintenance_silver_v2")
)

MLflow: Experiment Tracking and Model Registry

MLflow is the experiment tracking and model registry standard on Databricks. Every model you train should be logged to an MLflow experiment — this is not optional on government contracts where model reproducibility and audit trails are required.

python

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, classification_report
import pandas as pd

# Set experiment path (your email or team path in Databricks)
mlflow.set_experiment("/Users/your.email@agency.gov/maintenance-anomaly-v2")

with mlflow.start_run(run_name="rf_v3_balanced_classes"):
    # Log parameters
    params = {
        "n_estimators": 300,
        "max_depth": 10,
        "class_weight": "balanced",
        "random_state": 42,
    }
    mlflow.log_params(params)

    # Log data lineage (critical for audit)
    mlflow.log_param("training_table", "don_catalog.readiness.ship_maintenance_silver")
    mlflow.log_param("training_cutoff", "2025-01-01")
    mlflow.log_param("featurization_version", "v3.1")

    # Train
    clf = RandomForestClassifier(**params)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    # Log metrics
    mlflow.log_metric("f1_macro", f1_score(y_test, y_pred, average="macro"))
    mlflow.log_metric("f1_weighted", f1_score(y_test, y_pred, average="weighted"))
    mlflow.log_metric("test_accuracy", clf.score(X_test, y_test))

    # Log the model with signature
    from mlflow.models import infer_signature
    signature = infer_signature(X_train, y_pred)
    mlflow.sklearn.log_model(
        clf,
        artifact_path="model",
        signature=signature,
        registered_model_name="maintenance-anomaly-detector",
    )

    # Log classification report as artifact
    report = classification_report(y_test, y_pred)
    mlflow.log_text(report, "classification_report.txt")

print(f"Run ID: {mlflow.active_run().info.run_id}")

Unity Catalog: Governance in Practice

As of December 2025, all new Databricks accounts exclusively use Unity Catalog. Understanding the three-level namespace is essential.

sql

-- Unity Catalog three-level namespace: catalog.schema.table
-- catalog = data domain or organization
-- schema  = data tier or subject area
-- table   = specific dataset

-- List catalogs you have access to
SHOW CATALOGS;

-- List schemas in a catalog
SHOW SCHEMAS IN don_catalog;

-- Examine table lineage and properties
DESCRIBE EXTENDED don_catalog.readiness.ship_maintenance_gold;

-- Check your permissions on a table
SHOW GRANTS ON TABLE don_catalog.readiness.ship_maintenance_gold;

-- Query with full namespace (always use three-part names)
SELECT hull_number, COUNT(*) AS maintenance_events
FROM don_catalog.readiness.ship_maintenance_gold
WHERE fiscal_year = 2025
  AND mission_capable_flag = TRUE
GROUP BY hull_number
ORDER BY maintenance_events DESC
LIMIT 20;

Always use the three-part name (catalog.schema.table) when referencing tables in production code. Relative references work in notebooks but break in scheduled jobs and pipelines where the default catalog may differ.

When your program needs to share data across organizational boundaries — a combatant command sharing curated intelligence feeds with a service-level analytics team, or a DoD agency sharing procurement data with a mission partner on a separate cloud tenant — Delta Sharing provides an open protocol that respects Unity Catalog access controls at the sharing boundary. The data provider controls exactly which tables and columns the recipient can access, and the recipient does not need a Databricks account. This is the mechanism to plan around for governed data sharing across DoD components that operate on separate infrastructure.

Certifications Worth Getting

Certification	Focus	Who Should Get It
Data Engineer Associate	Delta Lake, Spark, ETL pipelines	Anyone building data pipelines
Machine Learning Associate	MLflow, feature engineering, model deployment	Anyone training and deploying models
Data Analyst Associate	SQL, visualization in Databricks	Analysts transitioning from SQL-based tools

Databricks

Overview

Architecture on Advana and Jupiter

Python Environment and Package Management

PySpark Patterns for Federal Data

MLflow: Experiment Tracking and Model Registry

Unity Catalog: Governance in Practice

Delta Sharing for Cross-Command Data Exchange

Certifications Worth Getting