Chapter 08: Deep Learning — Federal DS Handbook

When Deep Learning is Warranted

Before writing any neural network code, establish a strong XGBoost or gradient boosting baseline. If the neural network doesn't beat the baseline by 3+ percentage points on the primary metric, ship the XGBoost. Explainability is easier. Auditability is simpler. The contract sponsor is happier.

Deep learning is warranted when:

The dataset has 100,000+ labeled examples
The input is unstructured: imagery, text documents, audio recordings
The problem has temporal sequence structure (LSTM, Transformer)
High-cardinality categorical variables benefit from learned embeddings

PyTorch on Databricks

PyTorch is the dominant framework for government AI work. TensorFlow is still present in legacy programs but new development defaults to PyTorch. On Databricks ML Runtime, PyTorch is pre-installed alongside CUDA for GPU clusters.

python

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import mlflow, mlflow.pytorch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Training on: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")


class ContractRiskNet(nn.Module):
    """
    Feedforward network with embeddings for tabular government data.
    Uses embeddings for high-cardinality categoricals (hull_class, naics_code)
    and standard linear layers for numeric features.
    """
    def __init__(self, n_numeric: int, naics_vocab: int, hull_vocab: int,
                 embed_dim: int = 8, hidden_dim: int = 128):
        super().__init__()
        self.naics_embed = nn.Embedding(naics_vocab, embed_dim)
        self.hull_embed  = nn.Embedding(hull_vocab, embed_dim)

        input_dim = n_numeric + 2 * embed_dim
        self.network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.BatchNorm1d(hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim // 2, 1),
        )

    def forward(self, numeric, naics_idx, hull_idx):
        naics_emb = self.naics_embed(naics_idx)
        hull_emb  = self.hull_embed(hull_idx)
        x = torch.cat([numeric, naics_emb, hull_emb], dim=1)
        return self.network(x).squeeze(1)


model = ContractRiskNet(
    n_numeric=6, naics_vocab=500, hull_vocab=30
).to(device)

optimizer  = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
criterion  = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([5.0]).to(device))
scheduler  = optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=1e-2, steps_per_epoch=100, epochs=20
)

mlflow.set_experiment("/models/contract_risk_neural")

with mlflow.start_run(run_name="feedforward_baseline"):
    mlflow.log_params({
        "architecture": "feedforward_with_embeddings",
        "hidden_dim": 128, "embed_dim": 8, "dropout": 0.3,
        "optimizer": "AdamW", "lr": 1e-3, "epochs": 20,
    })

    for epoch in range(20):
        model.train()
        epoch_loss = 0.0
        for batch in train_loader:
            numeric, naics_idx, hull_idx, labels = [b.to(device) for b in batch]
            optimizer.zero_grad()
            logits = model(numeric, naics_idx, hull_idx)
            loss = criterion(logits, labels.float())
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            scheduler.step()
            epoch_loss += loss.item()

        mlflow.log_metric("train_loss", epoch_loss, step=epoch)

    mlflow.pytorch.log_model(model, artifact_path="model")

Transfer Learning

Government AI projects have a structural disadvantage: labeled data is expensive and slow to acquire. Transfer learning addresses this directly — a model pre-trained on large public datasets has already learned general representations that transfer to your specific task.

graph TD A[Pre-trained Model\nImageNet / CommonCrawl] --> B{Transfer Strategy} B -->|Small dataset\nunder 5K examples| C[Freeze backbone\nTrain head only] B -->|Medium dataset\n5K to 50K examples| D[Freeze early layers\nFine-tune deep + head] B -->|Large dataset\n50K+ examples| E[Fine-tune all layers\nLow LR on backbone] C --> F[Fast training\nLow overfitting risk] D --> G[Balanced speed\nand performance] E --> H[Maximum performance\nRequires careful regularization]

Transfer learning strategy by dataset size. Government datasets typically fall in the small-to-medium range — freeze more of the backbone to prevent overfitting on limited labels.

Transfer Learning in Air-Gapped Environments

At IL5 and above, your training environment has no internet access. pip install transformers and model = ResNet50(weights="IMAGENET1K_V2") will fail. Pre-trained model weights are on HuggingFace Hub and PyTorch Hub — both on the public internet, both inaccessible.

The process for getting pre-trained weights into an air-gapped environment:

Download on an unclassified system. Use model.save_pretrained("/path/to/output") for HuggingFace models. This produces config.json, tokenizer.json, and model.safetensors. Compute its SHA-256 hash for chain-of-custody.
Package for transfer. Create a single archive. Complete a DD-1149 (Requisition and Invoice/Shipping Document) or equivalent transfer form.
Virus scan and media transfer. Archive goes on approved removable media (hardware write-protected USB). The receiving side's ISSO runs the virus scan before mounting.
Install in the secure environment. Load from local path: AutoModel.from_pretrained("/path/to/local/model") for HuggingFace; set TORCH_HOME environment variable for PyTorch Hub.

This process takes 2–6 weeks at organizations with a mature transfer process, 3 months at organizations doing it for the first time. Model selection must happen before the transfer request goes in. You cannot pivot to a larger BERT model after three weeks of waiting for the weights to clear.

Multi-GPU Training on Databricks

python

from pyspark.ml.torch.distributor import TorchDistributor
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

def train_fn(n_gpus: int):
    """Training function executed on each GPU worker."""
    local_rank = int(os.environ.get("LOCAL_RANK", 0))
    dist.init_process_group(backend="nccl")
    torch.cuda.set_device(local_rank)

    model = build_facility_classifier(freeze_backbone=False)
    model = model.to(local_rank)
    model = DDP(model, device_ids=[local_rank])

    train_ds = SatelliteImageDataset(train_dir, transform=train_transforms)
    sampler  = DistributedSampler(train_ds)
    loader   = DataLoader(train_ds, batch_size=32, sampler=sampler, num_workers=4)

    # ... training loop ...
    dist.destroy_process_group()

n_gpus = torch.cuda.device_count()
distributor = TorchDistributor(
    num_processes=n_gpus,
    local_mode=False,
    use_gpu=True,
)
distributor.run(train_fn, n_gpus)

GPU Compute Costs on Federal Platforms

GPU time is expensive everywhere. On federal platforms it is expensive and administratively constrained.

Phase	Right approach	Wrong approach
Feature development	CPU cluster or local machine	GPU cluster running idle
Architecture search	1x A10g, early stopping	8x A10g, full training runs
Final training runs	Multi-GPU with distributed sampler	Single GPU for large dataset
Model serving	Serverless (pay per inference)	Dedicated cluster running 24/7
Exploratory analysis	Standard CPU cluster	GPU cluster for tabular EDA

The biggest GPU budget mistake on government contracts: leaving clusters running. Configure auto-termination at 60 minutes from day one. Over a project, idle GPU time adds up to tens of thousands of dollars — real budget that could fund the next phase.

DoD Directive 3000.09

DoD Directive 3000.09 establishes policy for autonomous weapons systems, but its engineering implications reach into any AI system used in operational military contexts:

Human judgment must be exercised in use-of-force decisions. Your model is decision support, not the decision maker. The interface must visually distinguish model output from human decision.
The system must minimize unintended engagements. Your false positive rate on a safety-critical class is a legal and policy constraint, not just a model quality metric. It goes in the requirements document. It gets tested. It gets certified.
Systems must be designed for failure safety. When confidence is low, the system must have a defined behavior — escalation to human review, flag for analyst queue — rather than defaulting silently to the highest-probability class.

python

class OperationalInferencePipeline:
    """
    Wrapper enforcing DoD 3000.09 requirements:
    - Confidence threshold gate (below threshold → human review)
    - Full probability distribution logged with every prediction
    - Audit trail for every inference
    """
    def __init__(self, model, confidence_threshold: float = 0.80):
        self.model               = model
        self.confidence_threshold = confidence_threshold

    def predict(self, inputs, record_id: str) -> dict:
        probs      = self.model.predict_proba([inputs])[0]
        pred_class = int(np.argmax(probs))
        confidence = float(probs.max())

        result = {
            "record_id":            record_id,
            "predicted_class":      pred_class,
            "confidence":           confidence,
            "probability_dist":     probs.tolist(),   # ALWAYS log full distribution
            "requires_human_review": confidence < self.confidence_threshold,
            "timestamp":            pd.Timestamp.now().isoformat(),
        }

        # Log to audit trail — every inference, no exceptions
        self._log_to_audit_table(result)
        return result

    def _log_to_audit_table(self, result: dict) -> None:
        # On Foundry: write to an Ontology Action
        # On Databricks: append to a Delta audit table
        pass

Where This Goes Wrong

Failure Mode 1: Using Deep Learning When It Isn't Warranted

A senior analyst asks for "an AI model." You build a 6-layer neural network. It performs worse than logistic regression on the same features. The program office loses confidence in the entire AI effort. Fix: always establish a strong baseline first. If the neural network doesn't beat XGBoost by 3+ points on the primary metric, ship the XGBoost.

Failure Mode 2: Training on the Wrong Distribution

The model trains and validates on historical data, then deploys on current data from a different operational context — different geography, different season, different equipment generation. Fix: treat distribution shift as a first-class engineering problem. Log prediction confidence distributions in production. Alert when mean confidence drops 5+ percentage points below training baseline.

Failure Mode 3: Ignoring the Confidence Score

Deploying a model that outputs a class label without exposing the confidence distribution to the end user. Users treat model outputs as facts rather than probabilistic predictions. Fix: always expose confidence scores. Define a threshold below which predictions route to human review. Document that threshold in the system's AI ethics review.

Platform Comparison

Capability	Advana (Databricks)	Navy Jupiter	Palantir Foundry/AIP
GPU training	A10g via GovCloud IL5	A10g via GovCloud IL5	External training; deploy via AIP
Multi-GPU training	TorchDistributor	TorchDistributor	Code Repository (custom)
Experiment tracking	MLflow 3.0	MLflow 3.0	Foundry Model Registry
Inference serving	Mosaic AI Model Serving	Mosaic AI Model Serving	AIP Logic + Workshop
Human-in-the-loop routing	Custom (via Workflows)	Custom (via Workflows)	Native (AIP + Workshop queues)
Max practical model size (IL5)	7B (single A10g, 4-bit)	7B (single A10g, 4-bit)	IL5 Azure GPU (varies)

Exercises

This chapter includes 4 hands-on exercises with full solutions — coding challenges, analysis tasks, and scenario-based problems.

View Exercises on GitHub →