Chapter 02

Python and R Foundations for Federal Platforms

The environment is the constraint. Federal platforms enforce Python versions, restrict package installs, and operate behind authentication walls that break standard workflows. This chapter covers what actually works — and what silently fails — when your environment is inside Advana, Databricks, Foundry, or Jupiter.

~40 min read Databricks, Palantir Foundry, Advana, Navy Jupiter, Qlik Code on GitHub

The Environment Problem

On your laptop, pip install pandas and you're done. On a federal platform, that same command might fail with a permissions error, pull from an internal mirror that is six months behind PyPI, or succeed silently while installing a version incompatible with the platform's Spark runtime. The tools are the same. The environment is not.

This chapter covers the practical realities: CAC authentication in code, the data structures each platform optimizes for, the SDK patterns that work, and the failure modes that new practitioners hit within their first two weeks.

All code in this chapter is written for Python 3.9+. Databricks Runtime 13.x and later ship Python 3.10. Foundry Code Repositories default to Python 3.9. Advana and Jupiter use Databricks under the hood — check your cluster's DBR version before assuming library availability.

Platform-Specific Environments

Databricks Runtime

Databricks clusters run one of several runtime variants. The plain DBR (e.g., DBR 14.3 LTS) includes Python, PySpark, pandas, and a broad set of data engineering libraries. The ML Runtime (DBR 14.3 LTS ML) adds scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, and MLflow pre-installed. The GPU ML Runtime additionally configures CUDA.

The practical implication: on a plain DBR cluster, import xgboost fails. On an ML cluster, it works. Choose your cluster type before you begin a project, not after you hit the first import error.

CAC Authentication in Code

Most government laptops require a Common Access Card (CAC) for authentication. When your Python script needs to call an API, access a data catalog, or push to a Git repository, standard OAuth and password flows don't apply.

graph TD A[Python Script] --> B{Platform} B -->|Databricks| C[Personal Access Token\nor Service Principal] B -->|Foundry| D[foundry-sdk token\nfrom browser session] B -->|Jupiter / Advana| E[Databricks PAT\ninside DoD SSO boundary] B -->|External APIs| F[CAC middleware\nor API key from\ndata.gov / SAM.gov] C --> G[Works in notebooks\nand local dev] D --> G E --> G F --> G

Authentication paths for common federal platform access patterns.

python
# Databricks authentication — use environment variables, never hardcoded tokens
import os
from databricks.sdk import WorkspaceClient

# PAT injected at cluster launch via Databricks Secrets or environment config
# NEVER: client = WorkspaceClient(token="dapi1234567890abcdef")
client = WorkspaceClient(
    host=os.environ["DATABRICKS_HOST"],
    token=os.environ["DATABRICKS_TOKEN"],
)

me = client.current_user.me()
print(f"Authenticated as: {me.user_name}")
python
# Palantir Foundry: token is injected automatically inside the platform
# For local development:
import os
from foundry import FoundryClient

client = FoundryClient(
    auth_token=os.environ.get("FOUNDRY_TOKEN"),
    hostname=os.environ.get("FOUNDRY_HOSTNAME", "your-agency.palantirfoundry.com"),
)

Core Data Structures by Platform

pandas DataFrames

pandas is the standard for in-memory, single-node data analysis. The constraint: pandas operates in memory on the driver node. For datasets above roughly 10 million rows, you will exhaust memory on a standard cluster. Switch to PySpark — do not just increase the driver size.

python
import pandas as pd

# USASpending procurement data — always load NAICS as string, not int
df = pd.read_csv(
    "usaspending_fy2024_awards.csv",
    dtype={
        "contract_award_unique_key": str,
        "recipient_uei": str,
        "naics_code": str,           # "031" != 31 — always string
        "federal_action_obligation": float,
    },
    parse_dates=["action_date", "period_of_performance_start_date"],
    low_memory=False,
)

# Federal FY: Oct 1 to Sep 30
# FY2024 = Oct 1, 2023 through Sep 30, 2024
df["fiscal_year"] = df["action_date"].apply(
    lambda d: d.year if d.month < 10 else d.year + 1
)

print(f"Rows: {len(df):,}  |  Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")

PySpark on Databricks

PySpark distributes computation across your cluster. On Databricks (Advana, Jupiter, or standalone workspace), a SparkSession is already active in any notebook.

python
from pyspark.sql import functions as F

# In a Databricks notebook, spark is already available
df = spark.table("procurement_catalog.silver.usaspending_fy2024")

# PySpark operations are lazy — nothing runs until an action is called
df_filtered = (
    df
    .filter(F.col("federal_action_obligation") > 1_000_000)
    .filter(F.col("naics_sector_code").isin(["54", "51", "33"]))
    .withColumn("obligation_millions",
                F.round(F.col("federal_action_obligation") / 1_000_000, 2))
    .select(
        "contract_award_unique_key", "recipient_name",
        "naics_code", "obligation_millions", "action_date", "fiscal_year"
    )
)

print(f"Filtered count: {df_filtered.count():,}")

NumPy for Numerical Computing

Use float64 (NumPy's default) rather than float32 when working with obligation amounts. Float32 has ~7 decimal digits of precision; rounding errors compound across large aggregations of trillion-dollar budgets.

python
import numpy as np

obligations = np.array(df["federal_action_obligation"].values, dtype=np.float64)

print(f"Total obligations: ${obligations.sum():,.2f}")
print(f"Median: ${np.median(obligations):,.2f}")
print(f"P90: ${np.percentile(obligations, 90):,.2f}")
print(f"P99: ${np.percentile(obligations, 99):,.2f}")

Platform SDKs

Databricks SDK

python
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# List jobs, trigger runs, inspect clusters
jobs = list(w.jobs.list())
print(f"Found {len(jobs)} jobs in workspace")

run = w.jobs.run_now(job_id=12345)
print(f"Job triggered, run ID: {run.run_id}")

for cluster in w.clusters.list():
    print(f"  {cluster.cluster_name}: {cluster.state.value}")

Palantir Foundry Transforms

python
from transforms.api import transform_df, Input, Output
import pandas as pd

@transform_df(
    Output("/analytics/silver/procurement_cleaned"),
    raw=Input("/data/raw/usaspending_fy2024"),
)
def clean_procurement(raw: pd.DataFrame) -> pd.DataFrame:
    """
    Foundry transform: input/output are automatically version-controlled.
    Runs in Foundry's managed Spark environment.
    """
    df = raw.copy()
    df["naics_code"] = (
        df["naics_code"]
        .astype(str).str.strip()
        .str.replace(r"[^0-9]", "", regex=True)
        .str.zfill(6)
    )
    df["federal_action_obligation"] = pd.to_numeric(
        df["federal_action_obligation"], errors="coerce"
    )
    return df

Notebook Environments

Databricks Notebooks

Databricks notebooks support Python, SQL, R, and Scala in the same notebook using magic commands (%python, %sql, %r). Notebooks are stored in the workspace, not the local filesystem — accessible from any cluster node and persist after cluster termination.

Databricks notebooks auto-save but are not version-controlled by default. Set up Git integration (Databricks Repos) on your first day. A notebook you've been working on for three hours that hasn't been committed to Git is at risk if the cluster crashes or you accidentally overwrite it.

Foundry Code Workspaces

Foundry Code Workspaces are JupyterLab environments running inside Foundry's managed compute. They have direct access to Foundry datasets without needing to configure SDK authentication. Workspaces are designed for interactive development; production pipelines run in Code Repositories (Git-backed, scheduled, tested).

Package Constraints on Restricted Networks

On classified and restricted networks, pip install hitting the public PyPI index will fail. Packages must come from an internal mirror, a pre-loaded cluster library, or an approved repository.

On Databricks GovCloud, you can install packages from Databricks' managed mirror for most common libraries. For packages not in the mirror, submit a request to your Databricks administrator. Review typically takes 3–5 business days unclassified, 2–4 weeks for classified environments.

python
import pkg_resources

def check_package_availability(packages: list) -> dict:
    results = {}
    for pkg in packages:
        try:
            dist = pkg_resources.get_distribution(pkg)
            results[pkg] = dist.version
        except pkg_resources.DistributionNotFound:
            results[pkg] = None
    return results

needed = ["scikit-learn", "xgboost", "shap", "mlflow", "hyperopt"]
status = check_package_availability(needed)
for pkg, version in status.items():
    status_str = f"v{version}" if version else "NOT INSTALLED"
    print(f"  {pkg:<20} {status_str}")

R on Federal Platforms

R remains widely used in federal statistical agencies (Census, BLS, BEA, CDC) and some DoD analytical shops. On Databricks, R runs in notebooks via the %r magic. SparkR and sparklyr provide R interfaces to the Spark distributed computing layer.

r
library(sparklyr)
library(dplyr)

sc <- spark_connect(method = "databricks")

df_spark <- spark_read_table(sc, "procurement_catalog.silver.usaspending_fy2024")

# dplyr verbs translate to Spark operations via sparklyr
summary_df <- df_spark %>%
  filter(federal_action_obligation > 1e6) %>%
  group_by(naics_sector_code, fiscal_year) %>%
  summarise(
    total_obligations = sum(federal_action_obligation, na.rm = TRUE),
    award_count = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(total_obligations)) %>%
  collect()  # bring result to local R dataframe

print(head(summary_df, 10))

Where This Goes Wrong

Failure Mode 1: Assuming Your Laptop Environment Transfers

Developing locally and then copying notebooks to Databricks expecting them to work. Version mismatches, missing native libraries, or packages that assume internet access at runtime all break silently on the first import. Fix: check available packages before writing code that depends on them. Build a cluster init script for non-standard dependencies and test it in isolation.

Failure Mode 2: Using pandas on Data That Needs Spark

Loading a 50M-row Delta table into pandas with .toPandas() exhausts driver memory. Fix: write analyses for datasets over 5M rows in PySpark. If you need pandas for a specific operation, sample first (df.sample(fraction=0.01)) or use pyspark.pandas which runs pandas-like syntax on the full distributed dataset.

Failure Mode 3: Hard-Coding Credentials

A Databricks PAT or API key in a notebook cell works until the notebook is shared or committed to Git — at which point it becomes a reportable security incident. Fix: on Databricks, use dbutils.secrets.get(scope="my-scope", key="api-key"). On Foundry, the platform injects credentials via environment variables automatically.

Platform Comparison

Capability Databricks (Advana/Jupiter) Palantir Foundry Qlik
Primary Python interface Notebooks + PySpark Code Workspaces (JupyterLab) + Code Repositories Server-Side Extensions (gRPC)
R support Yes (SparkR, sparklyr) Limited Via SSE plugin
Package management Cluster libraries + init scripts Conda env per repository Plugin server dependencies
Git integration Databricks Repos (native) Built into Code Repositories External
Secrets management Databricks Secrets (dbutils) Foundry Secrets Qlik Sense credentials
Spark distributed computing Native (PySpark) Yes (Foundry backend) No

Practical Checklist

  • Verified Python and DBR version on your cluster before writing import statements
  • Stored all credentials in Databricks Secrets — no hardcoded tokens anywhere
  • Set up Git integration (Databricks Repos or Foundry Code Repository) before writing significant code
  • Confirmed dataset size before choosing pandas vs. PySpark — switch at 5M rows
  • Verified required non-standard libraries are available in the target environment
  • Set cluster auto-termination to 60 minutes or less to avoid idle compute costs

Exercises

This chapter includes 5 hands-on exercises with full solutions — coding challenges, analysis tasks, and scenario-based problems.

View Exercises on GitHub →