Python and R Foundations for Federal Platforms
The environment is the constraint. Federal platforms enforce Python versions, restrict package installs, and operate behind authentication walls that break standard workflows. This chapter covers what actually works — and what silently fails — when your environment is inside Advana, Databricks, Foundry, or Jupiter.
The Environment Problem
On your laptop, pip install pandas and you're done. On a federal platform, that same command might fail with a permissions error, pull from an internal mirror that is six months behind PyPI, or succeed silently while installing a version incompatible with the platform's Spark runtime. The tools are the same. The environment is not.
This chapter covers the practical realities: CAC authentication in code, the data structures each platform optimizes for, the SDK patterns that work, and the failure modes that new practitioners hit within their first two weeks.
All code in this chapter is written for Python 3.9+. Databricks Runtime 13.x and later ship Python 3.10. Foundry Code Repositories default to Python 3.9. Advana and Jupiter use Databricks under the hood — check your cluster's DBR version before assuming library availability.
Platform-Specific Environments
Databricks Runtime
Databricks clusters run one of several runtime variants. The plain DBR (e.g., DBR 14.3 LTS) includes Python, PySpark, pandas, and a broad set of data engineering libraries. The ML Runtime (DBR 14.3 LTS ML) adds scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, and MLflow pre-installed. The GPU ML Runtime additionally configures CUDA.
The practical implication: on a plain DBR cluster, import xgboost fails. On an ML cluster, it works. Choose your cluster type before you begin a project, not after you hit the first import error.
CAC Authentication in Code
Most government laptops require a Common Access Card (CAC) for authentication. When your Python script needs to call an API, access a data catalog, or push to a Git repository, standard OAuth and password flows don't apply.
Authentication paths for common federal platform access patterns.
# Databricks authentication — use environment variables, never hardcoded tokens
import os
from databricks.sdk import WorkspaceClient
# PAT injected at cluster launch via Databricks Secrets or environment config
# NEVER: client = WorkspaceClient(token="dapi1234567890abcdef")
client = WorkspaceClient(
host=os.environ["DATABRICKS_HOST"],
token=os.environ["DATABRICKS_TOKEN"],
)
me = client.current_user.me()
print(f"Authenticated as: {me.user_name}")
# Palantir Foundry: token is injected automatically inside the platform
# For local development:
import os
from foundry import FoundryClient
client = FoundryClient(
auth_token=os.environ.get("FOUNDRY_TOKEN"),
hostname=os.environ.get("FOUNDRY_HOSTNAME", "your-agency.palantirfoundry.com"),
)
Core Data Structures by Platform
pandas DataFrames
pandas is the standard for in-memory, single-node data analysis. The constraint: pandas operates in memory on the driver node. For datasets above roughly 10 million rows, you will exhaust memory on a standard cluster. Switch to PySpark — do not just increase the driver size.
import pandas as pd
# USASpending procurement data — always load NAICS as string, not int
df = pd.read_csv(
"usaspending_fy2024_awards.csv",
dtype={
"contract_award_unique_key": str,
"recipient_uei": str,
"naics_code": str, # "031" != 31 — always string
"federal_action_obligation": float,
},
parse_dates=["action_date", "period_of_performance_start_date"],
low_memory=False,
)
# Federal FY: Oct 1 to Sep 30
# FY2024 = Oct 1, 2023 through Sep 30, 2024
df["fiscal_year"] = df["action_date"].apply(
lambda d: d.year if d.month < 10 else d.year + 1
)
print(f"Rows: {len(df):,} | Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
PySpark on Databricks
PySpark distributes computation across your cluster. On Databricks (Advana, Jupiter, or standalone workspace), a SparkSession is already active in any notebook.
from pyspark.sql import functions as F
# In a Databricks notebook, spark is already available
df = spark.table("procurement_catalog.silver.usaspending_fy2024")
# PySpark operations are lazy — nothing runs until an action is called
df_filtered = (
df
.filter(F.col("federal_action_obligation") > 1_000_000)
.filter(F.col("naics_sector_code").isin(["54", "51", "33"]))
.withColumn("obligation_millions",
F.round(F.col("federal_action_obligation") / 1_000_000, 2))
.select(
"contract_award_unique_key", "recipient_name",
"naics_code", "obligation_millions", "action_date", "fiscal_year"
)
)
print(f"Filtered count: {df_filtered.count():,}")
NumPy for Numerical Computing
Use float64 (NumPy's default) rather than float32 when working with obligation amounts. Float32 has ~7 decimal digits of precision; rounding errors compound across large aggregations of trillion-dollar budgets.
import numpy as np
obligations = np.array(df["federal_action_obligation"].values, dtype=np.float64)
print(f"Total obligations: ${obligations.sum():,.2f}")
print(f"Median: ${np.median(obligations):,.2f}")
print(f"P90: ${np.percentile(obligations, 90):,.2f}")
print(f"P99: ${np.percentile(obligations, 99):,.2f}")
Platform SDKs
Databricks SDK
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# List jobs, trigger runs, inspect clusters
jobs = list(w.jobs.list())
print(f"Found {len(jobs)} jobs in workspace")
run = w.jobs.run_now(job_id=12345)
print(f"Job triggered, run ID: {run.run_id}")
for cluster in w.clusters.list():
print(f" {cluster.cluster_name}: {cluster.state.value}")
Palantir Foundry Transforms
from transforms.api import transform_df, Input, Output
import pandas as pd
@transform_df(
Output("/analytics/silver/procurement_cleaned"),
raw=Input("/data/raw/usaspending_fy2024"),
)
def clean_procurement(raw: pd.DataFrame) -> pd.DataFrame:
"""
Foundry transform: input/output are automatically version-controlled.
Runs in Foundry's managed Spark environment.
"""
df = raw.copy()
df["naics_code"] = (
df["naics_code"]
.astype(str).str.strip()
.str.replace(r"[^0-9]", "", regex=True)
.str.zfill(6)
)
df["federal_action_obligation"] = pd.to_numeric(
df["federal_action_obligation"], errors="coerce"
)
return df
Notebook Environments
Databricks Notebooks
Databricks notebooks support Python, SQL, R, and Scala in the same notebook using magic commands (%python, %sql, %r). Notebooks are stored in the workspace, not the local filesystem — accessible from any cluster node and persist after cluster termination.
Databricks notebooks auto-save but are not version-controlled by default. Set up Git integration (Databricks Repos) on your first day. A notebook you've been working on for three hours that hasn't been committed to Git is at risk if the cluster crashes or you accidentally overwrite it.
Foundry Code Workspaces
Foundry Code Workspaces are JupyterLab environments running inside Foundry's managed compute. They have direct access to Foundry datasets without needing to configure SDK authentication. Workspaces are designed for interactive development; production pipelines run in Code Repositories (Git-backed, scheduled, tested).
Package Constraints on Restricted Networks
On classified and restricted networks, pip install hitting the public PyPI index will fail. Packages must come from an internal mirror, a pre-loaded cluster library, or an approved repository.
On Databricks GovCloud, you can install packages from Databricks' managed mirror for most common libraries. For packages not in the mirror, submit a request to your Databricks administrator. Review typically takes 3–5 business days unclassified, 2–4 weeks for classified environments.
import pkg_resources
def check_package_availability(packages: list) -> dict:
results = {}
for pkg in packages:
try:
dist = pkg_resources.get_distribution(pkg)
results[pkg] = dist.version
except pkg_resources.DistributionNotFound:
results[pkg] = None
return results
needed = ["scikit-learn", "xgboost", "shap", "mlflow", "hyperopt"]
status = check_package_availability(needed)
for pkg, version in status.items():
status_str = f"v{version}" if version else "NOT INSTALLED"
print(f" {pkg:<20} {status_str}")
R on Federal Platforms
R remains widely used in federal statistical agencies (Census, BLS, BEA, CDC) and some DoD analytical shops. On Databricks, R runs in notebooks via the %r magic. SparkR and sparklyr provide R interfaces to the Spark distributed computing layer.
library(sparklyr)
library(dplyr)
sc <- spark_connect(method = "databricks")
df_spark <- spark_read_table(sc, "procurement_catalog.silver.usaspending_fy2024")
# dplyr verbs translate to Spark operations via sparklyr
summary_df <- df_spark %>%
filter(federal_action_obligation > 1e6) %>%
group_by(naics_sector_code, fiscal_year) %>%
summarise(
total_obligations = sum(federal_action_obligation, na.rm = TRUE),
award_count = n(),
.groups = "drop"
) %>%
arrange(desc(total_obligations)) %>%
collect() # bring result to local R dataframe
print(head(summary_df, 10))
Where This Goes Wrong
Failure Mode 1: Assuming Your Laptop Environment Transfers
Developing locally and then copying notebooks to Databricks expecting them to work. Version mismatches, missing native libraries, or packages that assume internet access at runtime all break silently on the first import. Fix: check available packages before writing code that depends on them. Build a cluster init script for non-standard dependencies and test it in isolation.
Failure Mode 2: Using pandas on Data That Needs Spark
Loading a 50M-row Delta table into pandas with .toPandas() exhausts driver memory. Fix: write analyses for datasets over 5M rows in PySpark. If you need pandas for a specific operation, sample first (df.sample(fraction=0.01)) or use pyspark.pandas which runs pandas-like syntax on the full distributed dataset.
Failure Mode 3: Hard-Coding Credentials
A Databricks PAT or API key in a notebook cell works until the notebook is shared or committed to Git — at which point it becomes a reportable security incident. Fix: on Databricks, use dbutils.secrets.get(scope="my-scope", key="api-key"). On Foundry, the platform injects credentials via environment variables automatically.
Platform Comparison
| Capability | Databricks (Advana/Jupiter) | Palantir Foundry | Qlik |
|---|---|---|---|
| Primary Python interface | Notebooks + PySpark | Code Workspaces (JupyterLab) + Code Repositories | Server-Side Extensions (gRPC) |
| R support | Yes (SparkR, sparklyr) | Limited | Via SSE plugin |
| Package management | Cluster libraries + init scripts | Conda env per repository | Plugin server dependencies |
| Git integration | Databricks Repos (native) | Built into Code Repositories | External |
| Secrets management | Databricks Secrets (dbutils) | Foundry Secrets | Qlik Sense credentials |
| Spark distributed computing | Native (PySpark) | Yes (Foundry backend) | No |
Practical Checklist
- Verified Python and DBR version on your cluster before writing import statements
- Stored all credentials in Databricks Secrets — no hardcoded tokens anywhere
- Set up Git integration (Databricks Repos or Foundry Code Repository) before writing significant code
- Confirmed dataset size before choosing pandas vs. PySpark — switch at 5M rows
- Verified required non-standard libraries are available in the target environment
- Set cluster auto-termination to 60 minutes or less to avoid idle compute costs
Exercises
This chapter includes 5 hands-on exercises with full solutions — coding challenges, analysis tasks, and scenario-based problems.
View Exercises on GitHub →