Chapter 02: Python & R Foundations

The Environment Problem

On your laptop, pip install pandas and you're done. On a federal platform, that same command might fail with a permissions error, pull from an internal mirror that is six months behind PyPI, or succeed silently while installing a version incompatible with the platform's Spark runtime. The tools are the same. The environment is not.

This chapter covers the practical realities: CAC authentication in code, the data structures each platform optimizes for, the SDK patterns that work, and the failure modes that new practitioners hit within their first two weeks.

All code in this chapter is written for Python 3.9+. Databricks Runtime 13.x and later ship Python 3.10. Foundry Code Repositories default to Python 3.9. Advana and Jupiter use Databricks under the hood — check your cluster's DBR version before assuming library availability.

Platform-Specific Environments

Databricks Runtime

Databricks clusters run one of several runtime variants. The plain DBR (e.g., DBR 14.3 LTS) includes Python, PySpark, pandas, and a broad set of data engineering libraries. The ML Runtime (DBR 14.3 LTS ML) adds scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, and MLflow pre-installed. The GPU ML Runtime additionally configures CUDA.

The practical implication: on a plain DBR cluster, import xgboost fails. On an ML cluster, it works. Choose your cluster type before you begin a project, not after you hit the first import error.

CAC Authentication in Code

Most government laptops require a Common Access Card (CAC) for authentication. When your Python script needs to call an API, access a data catalog, or push to a Git repository, standard OAuth and password flows don't apply.

graph TD A[Python Script] --> B{Platform} B -->|Databricks| C[Personal Access Token\nor Service Principal] B -->|Foundry| D[foundry-sdk token\nfrom browser session] B -->|Jupiter / Advana| E[Databricks PAT\ninside DoD SSO boundary] B -->|External APIs| F[CAC middleware\nor API key from\ndata.gov / SAM.gov] C --> G[Works in notebooks\nand local dev] D --> G E --> G F --> G

Authentication paths for common federal platform access patterns.

python

# Databricks authentication — use environment variables, never hardcoded tokens
import os
from databricks.sdk import WorkspaceClient

# PAT injected at cluster launch via Databricks Secrets or environment config
# NEVER: client = WorkspaceClient(token="dapi1234567890abcdef")
client = WorkspaceClient(
    host=os.environ["DATABRICKS_HOST"],
    token=os.environ["DATABRICKS_TOKEN"],
)

me = client.current_user.me()
print(f"Authenticated as: {me.user_name}")

python

# Palantir Foundry: token is injected automatically inside the platform
# For local development:
import os
from foundry import FoundryClient

client = FoundryClient(
    auth_token=os.environ.get("FOUNDRY_TOKEN"),
    hostname=os.environ.get("FOUNDRY_HOSTNAME", "your-agency.palantirfoundry.com"),
)

Core Data Structures by Platform

pandas DataFrames

pandas is the standard for in-memory, single-node data analysis. The constraint: pandas operates in memory on the driver node. For datasets above roughly 10 million rows, you will exhaust memory on a standard cluster. Switch to PySpark — do not just increase the driver size.

python

import pandas as pd

# USASpending procurement data — always load NAICS as string, not int
df = pd.read_csv(
    "usaspending_fy2024_awards.csv",
    dtype={
        "contract_award_unique_key": str,
        "recipient_uei": str,
        "naics_code": str,           # "031" != 31 — always string
        "federal_action_obligation": float,
    },
    parse_dates=["action_date", "period_of_performance_start_date"],
    low_memory=False,
)

# Federal FY: Oct 1 to Sep 30
# FY2024 = Oct 1, 2023 through Sep 30, 2024
df["fiscal_year"] = df["action_date"].apply(
    lambda d: d.year if d.month < 10 else d.year + 1
)

print(f"Rows: {len(df):,}  |  Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")

PySpark on Databricks

PySpark distributes computation across your cluster. On Databricks (Advana, Jupiter, or standalone workspace), a SparkSession is already active in any notebook.

python

from pyspark.sql import functions as F

# In a Databricks notebook, spark is already available
df = spark.table("procurement_catalog.silver.usaspending_fy2024")

# PySpark operations are lazy — nothing runs until an action is called
df_filtered = (
    df
    .filter(F.col("federal_action_obligation") > 1_000_000)
    .filter(F.col("naics_sector_code").isin(["54", "51", "33"]))
    .withColumn("obligation_millions",
                F.round(F.col("federal_action_obligation") / 1_000_000, 2))
    .select(
        "contract_award_unique_key", "recipient_name",
        "naics_code", "obligation_millions", "action_date", "fiscal_year"
    )
)

print(f"Filtered count: {df_filtered.count():,}")

NumPy for Numerical Computing

Use float64 (NumPy's default) rather than float32 when working with obligation amounts. Float32 has ~7 decimal digits of precision; rounding errors compound across large aggregations of trillion-dollar budgets.

python

import numpy as np

obligations = np.array(df["federal_action_obligation"].values, dtype=np.float64)

print(f"Total obligations: ${obligations.sum():,.2f}")
print(f"Median: ${np.median(obligations):,.2f}")
print(f"P90: ${np.percentile(obligations, 90):,.2f}")
print(f"P99: ${np.percentile(obligations, 99):,.2f}")

Platform SDKs

Databricks SDK

python

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# List jobs, trigger runs, inspect clusters
jobs = list(w.jobs.list())
print(f"Found {len(jobs)} jobs in workspace")

run = w.jobs.run_now(job_id=12345)
print(f"Job triggered, run ID: {run.run_id}")

for cluster in w.clusters.list():
    print(f"  {cluster.cluster_name}: {cluster.state.value}")

Palantir Foundry Transforms

python

from transforms.api import transform_df, Input, Output
import pandas as pd

@transform_df(
    Output("/analytics/silver/procurement_cleaned"),
    raw=Input("/data/raw/usaspending_fy2024"),
)
def clean_procurement(raw: pd.DataFrame) -> pd.DataFrame:
    """
    Foundry transform: input/output are automatically version-controlled.
    Runs in Foundry's managed Spark environment.
    """
    df = raw.copy()
    df["naics_code"] = (
        df["naics_code"]
        .astype(str).str.strip()
        .str.replace(r"[^0-9]", "", regex=True)
        .str.zfill(6)
    )
    df["federal_action_obligation"] = pd.to_numeric(
        df["federal_action_obligation"], errors="coerce"
    )
    return df

Notebook Environments

Databricks Notebooks

Databricks notebooks support Python, SQL, R, and Scala in the same notebook using magic commands (%python, %sql, %r). Notebooks are stored in the workspace, not the local filesystem — accessible from any cluster node and persist after cluster termination.

Databricks notebooks auto-save but are not version-controlled by default. Set up Git integration (Databricks Repos) on your first day. A notebook you've been working on for three hours that hasn't been committed to Git is at risk if the cluster crashes or you accidentally overwrite it.

Foundry Code Workspaces

Foundry Code Workspaces are JupyterLab environments running inside Foundry's managed compute. They have direct access to Foundry datasets without needing to configure SDK authentication. Workspaces are designed for interactive development; production pipelines run in Code Repositories (Git-backed, scheduled, tested).

Package Constraints on Restricted Networks

On classified and restricted networks, pip install hitting the public PyPI index will fail. Packages must come from an internal mirror, a pre-loaded cluster library, or an approved repository.

On Databricks GovCloud, you can install packages from Databricks' managed mirror for most common libraries. For packages not in the mirror, submit a request to your Databricks administrator. Review typically takes 3–5 business days unclassified, 2–4 weeks for classified environments.

python

import pkg_resources

def check_package_availability(packages: list) -> dict:
    results = {}
    for pkg in packages:
        try:
            dist = pkg_resources.get_distribution(pkg)
            results[pkg] = dist.version
        except pkg_resources.DistributionNotFound:
            results[pkg] = None
    return results

needed = ["scikit-learn", "xgboost", "shap", "mlflow", "hyperopt"]
status = check_package_availability(needed)
for pkg, version in status.items():
    status_str = f"v{version}" if version else "NOT INSTALLED"
    print(f"  {pkg:<20} {status_str}")

R on Federal Platforms

R remains widely used in federal statistical agencies (Census, BLS, BEA, CDC) and some DoD analytical shops. On Databricks, R runs in notebooks via the %r magic. SparkR and sparklyr provide R interfaces to the Spark distributed computing layer.

library(sparklyr)
library(dplyr)

sc <- spark_connect(method = "databricks")

df_spark <- spark_read_table(sc, "procurement_catalog.silver.usaspending_fy2024")

# dplyr verbs translate to Spark operations via sparklyr
summary_df <- df_spark %>%
  filter(federal_action_obligation > 1e6) %>%
  group_by(naics_sector_code, fiscal_year) %>%
  summarise(
    total_obligations = sum(federal_action_obligation, na.rm = TRUE),
    award_count = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(total_obligations)) %>%
  collect()  # bring result to local R dataframe

print(head(summary_df, 10))

Where This Goes Wrong

Failure Mode 1: Assuming Your Laptop Environment Transfers

Developing locally and then copying notebooks to Databricks expecting them to work. Version mismatches, missing native libraries, or packages that assume internet access at runtime all break silently on the first import. Fix: check available packages before writing code that depends on them. Build a cluster init script for non-standard dependencies and test it in isolation.

Failure Mode 2: Using pandas on Data That Needs Spark

Loading a 50M-row Delta table into pandas with .toPandas() exhausts driver memory. Fix: write analyses for datasets over 5M rows in PySpark. If you need pandas for a specific operation, sample first (df.sample(fraction=0.01)) or use pyspark.pandas which runs pandas-like syntax on the full distributed dataset.

Failure Mode 3: Hard-Coding Credentials

A Databricks PAT or API key in a notebook cell works until the notebook is shared or committed to Git — at which point it becomes a reportable security incident. Fix: on Databricks, use dbutils.secrets.get(scope="my-scope", key="api-key"). On Foundry, the platform injects credentials via environment variables automatically.

Platform Comparison

Capability	Databricks (Advana/Jupiter)	Palantir Foundry	Qlik
Primary Python interface	Notebooks + PySpark	Code Workspaces (JupyterLab) + Code Repositories	Server-Side Extensions (gRPC)
R support	Yes (SparkR, sparklyr)	Limited	Via SSE plugin
Package management	Cluster libraries + init scripts	Conda env per repository	Plugin server dependencies
Git integration	Databricks Repos (native)	Built into Code Repositories	External
Secrets management	Databricks Secrets (dbutils)	Foundry Secrets	Qlik Sense credentials
Spark distributed computing	Native (PySpark)	Yes (Foundry backend)	No

Practical Checklist

Verified Python and DBR version on your cluster before writing import statements
Stored all credentials in Databricks Secrets — no hardcoded tokens anywhere
Set up Git integration (Databricks Repos or Foundry Code Repository) before writing significant code
Confirmed dataset size before choosing pandas vs. PySpark — switch at 5M rows
Verified required non-standard libraries are available in the target environment
Set cluster auto-termination to 60 minutes or less to avoid idle compute costs

Exercises

This chapter includes 5 hands-on exercises with full solutions — coding challenges, analysis tasks, and scenario-based problems.

View Exercises on GitHub →