Chapter 03: Data Acquisition — Federal DS Handbook

Data Classification First

Before you write a single line of code to acquire data, you need to know its classification level. This is not a compliance formality — it determines which platform you can use, who can see the results, and whether you can even describe the data in a status update.

graph TD A[Identify Data Source] --> B{Classification Level?} B -->|Unclassified - Public| C[data.gov, USASpending,\nCensus API, SAM.gov] B -->|CUI / FOUO| D[Agency intranet,\nAdvana IL4/IL5,\nJupiter NIPRNET] B -->|Secret| E[SIPRNET systems,\nJupiter SIPRNET tier] B -->|Top Secret / SCI| F[JWICS-connected\nplatforms only] C --> G[Standard API calls\nover public internet] D --> H[Requires VPN or\nplatform access + authorization] E --> I[Separate network,\nno cross-domain without XDS] F --> J[Air-gapped environment,\nno cloud ML platforms]

Data classification determines the acquisition path and authorized platforms.

Classification Levels That Matter

CUI (Controlled Unclassified Information) is the catch-all for sensitive but unclassified data. FOUO (For Official Use Only) is a legacy CUI sub-category still in widespread use. CUI requires access controls, markings, and authorized platforms — it cannot live in a shared Google Drive or a personal laptop without encryption.

PII (Personally Identifiable Information) and PHI (Protected Health Information) may be CUI or may have additional handling requirements under Privacy Act, HIPAA, or DoD 5400.11. The classification level of a dataset does not tell you whether it contains PII — you must scan for it explicitly (see Chapter 04).

If you're not sure about a dataset's classification, ask the data owner — not a colleague who also isn't sure. Mishandling classified data is a federal crime, and "I didn't know" is not a defense. The data owner or your agency's information security officer (ISSO) is the authoritative source.

Federal Open Data APIs

A substantial amount of useful federal data is publicly available via APIs that require no clearance — only an API key or no authentication at all. These are the starting points for procurement analysis, budget research, demographic work, and economic analysis.

USASpending.gov API

USASpending.gov is the public-facing portal for all federal spending data — contract awards, grants, loans, and other financial assistance. No API key or authentication required for public endpoints. Rate limits apply but are not formally published; for bulk data pulls, use the award data archive downloads rather than paginating the API.

python

import requests
import pandas as pd
import time

def fetch_usaspending_awards(
    fiscal_year: int,
    naics_codes: list,
    min_amount: float = 1_000_000,
    max_pages: int = 10,
) -> pd.DataFrame:
    """
    Fetch contract awards from USASpending.gov API.
    Returns a DataFrame of awards matching the filters.

    Rate limit: 1,000 requests/hour unauthenticated.
    Use time.sleep(0.1) between pages to stay well under the limit.
    """
    url = "https://api.usaspending.gov/api/v2/search/spending_by_award/"
    all_records = []

    for page in range(1, max_pages + 1):
        payload = {
            "filters": {
                "time_period": [{"start_date": f"{fiscal_year - 1}-10-01",
                                 "end_date": f"{fiscal_year}-09-30"}],
                "naics_codes": naics_codes,
                "award_type_codes": ["A", "B", "C", "D"],  # contract types
                "award_amounts": [{"lower_bound": min_amount}],
            },
            "fields": [
                "Award ID", "Recipient Name", "Award Amount",
                "NAICS Code", "NAICS Description",
                "Period of Performance Start Date",
                "Period of Performance Current End Date",
                "Awarding Agency", "Awarding Sub Agency",
            ],
            "page": page,
            "limit": 100,
            "sort": "Award Amount",
            "order": "desc",
        }

        resp = requests.post(url, json=payload, timeout=30)
        resp.raise_for_status()
        data = resp.json()

        results = data.get("results", [])
        if not results:
            break

        all_records.extend(results)
        print(f"Page {page}: fetched {len(results)} records "
              f"(total so far: {len(all_records):,})")

        if page < data.get("page_metadata", {}).get("total_pages", 1):
            time.sleep(0.1)  # stay under rate limit
        else:
            break

    return pd.DataFrame(all_records)


# Example: IT services (NAICS 54) contracts over $1M in FY2024
df_it_contracts = fetch_usaspending_awards(
    fiscal_year=2024,
    naics_codes=["541511", "541512", "541513", "541519"],
    min_amount=1_000_000,
)
print(f"\nFetched {len(df_it_contracts):,} IT services contracts")

SAM.gov API

SAM.gov (System for Award Management) is the authoritative source for vendor registration data — UEI numbers, CAGE codes, business size certifications, and exclusion lists. An API key is required and free to obtain at api.sam.gov.

python

import requests
import os

SAM_API_KEY = os.environ["SAM_API_KEY"]  # free, register at api.sam.gov

def lookup_vendor_by_uei(uei: str) -> dict:
    """
    Look up a vendor's SAM.gov registration by UEI.
    Returns registration data including business type, certifications,
    and registration status.
    """
    url = "https://api.sam.gov/entity-information/v3/entities"
    params = {
        "api_key": SAM_API_KEY,
        "ueiSAM": uei,
        "includeSections": "entityRegistration,coreData,assertions",
    }

    resp = requests.get(url, params=params, timeout=30)
    resp.raise_for_status()
    data = resp.json()

    entities = data.get("entityData", [])
    if not entities:
        return {}

    entity = entities[0]
    reg = entity.get("entityRegistration", {})
    core = entity.get("coreData", {})

    return {
        "uei":                reg.get("ueiSAM"),
        "legal_name":         reg.get("legalBusinessName"),
        "registration_status": reg.get("registrationStatus"),
        "cage_code":          reg.get("cageCode"),
        "expiration_date":    reg.get("registrationExpirationDate"),
        "entity_type":        core.get("entityInformation", {}).get("entityTypeDesc"),
        "business_types":     core.get("businessTypes", {}).get("businessTypeList", []),
    }


# Check a vendor before awarding a contract
vendor = lookup_vendor_by_uei("ABCDE1234567")
print(f"Vendor: {vendor.get('legal_name')}")
print(f"Status: {vendor.get('registration_status')}")  # Must be "Active"
print(f"CAGE: {vendor.get('cage_code')}")

Census Bureau API

The Census API provides demographic, economic, and geographic data. Essential for workforce analysis, program area demographics, and any model requiring population denominators. Free API key at api.census.gov.

python

import requests
import pandas as pd
import os

CENSUS_KEY = os.environ["CENSUS_API_KEY"]  # free at api.census.gov

def fetch_acs_demographics(state_fips: str, year: int = 2022) -> pd.DataFrame:
    """
    Fetch American Community Survey 5-year estimates by county.
    Returns population, median age, education, and income by county.
    """
    url = f"https://api.census.gov/data/{year}/acs/acs5"
    variables = [
        "B01003_001E",  # Total population
        "B01002_001E",  # Median age
        "B15003_022E",  # Bachelor's degree (25+)
        "B15003_023E",  # Master's degree (25+)
        "B19013_001E",  # Median household income
        "NAME",
    ]

    params = {
        "get": ",".join(variables),
        "for": "county:*",
        "in": f"state:{state_fips}",
        "key": CENSUS_KEY,
    }

    resp = requests.get(url, params=params, timeout=30)
    resp.raise_for_status()
    data = resp.json()

    df = pd.DataFrame(data[1:], columns=data[0])
    df = df.rename(columns={
        "B01003_001E": "total_population",
        "B01002_001E": "median_age",
        "B15003_022E": "bachelors_degree",
        "B15003_023E": "masters_degree",
        "B19013_001E": "median_household_income",
    })

    # Convert numeric columns from strings
    numeric_cols = ["total_population", "median_age", "bachelors_degree",
                    "masters_degree", "median_household_income"]
    df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors="coerce")

    return df


# Virginia county demographics (FIPS 51)
va_demographics = fetch_acs_demographics("51", year=2022)
print(va_demographics.head())

Data Catalogs on Federal Platforms

Once inside a federal platform, you don't acquire data by calling external APIs — you query internal data catalogs. Each platform has its own catalog system.

Unity Catalog on Databricks

Unity Catalog is Databricks' centralized governance layer. On Advana and Jupiter, all authorized datasets are registered in Unity Catalog with three-part names: catalog.schema.table. You don't need to know where the data physically lives — query it by name.

sql

-- Discover what's available in Unity Catalog
SHOW CATALOGS;

-- Browse schemas in a catalog
SHOW SCHEMAS IN advana_procurement;

-- List tables in a schema
SHOW TABLES IN advana_procurement.silver;

-- Examine a table's structure and metadata
DESCRIBE TABLE EXTENDED advana_procurement.silver.fpds_contract_actions;

-- Search the catalog for tables related to a topic
SELECT table_catalog, table_schema, table_name, comment
FROM system.information_schema.tables
WHERE lower(table_name) LIKE '%procurement%'
  AND table_type = 'MANAGED'
ORDER BY table_catalog, table_schema, table_name;

Collibra on Navy Jupiter

Navy Jupiter uses Collibra as its primary data catalog alongside Unity Catalog. Collibra provides the business glossary, data stewardship workflows, and Bronze/Silver/Gold tier designations. When looking for an authoritative dataset on Jupiter, start in Collibra — it tells you which Unity Catalog table is the Gold-tier approved source for a given subject area, who the data steward is, and what access request form to submit.

Palantir Foundry Ontology

On Foundry, data discovery happens through the Ontology — the semantic layer that maps raw datasets to Object Types. Rather than looking for a table called usaspending_awards, you navigate to the Contract Object Type in the Ontology Browser and see all the underlying data, linked objects, and available properties. The data model is the catalog.

python

# Foundry: query objects through the Ontology SDK
from foundry import FoundryClient
from foundry.ontology import Contract

client = FoundryClient(auth_token=os.environ["FOUNDRY_TOKEN"],
                       hostname=os.environ["FOUNDRY_HOSTNAME"])

# Load contracts over $5M from the Ontology
contracts = (
    client.ontology.objects.Contract
    .filter(Contract.obligation_amount.gt(5_000_000))
    .filter(Contract.fiscal_year.eq(2024))
    .select(["contract_id", "vendor_name", "obligation_amount",
             "naics_code", "period_of_performance_end_date"])
    .take(1000)
)

df = pd.DataFrame([c.as_dict() for c in contracts])
print(f"Fetched {len(df):,} high-value contracts from Foundry Ontology")

Messy File Formats in Government Data

Government data arrives in formats that reflect decades of incompatible systems: fixed-width text files from mainframes, FOIA-produced PDFs, Excel files with merged cells used as headers, and CSV exports with inconsistent encodings.

Fixed-Width Files

Many legacy government systems export data in fixed-width format — each field occupies a fixed column position, no delimiters. The format specification is usually in a separate data dictionary document.

python

import pandas as pd

# Fixed-width specification from data dictionary
# Format: (field_name, start_col, end_col) — 0-indexed, end is exclusive
FIELD_SPEC = [
    ("record_type",          0,   2),
    ("agency_code",          2,   6),
    ("bureau_code",          6,  10),
    ("program_code",        10,  16),
    ("obligation_amount",   16,  30),
    ("fiscal_year",         30,  34),
    ("action_date",         34,  42),   # YYYYMMDD
    ("award_type",          42,  44),
]

colspecs   = [(start, end) for _, start, end in FIELD_SPEC]
col_names  = [name for name, _, _ in FIELD_SPEC]

df = pd.read_fwf(
    "financial_transactions_fy2024.dat",
    colspecs=colspecs,
    names=col_names,
    dtype=str,           # Read everything as string first
    encoding="latin-1",  # Many legacy government files use latin-1, not UTF-8
    skiprows=1,          # Skip header record
)

# Parse dates and amounts after initial load
df["action_date"] = pd.to_datetime(df["action_date"], format="%Y%m%d", errors="coerce")
df["obligation_amount"] = pd.to_numeric(
    df["obligation_amount"].str.strip(), errors="coerce"
) / 100  # Often stored as integer cents

print(f"Loaded {len(df):,} financial transactions")

Processing FOIA-Released PDFs

FOIA responses often arrive as PDFs — sometimes searchable, often not. For searchable PDFs, pdfplumber can extract tables directly. For scanned PDFs, you need OCR.

python

import pdfplumber
import pandas as pd

def extract_tables_from_pdf(pdf_path: str) -> list[pd.DataFrame]:
    """
    Extract all tables from a searchable PDF using pdfplumber.
    Returns a list of DataFrames, one per detected table per page.
    Works on native PDFs; scanned PDFs require OCR preprocessing.
    """
    tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, 1):
            page_tables = page.extract_tables()
            for t_idx, table in enumerate(page_tables):
                if not table or len(table) < 2:
                    continue
                df = pd.DataFrame(table[1:], columns=table[0])
                df["_source_page"] = page_num
                df["_table_index"] = t_idx
                tables.append(df)
                print(f"  Page {page_num}, Table {t_idx}: "
                      f"{len(df)} rows x {len(df.columns)} columns")

    return tables


tables = extract_tables_from_pdf("foia_response_contracts.pdf")
print(f"Extracted {len(tables)} tables from PDF")

Ingesting External Data into Platforms

Once you have data from an API or file, you need to ingest it into your platform's storage layer for analysis at scale.

python

from pyspark.sql import functions as F

# Write a pandas DataFrame to a Delta table in Unity Catalog
# After fetching from an external API, this is the standard ingestion pattern

# Convert to Spark DataFrame
df_spark = spark.createDataFrame(df_it_contracts)

# Write to Bronze tier — raw data, minimally transformed
(
    df_spark
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable("procurement_catalog.bronze.usaspending_it_fy2024")
)

print("Written to Bronze tier — raw, minimally processed")

# Add ingestion metadata
spark.sql("""
    ALTER TABLE procurement_catalog.bronze.usaspending_it_fy2024
    SET TBLPROPERTIES (
        'delta.enableChangeDataFeed' = 'true',
        'source_system' = 'USASpending.gov API v2',
        'ingestion_date' = current_date(),
        'classification' = 'Unclassified/Public'
    )
""")

Where This Goes Wrong

Failure Mode 1: Skipping the Classification Check

Loading a dataset without confirming its classification level. A dataset labelled "procurement" may include sole-source awards with SSNs, personnel records, or data elements that bump it from unclassified to CUI. Run a classification review and a PII scan before loading to any shared analytics workspace. Not optional.

Failure Mode 2: Treating API Results as Authoritative Without Reconciliation

The USASpending.gov API reflects contract data sourced from SAM.gov (formerly FPDS-NG), which has known data quality issues — late reporting, missing modifications, and vendor name inconsistencies. For any analysis that will be presented to a contracting officer or program manager, reconcile against the official contracting system of record, not just the API. Note this limitation explicitly in your analysis.

Failure Mode 3: Acquiring Data You're Not Authorized to Use

Just because you can query a table in Unity Catalog does not mean you are authorized to use it for your specific analysis purpose. Data authorization on federal platforms is often role-based but purpose-limited — your credentials may grant read access for operational reporting purposes but not for building ML training datasets. When in doubt, get written authorization from the data steward. This protects both the data and your contract.

Platform Comparison

Dimension	Advana (Databricks)	Navy Jupiter	Palantir Foundry
Data catalog	Unity Catalog	Collibra + Unity Catalog	Ontology (Object Types)
Data tiers	Bronze / Silver / Gold (Delta)	Bronze / Silver / Gold (Delta + Collibra)	Raw datasets → Foundry Objects
External API ingestion	Notebooks + Delta write	Notebooks + Delta write	Foundry Connectors or Code Repositories
Classification ceiling	IL5 (ITAR/CUI)	IL5 (Navy NIPRNET + SIPRNET)	IL5 (FedRAMP High)
Access request path	CDAO access request + PKI	DD 2875 + CAC access	Foundry data access request

Exercises

This chapter includes 6 hands-on exercises with full solutions — coding challenges, analysis tasks, and scenario-based problems.

View Exercises on GitHub →