Data Acquisition and Government Data Sources
Federal data is classified, catalogued, and controlled. Getting the right data to the right analyst on the right platform requires navigating classification authorities, data catalog systems, and API authentication schemes that exist nowhere outside government. This chapter covers how to actually get the data.
Data Classification First
Before you write a single line of code to acquire data, you need to know its classification level. This is not a compliance formality — it determines which platform you can use, who can see the results, and whether you can even describe the data in a status update.
Data classification determines the acquisition path and authorized platforms.
Classification Levels That Matter
CUI (Controlled Unclassified Information) is the catch-all for sensitive but unclassified data. FOUO (For Official Use Only) is a legacy CUI sub-category still in widespread use. CUI requires access controls, markings, and authorized platforms — it cannot live in a shared Google Drive or a personal laptop without encryption.
PII (Personally Identifiable Information) and PHI (Protected Health Information) may be CUI or may have additional handling requirements under Privacy Act, HIPAA, or DoD 5400.11. The classification level of a dataset does not tell you whether it contains PII — you must scan for it explicitly (see Chapter 04).
If you're not sure about a dataset's classification, ask the data owner — not a colleague who also isn't sure. Mishandling classified data is a federal crime, and "I didn't know" is not a defense. The data owner or your agency's information security officer (ISSO) is the authoritative source.
Federal Open Data APIs
A substantial amount of useful federal data is publicly available via APIs that require no clearance — only an API key or no authentication at all. These are the starting points for procurement analysis, budget research, demographic work, and economic analysis.
USASpending.gov API
USASpending.gov is the public-facing portal for all federal spending data — contract awards, grants, loans, and other financial assistance. No API key or authentication required for public endpoints. Rate limits apply but are not formally published; for bulk data pulls, use the award data archive downloads rather than paginating the API.
import requests
import pandas as pd
import time
def fetch_usaspending_awards(
fiscal_year: int,
naics_codes: list,
min_amount: float = 1_000_000,
max_pages: int = 10,
) -> pd.DataFrame:
"""
Fetch contract awards from USASpending.gov API.
Returns a DataFrame of awards matching the filters.
Rate limit: 1,000 requests/hour unauthenticated.
Use time.sleep(0.1) between pages to stay well under the limit.
"""
url = "https://api.usaspending.gov/api/v2/search/spending_by_award/"
all_records = []
for page in range(1, max_pages + 1):
payload = {
"filters": {
"time_period": [{"start_date": f"{fiscal_year - 1}-10-01",
"end_date": f"{fiscal_year}-09-30"}],
"naics_codes": naics_codes,
"award_type_codes": ["A", "B", "C", "D"], # contract types
"award_amounts": [{"lower_bound": min_amount}],
},
"fields": [
"Award ID", "Recipient Name", "Award Amount",
"NAICS Code", "NAICS Description",
"Period of Performance Start Date",
"Period of Performance Current End Date",
"Awarding Agency", "Awarding Sub Agency",
],
"page": page,
"limit": 100,
"sort": "Award Amount",
"order": "desc",
}
resp = requests.post(url, json=payload, timeout=30)
resp.raise_for_status()
data = resp.json()
results = data.get("results", [])
if not results:
break
all_records.extend(results)
print(f"Page {page}: fetched {len(results)} records "
f"(total so far: {len(all_records):,})")
if page < data.get("page_metadata", {}).get("total_pages", 1):
time.sleep(0.1) # stay under rate limit
else:
break
return pd.DataFrame(all_records)
# Example: IT services (NAICS 54) contracts over $1M in FY2024
df_it_contracts = fetch_usaspending_awards(
fiscal_year=2024,
naics_codes=["541511", "541512", "541513", "541519"],
min_amount=1_000_000,
)
print(f"\nFetched {len(df_it_contracts):,} IT services contracts")
SAM.gov API
SAM.gov (System for Award Management) is the authoritative source for vendor registration data — UEI numbers, CAGE codes, business size certifications, and exclusion lists. An API key is required and free to obtain at api.sam.gov.
import requests
import os
SAM_API_KEY = os.environ["SAM_API_KEY"] # free, register at api.sam.gov
def lookup_vendor_by_uei(uei: str) -> dict:
"""
Look up a vendor's SAM.gov registration by UEI.
Returns registration data including business type, certifications,
and registration status.
"""
url = "https://api.sam.gov/entity-information/v3/entities"
params = {
"api_key": SAM_API_KEY,
"ueiSAM": uei,
"includeSections": "entityRegistration,coreData,assertions",
}
resp = requests.get(url, params=params, timeout=30)
resp.raise_for_status()
data = resp.json()
entities = data.get("entityData", [])
if not entities:
return {}
entity = entities[0]
reg = entity.get("entityRegistration", {})
core = entity.get("coreData", {})
return {
"uei": reg.get("ueiSAM"),
"legal_name": reg.get("legalBusinessName"),
"registration_status": reg.get("registrationStatus"),
"cage_code": reg.get("cageCode"),
"expiration_date": reg.get("registrationExpirationDate"),
"entity_type": core.get("entityInformation", {}).get("entityTypeDesc"),
"business_types": core.get("businessTypes", {}).get("businessTypeList", []),
}
# Check a vendor before awarding a contract
vendor = lookup_vendor_by_uei("ABCDE1234567")
print(f"Vendor: {vendor.get('legal_name')}")
print(f"Status: {vendor.get('registration_status')}") # Must be "Active"
print(f"CAGE: {vendor.get('cage_code')}")
Census Bureau API
The Census API provides demographic, economic, and geographic data. Essential for workforce analysis, program area demographics, and any model requiring population denominators. Free API key at api.census.gov.
import requests
import pandas as pd
import os
CENSUS_KEY = os.environ["CENSUS_API_KEY"] # free at api.census.gov
def fetch_acs_demographics(state_fips: str, year: int = 2022) -> pd.DataFrame:
"""
Fetch American Community Survey 5-year estimates by county.
Returns population, median age, education, and income by county.
"""
url = f"https://api.census.gov/data/{year}/acs/acs5"
variables = [
"B01003_001E", # Total population
"B01002_001E", # Median age
"B15003_022E", # Bachelor's degree (25+)
"B15003_023E", # Master's degree (25+)
"B19013_001E", # Median household income
"NAME",
]
params = {
"get": ",".join(variables),
"for": "county:*",
"in": f"state:{state_fips}",
"key": CENSUS_KEY,
}
resp = requests.get(url, params=params, timeout=30)
resp.raise_for_status()
data = resp.json()
df = pd.DataFrame(data[1:], columns=data[0])
df = df.rename(columns={
"B01003_001E": "total_population",
"B01002_001E": "median_age",
"B15003_022E": "bachelors_degree",
"B15003_023E": "masters_degree",
"B19013_001E": "median_household_income",
})
# Convert numeric columns from strings
numeric_cols = ["total_population", "median_age", "bachelors_degree",
"masters_degree", "median_household_income"]
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors="coerce")
return df
# Virginia county demographics (FIPS 51)
va_demographics = fetch_acs_demographics("51", year=2022)
print(va_demographics.head())
Data Catalogs on Federal Platforms
Once inside a federal platform, you don't acquire data by calling external APIs — you query internal data catalogs. Each platform has its own catalog system.
Unity Catalog on Databricks
Unity Catalog is Databricks' centralized governance layer. On Advana and Jupiter, all authorized datasets are registered in Unity Catalog with three-part names: catalog.schema.table. You don't need to know where the data physically lives — query it by name.
-- Discover what's available in Unity Catalog
SHOW CATALOGS;
-- Browse schemas in a catalog
SHOW SCHEMAS IN advana_procurement;
-- List tables in a schema
SHOW TABLES IN advana_procurement.silver;
-- Examine a table's structure and metadata
DESCRIBE TABLE EXTENDED advana_procurement.silver.fpds_contract_actions;
-- Search the catalog for tables related to a topic
SELECT table_catalog, table_schema, table_name, comment
FROM system.information_schema.tables
WHERE lower(table_name) LIKE '%procurement%'
AND table_type = 'MANAGED'
ORDER BY table_catalog, table_schema, table_name;
Collibra on Navy Jupiter
Navy Jupiter uses Collibra as its primary data catalog alongside Unity Catalog. Collibra provides the business glossary, data stewardship workflows, and Bronze/Silver/Gold tier designations. When looking for an authoritative dataset on Jupiter, start in Collibra — it tells you which Unity Catalog table is the Gold-tier approved source for a given subject area, who the data steward is, and what access request form to submit.
Palantir Foundry Ontology
On Foundry, data discovery happens through the Ontology — the semantic layer that maps raw datasets to Object Types. Rather than looking for a table called usaspending_awards, you navigate to the Contract Object Type in the Ontology Browser and see all the underlying data, linked objects, and available properties. The data model is the catalog.
# Foundry: query objects through the Ontology SDK
from foundry import FoundryClient
from foundry.ontology import Contract
client = FoundryClient(auth_token=os.environ["FOUNDRY_TOKEN"],
hostname=os.environ["FOUNDRY_HOSTNAME"])
# Load contracts over $5M from the Ontology
contracts = (
client.ontology.objects.Contract
.filter(Contract.obligation_amount.gt(5_000_000))
.filter(Contract.fiscal_year.eq(2024))
.select(["contract_id", "vendor_name", "obligation_amount",
"naics_code", "period_of_performance_end_date"])
.take(1000)
)
df = pd.DataFrame([c.as_dict() for c in contracts])
print(f"Fetched {len(df):,} high-value contracts from Foundry Ontology")
Messy File Formats in Government Data
Government data arrives in formats that reflect decades of incompatible systems: fixed-width text files from mainframes, FOIA-produced PDFs, Excel files with merged cells used as headers, and CSV exports with inconsistent encodings.
Fixed-Width Files
Many legacy government systems export data in fixed-width format — each field occupies a fixed column position, no delimiters. The format specification is usually in a separate data dictionary document.
import pandas as pd
# Fixed-width specification from data dictionary
# Format: (field_name, start_col, end_col) — 0-indexed, end is exclusive
FIELD_SPEC = [
("record_type", 0, 2),
("agency_code", 2, 6),
("bureau_code", 6, 10),
("program_code", 10, 16),
("obligation_amount", 16, 30),
("fiscal_year", 30, 34),
("action_date", 34, 42), # YYYYMMDD
("award_type", 42, 44),
]
colspecs = [(start, end) for _, start, end in FIELD_SPEC]
col_names = [name for name, _, _ in FIELD_SPEC]
df = pd.read_fwf(
"financial_transactions_fy2024.dat",
colspecs=colspecs,
names=col_names,
dtype=str, # Read everything as string first
encoding="latin-1", # Many legacy government files use latin-1, not UTF-8
skiprows=1, # Skip header record
)
# Parse dates and amounts after initial load
df["action_date"] = pd.to_datetime(df["action_date"], format="%Y%m%d", errors="coerce")
df["obligation_amount"] = pd.to_numeric(
df["obligation_amount"].str.strip(), errors="coerce"
) / 100 # Often stored as integer cents
print(f"Loaded {len(df):,} financial transactions")
Processing FOIA-Released PDFs
FOIA responses often arrive as PDFs — sometimes searchable, often not. For searchable PDFs, pdfplumber can extract tables directly. For scanned PDFs, you need OCR.
import pdfplumber
import pandas as pd
def extract_tables_from_pdf(pdf_path: str) -> list[pd.DataFrame]:
"""
Extract all tables from a searchable PDF using pdfplumber.
Returns a list of DataFrames, one per detected table per page.
Works on native PDFs; scanned PDFs require OCR preprocessing.
"""
tables = []
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages, 1):
page_tables = page.extract_tables()
for t_idx, table in enumerate(page_tables):
if not table or len(table) < 2:
continue
df = pd.DataFrame(table[1:], columns=table[0])
df["_source_page"] = page_num
df["_table_index"] = t_idx
tables.append(df)
print(f" Page {page_num}, Table {t_idx}: "
f"{len(df)} rows x {len(df.columns)} columns")
return tables
tables = extract_tables_from_pdf("foia_response_contracts.pdf")
print(f"Extracted {len(tables)} tables from PDF")
Ingesting External Data into Platforms
Once you have data from an API or file, you need to ingest it into your platform's storage layer for analysis at scale.
from pyspark.sql import functions as F
# Write a pandas DataFrame to a Delta table in Unity Catalog
# After fetching from an external API, this is the standard ingestion pattern
# Convert to Spark DataFrame
df_spark = spark.createDataFrame(df_it_contracts)
# Write to Bronze tier — raw data, minimally transformed
(
df_spark
.write
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable("procurement_catalog.bronze.usaspending_it_fy2024")
)
print("Written to Bronze tier — raw, minimally processed")
# Add ingestion metadata
spark.sql("""
ALTER TABLE procurement_catalog.bronze.usaspending_it_fy2024
SET TBLPROPERTIES (
'delta.enableChangeDataFeed' = 'true',
'source_system' = 'USASpending.gov API v2',
'ingestion_date' = current_date(),
'classification' = 'Unclassified/Public'
)
""")
Where This Goes Wrong
Failure Mode 1: Skipping the Classification Check
Loading a dataset without confirming its classification level. A dataset labelled "procurement" may include sole-source awards with SSNs, personnel records, or data elements that bump it from unclassified to CUI. Run a classification review and a PII scan before loading to any shared analytics workspace. Not optional.
Failure Mode 2: Treating API Results as Authoritative Without Reconciliation
The USASpending.gov API reflects contract data sourced from SAM.gov (formerly FPDS-NG), which has known data quality issues — late reporting, missing modifications, and vendor name inconsistencies. For any analysis that will be presented to a contracting officer or program manager, reconcile against the official contracting system of record, not just the API. Note this limitation explicitly in your analysis.
Failure Mode 3: Acquiring Data You're Not Authorized to Use
Just because you can query a table in Unity Catalog does not mean you are authorized to use it for your specific analysis purpose. Data authorization on federal platforms is often role-based but purpose-limited — your credentials may grant read access for operational reporting purposes but not for building ML training datasets. When in doubt, get written authorization from the data steward. This protects both the data and your contract.
Platform Comparison
| Dimension | Advana (Databricks) | Navy Jupiter | Palantir Foundry |
|---|---|---|---|
| Data catalog | Unity Catalog | Collibra + Unity Catalog | Ontology (Object Types) |
| Data tiers | Bronze / Silver / Gold (Delta) | Bronze / Silver / Gold (Delta + Collibra) | Raw datasets → Foundry Objects |
| External API ingestion | Notebooks + Delta write | Notebooks + Delta write | Foundry Connectors or Code Repositories |
| Classification ceiling | IL5 (ITAR/CUI) | IL5 (Navy NIPRNET + SIPRNET) | IL5 (FedRAMP High) |
| Access request path | CDAO access request + PKI | DD 2875 + CAC access | Foundry data access request |
Exercises
This chapter includes 6 hands-on exercises with full solutions — coding challenges, analysis tasks, and scenario-based problems.
View Exercises on GitHub →