Methodology

Our Approach: Trust-First Design

CareGraph exists because public healthcare data should be publicly accessible — not locked behind vendor portals, paywalls, or impenetrable government websites. Every number on this site comes from an official CMS, CDC, or Medicaid dataset. We do not editorialize the data; we clean it, connect it, and present it with full provenance.

Trust-first design means three things:

Provenance on every page. Every entity page shows exactly which dataset produced each number, when it was downloaded, and how many rows were in the source file.
Limitations up front. We document known caveats, suppression rules, and coverage gaps for every dataset — not buried in footnotes, but in dedicated methodology pages linked from every entity page.
Reproducible pipeline. The entire ETL pipeline is open source. Anyone can re-run it, verify the outputs, and file an error report if something looks wrong.

Datasets Overview

CareGraph ingests 29 datasets from CMS, CDC, and Medicaid, organized across 7 entity types. Each dataset has a dedicated methodology page describing its contents, join strategy, known limitations, and data quality notes.

Hospitals 11 datasets · 5,426 entities Skilled Nursing Facilities 6 datasets · 14,703 entities Counties 3 datasets · 3,198 entities ACOs 4 datasets · 476 entities Drugs 4 datasets · 1,938 entities Conditions derived · 40 entities DRGs 1 datasets · 534 entities

Each dataset is classified by its role: base datasets create entity pages, enrichment datasets add measures to existing pages, and cross-link datasets connect entities across types.

hospital Hospitals (11 datasets)

Dataset	Role	Source	Vintage	Methodology
Hospital General Information	base	CMS	FY2026	View details
Hospital Readmissions Reduction Program	enrichment	CMS	FY2026	View details
Hospital Value-Based Purchasing TPS	enrichment	CMS	FY2026	View details
Timely and Effective Care — Hospital	enrichment	CMS	FY2026	View details
Complications and Deaths — Hospital	enrichment	CMS	FY2026	View details
Patient Survey (HCAHPS) — Hospital	enrichment	CMS	FY2026	View details
Healthcare Associated Infections — Hospital	enrichment	CDC	FY2026	View details
Unplanned Hospital Visits — Hospital	enrichment	CMS	FY2026	View details
Medicare Spending Per Beneficiary — Hospital	enrichment	CMS	FY2026	View details
Hospital Provider Cost Report	enrichment	CMS	FY2023	View details
Hospital-Acquired Condition (HAC) Reduction Program	enrichment	CMS	FY2026	View details

snf Skilled Nursing Facilities (6 datasets)

Dataset	Role	Source	Vintage	Methodology
Nursing Home Provider Info	base	CMS	Mar 2026	View details
SNF Quality Measures (MDS)	enrichment	CMS	Mar 2026	View details
Nursing Home Penalties	enrichment	CMS	Mar 2026	View details
Nursing Home Health Deficiencies	enrichment	CMS	Mar 2026	View details
Nursing Home Ownership	enrichment	CMS	Mar 2026	View details
Skilled Nursing Facility Cost Report	enrichment	CMS	FY2023	View details

county Counties (3 datasets)

Dataset	Role	Source	Vintage	Methodology
Medicare Geographic Variation by County	base	CMS	2014–2023	View details
CDC PLACES County-Level Data	enrichment	CDC	2023 Release	View details
CDC SDOH Measures for County	enrichment	CDC	2023 Release	View details

aco ACOs (4 datasets)

Dataset	Role	Source	Vintage	Methodology
MSSP ACO Performance PY2024	base	CMS	PY2024	View details
ACO Participants	cross-link	CMS	PY2024	View details
ACO SNF Affiliates	cross-link	CMS	PY2024	View details
ACO Assigned Beneficiaries by County	cross-link	CMS	PY2023	View details

drug Drugs (4 datasets)

Dataset	Role	Source	Vintage	Methodology
Medicare Part D Spending by Drug	base	CMS	CY2023	View details
Medicare Part B Spending by Drug	enrichment	CMS	CY2023	View details
Medicare Part B Discarded Drug Units	enrichment	CMS	CY2023	View details
NADAC National Average Drug Acquisition Cost	enrichment	Medicaid	2026 (weekly)	View details

condition Conditions

Condition entity pages are derived from the CDC PLACES County-Level Data dataset by aggregating prevalence estimates across all counties for each health measure. See the Counties section for the source dataset methodology.

drg DRGs (1 datasets)

Dataset	Role	Source	Vintage	Methodology
Medicare Inpatient Hospitals by Provider and Service (DRG)	base	CMS	CY2023	View details

Limitations Ledger

Across all datasets, the following systemic limitations apply:

Medicare FFS only. Most CMS datasets cover Medicare fee-for-service beneficiaries. Medicare Advantage enrollees (over 50% of Medicare beneficiaries as of 2024) are generally excluded. This is not a sampling choice — MA plans are not required to report claims-level data to CMS in the same way FFS does.
Reporting lag. CMS data typically reflects a measurement period 12–24 months prior to the data release date. Quality metrics like star ratings, readmission ratios, and VBP scores use multi-year measurement windows that lag even further.
Cell-size suppression. To protect patient privacy, CMS suppresses values when the underlying count is fewer than 11 (sometimes 25 for specific programs). This disproportionately affects rural providers and small counties.
Facility vs. system. CMS data is organized by individual facility CCN, not by health system. A hospital system operating 15 facilities appears as 15 separate entities. System-level aggregation is not currently supported in CareGraph.
Model-based estimates. CDC PLACES county-level prevalence estimates are model-derived (MRP from BRFSS), not direct measurements. Confidence intervals should be consulted, especially for small-population counties.
Claims-based identification. Chronic conditions, complications, and readmissions are identified from billing claims, not clinical records. Coding practices vary across providers and can affect reported rates independently of actual clinical events.

Build Pipeline: From Raw Data to Published Site

Producing CareGraph end-to-end takes three orchestrated stages: a Python data pipeline, an AI-assisted editorial pipeline, and a static site build deployed to GitHub Pages. Every stage is open source, idempotent, and reproducible from the same raw CSVs.

Stage 1 — Data pipeline (`etl/run.py`)

A single Python orchestrator walks through roughly eighteen numbered steps. In order:

Acquire. Downloads 29 raw CSV files from data.cms.gov, data.cdc.gov, and data.medicaid.gov APIs. Files are stored with content-addressable names ({dataset_id}_{YYYY-MM-DD}.csv) under data/raw/; re-running on the same day skips already-downloaded files. Previous vintages are preserved rather than overwritten.
Normalize join keys. CCNs are zero-padded to 6 characters, FIPS codes to 5 digits, NPIs to 10 digits. SSA state/county codes are converted to FIPS.
Build base entity manifests for all 7 entity types — Hospitals, SNFs, Counties, ACOs, Drugs, Conditions, and DRGs — one JSON manifest per entity, written to site_data/{entity}/{id}.json.
Enrich entities with supplementary datasets: HRRP/HVBP penalties, HCAHPS patient experience, HAI infection ratios, complications, timely & effective care, MSPB spending, HAC reduction, hospital and SNF cost reports, nursing-home penalties and deficiencies and ownership, ACO participants and SNF affiliates, county-level CDC PLACES prevalence and SDOH indicators, Part D / Part B drug spending, Part B discarded units, and NADAC acquisition prices.
Cross-link entities — hospitals ↔ ACOs, SNFs ↔ ACOs, ACOs ↔ counties (beneficiary assignment), hospitals ↔ counties, drugs ↔ DRGs. Cross-links are written back into each side's manifest so entity pages render without runtime joins.
Compute benchmarks and peer cohorts for ACOs, hospitals, and counties. Peer cohorts are built by matching ACOs on assigned-beneficiary size, region, and track so individual ACO pages can show "how does this compare to similar ACOs?" without client-side computation.
Derive precomputed views. Search index, map choropleth layers (county-level), compare-mode summaries, and the paginated Explore-table indexes are all built as static JSON artifacts.
Validate & package. Pydantic schemas validate every manifest. A provenance envelope (dataset ID, download date, source URL, row count, ETL version) is attached to each enriched field. The pipeline writes site_data/index.json with run-level metadata.

Stage 2 — Editorial pipeline (`etl/editorial/run.py`)

Every dataset gets a dedicated methodology page. These pages are drafted by a large language model under tight constraints, not hand-written, so they can scale to dozens of datasets while staying consistent in structure. The editorial orchestrator:

Iterates each registered dataset and loads a shared prompt template (etl/editorial/prompts/dataset_methodology.txt) plus a dataset-specific hints file (etl/editorial/hints/{dataset_id}.md) curated by the maintainer to pin down known caveats, join keys, suppression rules, and reporting lags.
Substitutes dataset metadata (name, entity type, row count from the ETL's index.json, join-key format, field count, suppression notes) into the template and invokes the Claude CLI (claude -p) to generate the page.
Validates the generated markdown against required sections (Overview, Join Strategy, Known Limitations, Data Quality Notes), minimum length, and forbidden-phrase rules. Failed validation is flagged, not silently accepted.
Writes a checkpoint (etl/editorial/.checkpoint.json) so successful generations are not regenerated on the next run — keeping costs bounded and making the pipeline resumable after transient failures. A --force flag regenerates everything; --skip-ai produces template placeholders without calling the model.
Stage 1 then copies etl/editorial/output/*.md into site_data/editorial/ where the Astro build picks them up.

Of 29 datasets, 29 currently have a generated methodology page. The editorial layer describes the data; it does not invent, reinterpret, or summarize any numbers shown on entity pages.

Stage 3 — Static site build & deploy

Astro static build (site/, run with npm run build). Astro reads the JSON manifests in site_data/ plus the generated editorial markdown and renders a static HTML page for every entity, dataset methodology page, explore table, map, and index. No server-side rendering, no runtime database, no API calls from the browser — just precomputed HTML and JSON.
Commit & push. The built site/dist/ directory is checked into the repository on main.
GitHub Pages deploy. A GitHub Actions workflow (.github/workflows/deploy.yml) publishes site/dist/ to GitHub Pages on every push. The custom domain caregraph.org is configured via site/public/CNAME.

The full pipeline — from an empty checkout to a deployable site — is a handful of commands: pip install -e ., python etl/run.py, python etl/editorial/run.py, cd site && npm install && npm run build. Anyone can re-run it against the same raw CSVs and verify the outputs byte-for-byte.

Report an Error

If you find incorrect data, a missing entity, or a methodology concern, please file an error report on GitHub. We review every report and publish corrections with full changelog entries.