Methodology
How CareGraph processes, validates, and presents public healthcare data.
Our Approach: Trust-First Design
CareGraph exists because public healthcare data should be publicly accessible — not locked behind vendor portals, paywalls, or impenetrable government websites. Every number on this site comes from an official CMS, CDC, or Medicaid dataset. We do not editorialize the data; we clean it, connect it, and present it with full provenance.
Trust-first design means three things:
- Provenance on every page. Every entity page shows exactly which dataset produced each number, when it was downloaded, and how many rows were in the source file.
- Limitations up front. We document known caveats, suppression rules, and coverage gaps for every dataset — not buried in footnotes, but in dedicated methodology pages linked from every entity page.
- Reproducible pipeline. The entire ETL pipeline is open source. Anyone can re-run it, verify the outputs, and file an error report if something looks wrong.
Datasets Overview
CareGraph ingests 29 datasets from CMS, CDC, and Medicaid, organized across 7 entity types. Each dataset has a dedicated methodology page describing its contents, join strategy, known limitations, and data quality notes.
Each dataset is classified by its role: base datasets create entity pages, enrichment datasets add measures to existing pages, and cross-link datasets connect entities across types.
hospital Hospitals (11 datasets)
| Dataset | Role | Source | Vintage | Methodology |
|---|---|---|---|---|
| Hospital General Information | base | CMS | FY2026 | View details |
| Hospital Readmissions Reduction Program | enrichment | CMS | FY2026 | View details |
| Hospital Value-Based Purchasing TPS | enrichment | CMS | FY2026 | View details |
| Timely and Effective Care — Hospital | enrichment | CMS | FY2026 | View details |
| Complications and Deaths — Hospital | enrichment | CMS | FY2026 | View details |
| Patient Survey (HCAHPS) — Hospital | enrichment | CMS | FY2026 | View details |
| Healthcare Associated Infections — Hospital | enrichment | CDC | FY2026 | View details |
| Unplanned Hospital Visits — Hospital | enrichment | CMS | FY2026 | View details |
| Medicare Spending Per Beneficiary — Hospital | enrichment | CMS | FY2026 | View details |
| Hospital Provider Cost Report | enrichment | CMS | FY2023 | View details |
| Hospital-Acquired Condition (HAC) Reduction Program | enrichment | CMS | FY2026 | View details |
snf Skilled Nursing Facilities (6 datasets)
| Dataset | Role | Source | Vintage | Methodology |
|---|---|---|---|---|
| Nursing Home Provider Info | base | CMS | Mar 2026 | View details |
| SNF Quality Measures (MDS) | enrichment | CMS | Mar 2026 | View details |
| Nursing Home Penalties | enrichment | CMS | Mar 2026 | View details |
| Nursing Home Health Deficiencies | enrichment | CMS | Mar 2026 | View details |
| Nursing Home Ownership | enrichment | CMS | Mar 2026 | View details |
| Skilled Nursing Facility Cost Report | enrichment | CMS | FY2023 | View details |
county Counties (3 datasets)
| Dataset | Role | Source | Vintage | Methodology |
|---|---|---|---|---|
| Medicare Geographic Variation by County | base | CMS | 2014–2023 | View details |
| CDC PLACES County-Level Data | enrichment | CDC | 2023 Release | View details |
| CDC SDOH Measures for County | enrichment | CDC | 2023 Release | View details |
aco ACOs (4 datasets)
| Dataset | Role | Source | Vintage | Methodology |
|---|---|---|---|---|
| MSSP ACO Performance PY2024 | base | CMS | PY2024 | View details |
| ACO Participants | cross-link | CMS | PY2024 | View details |
| ACO SNF Affiliates | cross-link | CMS | PY2024 | View details |
| ACO Assigned Beneficiaries by County | cross-link | CMS | PY2023 | View details |
drug Drugs (4 datasets)
| Dataset | Role | Source | Vintage | Methodology |
|---|---|---|---|---|
| Medicare Part D Spending by Drug | base | CMS | CY2023 | View details |
| Medicare Part B Spending by Drug | enrichment | CMS | CY2023 | View details |
| Medicare Part B Discarded Drug Units | enrichment | CMS | CY2023 | View details |
| NADAC National Average Drug Acquisition Cost | enrichment | Medicaid | 2026 (weekly) | View details |
condition Conditions
Condition entity pages are derived from the CDC PLACES County-Level Data dataset by aggregating prevalence estimates across all counties for each health measure. See the Counties section for the source dataset methodology.
drg DRGs (1 datasets)
| Dataset | Role | Source | Vintage | Methodology |
|---|---|---|---|---|
| Medicare Inpatient Hospitals by Provider and Service (DRG) | base | CMS | CY2023 | View details |
Limitations Ledger
Across all datasets, the following systemic limitations apply:
- Medicare FFS only. Most CMS datasets cover Medicare fee-for-service beneficiaries. Medicare Advantage enrollees (over 50% of Medicare beneficiaries as of 2024) are generally excluded. This is not a sampling choice — MA plans are not required to report claims-level data to CMS in the same way FFS does.
- Reporting lag. CMS data typically reflects a measurement period 12–24 months prior to the data release date. Quality metrics like star ratings, readmission ratios, and VBP scores use multi-year measurement windows that lag even further.
- Cell-size suppression. To protect patient privacy, CMS suppresses values when the underlying count is fewer than 11 (sometimes 25 for specific programs). This disproportionately affects rural providers and small counties.
- Facility vs. system. CMS data is organized by individual facility CCN, not by health system. A hospital system operating 15 facilities appears as 15 separate entities. System-level aggregation is not currently supported in CareGraph.
- Model-based estimates. CDC PLACES county-level prevalence estimates are model-derived (MRP from BRFSS), not direct measurements. Confidence intervals should be consulted, especially for small-population counties.
- Claims-based identification. Chronic conditions, complications, and readmissions are identified from billing claims, not clinical records. Coding practices vary across providers and can affect reported rates independently of actual clinical events.
Build Pipeline: From Raw Data to Published Site
Producing CareGraph end-to-end takes three orchestrated stages: a Python data pipeline, an AI-assisted editorial pipeline, and a static site build deployed to GitHub Pages. Every stage is open source, idempotent, and reproducible from the same raw CSVs.
Stage 1 — Data pipeline (etl/run.py)
A single Python orchestrator walks through roughly eighteen numbered steps. In order:
- Acquire. Downloads 29 raw CSV files from data.cms.gov,
data.cdc.gov, and data.medicaid.gov APIs. Files are stored with content-addressable
names (
{dataset_id}_{YYYY-MM-DD}.csv) underdata/raw/; re-running on the same day skips already-downloaded files. Previous vintages are preserved rather than overwritten. - Normalize join keys. CCNs are zero-padded to 6 characters, FIPS codes to 5 digits, NPIs to 10 digits. SSA state/county codes are converted to FIPS.
- Build base entity manifests for all 7 entity types — Hospitals, SNFs,
Counties, ACOs, Drugs, Conditions, and DRGs — one JSON manifest per entity, written
to
site_data/{entity}/{id}.json. - Enrich entities with supplementary datasets: HRRP/HVBP penalties, HCAHPS patient experience, HAI infection ratios, complications, timely & effective care, MSPB spending, HAC reduction, hospital and SNF cost reports, nursing-home penalties and deficiencies and ownership, ACO participants and SNF affiliates, county-level CDC PLACES prevalence and SDOH indicators, Part D / Part B drug spending, Part B discarded units, and NADAC acquisition prices.
- Cross-link entities — hospitals ↔ ACOs, SNFs ↔ ACOs, ACOs ↔ counties (beneficiary assignment), hospitals ↔ counties, drugs ↔ DRGs. Cross-links are written back into each side's manifest so entity pages render without runtime joins.
- Compute benchmarks and peer cohorts for ACOs, hospitals, and counties. Peer cohorts are built by matching ACOs on assigned-beneficiary size, region, and track so individual ACO pages can show "how does this compare to similar ACOs?" without client-side computation.
- Derive precomputed views. Search index, map choropleth layers (county-level), compare-mode summaries, and the paginated Explore-table indexes are all built as static JSON artifacts.
- Validate & package. Pydantic schemas validate every manifest.
A provenance envelope (dataset ID, download date, source URL, row count, ETL
version) is attached to each enriched field. The pipeline writes
site_data/index.jsonwith run-level metadata.
Stage 2 — Editorial pipeline (etl/editorial/run.py)
Every dataset gets a dedicated methodology page. These pages are drafted by a large language model under tight constraints, not hand-written, so they can scale to dozens of datasets while staying consistent in structure. The editorial orchestrator:
-
Iterates each registered dataset and loads a shared prompt template
(
etl/editorial/prompts/dataset_methodology.txt) plus a dataset-specific hints file (etl/editorial/hints/{dataset_id}.md) curated by the maintainer to pin down known caveats, join keys, suppression rules, and reporting lags. -
Substitutes dataset metadata (name, entity type, row count from the ETL's
index.json, join-key format, field count, suppression notes) into the template and invokes the Claude CLI (claude -p) to generate the page. - Validates the generated markdown against required sections (Overview, Join Strategy, Known Limitations, Data Quality Notes), minimum length, and forbidden-phrase rules. Failed validation is flagged, not silently accepted.
-
Writes a checkpoint (
etl/editorial/.checkpoint.json) so successful generations are not regenerated on the next run — keeping costs bounded and making the pipeline resumable after transient failures. A--forceflag regenerates everything;--skip-aiproduces template placeholders without calling the model. -
Stage 1 then copies
etl/editorial/output/*.mdintosite_data/editorial/where the Astro build picks them up.
Of 29 datasets, 29 currently have a generated methodology page. The editorial layer describes the data; it does not invent, reinterpret, or summarize any numbers shown on entity pages.
Stage 3 — Static site build & deploy
- Astro static build (
site/, run withnpm run build). Astro reads the JSON manifests insite_data/plus the generated editorial markdown and renders a static HTML page for every entity, dataset methodology page, explore table, map, and index. No server-side rendering, no runtime database, no API calls from the browser — just precomputed HTML and JSON. - Commit & push. The built
site/dist/directory is checked into the repository onmain. - GitHub Pages deploy. A GitHub Actions workflow
(
.github/workflows/deploy.yml) publishessite/dist/to GitHub Pages on every push. The custom domain caregraph.org is configured viasite/public/CNAME.
The full pipeline — from an empty checkout to a deployable site — is a handful of
commands: pip install -e ., python etl/run.py,
python etl/editorial/run.py, cd site && npm install && npm run build.
Anyone can re-run it against the same raw CSVs and verify the outputs byte-for-byte.
Report an Error
If you find incorrect data, a missing entity, or a methodology concern, please file an error report on GitHub. We review every report and publish corrections with full changelog entries.