Our Approach: Trust-First Design

CareGraph exists because public healthcare data should be publicly accessible — not locked behind vendor portals, paywalls, or impenetrable government websites. Every number on this site comes from an official CMS, CDC, or Medicaid dataset. We do not editorialize the data; we clean it, connect it, and present it with full provenance.

Trust-first design means three things:

Datasets Overview

CareGraph ingests 29 datasets from CMS, CDC, and Medicaid, organized across 7 entity types. Each dataset has a dedicated methodology page describing its contents, join strategy, known limitations, and data quality notes.

Hospitals 11 datasets · 5,426 entities Skilled Nursing Facilities 6 datasets · 14,703 entities Counties 3 datasets · 3,198 entities ACOs 4 datasets · 476 entities Drugs 4 datasets · 1,938 entities Conditions derived · 40 entities DRGs 1 datasets · 534 entities

Each dataset is classified by its role: base datasets create entity pages, enrichment datasets add measures to existing pages, and cross-link datasets connect entities across types.

hospital Hospitals (11 datasets)

Dataset Role Source Vintage Methodology
Hospital General Information base CMS FY2026 View details
Hospital Readmissions Reduction Program enrichment CMS FY2026 View details
Hospital Value-Based Purchasing TPS enrichment CMS FY2026 View details
Timely and Effective Care — Hospital enrichment CMS FY2026 View details
Complications and Deaths — Hospital enrichment CMS FY2026 View details
Patient Survey (HCAHPS) — Hospital enrichment CMS FY2026 View details
Healthcare Associated Infections — Hospital enrichment CDC FY2026 View details
Unplanned Hospital Visits — Hospital enrichment CMS FY2026 View details
Medicare Spending Per Beneficiary — Hospital enrichment CMS FY2026 View details
Hospital Provider Cost Report enrichment CMS FY2023 View details
Hospital-Acquired Condition (HAC) Reduction Program enrichment CMS FY2026 View details

snf Skilled Nursing Facilities (6 datasets)

Dataset Role Source Vintage Methodology
Nursing Home Provider Info base CMS Mar 2026 View details
SNF Quality Measures (MDS) enrichment CMS Mar 2026 View details
Nursing Home Penalties enrichment CMS Mar 2026 View details
Nursing Home Health Deficiencies enrichment CMS Mar 2026 View details
Nursing Home Ownership enrichment CMS Mar 2026 View details
Skilled Nursing Facility Cost Report enrichment CMS FY2023 View details

county Counties (3 datasets)

Dataset Role Source Vintage Methodology
Medicare Geographic Variation by County base CMS 2014–2023 View details
CDC PLACES County-Level Data enrichment CDC 2023 Release View details
CDC SDOH Measures for County enrichment CDC 2023 Release View details

aco ACOs (4 datasets)

Dataset Role Source Vintage Methodology
MSSP ACO Performance PY2024 base CMS PY2024 View details
ACO Participants cross-link CMS PY2024 View details
ACO SNF Affiliates cross-link CMS PY2024 View details
ACO Assigned Beneficiaries by County cross-link CMS PY2023 View details

drug Drugs (4 datasets)

Dataset Role Source Vintage Methodology
Medicare Part D Spending by Drug base CMS CY2023 View details
Medicare Part B Spending by Drug enrichment CMS CY2023 View details
Medicare Part B Discarded Drug Units enrichment CMS CY2023 View details
NADAC National Average Drug Acquisition Cost enrichment Medicaid 2026 (weekly) View details

condition Conditions

Condition entity pages are derived from the CDC PLACES County-Level Data dataset by aggregating prevalence estimates across all counties for each health measure. See the Counties section for the source dataset methodology.

drg DRGs (1 datasets)

Dataset Role Source Vintage Methodology
Medicare Inpatient Hospitals by Provider and Service (DRG) base CMS CY2023 View details

Limitations Ledger

Across all datasets, the following systemic limitations apply:

Build Pipeline: From Raw Data to Published Site

Producing CareGraph end-to-end takes three orchestrated stages: a Python data pipeline, an AI-assisted editorial pipeline, and a static site build deployed to GitHub Pages. Every stage is open source, idempotent, and reproducible from the same raw CSVs.

Stage 1 — Data pipeline (etl/run.py)

A single Python orchestrator walks through roughly eighteen numbered steps. In order:

  1. Acquire. Downloads 29 raw CSV files from data.cms.gov, data.cdc.gov, and data.medicaid.gov APIs. Files are stored with content-addressable names ({dataset_id}_{YYYY-MM-DD}.csv) under data/raw/; re-running on the same day skips already-downloaded files. Previous vintages are preserved rather than overwritten.
  2. Normalize join keys. CCNs are zero-padded to 6 characters, FIPS codes to 5 digits, NPIs to 10 digits. SSA state/county codes are converted to FIPS.
  3. Build base entity manifests for all 7 entity types — Hospitals, SNFs, Counties, ACOs, Drugs, Conditions, and DRGs — one JSON manifest per entity, written to site_data/{entity}/{id}.json.
  4. Enrich entities with supplementary datasets: HRRP/HVBP penalties, HCAHPS patient experience, HAI infection ratios, complications, timely & effective care, MSPB spending, HAC reduction, hospital and SNF cost reports, nursing-home penalties and deficiencies and ownership, ACO participants and SNF affiliates, county-level CDC PLACES prevalence and SDOH indicators, Part D / Part B drug spending, Part B discarded units, and NADAC acquisition prices.
  5. Cross-link entities — hospitals ↔ ACOs, SNFs ↔ ACOs, ACOs ↔ counties (beneficiary assignment), hospitals ↔ counties, drugs ↔ DRGs. Cross-links are written back into each side's manifest so entity pages render without runtime joins.
  6. Compute benchmarks and peer cohorts for ACOs, hospitals, and counties. Peer cohorts are built by matching ACOs on assigned-beneficiary size, region, and track so individual ACO pages can show "how does this compare to similar ACOs?" without client-side computation.
  7. Derive precomputed views. Search index, map choropleth layers (county-level), compare-mode summaries, and the paginated Explore-table indexes are all built as static JSON artifacts.
  8. Validate & package. Pydantic schemas validate every manifest. A provenance envelope (dataset ID, download date, source URL, row count, ETL version) is attached to each enriched field. The pipeline writes site_data/index.json with run-level metadata.

Stage 2 — Editorial pipeline (etl/editorial/run.py)

Every dataset gets a dedicated methodology page. These pages are drafted by a large language model under tight constraints, not hand-written, so they can scale to dozens of datasets while staying consistent in structure. The editorial orchestrator:

  1. Iterates each registered dataset and loads a shared prompt template (etl/editorial/prompts/dataset_methodology.txt) plus a dataset-specific hints file (etl/editorial/hints/{dataset_id}.md) curated by the maintainer to pin down known caveats, join keys, suppression rules, and reporting lags.
  2. Substitutes dataset metadata (name, entity type, row count from the ETL's index.json, join-key format, field count, suppression notes) into the template and invokes the Claude CLI (claude -p) to generate the page.
  3. Validates the generated markdown against required sections (Overview, Join Strategy, Known Limitations, Data Quality Notes), minimum length, and forbidden-phrase rules. Failed validation is flagged, not silently accepted.
  4. Writes a checkpoint (etl/editorial/.checkpoint.json) so successful generations are not regenerated on the next run — keeping costs bounded and making the pipeline resumable after transient failures. A --force flag regenerates everything; --skip-ai produces template placeholders without calling the model.
  5. Stage 1 then copies etl/editorial/output/*.md into site_data/editorial/ where the Astro build picks them up.

Of 29 datasets, 29 currently have a generated methodology page. The editorial layer describes the data; it does not invent, reinterpret, or summarize any numbers shown on entity pages.

Stage 3 — Static site build & deploy

  1. Astro static build (site/, run with npm run build). Astro reads the JSON manifests in site_data/ plus the generated editorial markdown and renders a static HTML page for every entity, dataset methodology page, explore table, map, and index. No server-side rendering, no runtime database, no API calls from the browser — just precomputed HTML and JSON.
  2. Commit & push. The built site/dist/ directory is checked into the repository on main.
  3. GitHub Pages deploy. A GitHub Actions workflow (.github/workflows/deploy.yml) publishes site/dist/ to GitHub Pages on every push. The custom domain caregraph.org is configured via site/public/CNAME.

The full pipeline — from an empty checkout to a deployable site — is a handful of commands: pip install -e ., python etl/run.py, python etl/editorial/run.py, cd site && npm install && npm run build. Anyone can re-run it against the same raw CSVs and verify the outputs byte-for-byte.

Report an Error

If you find incorrect data, a missing entity, or a methodology concern, please file an error report on GitHub. We review every report and publish corrections with full changelog entries.