Hospital General Information
Dataset ID: xubh-q36u ·
← Back to Methodology Hub
Provenance
- Dataset ID
xubh-q36u- Entity Type
- hospital
- Role
- base
- Source
- CMS
- Vintage
- FY2026
- Entity Count
- 5,426
- Last ETL Run
- 2026-04-13
Overview
The Hospital General Information dataset (dataset ID xubh-q36u) is published by the Centers for Medicare & Medicaid Services (CMS) through the Care Compare initiative (formerly Hospital Compare) and distributed via data.cms.gov. It contains one row per Medicare-certified hospital and covers approximately 5,426 facilities across all 50 states, the District of Columbia, and US territories. Each record includes facility identifiers, address, phone number, hospital type, ownership, emergency services availability, and the Hospital Overall Rating (1–5 star rating). The file is refreshed by CMS on a roughly quarterly cycle tied to Care Compare update releases.
This dataset answers questions such as: What type of hospital is this (acute care, critical access, psychiatric, etc.)? Who owns it? Does it provide emergency services? What is its CMS star rating? On CareGraph, it serves as the foundational spine for all hospital entity pages — every hospital page is seeded from a row in this file.
Join Strategy
Each row is matched to a hospital entity page using the Facility ID column, which contains the CMS Certification Number (CCN). The CCN is a 6-character string, zero-padded on the left. During ETL, the normalize_ccn function strips whitespace and non-alphanumeric characters, then zero-pads to 6 characters. One JSON manifest is emitted per valid CCN into site_data/hospital/{CCN}.json. Rows with blank or unparseable CCNs are logged and skipped. The hospital entity page at /hospital/{CCN} renders data from this manifest. Because this dataset defines the hospital entity roster, every hospital page on CareGraph originates from a record in this file — other hospital datasets (HRRP, HVBP) are enrichment joins onto the same CCN key.
Known Limitations
- Star rating suppression. The
Hospital overall ratingfield displays "Not Available" when CMS determines a hospital has insufficient measure data to compute a rating, or when the hospital has opted out. CMS computes star ratings using a latent variable model that groups approximately 60 quality measures into 5 categories (mortality, safety of care, readmission, patient experience, timely and effective care). Hospitals missing data in too many categories receive no rating. This is not cell-size suppression — no patient-level counts are present in this dataset. - Emergency services is self-reported. The
Emergency Servicesfield (Yes/No) is declared by hospitals during enrollment and does not distinguish between a full emergency department, a freestanding ED, and an urgent care center. It should not be used as a proxy for trauma center designation or ED capability level. - VA, tribal, and territorial hospitals. VA hospitals, Indian Health Service / tribal hospitals, and hospitals in US territories (Puerto Rico, Guam, US Virgin Islands, American Samoa, Northern Mariana Islands) are included in the file but frequently lack star ratings. VA facility CCNs may begin with alpha prefixes that do not match the numeric 6-character format used in other CMS datasets, which can cause join failures with enrichment files such as HRRP or HVBP.
- Hospital Type conflation. The
Hospital Typefield groups Critical Access Hospitals (CAHs) alongside other acute care categories. CAHs operate under fundamentally different Medicare payment rules (cost-based reimbursement, 25-bed cap, 96-hour length-of-stay limit) and different cost reporting structures. Comparing quality or financial metrics across CAHs and non-CAH acute care hospitals without accounting for these structural differences is misleading. - Address and phone data staleness. CMS updates provider enrollment data (addresses, phone numbers) on a different refresh cycle than the quality datasets. A hospital may have relocated, changed phone numbers, or updated its mailing address since the last enrollment update. These fields should not be treated as real-time contact information.
- Medicare-certified facilities only. This file covers only hospitals certified to participate in Medicare. Hospitals that serve exclusively non-Medicare populations (e.g., some state psychiatric hospitals, certain specialty surgical facilities) are absent.
Data Quality Notes
- "Not Available" sentinel values. Several fields use the string "Not Available" rather than null or empty string to indicate missing data. The ETL treats empty strings, "Not Available", and "N/A" uniformly as null in the output JSON manifest. The
hospital_overall_ratingfield is set tonullwhen the source value is blank or non-numeric. - Numeric fields stored as strings. The source CSV encodes all columns as text. Fields that contain numeric values (e.g.,
Hospital overall rating,ZIP Code) are not cast to numeric types during ingest — the ETL preserves them as trimmed strings in thedata.general_infosub-object and extracts selected fields (e.g.,hospital_overall_rating) into top-level manifest keys. - Column name mapping. The source CSV uses human-readable column headers with mixed case and spaces (e.g.,
Facility Name,City/Town,County/Parish). The ETL maps these to snake_case keys in the manifest (e.g.,facility_name,city,county_name). The original column names are preserved in thedata.general_infoblock for provenance. - CCN validation failures. Rows with blank, whitespace-only, or otherwise unparseable
Facility IDvalues are silently dropped during ETL. The build log reports the total number of skipped rows. In practice, this affects fewer than 10 rows per refresh.