Provenance

Dataset ID
hosp-hai
Entity Type
hospital
Role
enrichment
Source
CDC
Vintage
FY2026
Entity Count
5,399
Last ETL Run
2026-04-13

Overview

The Healthcare Associated Infections — Hospital (HAI) dataset is published by the Centers for Disease Control and Prevention (CDC) through the National Healthcare Safety Network (NHSN), and distributed via the CMS Care Compare initiative on data.cms.gov (Provider Data API identifier 77hc-ibv8). Unlike most CMS quality datasets, HAI data is not derived from claims — hospitals report infections directly to the NHSN as part of CMS Inpatient Quality Reporting (IQR) program requirements. The data uses clinical surveillance definitions rather than billing codes. Each row represents one infection measure for one hospital, with fields for observed infections, predicted infections, the Standardized Infection Ratio (SIR), SIR confidence intervals, and the comparison category against the national benchmark. The current file covers FY2026.

Six HAI measures are reported: CLABSI (central line-associated bloodstream infections), CAUTI (catheter-associated urinary tract infections), SSI for colon surgery, SSI for abdominal hysterectomy, MRSA bacteremia, and C. difficile infection (CDI). The SIR is the primary metric — it compares observed infections to a predicted count derived from a 2015 national baseline. An SIR of 1.0 means the hospital matched the 2015 baseline; below 1.0 indicates fewer infections than predicted. This dataset answers questions such as: which hospitals have higher- or lower-than-expected infection rates for specific HAI categories, how a hospital's infection control performance compares to the national benchmark, and which infection types are most prevalent at a given facility.

Join Strategy

This dataset joins to hospital entity pages on CareGraph using the Facility ID field, which contains the CMS Certification Number (CCN) as a 6-digit zero-padded string (e.g., 010001). During ETL, the normalize_ccn() function strips whitespace and zero-pads values shorter than 6 characters to ensure consistent matching. The generic _load_measures_by_ccn() loader reads the source CSV, identifies the CCN column using a candidate-list strategy (checking "Facility ID", "Hospital CCN", "Provider Number", and variants), and identifies the measure column via candidates including "HAI Measure ID" and "Measure ID". All measure rows are grouped by normalized CCN. Each hospital's HAI records are attached to its JSON manifest under the hai key as an array of per-measure objects. Non-numeric values such as "Not Available" are excluded during loading; numeric fields are parsed via _try_float(). The join is a left join from the hospital manifest — hospitals without HAI records retain their existing data and display missing indicators for this dataset. Not all hospitals report all six measures, as some measures (e.g., SSI for abdominal hysterectomy) apply only to hospitals that perform the corresponding procedure.

Known Limitations

Data Quality Notes

← Back to Methodology Hub · Report an error