Healthcare Associated Infections — Hospital
Dataset ID: hosp-hai ·
← Back to Methodology Hub
Provenance
- Dataset ID
hosp-hai- Entity Type
- hospital
- Role
- enrichment
- Source
- CDC
- Vintage
- FY2026
- Entity Count
- 5,399
- Last ETL Run
- 2026-04-13
Overview
The Healthcare Associated Infections — Hospital (HAI) dataset is published by the Centers for Disease Control and Prevention (CDC) through the National Healthcare Safety Network (NHSN), and distributed via the CMS Care Compare initiative on data.cms.gov (Provider Data API identifier 77hc-ibv8). Unlike most CMS quality datasets, HAI data is not derived from claims — hospitals report infections directly to the NHSN as part of CMS Inpatient Quality Reporting (IQR) program requirements. The data uses clinical surveillance definitions rather than billing codes. Each row represents one infection measure for one hospital, with fields for observed infections, predicted infections, the Standardized Infection Ratio (SIR), SIR confidence intervals, and the comparison category against the national benchmark. The current file covers FY2026.
Six HAI measures are reported: CLABSI (central line-associated bloodstream infections), CAUTI (catheter-associated urinary tract infections), SSI for colon surgery, SSI for abdominal hysterectomy, MRSA bacteremia, and C. difficile infection (CDI). The SIR is the primary metric — it compares observed infections to a predicted count derived from a 2015 national baseline. An SIR of 1.0 means the hospital matched the 2015 baseline; below 1.0 indicates fewer infections than predicted. This dataset answers questions such as: which hospitals have higher- or lower-than-expected infection rates for specific HAI categories, how a hospital's infection control performance compares to the national benchmark, and which infection types are most prevalent at a given facility.
Join Strategy
This dataset joins to hospital entity pages on CareGraph using the Facility ID field, which contains the CMS Certification Number (CCN) as a 6-digit zero-padded string (e.g., 010001). During ETL, the normalize_ccn() function strips whitespace and zero-pads values shorter than 6 characters to ensure consistent matching. The generic _load_measures_by_ccn() loader reads the source CSV, identifies the CCN column using a candidate-list strategy (checking "Facility ID", "Hospital CCN", "Provider Number", and variants), and identifies the measure column via candidates including "HAI Measure ID" and "Measure ID". All measure rows are grouped by normalized CCN. Each hospital's HAI records are attached to its JSON manifest under the hai key as an array of per-measure objects. Non-numeric values such as "Not Available" are excluded during loading; numeric fields are parsed via _try_float(). The join is a left join from the hospital manifest — hospitals without HAI records retain their existing data and display missing indicators for this dataset. Not all hospitals report all six measures, as some measures (e.g., SSI for abdominal hysterectomy) apply only to hospitals that perform the corresponding procedure.
Known Limitations
- Suppression for low predicted infections. Hospitals with fewer than 1 predicted infection for a given measure are suppressed. The SIR is statistically unreliable when the predicted denominator is below 1.0, so these entries show suppression markers rather than calculated values. Suppression indicates low device-days or procedure volume, not a data quality problem.
- Stale baseline year. The SIR baseline is derived from 2015 national data and is not updated annually. As national infection rates have generally declined since 2015, an increasing proportion of hospitals show SIR values below 1.0 ("better than expected"). This baseline drift means current SIR values are not directly comparable to SIR values from earlier reporting periods, and cross-year trend analysis requires caution.
- Surveillance definition changes. NHSN has revised HAI surveillance definitions over time — notably, CAUTI definitions were updated in 2015, narrowing the criteria for what constitutes a reportable infection. These definitional changes create discontinuities in time-series comparisons and can cause apparent drops in infection rates that reflect reclassification rather than clinical improvement.
- Not derived from claims data. HAI data reflects clinical surveillance reported directly by hospital infection preventionists to NHSN. It does not use billing codes, ICD-10 diagnoses, or Medicare claims. This means the data is not affected by coding variation, but is subject to variation in hospital surveillance practices, staffing of infection prevention programs, and interpretation of NHSN case definitions.
- Not all measures apply to all hospitals. Procedure-specific measures (SSI colon, SSI abdominal hysterectomy) are reported only by hospitals that perform those procedures in sufficient volume. Device-associated measures (CLABSI, CAUTI) require a minimum number of device-days. Missing measures for a hospital do not indicate suppression — the procedure or device exposure may simply not be applicable.
- No Medicare Advantage distinction. Because HAI reporting is based on clinical surveillance rather than claims, it covers all patients in qualifying units regardless of payer. However, hospitals exempt from the CMS IQR program (e.g., Critical Access Hospitals that do not voluntarily report) are absent from the dataset entirely.
Data Quality Notes
- SIR and count fields stored as strings. The source CSV encodes
SIR, observed infections, predicted infections, and confidence interval bounds as string values. Suppressed rows contain "Not Available" or similar markers instead of numeric values. The ETL parses these with_try_float(), converting non-numeric entries to null in the JSON manifest. Rows where all value fields resolve to "Not Available" are excluded entirely during the_load_measures_by_ccn()step. - Column name variation across vintages. CMS changes column header casing and naming between file releases (e.g., "Facility ID" vs. "Facility Id", "HAI Measure ID" vs. "Measure ID"). The ETL uses a candidate-list column matching strategy via
_find_column()to handle these variations without manual mapping updates. - Missing value encoding inconsistency. The source data uses "Not Available", "Not Applicable", and empty strings interchangeably for missing values depending on the field and suppression reason. The ETL's
_clean()function normalizes whitespace and strips these sentinel values, converting them uniformly to null. Date fields useMM/DD/YYYYstring format in the source. - Measure-level granularity. Each hospital may have up to six rows (one per HAI measure), but the actual count varies. Hospitals with zero applicable measures have no rows in the source file and no
haikey in their manifest. The ETL does not impute missing measures — absence of a measure in the manifest means the hospital either did not report it or was not eligible.