Each release of new overall hospital ratings is captivating to journalists, hospital leaders, and health care consumers in the United States. These overall ratings, whether published by U.S. News, Consumer Reports, or Hospital Compare, aggregate a wide array of underlying measures into a single score for each hospital. Without such composite scores, it would be impossible to create rankings, star ratings, and “honor rolls” based on overall performance. As every hospital chief executive (and college president) knows, boards of directors rarely ignore these ratings.
Responsible creators of overall performance ratings carefully consider the validity and reliability of individual measures. They ask questions such as, Is risk adjustment adequate? and Is the signal-to-noise ratio reasonable? These are important questions, and methodologic guides based on measurement science can help answer them.1
Where mathematics ends, however, there is an inescapable value judgment: In computing the overall hospital rating, how much weight should each measure (or group of measures) receive? Measurement science is largely silent on this question, as it should be. No equation can help report makers decide how much relative weight to place on fundamentally different dimensions of inherently desirable performance, such as technical quality, patient experience, and efficiency of care. Thus, as currently constructed, the weighting systems that underlie overall hospital performance ratings are expressions of the values, preferences, and tastes of their creators.
Is this approach appropriate? Why should the opinions of report creators hold sway, if the intent is to inform patient choice? Instead, why not ask patients what’s important to them? Report creators could survey patients to estimate weights that reflect the population mean. Such an approach might help align overall performance ratings with the preferences of the average patient. However, individual patients vary considerably in their needs and preferences. A report tailored to the “average patient” will probably be a poor fit for most.
We therefore suggest creating overall performance scores that can be modified, in real time, by each user in accordance with the user’s individual needs, values, and preferences. Although such an approach would have been impossible before the Internet age, it is feasible with current technology. To illustrate how such a report might look, we have created a mock-up based on the 2016 version of the Centers for Medicare and Medicaid Services (CMS) Overall Hospital Quality Star Rating System, available on Hospital Compare.
The 2016 Hospital Compare overall hospital star ratings used latent variable modeling techniques to combine 57 process and outcomes measures into seven domains of quality (mortality, safety of care, readmissions, patients’ experience, timeliness of care, effectiveness of care, and efficient use of medical imaging).2 Overall star ratings were computed from a weighted average of these seven domain scores, using weights chosen for consistency with existing CMS policies and priorities, incorporating input from stakeholders such as members of the agency’s panel of technical experts. Mortality, safety of care, readmissions, and patient experience each received weights of 22%, and the remaining domains each received weights of 4%. In other words, report creators considered readmissions to be 5.5 times as important as the effectiveness of care. We sought to allow the report users to make their own determinations.
To do so, we modified the original program that CMS had used to create the 2016 Overall Hospital Quality Star Ratings to allow report users to set their own weights (choosing from 100, 50, 22, 4, and 0 points — corresponding to “extremely important,” “very important,” “quite important,” “minimally important,” and “unimportant”). We then applied this modified program to individual measure scores from the publicly available 2016 Hospital Compare database, recomputing overall hospital stars for every possible combination of weights and dividing each domain’s points by the total (so that the weights always summed to 100%). Finally, we created a Web-based report card on which report users could display customized overall hospital ratings, using weights that reflect their own assessments of the relative importance of the seven domains. Overall ratings were sensitive to customized weights, as a few examples can demonstrate.
Let’s say Patient A is a pregnant woman in Taunton, Massachusetts. She wonders whether to establish obstetrical care locally or at one of the well-known hospitals in downtown Boston. The default Hospital Compare overall star ratings give four stars to Massachusetts General Hospital (downtown), four stars to Saint Anne’s Hospital in Fall River (a closer option), and three stars to Sturdy Memorial Hospital (another nearby option). However, Patient A prefers to assign zero points to mortality, readmissions, and efficient use of medical imaging, since these variables are based on conditions and services that have questionable relevance to obstetrical care. She then assigns 100 points to effectiveness (which includes some obstetrical measures), safety (she is concerned about postoperative complications, should she need a cesarean section), and timeliness (she wants to be seen right away if she presents to the emergency department) and 50 points to patient experience (all else being equal, she values a good night’s sleep). Using these personalized weights, she finds that both her local options receive five-star ratings, whereas Massachusetts General Hospital receives three stars.
Patient B, a generally healthy 45-year-old man living in West Covina, California, recently had a bike accident. The nearest hospitals offering elective knee surgery are Chino Valley Medical Center (four-star default overall rating on Hospital Compare) and Methodist Hospital of Southern California (five stars). Perusing the underlying measures, the man decides that effectiveness and safety are most relevant to his surgery and assigns them each 100 points. Because his job allows only limited sick leave, he cares greatly about avoiding readmission, so he assigns that domain maximum weight as well. He assigns 50 points to patient experience and gives the remaining domains four points each because he considers them minimally important to his surgery. With these personalized weights, the hospitals’ relative rankings are reversed, with Chino Valley now at five stars and Methodist at four.
Professor C’s interest in hospital ratings is less personal than those of Patients A and B. She is a researcher living in Chicago who questions the validity of the measures underlying the safety domain.3 She sets the weight for safety to zero points and leaves the remaining default weights unchanged. Having thus removed the safety domain from the star calculations, she notes substantial changes in Chicago-area hospital ratings: four hospitals’ ratings drop from five to four stars, whereas those of five others, including Northwestern Memorial Hospital, increase from three to four stars.
Thus, overall hospital ratings are sensitive to the inherently subjective weights applied to the underlying performance measures. One-size-fits-all weighting, which was necessary when performance ratings were published only in print, can be replaced with user-determined weights in the Internet age. By allowing such personalization, creators of performance reports can enhance the value of their overall ratings and rankings to the consumers who might use them.
Our illustrative report card, which is intended only as an example and not as a consumer-ready performance-rating site, does not seek to address every methodologic challenge of public reporting. Rather, it is intended to give users an intuitive understanding of how different weightings can affect overall hospital performance ratings. With further development to assist consumers in setting their personalized weights (perhaps by suggesting weights on the basis of their responses to a questionnaire) and to help them interpret the resulting ratings,4,5 we believe user-determined weights could become a highly desirable feature of future hospital ratings.
From the Northland District Health Board, Whangarei, New Zealand (J.R.-S.); Rand Corporation, Santa Monica, CA (J.R.-S., J.G.), and Boston (M.W.F.); and Brigham and Women’s Hospital and Harvard Medical School — both in Boston (M.W.F.).
1. Friedberg MW, Damberg CL. Methodological considerations in generating provider performance scores for use in public reporting: a guide for community quality collaboratives. Rockville, MD: Agency for Healthcare Research and Quality, September 2011. Google Scholar
2. Yale New Haven Health Services Corporation/Center for Outcomes Research and Evaluation. Overall hospital quality star rating on Hospital Compare: December 2016 updates and specifications report. Woodlawn, MD: Centers for Medicare and Medicaid Services, October 20, 2016 (https://www.rand.org/content/dam/rand/www/external/health/projects/hospital-performance-report-card/StrRtgDec16PrevQUS_rept_110416.pdf). Google Scholar
3. Rajaram R, Barnard C, Bilimoria KY. Concerns about using the patient safety indicator-90 composite in pay-for-performance programs. JAMA 2015;313:897-898. CrossRef | Medline | Google Scholar
4. Hibbard JH, Peters E, Slovic P, Finucane ML, Tusler M. Making health care quality reports easier to use. Jt Comm J Qual Improv 2001;27:591-604. Medline | Google Scholar
5. Hibbard J, Sofaer S. Best practices in public reporting no. 1: how to effectively present health care performance data to consumers. Rockville, MD: Agency for Healthcare Research and Quality, June 2010. Google Scholar
This Perspective article originally appeared in The New England Journal of Medicine.