Stars 101 explained the basic idea of Star Ratings. This tutorial details of how the Star Ratings are assigned to each plan each year. We start with the very basics -- weighted averages and build into sections around each technical area. The important part for most people is the list of “Implications” in the 3rd section.
A Weighted Average of Scores
The simplest definition of the Overall Star Rating is that it’s a weighted average of Plans’ ratings for individual measures. It’s very much like (probably inspired by) college grade point average. If you can remember that far back, you might recall two things. First, grades are assigned A, B, C, D or F for each class, and once the grade is assigned the underlying numeric value is set aside. It doesn’t matter if your “A“ was a 96% from one teacher or a 75% from a teacher who grades on a curve -- only the letter grade matters. Each class can have its own scale. It would be a mess if you tried to average these scores, but the school pulls it off because it normalizes these systems into a common standard. The second thing to recall is that not all classes are equally weighted. It’s better to get an A in 3-credit Calculus and a B in 1-credit Tennis than vice versa.
CMS calculates Stars using the same type of weighted average. The first task is to collect scores for each of the measures from each of the plans and convert the numeric scores into 1, 2, 3, 4 or 5 Stars. The second task is to take the individual measure stars and calculate the weighted average and round to get a final Star Rating. I’m going to gloss over a whole bunch of details here so we can focus on the weighted average and rounding. Here’s a simple example with a few measures:
Id | Score | Stars | Weight | Measure |
---|---|---|---|---|
C02 | 65 | 3 | 1 | Colon Cancer Screening |
C11 | 75 | 4 | 3 | Blood Sugar Control |
C23 | 0.15 | 3 | 2 | Complaints about Health Plan |
D05 | 85 | 2 | 2 | Rating of Drug Plan |
D10 | 78 | 3 | 3 | Medication Adherence for Statins |
The weighted average is easy to calculate. It's SUM ( Stars * Weight) / SUM( Weight)
Id | Stars | Weight | Numerator (Star * Weight) | Denominator (Weight) |
---|---|---|---|---|
C02 | 3 | 1 | 3 | 1 |
C11 | 4 | 3 | 12 | 3 |
C23 | 3 | 2 | 6 | 2 |
D05 | 2 | 2 | 4 | 2 |
D10 | 3 | 3 | 9 | 3 |
Total | 34 | 11 |
The raw calculation for the overall rating is 34/11 = 3.091. This rounds to 3 Stars.
Stars round on the half-star, not whole numbers, so the plan would need a raw score of 3.250 to move higher. If C02 was 4 stars instead of 3, the rating would move to 3.18. That would fall short. But if that measure jumped to 5 Stars, or if a higher weighted measure added a star, it would move the average above 3.25 and put the Overall Rating up to 3.5.
If you’re new to Star Ratings, now might be a good time to pause and play with an example for a bit to get a more intuitive sense of it. If you click on BHA’s Stars Planner, it will randomly pick a contract’s real data from the past year. See what happens to the overall rating when you move a 1x measure up and down a star. Pick some higher weighted measure and get a feel for what it takes to move the overall rating.
Stuff Glossed Over
The calculation is not actually a 2 step process. It’s more like 11 steps:
Assign Stars Per Measure
- Collect and validate numeric scores from all plans for all measures
- Calculate the numeric score Improvement Measure for Part C and for Part D
- Use statistical methods to come up with cutpoints for each measure (the grading curve)
- Convert each numeric score to a 1, 2, 3, 4, or 5 Star rating for each measure
- Adjust certain CAHPS measure results up or down a star
- Apply disaster area (and Covid) rules
Assign Overall Star Rating
- Calculate the weighted average
- Recalculate the weighted average without the Improvement Measure and apply the “hold harmless” rules.
- Apply Categorical Adjustment Index (CAI) to adjust for plans with low-income members
- Apply the Reward Factor that boosts plans with low variance and high mean ratings
- Round on the half-star and award the final star ratings.
The first 79-page Technical Notes was published in 2012. The 2022 edition grew to 175 pages as CMS has layered in multiple adjustments to the calculation. Like any good rulemaking body, CMS listened to the critics who pointed out consequences of some part of the calculation and introduced adjustments to offset them. We’ll use the rest of this chapter to discuss these adjustments.
If you’re not ready for all the details, stick around at least for the table in the next section. We quickly summarize the “what” and “why” and try to answer the “what does it mean to you?”
Implications of Rules and Adjustments
Here’s a summary of the various adjustments and quirks in the calculation. “Why” -- what is the theoretical reason for the rule, or what problem is it solving. “What” - what is it’s impact on the calculation. “Implications” - what does it mean practically to Plans. Several of the quirks pertain to the 5x weighted “Part C Improvement” and “Part D Improvement” measures. These are discussed in detail with the Improvement measure.
Rule/Concept | Why? | What? | Practical Implications |
---|---|---|---|
Stars Mechanics | |||
Annual Cutpoints | Ensures that the “curve” is based on current relative performance. Blocks Plans from gaming the system by only doing the minimum necessary work. | Changes cutpoints annually based on actual results from all plans’ data. | You don’t quite know the rules until it’s too late to do anything. Be sure to overshoot your goals a bit. |
Measure Weights | Some measure topics have a greater impact on members’ health than others. | Weighs measures 1x to 5x according to the type of measure. Improvement member experience are weightier than “process of care”. | (1) Not all gaps are equally important. (2) Be attentive to member experience measures that are jumping in weight. |
Adjustments | |||
CAI / Categorical Adjustment Index | Disabled and low income members are less likely to score well on Star Ratings. Some Plans have more than others. | Adjusts final score up or down based on the percentage of low income and disabled members. Adjustment between -0.04 and +0.16, which comes to -3 to +13 weighted stars. | Account for it when projecting your rating. But the adjustment is small except for plans that have the most disabled and low income members. |
Reward Factor | Top quality means performing well across all measures, not selectively excelling on some and ignoring others. | Adjusts final score up 0.1 - 0.4 for plans that have high average scores and low variance. | You can hit 4.5 and 5 Stars overall rating without having a majority of 5 Star scores by avoiding low scoring measures. |
Disaster Areas | Natural disasters can make some measures unfair to impacted Plans. | If > 25% of members are impacted by disasters, CMS takes the “better of” current or prior year scores for most measures. | Focus on your worst measures from the prior year. |
Covid | Disaster rules applied nationwide. | For 2022 Stars, the hold harmless rules apply to all plans. | - |
Merged Contracts | A quirk allowed large insurers to get higher ratings by merging large Plans into smaller plans. | Merged Plans scores are calculated based on weighted average of the prior plans. | |
SNP Only Measures | Some services are critical for high acuity Special Needs Plans. | (C-SNP, I-SNP, D-SNP) have Care for Older Adults (2 measures as of 2022, but 3 previously) and SNP Care Management. | These should be “gimme” measures to boost your rating. |
Improvement Measures | |||
Part C Improvement Measure | Encourages Plans to always be striving for improvement. | 5x weighted measure for Part C improvement is calculated from how many other measures improve. | Big help to plans on their way up. Spread effort around to make steady improvement on a wide front instead of a big jumps in a few places. |
Part D Improvement Measure | Drug Plan improvement is measured separately from the medical plan. | 5x Part D measure. In Part D, the three triple-weighted medication adherence measures have an outsized impact. | Try to go up every year in medication adherence. |
“Statistical Significance” Testing | Ensure that “improved” or “declined” is meaningful and not just random chance. | Results are not counted as “improved” or “declined” unless the change meets CMS’ definition of “Statistically Significant”. | Large denominator measures take smaller movement to meet this test. Hence Medication adherence is more likely to move up or down than HEDIS Hybrid. |
Hold Harmless #1 - 5 Star Measures | Once a plan is performing well on a measure, small declines are not meaningful. | Scores that “Declined” will be counted as “No Change” if both years are 5 Stars. | Since this only applies at 5 Star level and you don’t know the cutpoints, it’s not something to rely on. |
Hold Harmless - Improvement | 4-Star Plans would be penalized on Improvement when they max out the practical limits of performance. | Score ls calculated twice - with and without both improvement measures. 4 Star plans get the better result. | Improvement can’t hurt you once you’re 4 Stars. It’s usually a nonfactor once the plan is in the 4 star range. |
Hold Harmless - Covid | All Plans met the definition of “disaster” in 2022. | Hold harmless rule applied to all plans. Rating calculated “with” and “without”; plan gets the higher rating. | Improvement measure can only help in 2022 Stars. This is a one-off. |
Setting Cutpoints
CMS explains how they use clustering to set cutpoints:
"Mean resampling is used to determine the cut points for all non-CAHPS measures. With mean resampling, measure-specific scores for the current year’s Star Ratings are randomly separated into 10 equal-sized groups. The hierarchal clustering algorithm is then applied 10 times, each time leaving one of the 10 groups out of the clustered data. The method results in 10 sets of measure-specific cut points. The mean for each 1 through 5 star level cut point is taken across the 10 sets for each measure to produce the final cut points used for assigning measure stars."
I couldn’t begin to explain this. Well maybe I could, but the Tech Specs themselves and a statistical textbook is the place to learn the theory and practice behind it. Instead, here we’ll focus on a few of the impacts.
- Cutpoints are not Set Until After the Measurement Period. As designed, you cannot know the cutpoints until long after you can do anything to improve your score. This ensures that they’re fair in the sense that they represent actual relative performance across all Plans. It also means that you can’t game things by aiming for exact targets and then stop. Frustratingly, it also means that you cannot definitely predict where you stand. Unless you’ve comfortably exceeded your target, you will have to operate using your best guess where Stars will land. The actual rating will not be available until just before the annual enrollment period.
- Cutpoints are based on “Clusters” not percentiles. This is the most surprising if you’re new to Stars. The method gives 5 groups, and the Plans within each group have similar performance (as much as possible). There have been years where only one Plan is 1-Star because no other Plan is close. There have been years where cutpoint levels have been entirely skipped. The count Plans at 1, 2, 3, 4 and 5 Star results are often very different across measures.
- Some Measures are Harder than Others. As a consequence of the prior point, the average Star Rating varies from measure to measure. CMS publishes the average each year in the “Fact Sheet” distributed with the technical specs. In 2020, C28-Complaints About the Health Plan was the “easiest” with an average of 4.9 -- meaning that the vast majority of plans achieved 5 Stars. C17-Falls Risk was the “hardest” with an average of only 2.5 Stars.
Trends. The biggest mistake in predicting cutpoints is to assume that they move in one direction. What you often see instead -- especially for new measures -- is that they will swing higher then swing lower. If a particular cutpoint jumps 6% one year, it will probably give some of that back the following year. A trustworthy prediction of the cutpoints will have to be able to predict both the direction of the industry and of the outliers.
You can get a feel for the cutpoint trends using BHA’s web tools. Click on our Measure Browser and check the graphs at the bottom of the screen.
Measure Weights
Measures are weighted by CMS as follows:
Weight | Category |
---|---|
1 | New Measures |
1 | Process Measures |
2 (2022) 4 (starting 2023 but proposed to end in 2026) | Patient’s Experience and Complaints |
3 | Intermediate Outcome Measures |
5 | Improvement Measures |
There’s an important distinction between “process” and “outcome” measures. Process measures are like Flu Shots and Eye Exams. In these, the clinical standard of care is that a certain group of people are expected to receive a certain service. Outcome measures are those that more directly reflect the members’ health, such as Blood Sugar Control and Hospital Readmission. There is some grey area, for example Medication Adherence is considered an outcome measure. Every year, CMS solicits public feedback and shares the responses, and the classification and weights of measures is a frequent topic.
Lately, the big buzz around measure weights has been CMS’s decision to increase the weight of the “patient experience” measures. They had been 1.5X for years, but CMS announced plans to change it first to 2X and then to 4X. The jump to 4X is a huge shift in the relative importance of different types of measures. Every Stars focused conference has a presentation on what to do about it. (Our blog does too.)
Reward Factor (r-factor)
CMS boosts the final rating for plans that have a “high average” and “low variance” in their individual measure ratings. The high/medium is calculated from the 65th and 85th percentiles of Part C, Part D and Overall. Variance is the measure of how consistent scores are -- a Plan with all 5-Star results is low (actually zero) variance, but so would be a plan with all 1-Star results. A plan could average 4.33 from mostly 4-Star results and an occasional 5, or it could achieve the same average with nearly all 5’s but a few 1’s. The first is low variance, the latter is not.
Fun Fact: Before 2015, CMS labeled this “Applying the Integration Factor (Reward for Consistently High Performance) and referred to it as the “i-factor.” In 2016 this truncated to “Applying the Reward Factor”. Maybe not so fun, but it was news to me. My old colleagues called it the "i-factor" and I never undestood why.
The Tech Notes define precisely want constitutes high/medium/low average score and high/medium/low variance. Then the r-factor is as follows:
Reward | Mean Rating | Variance |
---|---|---|
0.4 | High | Low |
0.3 | Medium | High |
0.2 | “Relatively high” | Low |
0.1 | “Relatively high” | Medium |
0 | All else |
The Tech Notes provide a table of thresholds for “High” and “Relatively high” separately for Part C, Part D and Overall. They vary, but it’s about 4.4 for “High” (85th percentile) and about 4.1 for “relatively High” (65th). The reward factor is jet fuel for getting to 4.5 and 5.0 Stars.
You can calculate backwards to figure out exactly how much scores can deviate from the mean. But the important thing to note is that high variance gets no bonus. CMS is rewarding plans for avoiding any poor scores.
Categorical Adjustment Index
I used to work for a Special Needs Plan(SNP). Every year, we’d complain that our patients were sicker and that Stars were stacked against us. It was true -- some Star measures were especially difficult for this population. But CMS’s policy response was not to adjust for this risk -- sickest patients would be those that need the highest quality care. Finally, in 2017 CMS threw us a bone with the “interim analytical adjustment called the Categorical Adjustment Index.”
Fun Fact: CAI was introduced as an interim adjustment in 2017 “while measure stewards undertake a comprehensive review of their measures in the Star Ratings program”. The word “interim” was quietly dropped in 2021, and in 2022 the review language was also dropped. So, it seems CAI is here to stay.
For the CAI, the “sickest” patients are defined by two percentages:
- % of membership with “Disabled Status”
- % of membership with Low Income Subsidy or Dual Eligible status
CMS deducts 0.035526 points from the final score (approximately -3 weighted stars) for the least disabled/low income plans or adds as much as 0.156984 (+12) for the plans with the highest percentage of each.
There are two takeaways to note:
- If you have a healthy and wealthy membership, make some allowance for CAI to tug at your Star Rating
- Star Ratings are hard for low income and disabled populations. CAI is helpful. Don’t panic if the raw score is coming up short.
SNP-Only Measures
Care for Older Adults measures calculate the % of membership who receive 3 services at least once during the year: pain screens, functional status assessments and medication reviews. The SNP Care Management measure is the % of members who received their health risk assessment on time. These measures only apply to Special Needs Plans (SNPs).
SNPs are by definition high on disability and low income membership. The COA measures insist on following a basic standard of care for elderly and frail membership. HRAs are always part of SNP plans’ responsibilities. So, in addition to CAI, these measures exist to boost SNP plans’ Star Ratings.
The COAs are very simple to scan HEDIS charts for. Whatever the PCP hasn’t completed can be completed in a single phone call from the health plan. They should be a gimme.
Merged Contracts
Many of the rules discussed originated as CMS's response to some quirk or loophole in Star Ratings. The handling of Star Ratings in cases of contract mergers is an example. What happens when a Plan (hcontract) ceased to exist? It used to be that the rating ceased to exist too -- even if it’s a low score. What happened if the company had another Plan that the members could be moved to, and that plan had a higher Star Rating? Companies, of course, did so. This practice was widespread, and for example, a 3 Star 200,000 member plan closed and the members were shifted onto a 2000 member 4 Star plan.
CMS responded in 2017 with rules for “Mergers, Novations and Consolidations”. This gave the merged contracts’ Star Rating as the average rating of all the prior contracts weighted by the membership.
Disaster Areas
Star Ratings are based on relative performance -- Plans compete with all other Plans in the country. But with a local disaster (hurricanes, floods, wildfires, etc.) it’s an unfair comparison. To adjust, CMS gives plans the prior year’s score instead. The current year scores aren’t entirely discarded. If the plan succeeded in raising the sore for a measure, CMS uses the better of the two scores.
Affected contracts are defined as plans in which at least 25% of the Plans’ members are in a county in which there is a federally declared disaster “individual assistance area”.
There are a few exceptions; these are explained in the Tech Notes and not rehashed here.
COVID-19
The pandemic made the entire country effectively a disaster area and Plans eligible for relief. The above rules apply to everyone in 2022, and CMS had to make some adjustments in how they calculate the averages for the reward factor and cutpoints.
Improvement Measure
CMS has two meta-measures included in Star Ratings:
- 5x - Part C Improvement
- 5x - Part D Improvement
The next chapter of this tutorial will be devoted to these two, but I want to point out some highlights before closing out this chapter.
- A good Improvement Rating can push you above 4 stars. But a bad one cannot drag you below it.
- Improvement measure is calculated by the net of how many measures go up versus how many go down. It's designed to center at 3 Stars if an equal number go up as go down. The up/down is based on the scores not the measure star so if cutpoints jump higher a measure could go down in stars but still be "improved". Final note, the add/subtract is weighted by measure.
- The change must be “statistically significant”. There is an interval around the prior year score of “unchanged”. Below that is “decline” and above is “improved”. You can calculate this in advance of HEDIS and know exactly what you need to hit -- changing cutpoints do not affect this calculation.
The last point is important. Attention to Statistically Significant Improvement during HEDIS should be a critical piece of your Stars Strategy.
Next Up
We’ve now provided an overview of the whole Star Ratings program and a detailed explanation of how they’re calculated. To complete the introduction, we’ll have a chapter on each of the types of measures (including Improvement) and a high level discussion of the types of things that Plans do to improve scores.