The statistics on how to win cat enforcement events

Disclaimer:
(Yes, this is another long post. That is because it is filled with facts. If facts without pictures is too much for you, then may I suggest a comic book instead. I have included a summary towards the end though.)

HOW TO WIN CAT ENFORCEMENT EVENTS

INTRODUCTION
They tested the cat-enforcement-with-added-benefits (i.e. with a new performance based cat model on top) on us. We criticized it. Their reply was to release it immediately and with no intention to change the model. Alright… so how does the model do?

Or we can put it differently, how do you win races using this new performance based model? And is there possibly something strange about how to win them? (Seeing as I and several others criticized it and predicted troubles…)

METHOD
I sampled as many cat C races using the new model as I could, using ZP, the test events included. Not that many races so far but many enough for some statistical testing. I discarded races with less than 10 ZP registered participants in cat C.

Although several of the latest races are crits on Downtown Dolphin there are also other races of varying length and with varying numbers of participants.

The reason for choosing cat C for this little study is it generally has high attendance, meaning more data, while at the same time being a rather low cat, meaning strange things going on with the new model will likely be more pronounced for various reasons than in cat B, another well-attended cat.

I then made several comparisons over a number of ZP measures between three sub-samples, using a statistical test called the Mann-Whitney U-test. This test can be used to check for differences between samples and it answers the question: if there is a difference in the averages of some measure between two samples, can it be just a random coincidence or is it rather highly unlikely that we wouldn’t see this difference again and again if we ran the test on new samples later on? (I.e. there IS a real difference.)

The Mann-Whitney test is a good choice when comparing samples when samples sizes differ and you don’t expect the measures you look at to be normally distributed, i.e. nicely bell shaped.

I have divided each race into three sub-samples. First I have made a comparison between podium winners (1st to 3rd) and the rest of the field. Then I have also compared the podium to 8th-10th place. The reason for the latter comparison is to get a fair number of racers who did put up a fight rather than use the race as a recovery ride, who thus still placed fairly high up in the results table (although there are some races with low attendance making 8th-10th a low placing), but who weren’t necessarily almost alongside the podium in the final sprint.

Below are the results and their implications for how to podium these events.

RESULTS

20 min W/kg - Podium vs Losers (the rest of the field)
Avg: 2.9 vs 2.8
Median: 2.9 vs 2.9
Sample sizes: 111 vs 788 racers
Significance: The result is statistically significant at the 1% level (p < 0.01). The p-value is 0.000435. It is therefore extremely unlikely that this difference, although small (2.9 vs 2.8), is random.

20 min W/kg - Podium vs Early Losers (8th-10th)
Avg: 2.9 vs 2.9
Median: 2.9 vs 2.9
Samples sizes: 111 vs 111
Significance: p = 0.77. p is NOT < 0.01. There is NO real difference between the podium and those going across the finish line fairly early with regards to 20 min W/kg.

The takeaway: You don’t really win by having a higher 20 min W/kg than others in your cat. We have to look elsewhere.

Watt - Podium vs All Losers
Avg: 240 W/kg vs 223 W/kg
Median: 242 W/kg vs 224 W/kg
Significance: p = 7.09E-07 (way smaller than 0.01). The difference is NOT random.

Watt - Podium vs Early Losers
Avg: 240W vs 229W
Median: 242W vs 233W
Significance: p = 0.017. The result is statistically significant at the 5% level but not at the 1% level. It is still quite unlikely that the difference is random.

Takeaway: 20 min W/kg doesn’t seem to matter much, not inside the category. But you do want to be able to push higher average Watts than others if you want to podium. Still, when Zwift remodeled the original new model, they suddenly chose to downplay the significance of raw Watt in favor of keeping W/kg measures as the primary cat divider. What were they smoking?

15 sec W/kg - Podium vs All Losers
Avg: 7.7 W/kg vs 5.5 W/kg
Median: 7.5 W/kg vs 5.2 W/kg
Significance: p = 0. (A 64-bit computer can’t even calculate a non-zero value here.) The chance that this difference is random is infinitely small.

15 sec W/kg - Podium vs Early Losers
Avg: 7.7 W/kg vs 5.9 W/kg
Median: 7.5 W/kg vs 5.8 W/kg
Significance: p = 3.38E-14. The difference is NOT random.

Takeaway: You want to have a strong sprint if you want to podium. No surprise there of course, but seeing as the new model is a mixed model that is supposed to take different parts of the power curve into account, don’t you think Zwift has undervalued the left end of the power curve? Just a little?

Height - Podium vs All Losers
Avg: 180 cm vs 179 cm
Median: 180 cm vs 180 cm
Significance: p = 0.21. The very small difference (1 cm) is not statistically significant and could simply be random. I haven’t even bothered to compare the podium to early losers as I expect the result to be the same.

Takeaway: Be average height. Or… height doesn’t really matter for race results (although we know it affects speed slightly in Zwift’s physics model).

Weight - Podium vs All Losers
Avg: 84 kg vs 81 kg
Median: 85 kg vs 81 kg
Significance: p = 0.0035. The difference is not random.

Weight - Podium vs Early Losers
Avg: 84 kg vs 81 kg
Median: 85 kg vs 80 kg
Significance: p = 0.016, significant at the 5% level.

Takeaway: Be heavy! I tested for weight in the old cat limits ages ago and the result was the same back then as it is today with the new and “improved” model. Heavy set racers with a larger muscle volume have an advantage. Also, juniors with very low weight have an advantage too but for different reasons (juniors were included in the podium sub-sample), and I expect the average weight difference between podium and losers to be even bigger if the juniors on the podium are discarded (gonna test this later). The heavy guy can usually push higher Watts while keeping in line with W/kg limits and the lighter guy usually has a hard time keeping up. And if he still does keep up, then he might not get the chance to race against the heavies anyway since the model might move him up to cat B. Why haven’t they accounted for weight in the model? Oh, it’s because they’re still stuck on W/kg. Fair? Fun? Zwift… you suck! You still suck!

Avg-to-Max HR - Podium vs All Losers
Avg: 0.85 vs 0.89
Median: 0.86 vs 0.90
Significance: p = 1.33E-15. No way in hell the difference is random.

Avg-to-Max HR - Podium vs Early Losers
Avg: 0.85 vs 0.89
Median: 0.86 vs 0.91
Significance: p = 7.22E-10. NOT random, no chance.

Note: This is an interesting ratio, comparing the avg HR to the HR peak during the race. Someone with a very low difference between the two measures is either holding back his efforts evenly across the race (not that common) or is close to a flat out effort with little variation for most of the race (more likely). By contrast, someone with a large difference between the two measures is doing a relatively low effort during most of the race compared to his max, which in turn will be his effort level at select moments, e.g. in the final sprint or in some crucial climb short enough to go hard in.

Takeaway: Keep cruising! It is still, with the new model in place, possible to stay in cat indefinitely and do a lesser effort, intentionally or not, than others putting up a fight, and still grab podiums. The cruising just isn’t as blatantly obvious as before since there is no transparency around the cat definitions anymore. You have to do a little statistics first to unearth the stink. And now I have. There it is, the proof that the new model doesn’t prevent cruising. Of course it doesn’t. It can’t, not as long as W/kg remains an important element in the model. This is what you get, epic fail. Shame on you, Zwift!

EXECUTIVE SUMMARY FOR COMICS FANS
This is how to improve your chances of winning the cat enforced races:

Be a relatively heavy sprinter, toward the upper end of the W/kg spectrum in your cat (like most others), a guy [sic!] who can also push heavy Watts on the flat. Oh, and don’t push yourself too hard when there is no need to (unlike in a sprint or a short climb). That will only get you promoted to the bottom of the next cat and that will be no fun.

In other words, little has changed with the new model. Surprise, surprise…

17 Likes

it won’t change the way races below A are won or probably even raced (nothing will), but it will do the simple but very important job of keeping the guy with the 5wkg ftp or the 5 minute power that is somehow 2wkg higher than his 20min from joining your category.

personally i don’t understand this philosophy that races need to be hard from start to finish anyway. they aren’t in real life. what even is cruising? because to me it just sounds like quality training

6 Likes

I agree that weight has a different impact on flatter races.

I’m a heavier guy with a big FTP and larger sprint power. I nearly always podium in CAT C crits but on races with climbs I often get dropped even though my average power is almost 3.2w/kg (the limit for cat c). For the crits my average power is usually well below 3.2w/kg.

If I slimmed down I would guess that I’d move to cat B and I’d struggle even more on the climbing races.

My take is that the higher FTP you have, the harder it is to compete in the climbing races in your category?

2 Likes

If you used mostly crits that are flat then most of us could tell you without looking at statistics that the stronger raw watts will win.

We also know by now that using w/kg on flat roads is like dividing into weight classes.

7 Likes

Not with road races but I think you’d agree that Crits, CX races and MTB races tend to be short and hard IRL, quite similar to Zwift races (running for about as long compared to the time to complete a road race). But that’s not the point here. The point is, why is there a consistent tendency for those with the best placings (not just the winners) to work less hard than people somehwat further down the field? You wouldn’t expect it to be the other way around either. You expect there to be no statistically significant difference.

Well, do you really? Couldn’t you argue that those doing well in races may not have to work as hard to keep those further down the field away from the podium since they are the strongest? That it’s a quite natural and logical phenomenon that we see? There are strong arguments against this idea.

If the model did a good job at predicting future success based on physical stats (a futile ambition), then you would see an even distribution of rider strength. At the top of cat you might be a little bit stronger than the other top contenders and so you win but only barely so. This difference in strength will then tend towards hard efforts to show at all.

I still wouldn’t like the principle, the philosophy behind their categorization, as I think it is very unhealthy for Zwift-as-a-sport. But at least Zwift would have shown that they succeeded in what they set out to do. This is not what we see however. At the top there is still room for a clique who can race at a more leisurely pace. So the model fails at its own ambition.

It is never a problem that Pogacar cruises a race against Joe Blow if you have a cat promotion system that will promote Pogacar because of his history, because of his previous race results. He was so strong that he only had to cruise and could still win. Great! WTG! But he can only keep doing that for so long. This only becomes a problem if the cat model 1) does not focus on history but rather tries to predict the future, 2) fails at doing that even, and 3) there is no mechanism promoting Pogacars success - the model dictates that he can stay in cat forever.

You’re wrong here. You couldn’t know. I didn’t show you in order to keep the original post somewhat shorter. I’ll post those results later (I’m at the wrong computer right now), but much like with the 15 sec W/kg comparison, there is a very strong statistically significant difference between podium 5 min and 1 min W/kg (as opposed to the 20 min measure) between podiums and all losers and podiums and early losers respectively. The model completely fails in it’s ambition. It doesn’t take into account the factors that really seem to matter for winning.

Executive summary:
No, he’s wrong. The new model SUCKS!

There is only one good thing about the new model. It prevents sandbagging. But we never needed a new performance based model for that. They could simply have added that to the old model, the one that sucks too.

2 Likes

So what is your solution? A ranking system? Flint has said numerous times that is the plan, it just isnt step 1 or 2.

4 Likes

I think your stats appear to show that racing is better than before.
I don’t recall the details of the previous system to compare the new results.
It looks like the field is more even but there are still winners and not winners.
Importantly, there appear to be fewer cheaters or people gaming the system but that will come and another change will occur.

To summarize:

  1. We got cat enforcement
  2. A new cat system that is more complete than the single point system of the past.

2A. The system was revised in the first week for light riders.

  1. Big time weight doping hack was closed.

We have never seen so much change in 1 month.

The only previous big change was the evolution away from Z power and the almost universal use of smart power devices.

6 Likes

Sort out micro watts and sticky bursts and we’re golden…

1 Like

Is it perfect? Nope.

Is it better than before? Absolutely, without a doubt.

12 Likes

Thanks for the TL;DR, so to win you need the exact attributes anybody would expect to be able to win on the typical Zwift course.

It sounds like you are basically saying ‘we didn’t want massively improved, we wanted perfect. Please take away the massive improvement and start again’.

A ranking system is coming.

6 Likes

You guys aren’t getting it.

I’m not saying the model pinpoints the exact attributes expected to win the typical course. That’s the racers. Those with the most suitable attributes will win. The only thing the model does is to cement their advantage, to hedge them, eternally. It’s clear the ambition of the model was to make each cat more competitive. It just doesn’t.

I’m also not saying I wanted perfect. I wanted something different, and you know that. They thought the model was perfect. It isn’t. They didn’t manage to save the performance metrics, how could they? And it’s not a massive improvement, not at all.

A ranking system isn’t coming. Where did you get that from? They have never said they would. We thought that was what they were hinting at before Christmas. It turned out it wasn’t. We got this instead. They wanted us to test it and give feedback. We did. People were overjoyed we finally got cat enforcement. They take that as approval of the model. Zwift was so invested in the new model that they didn’t stop to think even once about the criticism that was presented regarding the model per se and they probably never intended to listen either. They just went live without even asking. What happened next was it was suggested to us that we, the malcontent, continue the discussion on a ranking system on our own. That’s just a diversion, quelling the waves. There is no ranking system coming.

I just proved a lot of problems in the new model. They are not new problems, they are the same as before. There are other problems too. People voiced those in the test. One of them is a subgroup (which includes me) can’t beat the top racers in the cat below, the hedged guys. Still this subgroup is placed in the cat above. Massive improvement? I think not.

Here is the thing:
What I showed you above is an early warning. It’s the things people will have a bad gutfeeling and ungrounded opinions about 6 months from now.

And then people will come here and write about those bad feelings about racing, giving hearsay and anecdotal evidence. They will get mowed down by the Zwift/model fanbois, as usual. And it will continue like that until one or a few persons decide to persistently try to sway the misguided general opinion and put the root of the problems into the spotlight in various ways over and over. Then a few years later a storm is finally building up. That’s when you will see the next futile, stupid attempt at saving the hopeless performance measures. The staff may or may not agree with this course of action but Eric Min still refuses to budge, so what can they do?

Ok… let’s get settled for the next go round then. See you in a few years. I will necro this thread then (I will have to save it because it will be deleted).

3 Likes

The horse’s mouth. At least, that is their ambition.

5 Likes

I’d say we won’t be seeing ranking-based categories at least before Q2/2023, if even then. Feel free to surprise me.

7 Likes

Oh for sure. First of all they have to choose an appropriate system, then implement it, then design and deploy the mechanism for actually splitting pens by ranking and everything that goes along with that. As far as I can tell, the resources are only just coming together, so it will be a long way off. Should it have happened years ago? Yes. Is there any value in labouring that point? No. I think the best we can hope for is good 2-way comms as the system develops so that we can help steer it in the right direction, and we don’t end up with something that they think we need that actually misses the point entirely.

For me the biggest mistake would be to keep any sort of category system. There is no need to have A/B/C/D or E/1/2/3/4 or any other such split, in an online game. Rank from 0-1000 or something, and sure you could colour code in 100 point chunks or something if you wanted, but static boundaries only present problems.

6 Likes

Why would you hope for that? There is only 1-way communication, possibly leading up to them feeling that they are forced to act somehow. The communication you think you see is just CRM. It’s incident management to protect the brand. This is not co-development with the community or solicited subscriber influence. Never was. Never will be.

They have to stop hugging performance measures. They will never work. They cannot work. They and everyone else have to give up that dream. There is no other way about it. We knew that years ago. And I won’t stop saying it or showing it. I will never stop. Not until that happens. Then I would go and do something else elsewhere.

1 Like

In my opinion Zwift need to continually analyse race data for Category Enforcement especially over the first several months to ensure there are no anomalies that are favouring certain groups of riders. I’m aware that most want to know exactly how Zwift are calculating CAT enforcement but on doing so you are going to get riders trying their hardest to beat the system to race in Category below which they should be. CAT enforcement certainly ain’t perfect but if Zwift continue to monitor / adapt the profile it will improve fair racing for us all :crossed_fingers:.

1 Like

They don’t. You always had to do the job for them. Hence why I posted. And you can’t even ask them for the data. You have to scrape it yourself.

And once you spoonfeed them anomalies, they don’t act on them. At least not until a few annoying and very stubborn persons have spread the light to a large number of subscribers who weren’t easy to convince since they put too much faith in Zwift. Hence why I post a lot. Same message, for years.

We wouldn’t have cat enforcement even, if it wasn’t for people like me. That’s the sad truth. It took a few years to make the first action point of three in a short bullet list to happen. But OK. Maybe I can get the second point into the next 5-year plan of theirs, while still waiting for the third and crucial point to happen.

Bullet list? It’s this, always was:

  • Pen enforcement
  • Remove performance ceilings, stop the after-the-fact DQ’s
  • Results-based pens

That and only that would turn Zwift into a sport. Shaky, obviously, with user inputs they can’t possibly control (weight etc) and flaky tech both internally (physics model) and externally (unreliable trainers etc). But there would be something to build on at least. Today we have nothing. Oh, except in the cat A/elite events that they showcase. The bullets never applied to them, that’s why they work.

2 Likes
  • Pen enforcement - DONE
  • Remove performance ceilings, stop the after-the-fact DQ’s - Done (due the bullet one this is done)
  • Results-based pens - To Do.
3 Likes

:joy: I was typing with :crossed_fingers: agree with all you say it’s like once something new gets released eg PD 3.0 I doubt there are any tweaks going on to continually improve things rather stop optimising and start work on PD 4.0 :thinking:. I have a Wahoo Kickr Bike a bug has existed for several months where my Avatar moves out of the draft disabling steering on BLE sometimes helps but not always for myself it’s a disaster for racing but several updates later issue is still there. I have subscriptions to other platforms Rouvy and FulGaz although smaller community features get added much faster and I’m sure employee count is much smaller. Everyone has different options of what’s important but I don’t see a clear roadmap with a aim to improve racing, improve source code / bugs, add features and generally improve everything I think ZHQ thinks their platform is perfect as it is which unfortunately ain’t the case.

1 Like

I was in a pen enforced race last week and there was still one person disqualified (WKG) on the zwiftpower results.

1 Like