Proposal Addressing Smogon's 10+ Year old Usage Stat Problems

lighthouse64 · Aug 26, 2023

TL;DR -- Here's what's wrong

The tiering cutoff definition allows for a pokemon to be both truly OU and not designated OU because the percentage cutoff is based off unweighted usage stats, but percentages themselves are calculated using weighted stats (i.e. the cutoff calculation method is outdated by a decade).
The weight system overvalues new players and enables poorly performing new players to attain weights that can have noticeable impacts at a large scale (at least 0.05 weight)
The system has no mechanisms to stop masses of bad players with mediocre weights from making a garbage pokemon appear equal to a good pokemon in a tier. I propose changing the weight formula and bucketing weights into tiers (similar to histogram bins) to fix these problems.

Dear Smogon community,

I've spent the past 1.5 years researching the usage stats, and Antar's weighted stat system has some serious problems that don't seem to be well-known. There's 2 main issues: The usage calculation method and the cutoff justification calculation.

There's two parts here -- the weight calculation formula and the method for summing weights. Remember that new players start with a glicko rating of R = 1500, RD = 130. Glicko ratings are also not Elo. They are displayed to the right of them on a player's ratings.

Recall from antar's FAQ, that weights are calculated using the normal distribution cumulative density function with glicko weights. The problem here is that the function strongly overvalues high player deviation when calculating weights. This can be seen by comparing the following two ratings: P1 has R = 1480, RD = 100 and P2 has R = 1580, RD = 25. P1 has a weight of 0.0668 and P2's is 0.0228. If we compared GXE values though, we see p2 is likely much better. P2's GXE is 60.5 and P1's is 47.4.

Yet the usage stats think p1 is much better just because of its higher variability.

The second problem comes from directly summing weights to determine usage. With enough 0.05 weights, which are attainable by players who lose more than they win (50% winrate is usually enough to maintain R = 1500), bad players can easily overwhelm good players. There's no cap on the amount of weight that bad players can contribute bc all weights are directly summed. The system incorrectly assumes their weights will be too low to matter.

When I looked at uu player ratings, there was a ratio of at least 100 : 1 for players with weights below 0.1 to those above. Also one can spam bots with new accounts to play themselves and abuse this, using residential proxies at the cost of a few $ to bypass any IP restrictions.

All of the cutoffs past gen 4 are calculated incorrectly. For gen 9, the dex mentions

A Pokémon is truly OU if a typical competitive player is more than 50% likely to encounter that Pokémon at least once in a given day of playing (15 battles).

However, 4.52% weighted usage doesn't represent a 4.52% encounter rate. That 4.52% could be made up of 1000 bad pokemon or 1 good pokemon because a ton of low weights will stack up to be equal to a high weight. Thus, we have no idea what 4.52% usage actually means.

4.52% is calculated by assuming stats represent encounter rates and are therefore unweighted. We can see this by using the following formula to calculate just above 50%

Code:

1 - (1 - 0.0452)^15 = 0.5003

Where the term in the parens represents the probability of not seeing a pokemon, and the exponent says that we don't see a pokemon 15 times in a row. More details on the old formula are here.

This leads to pokemon like gen 8's regieleki that can be both truly OU and not designated OU.

Also even if the low weights were ignored somehow, it's important to remember that weights from above average players (>0.5) can vary widely. Some are very close to 1 and others are close to 0.5, but according to Antar, both should matter (hence why the 1630 cutoff is used). This means that no matter how you put it, weighted stats cannot be used to justify cutoffs.

I have a suggested fix, but it would require changing the usage calculation scripts (but not the data needed for them).

First, change the weighing formula to use the same formula that GXE uses. GXE uses a reference player with R = 1500, RD = 130, so you could change it to be R = 1630, RD = 0 for a cutoff of 1630.

Set buckets for weights based on ranges. You could have a category for weights from 0 - 0.1, 0.1 - 0.5, and 0.5 - 1. Each bucket calculates its own usage stats the same way as they are done now. Then force buckets to have a certain contribution percentage (e.g. force 0 - 0.1 to have 5% contribution so low weight spam no longer works). Then calculate overall usage per pokemon by taking the expected value of its buckets.

The great thing about buckets is that you can go back to using unweighted entries within buckets since the importance is already enforced by what bucket each entry falls into. This would allow stats to be based on encounter rates, while also limiting the contribution of bad players. It means people can actually understand stats, and the stats will be robust against abuse.

There's also a very weak fix that is bad but easy to implement

Set the minimum deviation restriction for players to have weight to be lower. Currently it's 100, which takes about 5 battles to reach. This would help a bit, but it doesn't solve the root issue and it also cannot logically be below 60. It takes about 30 battles to reach a deviation of 60, which is also noticeably the min # of battles required to meet suspect test requirements. Putting it below 60 would be saying that a person who qualifies for suspect reqs is not worthy of contributing to the usage stats, which makes no sense.

If the min required is 60 though, a 50% winrate player can still achieve a weight of 0.015, which only needs about 100 bad players per every good player to be even, so it does a crappy job.

Realistically, we'd need an event like Ambipom to the Top (except via bots) to trigger any serious action, especially considering Zarel's inactivity, but it's still important to understand how to view the current usage stats critically -- the percentages alone are not enough to assess a pokemon's viability or frequency within a metagame. If you have any questions, please feel free to ask here or message me on discord (same username).

If you would like to know more, I wrote a paper on the subject and also made a video to help explain the problems. These are just supplements to the post itself though.

Video

Paper
https://drive.google.com/file/d/1GEtjBAUg_9PgdHA52cBX2BJjkZDtEMdB/view?usp=drive_link

Also thanks to pre for helping me understand some of the usage stat mechanisms as I did research.

Quite Quiet · Aug 26, 2023

I think it would be interesting if someone could plot out how the weighting would behave with either of these proposals. From what I remember of reading Antar's posts the intent originally was to have some sort of logaritmic curve to it that could be shifted based on how skillful the players had to be to meaningfully contribute to usage stats. It would make understanding how this impacts weighted usage much easier. If not full graphs, at least some samples.

Also: nobody at all works with the usage stats scripts these days. Marty runs them to collect the data but has, multiple times, expressed no interest (or time I think) in actually working on these. So, if anything should happen at all, someone else needs to step up and actually be willing to work with the usage scripts. The code exists here but has seen very few changes in years for exactly that reason.

lighthouse64 · Aug 26, 2023

Quite Quiet said:
I think it would be interesting if someone could plot out how the weighting would behave with either of these proposals. From what I remember of reading Antar's posts the intent originally was to have some sort of logaritmic curve to it that could be shifted based on how skillful the players had to be to meaningfully contribute to usage stats. It would make understanding how this impacts weighted usage much easier. If not full graphs, at least some samples.

Ooh, that's a great suggestion. Thanks for the idea

I'd be more than happy to plot these out -- only issue is I don't have access to any of the data. If anyone could get me the 2016-08 RU logs (or any other logs with situations similar to gen 6 Ambipom), or something from gen 9 from April 2023 onwards (data has significantly less ladder errors), please reach out to me.

Quite Quiet said:
Also: nobody at all works with the usage stats scripts these days. Marty runs them to collect the data but has, multiple times, expressed no interest (or time I think) in actually working on these. So, if anything should happen at all, someone else needs to step up and actually be willing to work with the usage scripts. The code exists here but has seen very few changes in years for exactly that reason.

I've studied the usage stats scripts pretty intensively, and yep, they definitely are a pain to work with. I've talked to the PS admin, and the main problem here is that there's no specific server to store the data logs, so usage stat calculation has to share resources with the main server. The pkmn library has a recreation of them on github, although I don't think they are used atm.

I'd be available to work on this, but now that depends on people's willingness to trust me. Also, I would like to see how my proposed mechanism looks on old data before investing in a server to dedicate.

Bughouse · Aug 27, 2023

I can follow well enough what the potential problems with the usage weighting formula currently are and what the proposals for modifying it would do.

That said, I want to be clear that there's nothing wrong with using a cutoff percentage based on raw usage rates or indeed any arbitrary cutoff percentage (since the OP seems to be suggesting that by calling the cutoffs "outdated"). It would be very bad to change to using a cutpoint that is based on the moving target of weighted usage stats that could be different across measurement periods or even within a measurement period across different tiers. A simple and easy heuristic 4.52% and an explanation of its derivation, more likely than not to appear within 15 battles, is far superior to something more inscrutable and that would be a moving target. Frankly I would have no issues with a cutoff simply being a hard 4.5% or 5%, etc. even if those would no longer be tied to any particular calculation. This cutoff percentage is really "just a number" and more or less just needs to conform to community expectations of what should be in or out of a given tier.

quziel · Aug 27, 2023

Working through the paper. I feel that the point on encounter rates vis a vis the 4.52% cutoff is sorta misleading. The original text says "typical competitive player", and I'm reading that as someone who is competitive, aka not casual, and thus is spending most of their time in higher ladder, and facing players with 1630 glicko or higher (1695 in OU). So if we just view it as like seeing the mon 50% of the time over 15 matches against higher ladder players I think the issue, as much as there is one, is sorta solved.

As for the random deviation problems, I feel these are a bit overstated just because an account rarely spends that long with high deviation; the act of playing games, aka what gets mons on the usage charts, lowers deviation, and thus quickly prevents low ladder new accounts from having that high of an impact on usage stats. The weak fix is likely functional here; I'd choose RD<=65, just because that's 1/2 of the diff between 1500 and 1630, and 1/3rd the diff between 1500 and 1695.

Aka just change the following line in common.py:
```if deviation > 100 and cutoff > 1500:```
to
```if deviation > 65 and cutoff > 1500:```
or
```deviation=min(deviation,65)```

Your choice. Either likely prevents spamming low ladder with new accounts from even being remotely possible to inflate usage stats, especially in OU, which uses usage stats that start gathering info at a higher level.

Frankly the bigger issue with usage stats (in lower tiers) is just down to very low ladder population, which makes gathering any statistics far more difficult, as single users begin to have larger and larger influence on the usage stats themselves. We've seen this across the gens once ladder pops get too low, as a sufficiently dedicated user can get mons to rise by simply just spamming games. There's not an easy fix to this though.

lighthouse64 · Aug 27, 2023

Thanks for all the comments! Overall, I would like to state that I understand the issues (apart from the cutoff) are not that important for the most part. Realistically, no average person is going to do this, as it requires a significant amount of technical knowledge to actually pull off any meaningful tier manipulation. That being said, it wouldn't be surprising if it happened at some point in the future.

Also I am in favor of setting an arbitrary cutoff. Just make it transparent that the cutoff is arbitrary.

Bughouse said:
That said, I want to be clear that there's nothing wrong with using a cutoff percentage based on raw usage rates or indeed any arbitrary cutoff percentage (since the OP seems to be suggesting that by calling the cutoffs "outdated"). It would be very bad to change to using a cutpoint that is based on the moving target of weighted usage stats that could be different across measurement periods or even within a measurement period across different tiers.

Yep, I agree that it's fine if the cutoff is an arbitrary 5% (or some other number), and the number should be unchanging. However, if the community wants it to be that, then it should be justified as an arbitrary cutoff accordingly. It's the current explanation for the cutoff that bothers me because it spreads misinformation for other policy review posts. See the post regarding the hitmontop problem as an example. The stats should not be so confusing that they trip up the people who decide tiering policy. I called it outdated because the explanation is based on using unweighted stats, which haven't been used for tiering in over 10 years. I also find it problematic that the explanation gives a contradictory tiering decision for pokemon in comparison to the weighted percentages.

My main issue is that people are mislead into believing that weighted usage percents are encounter rates, when it's almost impossible for the two to line up. I wanted to introduce bucketed weighing so that usage stats could actually be interpretable by people who don't want to spend hours understanding its confusing details to know what they mean (in fact I still have no idea what meaning 4.52% weighted usage has).

Also if the cutoffs are arbitrary, I believe that pokemon should get seriously considered to be eligible to drop to a lower tier at the end of a generation if they are close to the borderline cutoff. It makes a ton of sense for a 4.4% usage pokemon to drop a tier and get banned if necessary, but is it really ideal to keep a 4.6% usage pokemon in the OU by technicality limbo if people think it's not good?

quziel said:
Working through the paper. I feel that the point on encounter rates vis a vis the 4.52% cutoff is sorta misleading. The original text says "typical competitive player", and I'm reading that as someone who is competitive, aka not casual, and thus is spending most of their time in higher ladder, and facing players with 1630 glicko or higher (1695 in OU). So if we just view it as like seeing the mon 50% of the time over 15 matches against higher ladder players I think the issue, as much as there is one, is sorta solved.

First of all, thanks so much for taking the time to read my paper. It means a lot to me. The follow-up I'd have to this statement then would be -- why not just use 1760 then, if we only care about super competitive players? That was answered by Doug here.

I'd like to really stress that it's a bad idea to consider weighted stats as encounter rates in any scenario. If you look at the 1695 vs 1825 cutoffs for OU this past month, the percentages are different. And the fact that they are different means that you really can't say that encounter rates for good players (by at least whatever antar cared for) can be captured by weighted usage. There's a lot of variability between what can be considered good -- really anywhere from 0.5 - 1 weight (which is 1695+ glicko center), so a pokemon seen by 10x 0.5 players would have a vastly different weight than a pokemon seen by 10x 0.7 players. For a current example, Meowscarada would be UU by 1825 cutoff standards, but by OU's 1695, it made the cutoff.

Even a 0.1% difference is HUGE (think gen 8 regieleki), as it can settle the difference between OU and UU.

quziel said:
As for the random deviation problems, I feel these are a bit overstated just because an account rarely spends that long with high deviation; the act of playing games, aka what gets mons on the usage charts, lowers deviation, and thus quickly prevents low ladder new accounts from having that high of an impact on usage stats.

In the context of just a single player, yes that'd be correct. If you played 30 games with 1 player, their deviation would go down quite a bit. This problem doesn't care about the common player though. What if I decided to play 30 games with 3 players instead? Sure, this effectively eliminates 15 of the games that count, but the players will still have a deviation over 85. The point I'm trying to make is that it is possible to play a lot of games at high deviation, not that it's normal. After all, one wouldn't consider active "tier trolling" to be a normal occurrence, yet it was on the forefront of Antar's mind.

quziel said:
The weak fix is likely functional here; I'd choose RD<=65, just because that's 1/2 of the diff between 1500 and 1630, and 1/3rd the diff between 1500 and 1695.

Your choice. Either likely prevents spamming low ladder with new accounts from even being remotely possible to inflate usage stats, especially in OU, which uses usage stats that start gathering info at a higher level.

Yep, reasoning makes sense and this is a very valid point. In fact, this problem actually cannot exist in OU because there are so many battles (and also the cutoff is higher). Cutoff deviation of 65 makes sense, although it still gives a 1500 player around 0.02 weight (i.e. the problem would only exist for lower tiers with not many battles). I do believe it's the best choice to make speed-wise, but it still leaves a terrible taste in my mouth because the usage stats can be so much much more useful to the community.

quziel said:
Frankly the bigger issue with usage stats (in lower tiers) is just down to very low ladder population, which makes gathering any statistics far more difficult, as single users begin to have larger and larger influence on the usage stats themselves. We've seen this across the gens once ladder pops get too low, as a sufficiently dedicated user can get mons to rise by simply just spamming games. There's not an easy fix to this though.

Yep, that's exactly why I said the hitmontop issue and ambipom to the top were entirely separate problems in my video. My only idea towards this would be to limit how much an IP/account could contribute to the stats per day (say, 10 battles per day max). That way you would have a much harder time spamming a bad pokemon with a genuinely good player.

chaos · Aug 28, 2023

It's good that there is interest in working on stats. I'll read the paper soon, but I'm not sure how much feedback I can give, as I am not an expert in this area.

quziel · Aug 28, 2023

Really the purpose of taking 1630 stats isn't to look at the absolute best players, because honest there's only so many of them, but rather to just weed out the Charizard/Blastoise/Venusaur/Snorlax/Pikachu/Lapras teams. To get effective usage stats we need a very large sample size, and these are available at 1695 in OU, with a sample size of 70k battles. I'd argue that 1630 might actually be too high of a cutoff for current lower tiers, which only sample 5k battles (using avg_weight x num_battles from smogon/stats). Small sample size absolutely is a larger enemy for us than the "quality" of teams sampled, as once you get down to 5k matches a month (this is true for NU, RU, PU) individual users start skewing stats super hard.

As for the IP measure, its probably valid to say "if you sample one team from one player, what is the chance you see a specific mon", but that requires logging a lot more data; recording IPs as associated data for teams likely would increase the processing load for stats. While this probably treats usage stats in a better manner; you might be able to decouple individual users, their team preference, and overall usage, the increase in book-keeping probably means this hurts rather than harms usage stat collection as a whole.

To look at solutions to the problem at hand (again, as much as there is one), probably just lower the maximum allowed random deviation, and for lower tiers consider going from 1630 to 1565 as the minimum allowable Glicko. Yes this means we have more influence from new players, but hopefully the stricter RD should help to prevent account spam from being as difficult for us. The sample sizes in lower tiers are simply just so small its actually insane; NU is currently sampling 166 (333 teams) matches a day, and that's small enough 30 games a day, provided your glicko is 1630+, is accounting for like 10% of the mons used.

-----

I think the framing of the encounter rates sorta obfuscates what 4.52 actually means. That is, if you sample 15 teams from 1630+ glicko battles, you will see a mon 50% of the time. Now that is something you can view as a typical laddering session, but strictly speaking its just a random sample of 15 teams from that elo.

lighthouse64 · Aug 28, 2023

quziel said:
Really the purpose of taking 1630 stats isn't to look at the absolute best players, because honest there's only so many of them, but rather to just weed out the Charizard/Blastoise/Venusaur/Snorlax/Pikachu/Lapras teams. To get effective usage stats we need a very large sample size, and these are available at 1695 in OU, with a sample size of 70k battles. I'd argue that 1630 might actually be too high of a cutoff for current lower tiers, which only sample 5k battles (using avg_weight x num_battles from smogon/stats). Small sample size absolutely is a larger enemy for us than the "quality" of teams sampled, as once you get down to 5k matches a month (this is true for NU, RU, PU) individual users start skewing stats super hard.
...
To look at solutions to the problem at hand (again, as much as there is one), probably just lower the maximum allowed random deviation, and for lower tiers consider going from 1630 to 1565 as the minimum allowable Glicko. Yes this means we have more influence from new players, but hopefully the stricter RD should help to prevent account spam from being as difficult for us. The sample sizes in lower tiers are simply just so small its actually insane; NU is currently sampling 166 (333 teams) matches a day, and that's small enough 30 games a day, provided your glicko is 1630+, is accounting for like 10% of the mons used.

Ooh yes this would be quite good since it would be combined with a lower deviation threshold. I definitely agree that sample size is the most important thing to care about. This would help against the hitmontop problem too (as single players would have less influence), which is arguably the most relevant tier manipulation problem (considering the technical barrier to bot spamming). The problem then becomes what average battle #/month would warrant a lower cutoff -- easy enough to implement.

quziel said:
As for the IP measure, its probably valid to say "if you sample one team from one player, what is the chance you see a specific mon", but that requires logging a lot more data; recording IPs as associated data for teams likely would increase the processing load for stats. While this probably treats usage stats in a better manner; you might be able to decouple individual users, their team preference, and overall usage, the increase in book-keeping probably means this hurts rather than harms usage stat collection as a whole.

IP logging in the battle logs is unnecessary to achieve this goal. There just needs to be pre-processing, similar to how validation rate limiting works (12 battles per every 3 minutes) to work. If one didn't want to deal with this (which isn't that much overhead in the first place), you could at least limit it per username.

quziel said:
I think the framing of the encounter rates sorta obfuscates what 4.52 actually means. That is, if you sample 15 teams from 1630+ glicko battles, you will see a mon 50% of the time. Now that is something you can view as a typical laddering session, but strictly speaking its just a random sample of 15 teams from that elo.

I can understand why you are saying this -- it's because weighted usage stats are highly correlated with how likely you are to see a pokemon. So yes, you probably would see a mon about 50% of the time if it had 4.52% weighted usage. In the context of tiering (where the 4.52% comes up) though, high correlation isn't enough. When cutoffs have razor thin margins that basically give no room for error, it means that this assumption is bound to come up with several contradictions because 1630+ glicko has such high variability of weights. A pokemon could be seen more often than its weighted stats suggest, and vice versa. If one doesn't want to change the way usage stats are calculated, it would be much better to just use an arbitrary cutoff of 5% and say that it's arbitrary so that these confusing contradictions cannot happen. Again, my most important concern here is misinformation, not that the cutoff has meaningful justification.

Also, you bring up an important point I forgot to put in my paper. You can't relate glicko ratings to elo. They are correlated, but the correlation is weak. For the UU and RU battles I measured, someone at 1630 glicko could be anywhere from around 1150 elo up to over 1500 elo. This makes sense because elo is much more volatile than glicko, and thus, you gain almost no information about where you would actually encounter a pokemon on the ladder with regards to the cutoffs used, which also makes the 4.52% encounter thing even more dubious.
For 1760, there isn't really enough data here, but I could gather info from OU for higher cutoffs if someone doesn't believe this is actually the case.

Data excludes users at 1000 elo because the elo floor causes rpr and elo to unnaturally deviate.

Proposal Addressing Smogon's 10+ Year old Usage Stat Problems

lighthouse64

Quite Quiet

why fall in love when you can fall asleep

lighthouse64

Bughouse

Like ships in the night, you're passing me by

quziel

I am the Scientist now

lighthouse64

chaos

quziel

I am the Scientist now

lighthouse64

Users Who Are Viewing This Thread (Users: 1, Guests: 0)