March Stats

In case you haven't been following, the PS server issues mean I can't run my stats scripts on the server, and copying them off is slow as balls.

If we wait to get all the logs in a place where I can process them, we're looking at a delivery date for the March stats some time in May, even if I just do the usage-based tiers.

There is a shitty alternative though: running stats over a subsample of the month's stats, maybe 10-100k battles for each tier?

That might only take a few days to copy over and should process in a few hours.

The primary issue is going to be figuring out a way to have the subsample be at least somewhat random, but I think that's just a matter of researching functionality in rsync.

Thoughts?

Stats aren't usually within less than a tenth of a percentage point of the cutoff, so someone who's taken Stats 101 should be able to tell me what size sample you need to have high confidence in the outcome.
 

Disjunction

Everything I waste gets recycled
is a Site Content Manager Alumnusis a Top Social Media Contributor Alumnusis a Community Leader Alumnusis a Community Contributor Alumnusis a Tiering Contributor Alumnusis a Contributor Alumnus
I think the sampling idea is good, but, as anyone who has taken stats 101 should know, runs a risk by itself of being a very biased representation of any changes that happen. For the purpose of forming NU, this is probably fine considering our metagame will dramatically change come May regardless of samples, but I don't want to have this end up misrepresenting stats for OU, UU, and RU and, consequently, dropping some awful creatures that shouldn't have to make the existing councils do anymore work than they already need to.

I think if we went down this route then we should have a higher cutoff for drops and rises to prevent any serious damage sampling error could cause. If people are comfortable with the cutoff we've been using for the quick drops, then I believe that would be the easiest and most recognizable bar we could use.

I don't have much of an opinion on the sample size, either, as I feel we are slightly in the dark on the process, resources, etc that are being used here. You mentioned in the usage stats thread that just OU alone would take the next month, but I don't know what that means in regard to UU/RU stats considering they are significantly smaller in size. Would RU's full battle's played size be fine to run, sitting at ~100K battles in Feb? Do we have to only run half of UU's battles because they're at 200K? Obviously a bigger sample is better, but we don't want to go so big that we make the stats delay another month anyhow.
 

Bughouse

Like ships in the night, you're passing me by
is a Site Content Manageris a Forum Moderator Alumnusis a CAP Contributor Alumnusis a Tiering Contributor Alumnusis a Contributor Alumnus
What's your capacity to truly pull a random sample? Can you at the very least control for the day the battle occurred on? That's the biggest enemy of randomness here, since trends come and go.
 

Zarel

Not a Yuyuko fan
is a Site Content Manageris a Battle Simulator Administratoris a Programmeris a Pokemon Researcheris an Administrator
Creator of PS
I think the sampling idea is good, but, as anyone who has taken stats 101 should know, runs a risk by itself of being a very biased representation of any changes that happen.
Sampling bias isn't really that hard to deal with. If you know your sample method, you know what biases it introduces.

Time of day and date range are the main worries, so a sample such as "every third day" is going to be reasonably free of bias.
 

quziel

I am the Scientist now
is a Site Content Manageris a Forum Moderatoris a Community Contributoris a Smogon Discord Contributoris a Top CAP Contributoris a Contributor to Smogonis a member of the Battle Simulator Staff
Sorry to interject, but tried to work out some minimum values for the number of battles needed to find an estimate for the usage percentage at various confidence levels.

1. Usage stats around 3.41% are most important for tiering

2. Margin of error of 0.1% is acceptable

3. Underlying distribution is large enough to be approximately normal

4. Confidence level Z values are << n


· Modifying confidence interval for population proportion; most important part of the formula is diff. between upper and lower bounds

· Choosing population proportion of interest to be p=0.0341

· Varying confidence levels of 90% (z*= 1.645), 95% (z*=1.96), and 99% (z*=2.575)



z*= z-critical value for various confidence intervals (relates to area on normal distribution, here its 90% (z*= 1.645), 95% (z*=1.96), and 99% (z*=2.575))

p = sample proportion (as we are most about usage near drop points, going to set it to 3.41% usage, or p = 0.0341)


Basically wanted to figure out a confidence interval for the true population proportion, so messed with the formula a bit.

Did a couple of numerical calculations, and found the following N values for Confidence intervals of 90%, 95%, and 99% respectively (sorry, unsure how to make a table).

Conf. Level N

90% 912
95% 126500
99% 218500​

These numbers are only approximate, and could vary a bit, but should give a vague idea of the number of battles needed to achieve accuracy within 0.1% and the confidence we can put into those values.


Edit: It seems that there was a mistake or two in the equations I was using that was only apparent at "low" N. I am currently redoing stats and will update the post once done.

There's a large chance that I could have gotten some of my working wrong, so if anyone more knowledgeable in statistics could check my work, that would be wonderful. Also, if anyone knows how to properly format the Conf. Level / Min. Battles into a table, could you please PM me?
 
Last edited:
Zarel, "every third day" is still WAAY too much. I was thinking more along the lines of "battles whose numbers end in "00" (that should give ~20k OU matches). I was also hoping there'd be an easy way to randomize rsync's order of transfer, but if there is a I've yet to find it.
 

phantom

Banned deucer.
I don't agree with using a subsample of the stats. If the full stats aren't there, then there shouldn't be any tier shifts for this month or an NU alpha ladder. I don't see the particular urgency in using a fraction of the stats for this, which are likely to differ from the full stats and thus throw UU/RU out of wack with whatever tier shifts result from it. I would understand the need to do this if NU was going into beta this month, but it's not, and NU alpha is not an official tier. Meanwhile, a tier shift with potentially volatile stats will negatively affect tiers that are actually official and have been tiering. I don't think the tradeoff for a messy ladder for an unofficial tier is worth putting official tiers in a potentially bad spot. The only solution that seems best for this to skip over stats for the month, allow RU to exit beta with what it currently has, let the tier shifts that were supposed to occur this month happen the next, and allow NU to skip over alpha and go into beta the following month as well. This would keep mostly everything on course, including not pushing PU back a month, while keeping the other tiers in a stable position instead of gambling with a small sample of stats that could result in some pretty wacky tier shifts.
 
I would like to use a subsample of the stats. I trust our resident smart people Zarel and Antar on how to best handle it. quziels confidence intervals look promising too but i dont know stats very well.

Either way, the NU playerbase would really appreciate some sort of way to play
 

Zarel

Not a Yuyuko fan
is a Site Content Manageris a Battle Simulator Administratoris a Programmeris a Pokemon Researcheris an Administrator
Creator of PS
Conf. Level N

90% 912
95% 126500
99% 218500​
You have an unusually huge gap between the 90% and 95% confidence levels, so I'm recalculating them.

upload_2017-4-9_23-28-34.png


is the approximation function for a Bernoulli sample. I'm not exactly sure what formula you're using instead, but yours looks more complex than necessary.

our target p-hat is 0.0341. We want our confidence interval to be within 0.1%, so we want our confidence interval to be p-hat ± 0.001.

plugging in our numbers, we get

upload_2017-4-9_23-34-20.png

upload_2017-4-9_23-37-25.png


z is 1.644 at 90%, 1.959 at 95%, 2.575 at 99%

Confidence level | N

90% | 89,021
95% | 126,402
99% | 218,394

So we got the same numbers for 95% and 99% confidence intervals, but I assume you messed up your 90% confidence interval calculation.
 
Honestly, if we had 99% confidence that sampled stats would mirror full stats, it would be stupid not to just switch to sampled stats moving forward, even once the new server is up and running. Imagine usage stats coming out on the 1st or 2nd instead of the 7th-10th. Can you really tell me that wouldn't be worth it?
 

Anty

let's drop
is a Site Content Manager Alumnusis a Team Rater Alumnusis a Community Leader Alumnusis a Community Contributor Alumnusis a Tiering Contributor Alumnusis a Top Contributor Alumnusis a Smogon Media Contributor Alumnus
Is there a sort of sample trial you could do with the previous February stats? Like compare the "battles ending in 00" sample stats with the actual ones to see how close they are (you could chi square test it i think or just look and compare). Would be the best way to convince people.
 
Is there a sort of sample trial you could do with the previous February stats? Like compare the "battles ending in 00" sample stats with the actual ones to see how close they are (you could chi square test it i think or just look and compare). Would be the best way to convince people.
In a perfect world, yeah. But I'm having enough trouble pulling March's logs. When it takes so long to just list and count the files that I have to just give up, I don't see pulling a validation sample first on another month's logs.

But if you're talking after we move to the new server, then sure. I'll check then.

Keep in mind: this is not the first time I've calculated stats with partial months. For a few of the Nintendo competitions I've calculated stats early. I've even calculated usage-based tiers early once or twice to give tier leaders a heads-up about upcoming changes. These have nonrandom samples, missing sometimes a full week of data, and very rarely has anything changed.
 

Honko

he of many honks
is a Site Content Manager Alumnusis a Programmer Alumnusis a Top Contributor Alumnus
If we wanna be extra conservative about not making unnecessary tier shifts, while still using a small enough sample of the data to get stats out in a reasonable amount of time, we could always temporarily adjust the cutoffs for rises/drops. Like if you can be 99% confident with a confidence interval of +/- 0.5%, then we only allow rises for >3.91% and drops for <2.91%.

I don't personally think that's necessary, but it's an option to unblock things for this month if there's strong opposition to trusting the sampled stats.
 
Honestly I would rather just skip this month of stats and put all resources to making sure the same mess doesn't happen next month than to pull a sample for a month that's near half over anyways. By the time this got done it wouldn't even be worth the time you put into writing the script.
 
We are going to have this problem for April as well, it's too late to make sure it won't happen again...

Now, May, I can relatively safely guarantee it won't be a problem for, if it's any consolation. :p
So then we're in the position where we can't advance tiering reliably for another month and a half, without a guarantee of it the next month? I feel like we're now at the point of "what other tiering option can we take" instead of "how do we fix the stats" then. We can't just do nothing for that long, and the sample stats, even at 99% confidence (which may still be impossible to take), would be questionable. Looking into short term viability tiering or something has to be a consideration at this stage.
 

Freeroamer

The greatest story of them all.
is a Community Contributoris a Top Tiering Contributor
I think you're being very harsh, particularly when it's been explained that the problem is the sheer mass of the amount of battles being played, hardly a realistic problem to expect to go away...

No one wants the issue to be solved via samples stats but as a best case scenario it's a perfectly reasonable alternative to not taking the stats at all. Viability tiering is an option also of course, but it's my belief that if smogon didnt initially tier using what their more experienced members believed to be viable, there's a good reason why they wouldn't consider it next to something we could have 95-99% confidence in, bias seems particularly evident here, as well as the poor PR that could result as of such a decision.

e: @ below, apologies if my reasoning is wrong, haven't done stats in awhile, be nice if you could tell us what's wrong instead of a one liner though.
 
Last edited:
I think you're being very harsh, particularly when it's been explained that the problem is the sheer mass of the amount of battles being played, hardly a realistic problem to expect to go away...

No one wants the issue to be solved via samples stats but as a best case scenario it's a perfectly reasonable alternative to not taking the stats at all. Viability tiering is an option also of course, but it's my belief that if smogon didnt initially tier using what their more experienced members believed to be viable, there's a good reason why they wouldn't consider it next to something we could have 95-99% confidence in, bias seems particularly evident here, as well as the poor PR that could result as of such a decision.
I'd agree that I'm being very harsh, I just feel like this is a totally avoidable position that we're in. I understand the situation is beyond anyone's control as well. So, why not take the step now and avoid it?
I have the feeling a number of people commenting itt haven't taken stats courses recently
Assuming you mean to the people commenting that the 99% confidence is "iffy", I do understand that it would probably be just fine, but if we're using a debatable and subjective system at that point anyways, why not change it to one that can get working quicker and isn't dependant?
 

Bughouse

Like ships in the night, you're passing me by
is a Site Content Manageris a Forum Moderator Alumnusis a CAP Contributor Alumnusis a Tiering Contributor Alumnusis a Contributor Alumnus
There are reviews of large government programs funneling billions of dollars around in the US (possibly trillions in the world) that are based on statistical samples... This is not me talking out of my ass - I literally work on one of these at my job. We can do this because well-designed statistical samples work.

I think we can use them when necessary on a Pokemon site (and frankly wouldn't be opposed to replacing full population stats with sample stats if they can be done much more quickly).
 

UltiMario

Out of Obscurity
is a Pokemon Researcher
For real though, even if we used "games ending in 00" (20k games, way below our true 90% confidence interval) we'd still have a better relative sample size than samples used to predict the outcomes of elections and for companies to make financial decisions.

I remember working with my brother once on analysis of a government survey that was supposed to represent specific income information for the entirety of the United States. The sample size was like 13k people. Surveys to analyse political opinions among the populace of countries often fall in the 1-2k range. Hopefully this gives an idea of how silly people are being over samples for Pokemon stats.

Assuming I did my math right, 20k games in OU with our current stats pulls us to about 90% confidence interval on +/- 0.55% error. Just change the add/drops as Honk suggested if you all are THAT paranoid about screwing up the metagame on sample size.
 
Last edited:

Users Who Are Viewing This Thread (Users: 1, Guests: 0)

Top