Statistics and probability for randomized games (loot shooters, roguelites, roguelikes, etc.)

2020/06/10

Goal: Let’s use R and statistics to make looter shooter drop rate videos better.

Games such as Borderlands 1, Borderlands 2, Borderlands 3, Binding of Isaac, Dark Souls, etc. all have random loot drop rates. This means that a player will randomly get items from defeating an enemy, completing a quest, opening a chest, or fulfilling some requirement. Players refer to the idea of drop-rate as the frequency that a particular item is “dropped” from a chest or a boss, that is how often will a player get that item as a reward for a particular quest.

There are many videos and guides online about these games and particular items in these games that service the player communities of these games (zKarmaa’s Drop Rate Videos). They often give reviews of items, and provide tips and tricks. Some of these contributions are completely empirical, consisting of numerous repeated tests in order to measure drop-rates and other properties of games.

Common questions I see addressed:

Questions that I don’t see addressed that should be addressed:

I really enjoy these empirical works. I enjoy when they do the testing well and share the results of their tests. I enjoy the same treatment of mundane things like the effects of different brands of penetrating oil from such videos from Project Farm[1][2]. But I do care about stats and I want it done well.

What I want is better statistical analysis in game service videos. There is a glossary at the end if any of the terminology is confusing.

Case study: Borderlands 3

The Borderlands series is a loot shooter series. This means that when you defeat enemies and open chests the loot is usually randomized, pulled from a random distribution. The computer generates random numbers and these are converted into loot. What this looks like to a player is that some loot is common (white weapons) and some loot is rare (purple items, or orange legendary items).

Borderlands 3 is the latest instalment of the Borderlands series whereby you run around killing indiscriminately and every kill is a slot machine pull whereby random loot can appear. You typically optimize for blue, purple, and orange legendary items as the higher the rarity the more money they are worth and the more extra features the item has. Although some legendaries are not worth it.

Many bosses have a chance to drop loot and some have dedicated drops. For instance the Kaosan SMG is dropped from Commander Traunt on the planet of Athenas. Everyone likes the Kaosan because it is a high fire rate accurate SMG that provides secondary damage because it shoots little bombs. But the drop rate of the Kaosan SMG is relatively low. I had 4 Kaosan SMGs drop after 90 rounds of defeating Commander Traunt. So what is the probability of a Kaosan drop from Traunt? 4/90 ~ 4.44%? What if I got lucky? What if I got lucky multiple times? What if I was having bad luck? Is 4.44% accurate? Given my experience what should you expect the drop rate to be?

Drop Rate of an item from defeating a boss

What is the actual drop rate of a particular item from a particular boss?

Would it surprise you if I told you that the true probability is probably between 1.4% and 11%? That’s a big spread. 11% means I probably only have to interact with Commander Traunt 10 to 20 times before I get what I want. 1.4% means we have to go 50 to 100 times, 95% of the time. That’s a huge difference in my time.

Imagine that 4.44% was the true probability. If I repeated this test 900 times and got 40 Kaosans, what would the spread from my testing be? 3.3% to 6.0%. By doing 10 times more work I still have a large spread of probability. Well that stinks. Therein lies the rub, to test low probability events we need a lot of data and these loot shooters love having low probability events.

How did you calculate these spread? I used the 95% confidence interval extracted from R’s 1-sample proportions test. This 95% confidence interval is calculated from Bernoulli trials or Binomial trials—trials which are either success or failure with a fixed probability (like flipping a coin).

What’s a 95% confidence interval? It’s range of values we expect to extract under similar conditions 95% of the time, so 1/20 it is really wrong (like winning the lottery, it can happen). It in this case it is an estimate of the proportion (frequencies are proportions) success over failures.

Here we do a 1-sample proportions test against 4/90 as the true probability. The 1-sample proportion test says it is likely, the p-value is 1.0 which is greater than the threshold of 0.1 or 0.05, it is likely to be from the same distribution—that means 4/90 is plausible if the probability/frequency of the event is 4/90 as well, but we see a 95% confidence interval below, that’s 0.017 to 0.10 or 1.7% to 11%. The 95% confidence interval is more useful to me as player because it means that my personal experience might only be 10 trials to achieve the weapon to 50 to 75 trials to achieve that weapon. That’s a big spread. I could get really lucky or really unlucky, that’s why you see those forum posts where a player complains about 100 kills and 0 drops. It can happen :-(.

Here we will use the 1-sample proportions test prop.test, and we’ll inform it that the expected probability was 4/90 (p=4/90), but we also observed 4 successes, from 90 runs. We want to know if 4/90 is likely coming from a distribution with probably 4/90, it says yes it indeed does. We also want to know the 95% confidence interval of the probability p given this observation.

> prop.test(4,90,p=4/90)

	1-sample proportions test without continuity correction

data:  4 out of 90, null probability 4/90
X-squared = 0, df = 1, p-value = 1
alternative hypothesis: true p is not equal to 0.04444444
95 percent confidence interval:
 0.01741724 0.10876855
sample estimates:
         p 
0.04444444 

Warning message:
In prop.test(4, 90, p = 4/90) : Chi-squared approximation may be incorrect

That warning message is correct, 4/90 is not great. Chi-squared wants at least 5 in each bin. But that’s another lesson.

Let’s look what happens if we achieved 10X the same results with 40 successes over 900 trials:

> prop.test(40,900,p=4/90)

	1-sample proportions test without continuity correction

data:  40 out of 900, null probability 4/90
X-squared = 0, df = 1, p-value = 1
alternative hypothesis: true p is not equal to 0.04444444
95 percent confidence interval:
 0.03280679 0.05995445
sample estimates:
         p 
0.04444444 

We see the estimate of the probability is a lot tighter in this case. More trials, more certainty. Was it necessary? It depends on what you want to give your audience. Half of those trials and you would get a similar estimate.

Drop Rate Summary

Number of items dropped.

Some bosses drop lots of legendaries items when they are defeated, some drop few. The big payout bosses are quite popular and people want to know what to expect.

Generally statisticians will try to match the thing you are measuring to a distribution. The values of the number of items dropped are kind of strange, they are non-negative, sometimes zero, and discrete—there are no 0.3 items. The counts are non-negative integers. This complicates things because the stats for this kind of data is complicated. People will often use negative binomial distributions and negative binomial regression to model these problems. I think that’s probably too much to discuss here.

Imagine that we don’t know the underlying distribution of our data and we measured this many items dropped per boss kill:

items <- c(4,4,4,4,2,4,3,3,2,3,2,4,0,4,5)

We can calculate some summary statistics such as the mean or average and the median and the standard deviation. Without knowing the underlying distribution many of these measures are not that useful. The median is arguably the most useful measure: 50% of the time what is the minimum or maximum number of items you will receive? Median is in the middle, it is the value that 50% of the measurements will be above, and 50% will be below.

> items <- c(4,4,4,4,2,4,3,3,2,3,2,4,0,4,5)
> mean(items)
[1] 3.2
> median(items)
[1] 4
> summary(items)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     2.5     4.0     3.2     4.0     5.0 

We can also calculate the 95% quartile, which is not a confidence interval that we saw before, this is simply where 95% of our measurements fit between. Essentially we cut off the first 2.5% and the last 2.5%.

> # simple 95% quantiles [not a confidence interval]
> quantile(items,c(0.025,0.975))
 2.5% 97.5% 
 0.70  4.65 

Bootstrapping

We want to calculate an estimate of the 95% confidence interval of the number of items dropped.

Median or average serve their purpose but viewers are going to have different experiences. I would prefer if one could give 19 out of 20 viewers an accurate expectation of the results. A 95% confidence interval can be employed. These are annoying because R and other languages often lack code to calculate them from arbitrary statistics. 95% confidence intervals are great because they give you an idea of what to expect 19 times out 20, as in most of the time.

So we’re going to bootstrap. Bootstrapping is a method of generating statistics from existing measurements to get an idea of the effect of outliers and to be a bit more confident in our results. The general idea is that you take your measurements and you sample them, you sample from them the same amount that you measured. So if you measured 20 things you sample 20 things from the 20 things you measured. Then you can do statistics (such as calculating a mean or median) on that those samples. Then those calculated statistics are sorted and you ignore the first 2.5% (half of 5%) and the last 2.5% (half of 5%) giving you a 95% confidence interval.

Generally you abuse computers and repeat these calculations 100s to 1000s to 1000000s of times to get a good idea of what to expect. Your statistics improve both with more tests and more repetitions. This has a tendency to exclude outright outliers (such as winning the lottery).

The first thing we’re going to do is we’re going to 10000 times sample 1 element from our measurements. This is the expectation that a user does 1 trial defeating a boss and should expect between X and Y items to drop from the boss, 95% of the time.

Adapted somewhat from DataCamp’s Quick R Bootstrapping tutorial: https://www.statmethods.net/advstats/bootstrapping.html

> # simple 95% confidence interval. We sample 10000 times from our
> # initial dataset to see what a 95% confidence interval will be
> # the statistic being sampled is selection.
> N=10000
> F=function(x) { return(sample(items,1)) }
> samples = sapply(1:N,function(i) { F(sample(items,length(items),replace=TRUE)) })
> quantile(samples,c(0.025,0.975))
 2.5% 97.5% 
    0     5 

Thus for N=10000 we sampled 1 item N times. We cut off the top and bottom values and we got a minimum of 0 and maximum of 5. Not super helpful. In this scenario we can expect 0 to 5 items. Not very descriptive. But we didn’t describe a lot of trials, perhaps if we double the number of trials we can get a nicer bound.

More detail of the R code

sapply means put/map the values 1 to N through the supplied function and return a list of the return values of the function, in that order. The supplied function samples a list of measurements from items of length(items) length. The sample contains duplicates because we sampled with replacement (replace=TRUE). The function F is very simple, it samples 1 value from the subsample—as if you defeated a boss once. So effectively we’re subsampling and subsampling because we want a 95% confidence interval of bootstrap subsamples.

> F=function(x) { return(sample(items,1)) }
> samples = sapply(1:N,function(i) { F(sample(items,length(items),replace=TRUE)) })

Calculating an estimate of the average number of items a user should see

We can replace our function F with another function such as the mean (the average) and we can get an estimate of the average that based on measurement 95% of those who replicate our measurements should achieve an average between 2.5 and 3.8.

Effectively this means we sampled 10000 times, calculated the mean 10000 times and then sorted the results and found the range that 2.5% in to 97.5% in—the 95% confidence interval.

> # Now if we want a confidence interval on the mean
> N=10000
> F=mean
> res = sapply(1:N,function(i) { F(sample(items,length(items),replace=TRUE)) })
> quantile(res,c(0.025,0.975))
    2.5%    97.5% 
2.533333 3.800000 
> 

Drop Rate of different types of items from defeating a boss

Now another popular part of Borderlands is finding bosses or quests that drop lots of loot. Often I’ll see guides and videos about farming these bosses and sometimes you get a summation of the haul. It feels all very consumeristic and greedy but if you’re in the game it is pretty fun.

Often you want to know the different types of items you are going to get. Some quests/bosses reward different kinds of loot than others. For instance in borderlands you typically have weapons, artifacts, class mods, grenade mods, and shields. Some people will group artifacts, class mods, grenade mods, and shields together and treat weapons separately.

These are categorical distributions, essentially counts of types. Typically there are 3 things we want of categorical distributions:

Be aware that the more categories you have the more tests you often have to do to detect small changes.

zKarmaa’s test of Graveward item drops with or without Loaded Dice

We’re going to use data from the video by zKarmaa:

zKarmaa farms the Graveward boss in Bl3 1000 times and counts the types of the items that drop from defeating the boss. zKarmaa breaks it down by counts of world drop legendaries, legendary class mod, legendary artifacts, and a sub category of legendaries of being anointed. In the video he does 500 kills with loaded dice and 500 without.

First we represent our distributions:

loaded  <- c(world=658, annoit=373, class=111, artifact=101)
control <- c(world=667, annoit=405, class=113, artifact=130)

Then how do we tell if using the loaded dice was any better than not using the loaded dice? Our data is categorical counts. They cannot be negative, they are 0 or more. We make the assumption that the counts are proportionally scaled and that the probability of each category is fixed.

To determine if loaded dice produce an effect we shall ask the question:

Given a treatment, such as loaded dice, does it come from the same distribution of items without the treatment. That is, is there a difference if I use loaded dice or not. My data is categorical counts and I want to see if there is a difference. Statisticians will ask: “Does my measured data from applying a treatment come from the same distribution of measured data without the treatment?” This will be in the form of the Chi Squared tests (X^2). R has an implementation of this test:

> chisq.test(loaded, p=control, rescale.p=TRUE)

	Chi-squared test for given probabilities

data:  loaded
X-squared = 5.5142, df = 3, p-value = 0.1378

I’m asking R to calculate a Chi Squared test, to treat the control measurements as the empirical distribution to fit and to rescale the proportions. It tells me that I have an X-squared statistic, and degrees of freedom (number of categories - 1) and a p-value of 0.1378. The p-value is greater than the normally used threshold of alpha of 0.05 so I assume that these distributions are not statistically significantly different. This matches with zKarmaa’s claim in his video.

Later he looks at counts of dedicated drops of Lob shotguns, Ward Shields, Graves, and Moxxi’s Endowment:

> loaded  <- c(lobs=38,wards=36,graves=43,moxxi=51)
> control <- c(lobs=43,wards=36,graves=43,moxxi=23)
> chisq.test(loaded,p=control,rescale.p=TRUE)

	Chi-squared test for given probabilities

data:  loaded
X-squared = 26.773, df = 3, p-value = 6.568e-06

We can see a clear that p-value is awfully small. That suggests there is a statistically significant difference between the application of loaded dice and the control without loaded dice.

Was the difference solely due to Moxxi’s endowment?

> loaded  <- c(lobs=38,wards=36,graves=43)
> control <- c(lobs=43,wards=36,graves=43)
> chisq.test(loaded,p=control,rescale.p=TRUE)

	Chi-squared test for given probabilities

data:  loaded
X-squared = 0.39257, df = 2, p-value = 0.8218

Given that p-value > 0.05 we can assume so. The loaded dice don’t seem to do much for zKarmaa other than increase the number of Moxxi’s endowment.

500 is a lot of trials

Is 500 trials enough? Or too many?

This is a topic for another post, but if you want to determine how many runs/tests you have to execute to determine if a treatment has an effect you can do power analysis. Power analysis tells us how many trials we’ll need to do in order to answer our hypothesis within a certain probability.

The first thing you need to do is to choose, how many categories do you want to measure? So in zKarmaa’s case he measured 4 categories, which means we’ll have a test of 4-1=3 degrees of freedom. Next the loaded-dice might be subtle, so we want to look for small effects, such as a difference of 10%. This will translate to a effect size of 0.2 for small, 0.5 for medium and 0.8 for large. But it means how much of the treatment will be different than the control. Here’s a good explanation of effect size if you want one: Saul Mcleod, 2019, What does effect size tell you?, https://www.simplypsychology.org/effect-size.html

The second thing we need is power, how wrong are you willing to be? 1/20? 1/100? This will be the level of power, so we’ll say 1/20 so 1-1/20 is 0.95 which will be our power. We want to be right 95% of the time.

The third thing is our alpha which is 0.05 as before. This is for video games though and social media clout so we could even go higher to 0.10 if we wanted.

> library(pwr) # install.packages("pwr") if you're missing it
> effect = 0.2
> df = (4-1)
> sig = 0.05
> power = 0.95
> pwr.chisq.test(w=effect, power=power, sig.level=sig, df=df)

     Chi squared power calculation 

              w = 0.2
              N = 429.2474
             df = 3
      sig.level = 0.05
          power = 0.95

NOTE: N is the number of observations

So zKarmaa was on the ball. If he wanted a smaller effect size, that is more subtle like 10% difference from the control he’d need a lot more.

> pwr.chisq.test(w=0.1, power=power, sig.level=sig, df=df)

     Chi squared power calculation 

              w = 0.1
              N = 1716.99
             df = 3
      sig.level = 0.05
          power = 0.95

NOTE: N is the number of observations

You can also gamble with this. You can drop your power to 80% and increase the sig.level to 0.1

> pwr.chisq.test(w=0.1, power=0.8, sig.level=0.1, df=df)

     Chi squared power calculation 

              w = 0.1
              N = 879.7742
             df = 3
      sig.level = 0.1
          power = 0.8

NOTE: N is the number of observations

So the balance here would be sensitivity (effect size) versus how trustable our Chi squared test is, plus a 20% chance that we didn’t observe what was necessary. Essentially it means that with 1000 samples of the Grave Ward we could achieve significant differences in effect size.

Summary

When comparing the counts of categories we should use distribution tests and on categorical distributions, such as the Chi Squared test.

Conclusion and Recommendations

In conclusion it is not hard for streamers and game community members to provide more reliable (probabilistically) estimates of drop-rates in their favorite games. Streamers who do this kind of testing, as long as their tests are consistent, should report the numbers they record to allow their fans to do statistical analysis. That said, with a little bit of R one can calculate these estimate rather quickly and successfully.

Please consider reporting:

Glossary