Christer
Joined: Aug 02, 2003

Introduction
The Coach Rating (CR) system is a way for coaches to get an idea of the relative strength of other coaches on FUMBBL. A higher CR indicates that the coach is playing at a higher level, whereas a lower CR means the opposite. It is important to understand that the system is tuned to give more weight to more recent results and that it does, indeed, estimate coach strength based on actual results rather than their true ability.
The difference here is that a number of coaches sometimes enjoy playing teams that is deliberately a tougher challenge (for example switching to a new roster they are not used to, or playing alllineman teams). This will inevitably lower the CR value to a lower level than the coach "deserves".
Another important takeaway is that the CR system is an estimation and not perfectly accurate. There is a fairly large variation of the actual rating number each coach has, and the precision shown should not be taken as an indication of accuracy.
Rating categories
Each coach has a number of different ratings. These are divided into one overall group and one group per competitive division (currently Ranked and Blackbox).
Each group has an overall rating, and one rating for each race available.
The overall ratings are completely independent from a mathematical perspective. This means they are completely separated and the ratings don't relate to eachother. An important effect of this is that there is no formula to convert divisional ratings to the overall one; the ratings are updated based on the opponent's rating in that specific category.
The racial ratings are slightly different. They correlate to some extent with the overall rating for the division they are in. The reasoning behind this is that coaches often do not play all races on their own. For example: A coach playing goblins may play against an opponent that doesn't play goblins. Comparing the goblin rating to the opponent's goblin rating would therefore be inaccurate (e.g. the opponent will have the default rating rather than an actual rating). Another way to do this would be to use the rating of the race the opponent is currently using. I chose not to do this because the ratings between the different races are not necessarily in the same scale. A goblin rating of 160 may not be the same level of play as an orc rating of 160, because orcs are probably easier to play than goblins for most people. As a compromise, racial ratings are compared against the opponent's overall rating for the division.
The fact that the racial ratings are inherently based on fewer games makes them less accurate, and you should take these with a larger pinch of salt than normal.
Definitions
CR = Coach Rating. A starting coach has a CR value of 15000 (no fractions allowed), scaled by 100 on display to 150.00.
CR' = New CR (ie, the CR after a match has been processed)
k = An amplification factor designating the effect of a match result on CR
S = Factor for the actual result in the game.
p = Win probability of the match.
The Math
The core of the rating system is based on the Elo system, which has been heavily modified for the unique parts of Blood Bowl (primarily that teams are not equal). Note that this is calculated twice per match (and CR type): Once for each coach involved in the match.
We start with calculating the win probability (p) for a given match:
CR_coach = CR of one of the coaches
CR_opponent = CR of the opponent
We then compute a weighted CR difference:
CR_diff = CR_linear * f_1 + (CR_linear * f_3) ^ 3
CR_linear = CR_opponent  CR_coach
f_1 = 1.0
f_3 = 0.275
We also compute the normalized TV difference for each team:
TV_min = min(TV_coach, TV_opponent)
TV_diff = 100 * (TV_opponent  TV_coach) / TV_min
Finally, we get to the win probability:
p = 1 / (10 ^ (CR_diff / f_CR + TV_diff / f_TV) + 1)
f_CR = 400
f_TV = 70
Next, we define the S value, which is the actual result of the match:
S = 0.0 for a loss
S = 0.5 for a tie
S = 1.0 for a win
Then, we calculate the basic amplification factor k:
k = 2  This is the base k value that is used unless the exceptions below are in play.
We then evaluate the match:
outcome = (Sp)
bracket_diff = bracket_coach  bracket_opponent. This is a numerical value between 1 and 6 for the different CR brackets (experienced through legendary). Additional detail about these brackets is posted below.
k = 2 + abs(bracket_diff) / 2, if bracket_diff < 0 and outcome > 0 or if bracket_diff > 0 and outcome < 0
This means that if the coach did better than expected vs an opponent in a higher bracket, or worse than expected vs a coach in a lower bracket, the k value is amplified to somewhere between 2.5 to 5.5. In short, we amplify the result if there's a clear upset.
k = 2  2/(5  abs(bracket_diff) / 2), if bracket_diff > 0 and outcome > 0 or if bracket_diff < 0 and outcome < 0
This means that is the coach did better than expected vs an opponent in a lower bracket or worse than expected vs a coach in a higher bracket, the k value is dampened to somewhere between 1.2 and 1.56. In short, we dampen the result if there's a clear nonupset.
After all this, we are ready to calculate the CR change:
CR' = CR + k * (S  p)
For each match, this is repeated a total of 8 times (overall, twice for the overall divisional one and the division/race specific one; all computer for each coach in the match). 
Last edited by Christer on Oct 21, 2017  16:36; edited 2 times in total 

Christer
Joined: Aug 02, 2003

Rating Brackets
The rating brackets are large groupings of coaches, where each coach is given a bracket based on their CR.
The brackets follow the skill levels of Blood Bowl:
Rookie, Experienced, Veteran, Emerging Star, Star, Super Star and Legend.
The Rookie bracket is handled in a special way, where a coach who has played less than 10 matches for a given CR category is listed as a Rookie. This is a filter only applied on the display of the bracket, and does not affect the CR calculation above. Instead, coaches begin at Emerging Star as far as the rating system is concerned, much in the same way they start at CR 150.
The Legend bracket is also handled in a special way. It is loosely defined as the top 50 active coaches for a given CR category. Active in this context is defined as a coach who has played at least one match within the last 3 months (regardless of division).
This top 50 is reduced to roughly 14% (50/350) of the active coaches if the number of coaches in a category is less than 350 (although I believe this is not the case in any category).
The rest of the brackets, from Experienced through Super Star, are divided into CR ranges based on the standard deviation multiplied by a factor that is picked to make the top 50 legends. I realize this sounds complex, so let's review an example:
Let's consider a category where the average CR is 150 (this is not always exactly 150.0, but it will be close), and the standard deviation of the ratings is around 5.47.
We take the lower CR of the top 50 coaches (let's say 174) and do the following calculation:
Raw CR above mean = 174  150 = 24
Number of sigmas above the mean = 24 / 5.47 = 4.39
Because we have 5 brackets we want to distribute the "remaining" coaches to, we calculate a bracket size:
sb = 4.39/2.5 = 1.756
We then define the brackets:
Experienced: 1.5*sb or lower sigma below average CR
Veteran: between 1.5*sb and 0.5*sb sigma below average CR
Emerging Star: from 0.5*sb sigma below to 0.5*sb sigma above average CR
Star: between 0.5*sb and 1.5*sb sigma above CR
Super Star: between 1.5*sb and 2.5*sb sigma above CR
Legend: 2.5*sb or higher sigma above average CR
Converting this to actual numbers using numbers above:
Experienced: below 135.59 CR (1501.5*1.756*5.47 = 135.59)
Veteran: 135.59 to 145.20 CR (1500.5*1.756*5.47 = 145.20)
Emerging Star: 145.20 to 154.80 CR (150+0.5*1.756*5.47 = 154.80)
Star: 154.80 to 164.41 CR (150+1.5*1.756*5.47 = 164.41)
Super Star: 164.41 to 174.01 CR (150+2.5*1.756*5.47 = 174.01)
Legend: over 174.01 CR
These bracket limits are recalculated daily.
When a match is played, and a CR increases or decreases, the resulting CR is compared against these brackets. To avoid a coach constantly swapping between two brackets, the resulting CR must exceed the lower limit of a bracket by 25% of the higher bracket size to be promoted or equivalently go below 25% of the lower bracket size to be relegated. To be promoted to Legend or relegated to Experienced, the 25% comes from the current bracket (as these two extremes are unbound and have no real size).
On a monthly basis, a script is performed (after recalculating the bracket limits) that moves coaches into their bracket, regardless of this threshold. 
Last edited by Christer on Oct 21, 2017  16:36; edited 2 times in total 

Christer
Joined: Aug 02, 2003

Now that the specifications are out of the way, let's talk about what these things mean, and why things are implemented the way they are.
Differences from Elo
At its core, the system is based on Elo. A pure Elo system uses two primary formulas:
CR' = CR + k * (sp), Starting at 1500 for nonrated players
and
p = 1 / (10 ^ (CR_diff / 400) + 1)
The similarities should be quite obvious. Now, in order to add TV into the mix, or in a more generic context allow for uneven games, the CR formula we use has generalized this to:
p = 1 / (10 ^ (CR_diff / a + TV_diff / b) + 1), where a and b are constants to be defined somehow
Now, what I've also done is to consider the specific domain of Blood Bowl and the TV diff was changed to be a normalized version (ie, we effectively look at the percentage difference between the two teams rather than the direct TV difference). This means that TV 1000 vs TV 1100 is equivalent to playing TV 2000 vs TV 2200. While this doesn't account for inducements, I feel that the normalized difference is better than the direct nonnormalized one.
With CR, we used a linear difference for a very long time. This is the root cause of why it's beneficial for your CR to play down a lot in terms of CR (e.g. cherry picking rookies). The estimated win probability is simply too low considering the match, and therefore winning has a larger effect than expected on the CR of the high CR coach.
CR system updates (Oct 2017)
So, I recently introduced the exponential curve, effectively X + X^3, where the linear X needs to be there to avoid silly effects with matches between coaches with very similar CR. In a way, you can think of this like a "uncertainty effect". With coaches close to eachother in CR, the system considers them relatively equal since the CR itself is an estimate and uses a bit of caution. With coaches further apart, it becomes more and more certain that their skill levels are actually different and a stronger effect is applied to the expected result.
The constants used (f_CR, f_1, f_3) are chosen to give a reasonable curve for the different CR differences. In my work in choosing these, I simply made an assumption that the normal distribution of the coach CRs would have a mean of 150.0 and a standard deviation of 10. This means that 68% of coaches on the site would be between 140 and 160 and 95% of coaches would be between 130 and 170. Something like 0.3% would be above 180.
Then I made the assumption that CR 180 vs CR 150 would have the system assume something along the lines of 97% win rate, and CR 155 vs CR 150 would be relatively close to 55%. I tweaked this around a bit to get to a point where I felt comfortable with the curve (looking at CR 160 and CR 170 in between those points.
CR distribution
It's important to say here that overall low p values will generate a larger spread of the CRs for coaches, while p values that tend to quickly move from 50% will narrow the curve (this is an effect I've learned by trial and error mostly, looking at the effects of how tweaking numbers affect the overall distribution of CRs).
At the same time, having too "flat" p values, where the win probability is estimated to being too close to 50% will make it very powerful in terms of CR to cherry pick.
At this point, I think the basic foundation is strong. I have the tools to adjust the constants and adjust the curve that is used for CR differences in a fairly granular way.
However, the last CR update I ran gave me somewhat of a surprise. I picked the constants to the formula in a way that would effectively force the standard deviation of the ratings to be close to 10, and very very few coaches should be hitting CR 180 or higher. After running the script for a while (and also looking at the end result), we have loads of coaches at beyond CR 200. While this isn't a problem as such, it was a surprising result to me.
Finding problems
Looking more into some details, "cherry picking" type games were still giving too much CR increase, despite the exponential CR difference *and* the kvalue adjustment that is in place (which in a way is even further penalizing the cherry picking behaviour). Also, racespecific ratings are all over the place and hard to understand. So what gives?
Looking back at the formula, you will see that there's a racial filter between the raw p formula, and what's used to calculate the CR difference. Thinking about this further, I am of the opinion that this p adjustment is problematic.
What it does is to take racial ratings and comparing the average result for a racial matchup to the win rating. My thinking is that most of these racial factors are very close to 50%, regardless of TV brackets or races. In reality, win rates for all sorts of TV differences is incredibly narrow (basically between 45 and 55%, regardless of races and TV difference, up to extremes like 500k).
This, I believe, causes the expected win probabilities to be dragged closer to 50% than I would expect. In turn, this will give cherry picking behaviours an underestimation for win rates (meaning cherry picking is good for CR) and at the same time increasing the standard deviation of the CR curve; ie, remember what I wrote about "flat" p values above.
What I am intending to do with the next update is to simply remove the racial factor. While I think it's still a good idea at its core, it's causing more problems than it solves. So removing it will hopefully make things better. 
Last edited by Christer on Oct 21, 2017  16:37; edited 1 time in total 

Christer
Joined: Aug 02, 2003


happygrue
Joined: Oct 15, 2010

Posted:
Oct 17, 2017  17:44 

That's really interesting, thanks Christer.
It seems to me the mechanics of the amplification factor, k, is central to many of the CR debates that rage across the forums from time to time. Finding the balance of getting to a somewhat accurate rating quickly vs a rating that resists wild fluctuations seems to be what "k" is all about  if I understand correctly. Is it better to have a rating that is much better that takes 1000 games to become effective or a rating that gets close after 10 but will move around a lot? It seems obvious to me that "something in between" is the answer, but how to do it, exactly, with the added challenge that if a coach switches what teams they have been playing then it throws everything out of whack...
Anyway, I look forward to the rest of your series of posts. Thanks for your time! 
_________________ Come join us in #metabox, the Discord channel for HLP, ARR, and E.L.F. in the box!


Sp00keh
Joined: Dec 06, 2011

Posted:
Oct 17, 2017  18:32 

Ok very interesting, thanks for this
You're adjusting by TV differences,
and also tracking racial winrate for each race Vs each other race, at various tv brackets...
Are they both necessary? (Is it double counting the effect of TV, or did I misunderstand)
Could like a big matrix of race Vs race at all TV bracket combinations do the same job instead?
Maybe there are situations where tv has unexpected interaction with race... (inducements might make a difference)
Like maybe necro beat orcs if they can have babes, but they lose without them,
Maybe rookie lizards beats rookie chaos, but 2000tv lizards loses to 2000tv chaos, etc
On the other hand some matchups would just not have enough data to predict. Could sandbag the results by adding 100 fake draws to the stats or something, to make it lean heavily towards draws until it has enough data to overwhelm that
Maybe keeping TV matchup and racial matchup as 2 separate factors is actually better overall... 


Sp00keh
Joined: Dec 06, 2011

Posted:
Oct 17, 2017  19:38 

"S = 0.9 for a win with 1TD
S = 1.0 for a win with 2 or more TDs"
Maybe a language niggle but, does this mean win by TD difference?
So a 21 win is S=0.9?
Difference of 1.0 to 0.9 for "big win" Vs "small win" seems appropriate
EDIT  BUT the value could be higher, nearly double the CR from major win over minor win, if P is very high.
This means if you're almost completely certain to win the game, you'll get only a small CR boost (fair) but if you get only a minor win, that small amount will be be halved...
I quite like this, it really deflates cherry pickers who play safe grindy matches the kind that newbies feel turned off by and don't return 
Last edited by Sp00keh on Oct 18, 2017  09:34; edited 1 time in total 

Sp00keh
Joined: Dec 06, 2011

Posted:
Oct 17, 2017  19:49 

In Azure machine learning, there's a function called tune model hyperparameters
Which basically can work out what the weights of constants should be
Is it possible to derive your constants from your existing data? Eg f_CR, f_TV. Also for finding the actual win rates here:
"CR 180 vs CR 150 would have the system assume something along the lines of 97% win rate, and CR 155 vs CR 150 would be relatively close to 55%"
Follow on point, you're using linear CR diff, but 150155 may not have the same influence as 170175?
edit  actually, did you change this already, the thing with X+X^3 ? 
Last edited by Sp00keh on Oct 17, 2017  23:37; edited 5 times in total 

Sp00keh
Joined: Dec 06, 2011

Posted:
Oct 17, 2017  19:59 

Say if Race1 has an overall winrate 30%, Race2 has 60%
CoachA plays Race1 and gets 30% wins, CoachB plays Race2 gets 60%
They are both performing as expected, so would have the same CR?
This sounds great to me as it means it doesn't matter which race you play if you want to farm CR, just go with the race that you perform the relativebest with, which would enable diversity
Hmmm, if they then both get beat by CoachC, they would both suffer the same amplification factor K no? As that's based on CR only
As Race1 loses more, so CoachA's CR will take more punishment from K?
Is masked a bit by f_race but still means that they wouldn't end up with the same CR, and you'd probably want to play tier1 races to make legend 
Last edited by Sp00keh on Oct 17, 2017  20:27; edited 1 time in total 

PurpleChest
Joined: Oct 25, 2003

Posted:
Oct 17, 2017  20:25 

I have always (perceptions aside)defended CR as a reasonable attempt at defining coach rating. Go check, I stand by everything i have ever posted.
I struggle to keep up with the pure math, but can usually follow the reasoning and hence eventually get the math too.
But I am lost as to why a 40 win is better than a 10 win.
It would seem contrary to your stated aims.
Increasing, as it does, the value of unfair games with high score outcomes.
Promoting certain styles of play and certain races while devaluing others,
Bringing 'perception' of a 'good' win into something seeking a rational outcome.
So I am interested as to why you feel this is an appropriate measure? It feels like a step toward 'personal preference of play style' and away from 'how likely is a win' to me. 
_________________ Barbarus hic ego sum quia non intelligor ulli
I am a barbarian here because i am not understood by anyone 

Sp00keh
Joined: Dec 06, 2011

Posted:
Oct 17, 2017  20:29 

A 10 win is worth 90% the CR of a 20 win
Seems fine to me... It is a small boost to reward a comprehensive win
Agile teams can win by more but they also can lose by more...
EDIT it's a range between 90% and 50%. See page2 
Last edited by Sp00keh on Oct 18, 2017  09:35; edited 2 times in total 

Arktoris
Joined: Feb 16, 2004

Posted:
Oct 17, 2017  20:31 

PurpleChest wrote: 
But I am lost as to why a 40 win is better than a 10 win.

Kind of like school. Student 1 got an A in the class, student 2 got a C.
both students passed and will go on to graduate.
But student 1 is definitely the better student and therefore should get proper recognition over student 2. 
_________________ Hail to Frik! The latest charioteer to DIE for bloodbowl!  Slain, by Raiders of the Lost Tomb 

The_Murker
Joined: Jan 30, 2011

Posted:
Oct 17, 2017  20:40 

Arktoris wrote:  PurpleChest wrote: 
But I am lost as to why a 40 win is better than a 10 win.

Kind of like school. Student 1 got an A in the class, student 2 got a C.
both students passed and will go on to graduate.
But student 1 is definitely the better student and therefore should get proper recognition over student 2. 
Mmmm.. no. In one 40 loss a noob with 1 reroll wasted it on turn 1 on a troll block, losing the game 40 against norse, or whatever. In a different 40 loss, a legend coach, victum of a BLITZ and several cas, who was down 2nil at the half, tried the best possible plays he could in a effort to salvage a draw, or even a win. These amazing plays didn't get the dice, hence the 40 loss. These are two dramatically different games, with the same lopsided score, both which happen often. Rewarding 40 is an error, imo. 
_________________
Join the waitlist. Watch the action. Leave the Empire. Come to Bretonnia! 

MattDakka
Joined: Oct 09, 2007

Posted:
Oct 17, 2017  20:43 

I agree, there is not necessarily direct and strict correlation between winning with high TD difference and coach's skill (of course, there could be exceptions). 
_________________ Please upvote the Diving Catch bug report, thanks! 

JackassRampant
Joined: Feb 26, 2011

Posted:
Oct 17, 2017  20:46 

Sp00keh wrote:  A 10 win is worth 90% the CR of a 20 win.  Only if p=1.00. If p=0.50, a 10 win is worth 80% the CR of a 20 win. No?
But it caps out at +2 TD, so I don't think it's too bad. 
_________________ What is Nuffle's tree? Risk its trunk, space the branches. Touchdowns are its fruit.
What is Nuffle's lawn? Inches, squares, and Tackle Zones; reddened blades of grass. 


 