FUMBBL :: Online Blood Bowl League

Joined: Apr 09, 2006

Posted: Sep 18, 2011 - 22:27

As I put in my blog, I was looking for ideas on how to test hypothesis against the FUMBBL api.

Some people suggested I should come here for advice, so here I am.

First, what I've got:

https://github.com/Hitonagashi/fumbbl_games

I've grabbed every single game played on FUMBBL since the Box turned LRB 6, and placed them all into a database. If you are particularly programmatically minded and interested, I've also got installation instructions there, and an uploaded database. Theoretically, anyone can repeat my experiments. Feel free to grab it, do whatever you want to it(within the restriction that all data belongs to the Big C.). If anyone has any changes to the code, feel free to submit a pull request.

As an experiment in writing against it, I've written a sample script that reproduces those racial tables we had for LRB 4. These are only using data gathered from the Box.

The script is here: https://github.com/Hitonagashi/fumbbl_games/blob/master/scenarios/team_comparison.rb

And the results are here:
http://hitonagashi.github.com/fumbbl_games/

I apologise for the lack of formatting on it..I've only been working on this today. Smile

If anyone wants to format it more nicely, be my guest. It's on the branch gh_pages, and I'd love a more readable site.

It should be accurate though. I'd also *REALLY* appreciate it if you create a new thread to discuss the results. I'd like to keep this one for details of the analysis. Call the results a) an experiment, and b) a sanity check to see what could do with being investigated in more detail. They don't look too different to the LRB 4 figures though.

Okay.

Onto what I want to find.

To get meaningful data, I think we need 2 things:

1) A way of producing a coach ranking purely from the games they have played.
Win/Draw/Loss percentage might just be enough here? See 2 though for why I'm doubting..

2) A way of ranking results for a team.

The problem with 2) is that you can't just use a raw percentage(1/0/0 is considerably less impressive than 99/0/1), and I'm wondering if a weighted set of results (5 for a win, 2 for a draw, 1 for a loss) falls folly to a team with 200 games and 100 losses being one of the best teams just by quantity.

Anyone of a more statistical bent, how do you get over this problem?

Also, anyone who can suggest a clear scenario, or any suggestions for analysing the data, I'm all ears. I realise that reading code isn't to everyones liking, but if you navigate into the Models folder (on the website of the source code) then you can see what you have available for each model. There is more information in the API, but that's all I grabbed for now. I can go back later however.

Right.

Feedback?

Malerun

Joined: Dec 03, 2003

Posted: Sep 18, 2011 - 22:52

I would recommend starting out as simple as you can get. So:

The statisticians answer to your doubt about 1/0/0 being better than 99/0/1 is that it seems better, but the uncertaincy/standard deviation about the former is too large to decide. The statistician would then carry out his calculations, but getting a larger uncertaincy on the results based on the 1/0/0 record, than the 99/0/1.

A "cruder" model would be a to use win percentage, but only for coaches with a resonable number of (box) games, say 10 or 100.

_________________
"noncompetitive play is like porn"
- Azure

Hitonagashi

Joined: Apr 09, 2006

Posted: Sep 19, 2011 - 00:37

Aye, I thought about statistical significance Malerun, but the problem is that you can't just say "oh, he's a good coach with 90% probability.

The issue is that you need those figures *before* you can start analysing data, a consistent metric across all coaches/teams to act as a method of getting 'results'. That way you can then do the actual tests against that number. "Are orcs more effective at 1500 TV than at 1900 TV? Well, their SillyRanking goes up by 2.5 points with the following deviation, so yes, we can conclude they are slightly more effective".

I did consider chopping the number of teams down to those that had played X games (say, 10), and then doing all the calculations with that. I believe FUMBBL itself uses 5. Might be the best way to start..

_________________
http://www.calculateyour.tv - an easy way to work out specific team builds.

koadah

Joined: Mar 30, 2005

Posted: Sep 19, 2011 - 00:59

Why aren't the mirror matches all 50%?

_________________

[SL] + Official Stunty teams. Progression KO. Old & new teams welcome. 29th May!

RobRoyDuncan

Joined: Apr 15, 2011

Posted: Sep 19, 2011 - 01:07

koadah wrote:

Why aren't the mirror matches all 50%?

Good catch. Looks to me like it might only be looking for wins, so that draws are counting negatively against both sides. You can see that also in some of the other matchups. For instance, in the first data set, it shows Amazons at 33% against CD and CD at 50% against Amazons, with 6 played. If you figure that Zons won 2, CD won 3 and they drew 1, with the draw being discarded, the numbers add up. The tables should probably be rerun with a draw being worth half of a win.

Wallace

Joined: May 26, 2004

Posted: Sep 19, 2011 - 01:09

Great work! I'm happy to provide some suggestions, I use stats for fun and profit (i.e. my job and my hobby...) so hopefully I can help. My main comment though, is that what is somewhat lacking is not so much a suitable comparison statistic as a clear idea of what you are trying to compare or achieve? If you can articulate a very clear hypothesis you want to test, then a meaningul statistic to test that would be much easier for myself and others to suggest. You'd probably be able to answer the question yourself, since you're clearly pretty cluely having set all this up!

The confusion is that you seem to be talking about ranking teams overall performance over many games (all at TVs they have played at), as well as ranking coaches (I think), but then your tables so far compare matches played by race in TV brackets. These are all very different things. Lets clarify first what exactly what we want the data to tell us, then how to pull it out becomes much easier to answer.

Hitonagashi

Joined: Apr 09, 2006

Posted: Sep 19, 2011 - 01:23

RobRoyDuncan wrote:

koadah wrote:

Why aren't the mirror matches all 50%?

Good catch. Looks to me like it might only be looking for wins, so that draws are counting negatively against both sides. You can see that also in some of the other matchups. For instance, in the first data set, it shows Amazons at 33% against CD and CD at 50% against Amazons, with 6 played. If you figure that Zons won 2, CD won 3 and they drew 1, with the draw being discarded, the numbers add up. The tables should probably be rerun with a draw being worth half of a win.

Exactly right.

I did that on purpose so that you can see the amount of games that are drawn as well(due to mirror would be 50%, so a 30% means that the race tends to draw a lot as a mirror). This is a rating of how likely a win is when you play that race against the other...I could weight them, that isn't a huge change to the calculation.

Has anyone got any suggestion?

[EDIT]

I see your point...didn't occur to me I was treating a draw the same as a loss. I'll reru n those tomorrow Smile

.

Last edited by Hitonagashi on %b %19, %2011 - %01:%Sep; edited 1 time in total

Hitonagashi

Joined: Apr 09, 2006

Posted: Sep 19, 2011 - 01:28

Wallace wrote:

Great work! I'm happy to provide some suggestions, I use stats for fun and profit (i.e. my job and my hobby...) so hopefully I can help. My main comment though, is that what is somewhat lacking is not so much a suitable comparison statistic as a clear idea of what you are trying to compare or achieve? If you can articulate a very clear hypothesis you want to test, then a meaningul statistic to test that would be much easier for myself and others to suggest. You'd probably be able to answer the question yourself, since you're clearly pretty cluely having set all this up!

The confusion is that you seem to be talking about ranking teams overall performance over many games (all at TVs they have played at), as well as ranking coaches (I think), but then your tables so far compare matches played by race in TV brackets. These are all very different things. Lets clarify first what exactly what we want the data to tell us, then how to pull it out becomes much easier to answer.

Personally, I wanted to try and discover a way of testing whether an experienced chaos/nurgle team (you can only get the number of games, not the skills on the players) has a higher success rate in the hands of a less skilled coach than you would expect.

My hypothesis is that clawmbpo can take a poorer coach and give them better results than they would expect with a different race, but in the hands of a stronger coach, it experiences the same success rate. Smile

Thing is, I thought that we'd need those 2 figures for pretty much anything to do with the data and analysing it usefully.

Stats is very far from my expertise though. I did do maths at uni, but I'm a number theory guy/programmer Very Happy

. Any advice is very much appreciated.

Wallace

Joined: May 26, 2004

Posted: Sep 19, 2011 - 02:24

Okay sounds good. Try this on for size.

The measures of performance (for coach or team) don't need to be complex, in fact it is better to be a simple reflection of how much they win. What is the key here is how these measures correlate in different circumstances.

So, lets define a simple Coach Ability (CA) statistic to be simply the win fraction, with draws counting as half, so for a W/D/L percentages of say 50/20/30 this we would have CA=0.6.Note that we can define this over all games or filter out certain races, so we could compare a coaches overal CA to CA without each race or CA for each race alone.

Now we want to compute the correlations, and we need a suitable statistic. Have a look here at the section "Pearson's product-moment coefficient". You want to compute this statistic for each set of [CA(overall but without playing with race X):CA (only playing with race X)]. The higher the number the stronger the correlation, the lower the number the weaker.

So, to be clear, say you are looking at Chaos. First find the list of all coaches who have ever played a [B] match playing AS a Chaos team. Filter out those with less than some suitable number of chaos games, probably about 10. Now, for each of these coaches, calculate their Box CA value considering every game they have played except for those when they played as Chaos. Now calculate their CA value for ONLY those games in which they played as Chaos. You now have for each coach a pair of CA values. These values are the x and y mentioned on the wiki page and N is the number of coaches you have calculated this pair of values for. The most straightforward version of the statistic is the last one on that page, where it says "r_xy = <something> = <use this one>".

If you do this you now know the correlation between overall ability and ability using Chaos. Alone this probably doesn't tell you too much, so you need to repeat the above process for every other race to see how this correlation pans out in a comparative sense.

Now, if your hypothesis is very true, it should be evident in the results you will obtain (Chaos/Nurgle correlations will be significantly weaker than the others). However if it is not, there is a lot more that can be done in isolating CA in different TV bands and for different match ups. I can go through this in detail, but maybe if you have a go at the above we could go from there. I'd be very interested to see the results!

Purplegoo

Joined: Mar 23, 2006

Posted: Sep 19, 2011 - 08:30

I’m going to leave the nuts and bolts of statistics to those that clearly know more about it than I do.

However, I think your study needs some sort of control (sort of building on the above point, but you can do that along with this more specific thing). If you come up with ‘Clawmbpo allows a less than average coach to win 20% more than if he were using a team without the mechanic’, then OK – but is that 20% hike more than say, what he’d experience with low TV ‘Zons? Woodies? MA11 Gutters? You know what I’m saying – another great mechanic that we know to be excellent and helps newer coaches along, but doesn’t get the press. If you can tell me that Clawmbpo helps a n00b win 20% more, but if he were to keep Amazons below TV1500, he’d win 30% more than he ‘should’, we’d at least have a reference point.

The other question is how much of a winning hike is ‘too much’ of a winning hike (broken, op, yada yada). Tough to answer without an awful thread! Wink

garyt1

Joined: Mar 12, 2011

Posted: Sep 19, 2011 - 09:35

Do we really think that CLAWPOMB increases the win % significantly or is it perhaps a small increase on this and a rather large swing on casualties. CLAWPOMB Chaos Dwarfs can maul on autopilot for example.
Is it possible to compare casualties in this study?

_________________
“A wise man can learn more from a foolish question than a fool can learn from a wise answer.”

Fela

Joined: Dec 27, 2004

Posted: Sep 19, 2011 - 10:16

garyt1 wrote:

Do we really think that CLAWPOMB increases the win % significantly

That's not the working hypothesis, though.

The WH is that clawpomb can somewhat substitute positioning skill and due to its inherent randomness works (again, somewhat) independently of coach skill. Therefore one might be able to see a pattern where less skilled coaches perform better than would be expected.

Hitonagashi

Joined: Apr 09, 2006

Posted: Sep 19, 2011 - 10:18

Goo:

Good idea. My initial inclination is rookie amazons. In the charts I posted, even if they are *just* using wins, look at their win rate at low TV! This isn't really news, but the only thing like it anywhere else in the chart is 1750-2250 TV woodies, who look very good as well. Both of which we knew already. I can't detect skills on players, so we will have to draw conclusions from the race and TV range.

Gary: I think if we made a comparison saying that clawmbpo causes more casualties than blodge, that would be fairly evident Very Happy

. I can get the figures for casualties I think though. All I care about right now is it helping you win games. LRB 4 Khemri caused plenty of blood, but only GW seemed to think they were OP.

_________________
http://www.calculateyour.tv - an easy way to work out specific team builds.

uuni

Joined: Mar 12, 2010

Posted: Sep 19, 2011 - 11:03

@Hitonogashi:

Still I wonder if you actually could ignore the coach factor. Can you not just compare clawpombing teams variance of winning to nonclawpombers variance of winning?

I am referring to your original thought in your blog of

Hitonogashi wrote:

It means that poorer coaches can beat stronger coaches by taking tons of killers, and hoping the dice pan out. If you have a clawmbpo heavy team, I think statistically, the percentage of your games where luck wins you the game increases(but not significantly...maybe from 40% to 45%).

My comment to that was

uuni wrote:

The first thing that comes to mind would be an experiment like following:

Null hypothesis: Variance of games of different rosters are not different.

Hypothesis: Variance of games of roster of teams that normally wield clawpomb is smaller than the variance of the complement.

If I have understood it right, you could use a statistical test if the variance of win-ratios of Clawpomb-race -rosters is smaller than the complement rosters. Would t-test work, can someone more familiar with statistics tell?

What do you think?

Fela

Joined: Dec 27, 2004

Posted: Sep 19, 2011 - 11:42

Why would the variance be smaller if the effect of luck on the match increases?