The metamorphosis of a Hacker: November 2019

TLDR: Here's the code.

This happened a few months ago...

World cup cricket mania was gripping the nation, and I was feeling left out because of my own ignorance towards the sport. I decided to fight the battle in my own battlefield and tried to make the game a bit more interesting for myself. I wanted to take part in company pool without putting in a lot of effort to follow the sport, but also not be the person to not have anything to talk about when all that people talk about is cricket.
Given all the buzz surrounding AI/ML and work that we are doing at Versa, I decided to just build a minimum viable solution to help me winning the pool.

The constraints I had for myself were simple:

Don't spend a lot of time. Maximum of 1 day to implement an end to end solution. This ruled out any massive model building using play by play stats, players stats, team dynamics etc.
Encode the zeitgeist and perception of the sports fans and enthusiasts rather than focus on micro indicators.

Here's how I went about winning the pool (shared it with two other folks).

Home team advantage:

I don't claim to be an expert in any sports. However, I do claim to be an expert listener of sports blabber, and one thing that everyone seem to agree on in home team advantage. To factor this into the model I needed to figure out if the teams are playing in their home country. Our ML algorithm can make the connection between a victory and location of the game easily, so I scraped a random website from google to get the data:

	Date	Ground	Team_A	Team_B	Winner	City	Host
0	2000-01-02	Eden Park	New Zealand	West Indies	New Zealand	Auckland	New Zealand
37	2000-01-04	Owen Delany Park	New Zealand	West Indies	New Zealand	Taupo	New Zealand
39	2000-01-06	McLean Park	New Zealand	West Indies	New Zealand	Napier	New Zealand
67	2000-01-08	Westpac Stadium	New Zealand	West Indies	New Zealand	Wellington	New Zealand
98	2000-01-09	Brisbane Cricket Ground	Australia	Pakistan	Pakistan	Brisbane	Australia

Encoding gaming form advantage:

Every sports fan I have spoken to, talks about a team or a player under performing or outperforming because they are in bad/good form.

How do we objectify a "good" form? If team A is a vastly better team than team B, but team A is in its worst form and team B is in their best form, then who has a better chance? Surely there is an upper bound from the benefit of a good form!!

Based on these questions, I formulated few axioms. These are not true "facts", but universally accepted ideas.

Sridhar's AXIOMS of team sports (SATS):

Axiom 1: A previous win against any specific team can improve the winning probability by 𝛼
Axiom 2: A previous loss against reduces the winning probability by some 𝛽, where 𝛼 < 𝛽
This is based on the observation that sports fans always say that a team is in form because they have won quite a few matches in the recent past but are quick to retort the form is broken if the team loses a single game. This points to the fact that a loss is a heavier blow to the psyche than upliftment provided by a win.

Axiom 3: Specific form against the competitor 𝛼' and 𝛽' also is a factor
If team A has always won against team B, then it'll have a higher probability of winning against team B, even if team B has been winning recent string of matches. This effect can be seen in world cup cricket matches between India and Pakistan.

Axiom 4: The contribution of form to the winning probability is capped to some number

Axiom 5: The form is function of winning streak. i.e. a team will have "good form" if it has been winning more matches in the recent past.

From Axiom 5, we know that 𝛼=f(streak). So, let's just define streak and define its growth function. I'm arbitrarily choosing a decay of 0.8 for winning streak

Let's take a look at the function growth for streak:

Here the max streak is 5 that can be achieved by around 30 continuous wins. However I'm going to linearly decrease the streak when a game is lost. Note the arbitrary nature of the decay function and streak increase/decrease. I don't need to get the function exactly right.

The machine learning algorithm will figure out the importance to give to the streak.

After running a gradient boosted tree classifier on the test set, I get an accuracy of 67%. I don't expect a high accuracy on this model because the inputs are highly subjective. It's not great, but acceptable for a few hours of work.

Once all the i's were dotted and t's crossed, was able to run the prediction on the actual matches. Here are few test predictions:

   
print_update_prediction('Australia','Afghanistan', host = 'England')
Australia, 0.9750151038169861

print_update_prediction('India','Australia', host='India') 
India, 0.6103534698486328


print_update_prediction('India','Australia', host='Australia') 
India, 0.5572055578231812

Thanks to the untimely rains in the world cup, 4-5 games were rained out. Few of those games might have been mispredicted by my model. All in all, I was able to predict 87% of the games correctly.

Check out the code on github. I suspect that this would work for any team sport where the teams don't change. It would work for soccer, field hockey etc, but not for IPL cricket or NFL/NBA where the teams change every year.

Thursday, November 07, 2019

Encoding the zietgeist for predicting outcome of a cricket match

Home team advantage:

Encoding gaming form advantage:

Sridhar's AXIOMS of team sports (SATS):