Thursday, November 07, 2019

Encoding the zietgeist for predicting outcome of a cricket match


TLDR: Here's the code.

This happened a few months ago...

World cup cricket mania was gripping the nation, and I was feeling left out because of my own ignorance towards the sport. I decided to fight the battle in my own battlefield and tried to make the game a bit more interesting for myself. I wanted to take part in company pool without putting in a lot of effort to follow the sport, but also not be the person to not have anything to talk about when all that people talk about is cricket.
Given all the buzz surrounding AI/ML and work that we are doing at Versa, I decided to just build a minimum viable solution to help me winning the pool.

The constraints I had for myself were simple: 
  • Don't spend a lot of time. Maximum of 1 day to implement an end to end solution. This ruled out any massive model building using play by play stats, players stats, team dynamics etc. 
  • Encode the zeitgeist and perception of the sports fans and enthusiasts rather than focus on micro indicators. 
Here's how I went about winning the pool (shared it with two other folks).

Home team advantage:

I don't claim to be an expert in any sports. However, I do claim to be an expert listener of sports blabber, and one thing that everyone seem to agree on in home team advantage. To factor this into the model I needed to figure out if the teams are playing in their home country. Our ML algorithm can make the connection between a victory and location of the game easily, so I scraped a random website from google to get the data:

DateGroundTeam_ATeam_BWinnerCityHost
02000-01-02Eden ParkNew ZealandWest IndiesNew ZealandAucklandNew Zealand
372000-01-04Owen Delany ParkNew ZealandWest IndiesNew ZealandTaupoNew Zealand
392000-01-06McLean ParkNew ZealandWest IndiesNew ZealandNapierNew Zealand
672000-01-08Westpac StadiumNew ZealandWest IndiesNew ZealandWellingtonNew Zealand
982000-01-09Brisbane Cricket GroundAustraliaPakistanPakistanBrisbaneAustralia

Encoding gaming form advantage:

Every sports fan I have spoken to, talks about a team or a player under performing or  outperforming because they are in bad/good form.

How do we objectify a "good" form? If team A is a vastly better team than team B, but team A is in its worst form and team B is in their best form, then who has a better chance? Surely there is an upper bound from the benefit of a good form!!

Based on these questions, I formulated few axioms. These are not true "facts", but universally accepted ideas.

Sridhar's AXIOMS of team sports (SATS):

Axiom 1: A previous win against any specific team can improve the winning probability by 𝛼
Axiom 2: A previous loss against reduces the winning probability by some 𝛽, where 𝛼 < 𝛽
This is based on the observation that sports fans always say that a team is in form because they have won quite a few matches in the recent past but are quick to retort the form is broken if the team loses a single game. This points to the fact that a loss is a heavier blow to the psyche than upliftment provided by a win.

Axiom 3: Specific form against the competitor 𝛼' and 𝛽' also is a factor
If team A has always won against team B, then it'll have a higher probability of winning against team B, even if team B has been winning recent string of matches. This effect can be seen in world cup cricket matches between India and Pakistan.

Axiom 4: The contribution of form to the winning probability is capped to some number
Axiom 5: The form is function of winning streak. i.e. a team will have "good form" if it has been winning more matches in the recent past.

From Axiom 5, we know that 𝛼=f(streak). So, let's just define streak and define its growth function. I'm arbitrarily choosing a decay of 0.8 for winning streak

Let's take a look at the function growth for streak:

Here the max streak is 5 that can be achieved by around 30 continuous wins. However I'm going to linearly decrease the streak when a game is lost. Note the arbitrary nature of the decay function and streak increase/decrease. I don't need to get the function exactly right.

The machine learning algorithm will figure out the importance to give to the streak.

After running a gradient boosted tree classifier on the test set, I get an accuracy of 67%. I don't expect a high accuracy on this model because the inputs are highly subjective. It's not great, but acceptable for a few hours of work.

Once all the i's were dotted and t's crossed, was able to run the prediction on the actual matches. Here are few test predictions:
   
print_update_prediction('Australia','Afghanistan', host = 'England')
Australia, 0.9750151038169861

print_update_prediction('India','Australia', host='India') 
India, 0.6103534698486328


print_update_prediction('India','Australia', host='Australia') 
India, 0.5572055578231812
  

Thanks to the untimely rains in the world cup, 4-5 games were rained out. Few of those games might have been mispredicted by my model. All in all, I was able to predict 87% of the games correctly.

Check out the code on github. I suspect that this would work for any team sport where the teams don't change. It would work for soccer, field hockey etc, but not for IPL cricket or NFL/NBA where the teams change every year.