**OBJECTIVE**

We are in the middle of one of the most exciting sporting events in the world, the World Cup. This is a soccer tournament where 32 countries compete against each other. Let's say you decide to predict the outcome of the games, could you do better than randomly guessing? In this post, I want to demonstrate that without having any prior knowledge of the teams, and with the insights obtained from a simple exploratory data analysis (EDA), it is possible to predict the outcome of the Group Stage games with higher accuracy than randomly guessing by never predicting ties.

**ABOUT THE WORLD CUP**

The World Cup is a tournament that occurs every four years. This year it takes place in Russia. The first part of the tournament (aka Group Stage) contains 32 teams split into eight groups (four teams per group). The teams per group for this year are presented on Figure 1 below. All the teams within a group compete against each other, for a total of six games per group (three games per team). For example, in Group A you have the following six matches: Russia-Saudi Arabia, Russia-Egypt, Russia-Uruguay, Saudi Arabia-Egypt, Saudi Arabia-Uruguay and Egypt-Uruguay.

In each game you have a Team A competing against a Team B, and there are three possible outcomes: Team A wins, Team B wins, or there is a tie (or draw). A team gets 3 points for every match won, 1 point for every tie and 0 points for every loss. Each team adds up the points from their games, and the top two teams per group move on to the next stage which is a round of 16. In this second stage, the winning teams move on, and the losing teams are eliminated from the tournament (there is no option to tie a game). The whole tournament has 64 games, but 48 of those games occur on the Group Stage. In this post I am only going to analyze the 48 games of the Group Stage, because these games have the options of having a tie!

**DATA**

For this study I used a data set of the outcome of World Cup games obtained from Kaggle. I only used the data from the Group Stage for the years between 1986 and 2014. This data is used to explore patterns about the game outcomes. I then simulate the prediction of the 48 Group Stage games for the current World Cup 2018.

**MODEL**

I will now analyze a few different models and understand how I can improve upon a baseline random guessing model.

*Baseline (Random) Guessing*

Our first model consists on completely guessing the results for each game at random. This is going to be the baseline model. There are 3 possible outcomes per game (Team A wins, Team B wins, or there is a tie). For each game I will guess one of the outcomes with equal probability per guess (1/3 for each option). I simulated the guess of all 48 games 10,000 different times to determine how the model performs.

In Figure 2 I present a histogram of the 10,000 simulations for predicting the 48 Group Stage games in the World Cup 2018. On the x-axis we have the number of correct predictions between 0 and 48, and on the y-axis we have the probability of obtaining a given number of correct guesses. In average the model guesses correctly 15.98 games (or 33.29% of the games), which is expected since there are only three options for guesses. The vertical dashed line indicates average number of correct guesses.

Based on the chart above, it is possible although unlikely that we will get all the games predicted correctly, or that we will not predict any game correctly; however, on average we will get one third of the games right.

*Bias Guessing*

In this part of the study we will explore the outcomes of the previous eight World Cup Group Stage matches. Given that there are three possible outcomes per game, I will determine what proportions of the games result in a tie.

In Figure 3 I present the percentage of the games that resulted in a tie for a given year . In average across all years about 26% of the games resulted in ties. The year 1998 had the most percentage of ties, with one third of the games!

If 26% of the games result on ties then 74% of the games have a winner. What if we incorporate this new information on our model to make a prediction? We will predict a tie 26% of the time, Team A winning 37% of the time, and Team B winning 37% of the time. Notice that when we add together the probabilities for Team A and Team B we have 74%!

In Figure 4, I present the results from re-running the simulation 10,000 times. The histogram of the Bias Guess model (red) is shifted to the right of the Baseline model (blue). This indicates that we had more correct predictions from 15.98 to 16.76 games (34.92%). Although not a huge improvement, it is trending in the right direction.

*No Tie Guessing*

Can we improve the model even further? There is another approach that we can use taking advantage of the knowledge that the probability of having a tie on a given game is biased (less likely to occur than one third of the time). If we know that only 26% of the games result in a tie, what if I always predict a winner? I will get the 26% of the games wrong (the ones with tie as an outcome), but from the 76% remaining, we can randomly pick Team A or Team B, which will allow us to guess correctly half of the time (37% of the games or 17.76 of the games). This is better than the previous two models!

In Figure 5, I present the histogram for always predicting a winner and never predicting a tie when re-running the 10,000 simulations. This year, we were particularly lucky because only 9 out of the 48 games (18.75%) resulted in a tie. In other words. 81.25% of the games had a winner, and if we randomly pick a winner we should be able to correctly predict half of those games, about 40.62%. The simulation (green curve) shows a result very close to our expectation with an average of 19.53 games (40.69%) predicted correctly.

In the best case scenario, assuming one year there are no ties in the tournament, we can expect on average to guess 50% of the games. Given that our worse case scenario observed so far is a 1/3 of the games tied in 1978, it is safe to say we have not observed a case where we would have done worse than the Baseline model.

The table below compares the percent and number of games in average predicted correctly after the 10,000 simulations for all three models. In summary, by never predicting a tie in the Group Stage, we can predict an average of 3.5 more games compared to the Baseline model.

Baseline | Bias | No Tie | |
---|---|---|---|

Percent of Correct Predictions |
33.29% | 34.92% | 40.69% |

Number of Correct Predictions |
15.98 | 16.76 | 19.53 |

**CONCLUSION**

In this post we demonstrate, with the help of EDA, that a tied soccer game in the Group Stage of the World Cup occurs only 26% (less than 1/3rd) of the time, which is an outcome bias. We can use this information to our advantage and **increase** our number of **correct predictions** using a simple strategy of **never predicting ties**.