Wednesday, May 7, 2014

The Birthday Problem: Pro Tennis Edition


If you've ever taken an introductory level course in statistics or probability, there's a decent change you've encountered the birthday problem, which is a classic example used to illustrate how probability often defies our intuition. The problem goes something like this: you're in a classroom with 30 students and the professor asks "what are the chances that two students in the room share the same birthday." Our intuition is that the likelihood of two people sharing the same birthday is very low, since the chances that one person shares a birthday with another random person is only 1/365 (let's assume no leap years...). But as soon as you start adding more people to the picture the equation changes.

In a room with 2 people, the chances that anyone shares a birthday is 1/365. Lets say the two people don't share the same birthday. If you add one more person, the third person could share a birthday with either of the two people who are already in the room. Thus the chances the third person shares a birthday with someone is 2/365. For a third person it is 3/365 and so on. In this way, the chances of getting a duplicate birthday increase for each person you add. For instance, if you have 36 people in the room already and nobody happens to share a birthday, there's a 36/365 chance--almost 10%--that adding a 37th person to the room will result in two people having the same birthday. It turns out, there's a 70.6% chance that among 30 random people, at least 2 will share a birthday.

I am a big tennis fan and with clay season in full swing I've started paying attention to the ATP rankings lately, which got me to thinking: I wonder how many top tennis players share the same birthday? I decided to put some of the tools I learned in getting and cleaning data and data wrangling with Mongo DB to work and write some code to scrape player data from the ATP website and then produce a list of players in the top 100 with the same birthday (note the top 100 has changed since I did this.).

First I went to the ATP site to look at the structure of their player bio webpages, then I made python code to download the html, extract the birthdays from the player bios and then loop through he players to check for duplicates. The code is a bit messy and separated into different modules so I'll spare you having to look at it, but, here is the final result:

['Novak-Djokovic', '22.05.1987'] ['Jurgen-Melzer', '22.05.1981']
['Roger-Federer', '08.08.1981'] ['Marinko-Matosevic', '08.08.1985']
['Andy-Murray', '15.05.1987'] ['Leonardo-Mayer', '15.05.1987']
['Milos-Raonic', '27.12.1990'] ['Gilles-Simon', '27.12.1984']
['Grigor-Dimitrov', '16.05.1991'] ['Lukasz-Kubot', '16.05.1982']
['Tommy-Robredo', '01.05.1982'] ['Michael-Russell', '01.05.1978']
['Florian-Mayer', '05.10.1983'] ['Federico-Delbonis', '05.10.1990']
['Radek-Stepanek', '27.11.1978'] ['Santiago-Giraldo', '27.11.1987']
['Jarkko-Nieminen', '23.07.1981'] ['Donald-Young', '23.07.1989']
['Albert-Montanes', '26.11.1980'] ['Matthew-Ebden', '26.11.1987']

It turns out that 20% of players in the top 100 share a birthday with someone else in the top 100, including many of the sports biggest names. Andy Murray and Leonardo Mayer were even born on the exact same day!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.