Whilst programming is often portrayed as quite a serious endeavor (someone hacking into a bank, a security analyst saving the world, a financial quant spending 80 hours a week modelling interest rates etc…) there is alot of fun and inventiveness that can be had with it. One such example is solving puzzles through simulating potential outcomes to determine the probability of an event occurring.
The first person who comes to mind on this topic, at least for me, is David Robinson (@drob - https://twitter.com/drob). In addition to being a highly regarded data scientist, he’s authored / maintained a number of R packages and finds time to blog / screen cast examples of analytics and problem solving using the R programming language (https://www.youtube.com/channel/UCeiiqmVK07qhY-wvg3IZiZQ).
My first stab at programming came with R, and as such I learnt a lot from reading David’s posts and watching his videos, along with those from other members of the amazing R community. These days my programming tends to be focused on Python, but I still try and keep up with what’s going on in the R community, leading me to David’s recent post solving the Birthday Paradox using some of the neater tidyverse tooling!
The jist of the problem is calculating the probability two people in a room will share the same birthday, as the number of people in the room increases. In order to blow off a few Christmas coding cobwebs I decided to have a stab at this using python, both in a functional manner using python’s base data structures and also the commonly used Pandas library!
The R way!
So, first up is David’s code using the crossing, mutate and map functions from tidyr, dplyr and purrr respectively. However, I did feel inclined to change summarize back to the English spelling….
If you’re interested in the breakdown of the code his post has a brilliant explanation of each element (http://varianceexplained.org/r/birthday-problem/), but I’ve added a couple of comments below too.
library(tidyverse) summarised <- crossing(people = seq(2, 50, 2), #Produce a cartesian df with people and trials trial = 1:10000) %>% #Give each trial a number of randomly assigned birthdays mutate(birthday = map(people, ~ sample(365, ., replace = TRUE)), #Check if any birthdays match multiple = map_lgl(birthday, ~ any(duplicated(.)))) %>% #Group by number of people and calculate average across all trials group_by(people) %>% summarise(chance = mean(multiple)) head(summarised)
## # A tibble: 6 x 2 ## people chance ## <dbl> <dbl> ## 1 2 0.0033 ## 2 4 0.0175 ## 3 6 0.0374 ## 4 8 0.0689 ## 5 10 0.118 ## 6 12 0.169
As you can see the above code leads to a dataframe with the simulated probability of a matching birthday based on the number of people in the room, which is what I’ve replicated below in two different ways.
Python part one - Native
For the first method I decided to stick mostly to python’s inbuilt data structures (list, tuples, etc), only using pandas to help aggregate and plot the data.
In order to stay away from dataframes I replied heavily on lists, with the first function taking an integer in order to provide a random sample on birthdays. This sample was then converted into a set (essentially a hashed list that can’t contain duplicates) in order to remove duplicates and a length check used to see if any items had been removed, giving me a binary output.
The second function then acted to replicate this function for both a number of iterations and a range of people, giving me a final list of tuples (essentially an immutable list) with the people, trial and result which I could feed into a pandas dataframe and aggregate, as shown below.
from random import choices import seaborn as sns import pandas as pd def birthday_trial(people): """Function that takes an integer of people and determines if at least two have matching birthdays""" #Build list of random birthdays birthdays = choices(range(1, 366, 1), k = people) #Check for matches if len(set(birthdays)) == people: return 0 else: return 1 def birthday_trailer(iterations, people): """Function to replicate previous function for required number of trails and groups of people """ results =  #Iterate over list of people for set number of iterations for i in people: for j in range(0, iterations): #for each iteration record number of people, trial and result and append to results list result = tuple([i, j, birthday_trial(i)]) results.append(result) return results chances = birthday_trailer(1000, range(2, 51, 2)) chances_df = pd.DataFrame(chances, columns = ["People", "Trial", "Probability"]) chances_summary = chances_df.loc[:, ["People", "Probability"]].groupby("People").mean() chances_summary.reset_index(inplace = True, drop = False) chances_summary.head(6)
Python part two - Pandas
Once I’d completed the puzzle in (mostly) base data structures I decided to have a go using Pandas, a library developed for data analysis and manipulation.
For this method I first created a Cartesian product dataframe, similar to David using crossing, and then applied a couple of custom functions. The first gave each row a list a random birthdays based on the number of people allocated to that row. The second checked for any matching birthdays and returns a binary result. This dataframe could then be grouped by the number of people and an average taken, as in the previous two examples.
Finally I decided to make a simple line plot using the seaborn package to highlight the change in probability as the number of people increases.
import pandas as pd import numpy as np from random import choices import seaborn as sns from itertools import product #Build the cartesian DF pop_df = pd.DataFrame(list(product(range(2, 51, 2), range(0, 1000))), columns=['People', 'Trial']) #Function to provide random birthdays def birthday_lister(people): return choices(range(1, 366, 1), k = people) #Function to check for matches def match_checker(x): if len(set(x)) == len(x): return 0 else: return 1 #Applying each function to the DF, each giving a new column pop_df["Birthdays"] = pop_df["People"].apply(birthday_lister) pop_df["Probability"] = pop_df["Birthdays"].apply(match_checker) #Grouping, summarising and plotting of data pop_df_summary = pop_df.loc[:, ["People", "Probability"]].groupby("People").mean() pop_df_summary.reset_index(inplace = True, drop = False) sns.lineplot(data=pop_df_summary, x='People',y='Probability').axhline(0.5, ls = "--")
So there we have it, a couple of ways to repeat David’s interesting solution using python! What I haven’t covered here is the expansion into assessing the probability of n birthdays matching, though this could be achieved through small tweaks to the defined functions, IE checking the difference between set and sample length and using that integer as a score.
Once again I’d highly advise checking out David’s blog (http://varianceexplained.org/) for a few other interesting puzzles, it’s certainly given me a few good quick reads!