Birthday paradox in Malaysia. This project was completed as a group project for the University of Calgary course DATA602 – Statistical Data Analysis.
The birthday paradox is mathematical phenomena which describes the probability of two people sharing the same birthday. The reason this phenomena is considered a paradox, is the fact that statistically the probability of two people sharing the same birthday is higher than what most people would expect. The goal of this group project is to calculate the probability of the shared birthdays using two different approaches. The first approach is the Exact Probability Formula and the second approach is Monte Carlo Simulation.
The following parts will discuss these approaches under various conditions:
The formula used for calculating the exact probability:
According to figure 1, which depicts the probability of at least two guests in a party of N guests sharing the same birthday, when there is at least 23 guests at the party the probability of at least 2 guests sharing the same birthday is more than 50%.
The probability that all
The probability that at least two people share a birthday:
Performing the Monte-Carlo Simulation with repetition of
Figure 2, illustrates the probability of two guests sharing the same birthday in a party of N guests using the Monte-Carlo simulation. In comparison to the combinatorial formula, which provides a precise probability, Monte Carlo simulation, estimates the probability by performing multiple random simulations. This is the reason that the value of N resulted from the Monte Carlo simulation is different from the combinatorial formula. Furthermore, the Monte Carlo simulation provides an estimate and contains the element of randomness. In order to reduce the difference between the Monte Carlo simulation and combinatorial formula, one can increase the number of simulations.
The random variable
The Expected Value
If guest
If guest
Therefore the Expected Value
We use the Monte-Carlo Simulations to compute the expected value of the number of guests who have birth-mates
The minimum number of
min_N = ceiling(min(results$N[results$Analytical_EY >= 1]))
cat("The minimum number of guests where E(Y) >= 1 is:", min_N)
The minimum number of guests where E(Y) >= 1 is: 20
The result is not surprising because of the following reasons:
-
As
$N$ increases, the probability of birth-mates increases. -
It is surprising because the number of guests required
$(20)$ is relatively small compared to the total number of days in a year$(365)$ . -
This result is a consequence of the birthday problem, which shows that shared birthdays become likely even in small groups due to the combination nature of the problem.
-
The intuition that shared birthdays are rare is often incorrect because people underestimate the number of possible pairs of guests that can share a birthday.
Figure 3 illustrates the expected number of guests with birth-mates using Monte-Carlo Simulation and the Expected Value. As
For verifying this assumption, we will use the Malaysia dataset that provides daily count of births from 1920 to 2022. The data will be summarized in terms of the following statistics:
for
The Chi-Square Goodness of Fit Test is a statistical test used to analyze the difference between the observed and expected frequency distribution values in categorical data.The chi-square goodness of fit test is used to measure the significant difference between the expected and observed frequencies under the null hypothesis that there is no difference between the expected and observed frequencies.The test will be perform to validate if birthdays followed a uniform distribution between 1994 and 2022 in Malaysia.
-
Null Hypothesis (H0): The observed data follows a uniform distribution.
-
Alternative Hypothesis (H1): The observed data does not follow a uniform distribution.
The Chi square test shows high discrepancy between the real data and the expected data. In this test the null hypothesis assumed that births are equally likely on any day of the year.However, the test strongly rejects this assumption, since p-value is extremely low, close to 0
After verifying the presence of a weekly cyclical pattern in daily births, we now investigate whether smoothing the data—by applying a 7-day moving average—can support the assumption of a uniform distribution of births.
The first step involves smoothing the data per day using a 7-day moving average:
Next, we organize the probabilities by week. The probability
Here, weeks 1 to 51 have 7 days each (
To assess whether the smoothed data follows a uniform distribution, we first compare the observed smoothed probabilities with the expected probabilities under a uniform distribution. The expected probabilities are:
The plot of the smoothed probabilities shows a trend similar to the expected uniform distribution, suggesting that smoothing the data may align it with a uniform distribution.
To confirm this, we perform a chi-square goodness-of-fit test with the following hypotheses:
-
Null Hypothesis (H0): The observed data follows a uniform distribution.
-
Alternative Hypothesis (H1): The observed data does not follow a uniform distribution.
chi = chisq.test(x=Smooth_week$Percentage, p=expected)
print(chi)
The test results yield a p-value of 1 (> 0.05), which means we fail to reject the null hypothesis (H0). This indicates that, after smoothing, the data is consistent with a uniform distribution.
For the final part of our analysis, we will look at how the probability of
To do this, we will split the birth data from Malaysia into 7 groups of different sizes.
Table 1: Probability that a birthday falls on a day within each group.
Next, instead of using the probability function from part 1 which assumed a normal distribution of birthdays, we have to account for the different chances that the
To calculate the probability that all
From these results, we can see that the probability for two people to share a birthday slightly increases for the non-uniform distribution that we used. The increase in probability was only marginal in this case because the 7 groups we defined all had very similar probabilities per day, only spanning a small range from 0.00267 to 0.00285. If the probabilities spanned a greater range, or if the most common group had a higher probability, we would see a more noticeable effect on the chance of at least two people sharing a birthday increasing for the non-uniform distribution. This is because when a group has greater relative odds for a birthday, it becomes increasingly likely that multiple people will be born in that specific group because it has higher odds. As a result, the distribution with the lowest probability of shared birthdays is the uniform distribution because no group has a greater chance of resulting in a shared birthday than any other group.
From our analysis of the birthday paradox, we observed the following:
- The number of people required to have a 50% of at least two sharing a birthday is 23, both by statistical calculation and Monte Carlo simulation.
- The expected value for number of birthdays that result in birth-mates is 20, lower than the number of people required for a 50% chance because of cases where more than 2 people share a birthday.
- When looking at real world data, we do not observe a uniform distribution of birthdays. Weekday births are consistently more common than weekend births.
- Smoothing the real world data by week provides a uniform distribution.
- When calculating probabilities of shared birthday for a non-uniform distribution, the likelihood of shared birthdays increases, but only slightly.
- Government of Malaysia Official Open Data Portal (2023). [Dataset] https://data.gov.my/data-catalogue/births