* Edit Jan 2021: I recently completed a YouTube video covering topics in this post:
Many statistical tools and techniques used in data analysis are based on probability. Probability measures how likely it is for an event to occur on a scale from 0 (the event never occurs) to 1 (the event always occurs.). When working with data, variables in the columns of the data set can be thought of as random variables: variables that vary due to chance. A probability distribution describes how a random variable is distributed; it tells us which values a random variable is most likely to take on and which values are less likely.
In statistics, there are a range of precisely defined probability distributions that have different shapes and can be used to model different types of random events. In this lesson we'll discuss some common probability distributions and how to work with them in Python.
The Uniform Distribution
The uniform distribution is a probability distribution where each value within a certain range is equally likely to occur and values outside of the range never occur. If we make a density plot of a uniform distribution, it appears flat because no value is any more likely (and hence has any more density) than another.
Many useful functions for working with probability distributions in Python are contained in the scipy.stats library. Let's load in some libraries, generate some uniform data and plot a density curve:
In [1]:
%matplotlib inline
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
In [3]:
uniform_data = stats.uniform.rvs(size=100000, # Generate 100000 numbers
loc = 0, # From 0
scale=10) # To 10
In [4]:
pd.DataFrame(uniform_data).plot(kind="density", # Plot the distribution
figsize=(9,9),
xlim=(-1,11))
Out[4]:
*Note: the plot above is an approximation of the underlying distribution, since it is based on a sample of observations.
In the code above, we generated 100,000 data points from a uniform distribution spanning the range 0 to 10. In the density plot, we see that the density of our uniform data is essentially level meaning any given value has the same probability of occurring. The area under a probability density curve is always equal to 1.
Probability distributions in scipy come with several useful functions for generating random data and extracting values of interest:
-stats.distribution.rvs() generates random numbers from the specified distribution. The arguments to rvs() will vary depending on the type of distribution you're working with; in the case of the uniform distribution, we have to specify the starting and ending points and the size (number of random points to generate.).
-stats.distribution.cdf() is used to determine the probability that an observation drawn from a distribution falls below a specified value (known as the cumulative distribution function.). In essence, cdf() gives you the area under the distribution's density curve to the left of a certain value on the x axis. For example, in the uniform distribution above, there is a 25% chance that an observation will be in the range 0 to 2.5 and a 75% chance it will fall in the range 2.5 to 10. We can confirm this with cdf():
In [5]:
stats.uniform.cdf(x=2.5, # Cutoff value (quantile) to check
loc=0, # Distribution start
scale=10) # Distribution end
Out[5]:
-stats.distribution.ppf() is the inverse of cdf(): it returns the x axis cutoff value (quantile) associated with a given probability. For instance, if we want to know the cutoff value for which we have a 40% chance of drawing an observation below that value, we can use ppf():
In [6]:
stats.uniform.ppf(q=0.4, # Probability cutoff
loc=0, # Distribution start
scale=10) # Distribution end
Out[6]:
-stats.distribution.pdf() gives you the probability density (height of the distribution) at a given x value. Since the uniform distribution is flat, all x values within its range will have the same probability density and x values outside the range have a probability density of 0:
In [7]:
for x in range(-1,12,3):
print("Density at x value " + str(x))
print( stats.uniform.pdf(x, loc=0, scale=10) )
Probability distribution functions in scipy also support median(), mean(), var() and std().
Generating Random Numbers and Setting The Seed
When you need to generate random real numbers in a range with equal probability you can draw numbers from a uniform distribution using stats.distribution.rvs(). Python also comes with a library called "random" that lets you perform various operations that involve randomization. Let's look at a few functions in the random library:
In [8]:
import random
In [9]:
random.randint(0,10) # Get a random integer in the specified range
Out[9]:
In [10]:
random.choice([2,4,6,9]) # Get a random element from a sequence
Out[10]:
In [11]:
random.random() # Get a real number between 0 and 1
Out[11]:
In [12]:
random.uniform(0,10) # Get a real in the specified range
Out[12]:
Notice that the random library also lets you generate random uniform numbers. Regardless of the method you use to generate random numbers, however, the result of a random process can differ from one run to the next. Having results vary each time you run a function is often not desirable. For example, if you want a colleague to be able to reproduce your results exactly, you can run into problems when you use randomization. You can ensure that your results are the same each time you use a function that involves randomness by setting the random number generator's seed value to initialize it prior to running the function. Set the random seed with random.seed():
In [13]:
random.seed(12) # Set the seed to an arbitrary value
print([random.uniform(0,10) for x in range(4)])
random.seed(12) # Set the seed to the same value
print([random.uniform(0,10) for x in range(4)])
Notice that we generated the exact same numbers with both calls to random.uniform() because we set the same seed before each call. If we had not set the seed, we would have gotten different numbers. This reproducibility illustrates the fact that these random numbers aren't truly random, but rather "pseudorandom".
Many functions in Python's libraries that use randomness have an optional random seed argument built in so that you don't have to set the seed outside of the function. For instance, the rvs() function has an optional argument random_state, that lets you set the seed.
* Note: The Python standard library "random" has a separate internal seed from the numpy library. When using functions from numpy and libraries built on top of numpy (pandas, scipy, scikit-learn) use np.random.seed() to set the seed.
* Note: The Python standard library "random" has a separate internal seed from the numpy library. When using functions from numpy and libraries built on top of numpy (pandas, scipy, scikit-learn) use np.random.seed() to set the seed.
The Normal Distribution
The normal or Gaussian distribution is a continuous probability distribution characterized by a symmetric bell-shaped curve. A normal distribution is defined by its center (mean) and spread (standard deviation.). The bulk of the observations generated from a normal distribution lie near the mean, which lies at the exact center of the distribution: as a rule of thumb, about 68% of the data lies within 1 standard deviation of the mean, 95% lies within 2 standard deviations and 99.7% lies within 3 standard deviations.
The normal distribution is perhaps the most important distribution in all of statistics. It turns out that many real world phenomena, like IQ test scores and human heights, roughly follow a normal distribution, so it is often used to model random variables. Many common statistical tests assume distributions are normal.
The scipy nickname for the normal distribution is norm. Let's investigate the normal distribution:
In [14]:
prob_under_minus1 = stats.norm.cdf(x= -1,
loc = 0,
scale= 1)
prob_over_1 = 1 - stats.norm.cdf(x= 1,
loc = 0,
scale= 1)
between_prob = 1-(prob_under_minus1+prob_over_1)
print(prob_under_minus1, prob_over_1, between_prob)
The output shows that roughly 16% of the data generated by a normal distribution with mean 0 and standard deviation 1 is below -1, 16% is above 1 and 68% lies between -1 and 1, which agrees with the 68, 95, 99.7 rule. Let's plot the normal distribution and inspect areas we calculated:
In [15]:
# Plot normal distribution areas*
plt.rcParams["figure.figsize"] = (9,9)
plt.fill_between(x=np.arange(-4,-1,0.01),
y1= stats.norm.pdf(np.arange(-4,-1,0.01)) ,
facecolor='red',
alpha=0.35)
plt.fill_between(x=np.arange(1,4,0.01),
y1= stats.norm.pdf(np.arange(1,4,0.01)) ,
facecolor='red',
alpha=0.35)
plt.fill_between(x=np.arange(-1,1,0.01),
y1= stats.norm.pdf(np.arange(-1,1,0.01)) ,
facecolor='blue',
alpha=0.35)
plt.text(x=-1.8, y=0.03, s= round(prob_under_minus1,3))
plt.text(x=-0.2, y=0.1, s= round(between_prob,3))
plt.text(x=1.4, y=0.03, s= round(prob_over_1,3))
Out[15]:
*Note: This lesson uses some plotting code we did not cover in the plotting lesson in order to make plots for explanatory purposes.
The plot above shows the bell shape of the normal distribution, the area below and above one standard deviation and the area within 1 standard deviation of the mean.
Finding quantiles of the normal distribution is a common task when performing statistical tests. You can check normal distribution quantiles with stats.norm.ppf():
In [16]:
print( stats.norm.ppf(q=0.025) ) # Find the quantile for the 2.5% cutoff
print( stats.norm.ppf(q=0.975) ) # Find the quantile for the 97.5% cutoff
The quantile output above confirms that roughly 5% of the data lies more than 2 standard deviations from the mean.
*Note: a mean of 0 and standard deviation of 1 are default values for the normal distribution.
The Binomial Distribution
The binomial distribution is a discrete probability distribution that models the outcomes of a given number of random trails of some experiment or event. The binomial is defined by two parameters: the probability of success in any given trial and the number of trials. The binomial distribution tells you how likely it is to achieve a given number of successes in n trials of the experiment. For example, we could model flipping a fair coin 10 times with a binomial distribution where the number of trials is set to 10 and the probability of success is set to 0.5. In this case the distribution would tell us how likely it is to get zero heads, 1 head, 2 heads and so on.
The scipy name for the binomial is binom. Let's generate and investigate some binomial data:
In [17]:
fair_coin_flips = stats.binom.rvs(n=10, # Number of flips per trial
p=0.5, # Success probability
size=10000) # Number of trials
print( pd.crosstab(index="counts", columns= fair_coin_flips))
pd.DataFrame(fair_coin_flips).hist(range=(-0.5,10.5), bins=11)
Out[17]:
Note that since the binomial distribution is discrete, it only takes on integer values so we can summarize binomial data with a frequency table and its distribution with a histogram. The histogram shows us that a binomial distribution with a 50% probability of success is roughly symmetric, with the most likely outcomes lying at the center. This is reminiscent of the normal distribution, but if we alter the success probability, the distribution won't be symmetric:
In [18]:
biased_coin_flips = stats.binom.rvs(n=10, # Number of flips per trial
p=0.8, # Success probability
size=10000) # Number of trials
# Print table of counts
print( pd.crosstab(index="counts", columns= biased_coin_flips))
# Plot histogram
pd.DataFrame(biased_coin_flips).hist(range=(-0.5,10.5), bins=11)
Out[18]:
The cdf() function lets us check the probability of achieving a number of successes within a certain range:
In [19]:
stats.binom.cdf(k=5, # Probability of k = 5 successes or less
n=10, # With 10 flips
p=0.8) # And success probability 0.8
Out[19]:
In [20]:
1 - stats.binom.cdf(k=8, # Probability of k = 9 successes or more
n=10, # With 10 flips
p=0.8) # And success probability 0.8
Out[20]:
For continuous probability density functions, you use pdf() to check the probability density at a given x value. For discrete distributions like the binomial, use stats.distribution.pmf() (probability mass function) to check the mass (proportion of observations) at given number of successes k:
In [21]:
stats.binom.pmf(k=5, # Probability of k = 5 successes
n=10, # With 10 flips
p=0.5) # And success probability 0.5
Out[21]:
In [22]:
stats.binom.pmf(k=8, # Probability of k = 8 successes
n=10, # With 10 flips
p=0.8) # And success probability 0.8
Out[22]:
The Geometric and Exponential Distributions
The geometric and exponential distributions model the time it takes for an event to occur. The geometric distribution is discrete and models the number of trials it takes to achieve a success in repeated experiments with a given probability of success. The exponential distribution is a continuous analog of the geometric distribution and models the amount of time you have to wait before an event occurs given a certain occurrence rate.
The scipy nickname for the geometric distribution is "geom". Let's use the geom functions to model the number of trials it takes to get a success (heads) when flipping a fair coin:
In [23]:
random.seed(12)
flips_till_heads = stats.geom.rvs(size=10000, # Generate geometric data
p=0.5) # With success prob 0.5
# Print table of counts
print( pd.crosstab(index="counts", columns= flips_till_heads))
# Plot histogram
pd.DataFrame(flips_till_heads).hist(range=(-0.5,max(flips_till_heads)+0.5)
, bins=max(flips_till_heads)+1)
Out[23]:
The distribution looks similar to what we'd expect: it is very likely to get a heads in 1 or 2 flips, while it is very unlikely for it to take more than 5 flips to get a heads. In the 10,000 trails we generated, the longest it took to get a heads was 13 flips.
Let's use cdf() to check the probability of needing 6 flips or more to get a success:
In [24]:
first_five = stats.geom.cdf(k=5, # Prob of success in first 5 flips
p=0.5)
1 - first_five
Out[24]:
Use pmf() to check the probability of seeing a specific number of flips before a successes:
In [25]:
stats.geom.pmf(k=2, # Prob of needing exactly 2 flips to get first success
p=0.5)
Out[25]:
The scipy name for the exponential distribution is "expon". Let's investigate the exponential distribution:
In [26]:
# Get the probability of waiting more than 1 time unit before a success
prob_1 = stats.expon.cdf(x=1,
scale=1) # Arrival rate
1 - prob_1
Out[26]:
*Note: The average arrival time for the exponential distribution is equal to 1/arrival_rate.
Let's plot this exponential distribution to get an idea of its shape:
In [27]:
plt.fill_between(x=np.arange(0,1,0.01),
y1= stats.expon.pdf(np.arange(0,1,0.01)) ,
facecolor='blue',
alpha=0.35)
plt.fill_between(x=np.arange(1,7,0.01),
y1= stats.expon.pdf(np.arange(1,7,0.01)) ,
facecolor='red',
alpha=0.35)
plt.text(x=0.3, y=0.2, s= round(prob_1,3))
plt.text(x=1.5, y=0.08, s= round(1 - prob_1,3))
Out[27]:
Similar to the geometric distribution, the exponential starts high and has a long tail that trails off to the right that contains rare cases where you have to wait much longer than average for an arrival.
The Poisson Distribution
The Poisson distribution models the probability of seeing a certain number of successes within a time interval, where the time it takes for the next success is modeled by an exponential distribution. The Poisson distribution can be used to model traffic, such as the number of arrivals a hospital can expect in a hour's time or the number of emails you'd expect to receive in a week.
The scipy name for the Poisson distribution is "poisson". Let's generate and plot some data from a Poisson distribution with an arrival rate of 1 per time unit:
In [28]:
random.seed(12)
arrival_rate_1 = stats.poisson.rvs(size=10000, # Generate Poisson data
mu=1 ) # Average arrival time 1
# Print table of counts
print( pd.crosstab(index="counts", columns= arrival_rate_1))
# Plot histogram
pd.DataFrame(arrival_rate_1).hist(range=(-0.5,max(arrival_rate_1)+0.5)
, bins=max(arrival_rate_1)+1)
Out[28]:
The histogram shows that when arrivals are relatively infrequent, it is rare to see more than a couple of arrivals in each time period. When the arrival rate is high, it becomes increasingly rare to see a low number of arrivals and the distribution starts to look more symmetric:
In [29]:
random.seed(12)
arrival_rate_10 = stats.poisson.rvs(size=10000, # Generate Poisson data
mu=10 ) # Average arrival time 10
# Print table of counts
print( pd.crosstab(index="counts", columns= arrival_rate_10))
# Plot histogram
pd.DataFrame(arrival_rate_10).hist(range=(-0.5,max(arrival_rate_10)+0.5)
, bins=max(arrival_rate_10)+1)
Out[29]:
As with other discrete probability distributions, we can use cdf() to check the probability of achieving more or less than a certain number of successes and pmf() to check the probability of obtaining a specific number of successes:
In [30]:
stats.poisson.cdf(k=5, # Check the probability of 5 arrivals or less
mu=10) # With arrival rate 10
Out[30]:
In [31]:
stats.poisson.pmf(k=10, # Check the prob f exactly 10 arrivals
mu=10) # With arrival rate 10
Out[31]:
Wrap Up
Python's scipy library contains functions that make it easy to work with a wide range of probability distributions, including many that we did not discuss in this lesson. Probability distribution functions are useful for generating random data, modeling random events and aiding with statistical tests and analysis.
In the next few lessons, we'll learn how to carry out common statistical tests with Python.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.