Bayesian Inference in Python
Bayes Theorem uses prior knowledge or experience to provide better results. Mathematically speaking, it uses conditional probability of an event
Bayes’ theorem is stated mathematically as the following equation:
P(A/B) = P(B/A)P(A)/P(B)
- A and B are events and P(B) not equal to 0.
- P(A) and P(B) are the probabilities of observing A and B independently of each other.
- P(A/B) is the conditional probability i.e. the probability of occurance of A given that B is true (has already occurred).
- P(B/A) is the conditional probability of B given that A is true.
Bayesian Inference is a method of Statistical Inference where we update the probability of our hypothesis(prior) H, as more information(data) D becomes available and finally arrive at our posterior probability. Bayes Theorem lays down the foundation of Bayesian Inference. Mathematically Expressing, we have
P(H/D) = P(H)P(D/H)/P(D)
- P(H) is the probability of the hypothesis before we see the data, prior
- P(H/D) is what we want to compute, the probability of the hypothesis after we see the data, posterior
- P(D/H) is the probability of the data under the hypothesis.
- P(D) is the probability of the data under any hypothesis.
In the above coin toss example, we have the probability of heads coming up to be 0.35. We assume our prior-the probability distribution of heads coming up to be Uniform(0,1) Distribution. And then on the basis of data, we arrive at our posterior distribution of heads coming up. What we notice is that our distribution gets more accurate as we increase the number of trials.
Now, assuming prior to being Bernoulli(0.5) Distribution and combining our result with our previous prior we have
INFLUENCE OF THE PRIOR:
The prior influences the result of our analysis. As it is obvious from the above example, the influence of the prior is more dominant when data-volume is less. The prior eventually subsidize as the number of trials becomes larger (where using frequentist’s inference methods might be a better option).
HOW TO CHOOSE A PRIOR:
This is quite a subjective question. Some do use a non-informative prior (such as Uniform(0,1) in our first example). Unless and until we are quite sure, it is not recommended to use strongly informative priors in our analysis.
The resultant probability distribution which summarizes both prior and the data is the posterior.
HIGHEST POSTERIOR DENSITY (HPD) INTERVAL
A highly useful tool to summarize the spread of the posterior density. It is defined as the shortest interval containing a given portion of probability density.
There are several ways to compute Posterior computationally which can be broadly classified as
- Non-Markovian Methods
- Markovian Methods
GRID COMPUTING: It is a brute force approach mainly used when we cannot compute the posterior analytically. For a single parameter model, the grid approach is as follows:
- Define a reasonable interval for the parameter (the prior should give you a hint).
- Place a grid of points (generally equidistant) on that interval.
- For each point in the grid, we multiply the likelihood and the prior.
- Normalize the computed values (divide the result at each point by sum of all points).
The Grid Computing approach does not scale well for high-dimension data.
There are other Non-markovian methods as well such as QUADRATIC(LAPLACE) METHOD and VARIATIONAL METHODS.
There’s a family of methods known as MCMC- MONTE CARLO MARKOV CHAIN Methods. Here as well, we need to compute the prior and the likelihood at each point to approximate the whole posterior distribution. MCMC methods outperform the grid approximation because they are designed to spend more time in higher probability regions than in lower ones.
# Revisiting the coin toss Example
Using PyMC3 Library (Python Library for probabilistic Programming)
Get more articles on Python for Data Science.
Download the iPython notebook used in the above example from GitHub