Probability Basics for Machine Learning


  • Joint probability, Marginal probability, Conditional probability.

  • Sum rule, Product rule, Bayes’ Theorem.

  • Prior probability, Posterior probability,  Maximum Likelihood Estimation.

  • Expectation.

1.Joint probability, Marginal probability, Conditional probability.

We shall assume that there are two random variables \mathbf{X} and \mathbf{Y}\mathbf{X} can take any of the values \mathit{\{x_1,x_2,...x_i...,x_m\}}, \mathbf{Y} can take any of the values \mathit{\{y_1,y_2,...y_j...,y_l\}} .

(Figure 1)

We shall define:
1>. \mathit{p(X=x_i)} denotes a specific probability.
2>. \mathit{p(X)} denotes the probability distribution.

When \mathbf{X} take the value \mathit{x_i} and \mathbf{Y} take the value \mathit{y_j} is written \mathit{p(X=x_i,Y=y_j)} and is called the joint probability of \mathit{X=x_i and Y=y_i} . It is given by the number of points falling in the cell i,j as a fraction of the total number of points, and hence

\mathit{p(X =x_i, Y=y_j) = \frac{n_{ij}}{N}}        (1.1)

\mathit{N=m*l}. Similarly, the probability that \mathit{X} takes the vaule \mathit{x_i} regardless of \mathit{Y} is denoted as \mathit{p(X=x_i)} and is given by the fraction of the total number of points that fall in column \mathit{i}, so that

\mathit{p(X =x_i) = \frac{c_i}{N}}      (1.2)

Since the number of instances in column \mathit{i} in Figure 1 is just the sum of the number of instances in each cell of that column, we have \mathit{i=\displaystyle\sum_{j} n_{ij}} and therefore, from (1.1) and (1.2), we have:

\mathit{p(X=x_i)=\displaystyle\sum_{j=1}^{l} p(X=x_i,Y =y_j) }       (1.3)

which is the sum rule of probability.  and in this context, \mathit{p(X=x_i)} is called the marginal probability.

Now suppose \mathit{X=x_i}, the fraction of instances which \mathit{Y=y_j} is denoted \mathit{p(Y=y_j | X=x_i)} and is called the conditional probability of \mathit{p(Y=y_j)} given \mathit{p(X=x_i)}. It then follows that

\mathit{p(Y=y_i | X=x_i) = \frac{n_{ij}}{c_i}}     (1.4)


2.Sum rule, Product rule, Bayes’ Theorem.

Since  \mathit{\frac{n_{ij}}{c_i} * \frac{c_i}{N}=\frac{n_{ij}}{N}}   . From (1.1),(1.2), and (1.4), yields:

\mathit{p(X =x_i, Y=y_j) = p(Y=y_i | X=x_i)p(X =x_i)}     (1.5)

called product rule of probability.

For the sake of simplicity, we may write \mathit{p(x_i)} to denote the distribution evaluated for the particular value \mathit{x_i}, Relatively, denoted Marginal probability distribution and Joint probability distribution as \mathit{p(X)} , \mathit{p(X,Y)} respectively.

With these more compact notations, we rewrite the two prominent rules of probability theory as:

Sum rule         \mathit{p(X)=\displaystyle\sum_{Y} p(X,Y) }      (1.6)
Product rule       \mathit{p(X, Y) = p(Y| X)p(X)}      (1.7)

These two concise rules form the basis for probabilistic theory that we use throughout this series of posts.

From the product rule, since \mathit{p(X,Y) = p(Y,X) }, we obtain \mathit{p(Y|X)p(X) = p(X|Y)p(Y)}, Namely:

\mathit{p(Y|X) = \frac{p(X|Y)p(Y)}{p(X)}}     (1.8)

Which is called Bayes’ theorem . it palys a central role in pattern recognition and machine learning.

3.Prior probability, Posterior probability.

1>.Prior probability: The probability that an event will reflect established beliefs about the event before the arrival of new evidence or information. Prior probabilities are the original probabilities of an outcome, which be will updated with new information to create posterior probability.

2>.Posterior probability:The revised probability of an event occurring after taking into consideration new information. Posterior probability is normally calculated by updating the prior probability by using Bayes’ theorem.

example 1:

We shall suppose there are 100 students, \mathit{ D= \{(a_1,b_1),(a_2,b_2),...(a_i,b_i)...,(a_{100},b_{100})\}} , \mathit{a_i,b_i \in \{0,1\}}, \mathit{p(A)} denotes the probability of students’ sex, \mathit{p(B)} denotes the probability of students’ wear.

\mathit{a_i=0 }  means the student is a girl, \mathit{a_i=1}  means a boy.  \mathit{b_i=0 }  means this student wears skirt, \mathit{b_i=1 }  means wearing pants.


Now suppose we saw a student \mathit{j \{a_j,b_j\}} wearing pants (\mathit{b_j=1}), Finding the probability that student j is a girl(\mathit{a_j=0}), namely asking for \mathit{p(A=0|B=1)} .

According to the Bayes’ theorem, \mathit{p(A|B) = p(B|A)p(A) / p(B)} .

Hence, \mathit{p(A=0|B=1) = p(B=1|A=0)p(A=0) / p(B=1)} .

\mathit{p(A=0)} denote the probability that a student is a girl irrespective of her wearing.

namely prior probability.  Correspondingly,  \mathit{p(A=0|B=1)} is posterior probability.  

\mathit{p(B=1)} denote the probability that a student wearing pants irrespective of its sex,

\mathit{p(B=1|A=0)} denote the probability that a girl wearing pants.

we can compute these three probabilities by maximum likelihood estimation,


Informally, if we use subset of dataset D, to estimate these three probabilities, We shall call this likelihood estimate, if we use overall dataset D to estimate, it is maximum likelihood estimation relatively.Therefore:

\mathit{p(A=0) = \frac{\sum_{i=1}^{100} I(a_i=0)}{100}}

\mathit{p(B=1) = \frac{\sum_{i=1}^{100} I(b_i=1)}{100}}

\mathit{p(B=1|A=0) = \frac{\sum_{i=1}^{100} I(a_i=0, b_i=1)}{\sum_{i=1}^{100} I(a_i=0)}}

\mathit{I(*)} is indicator function, when ” * ” is true  \mathit{I(*)=1},  otherwise  \mathit{I(*)=0} .



One of the most important operations involving probabilities is that of finding weighted averages of functions. The average value of some function f(x) under a probability distribution p(x) is called the expectation of f(x) denoted by ?[f] .For a discrete distribution. It is given by:

 \mathbb{E}\mathit{[f]=\displaystyle\sum_{x} p(x)f(x)}     (1.9)

So that the average is weighted by the relative probabilities of the different values of x. Futhermore, conditional expectation is given by:

\mathbb{E}\mathit{[f|y]=\displaystyle\sum_{x} p(x|y)f(x)}    (1.10)

In the case of continuous variables, expectations are expressed in terms of an integration with respect to the corresponding probability density:

\mathbb{E}\mathit{[f|y]=\int p(x)f(x)dx}    (1.11)


Bishop, C.M. (2006).  Pattern Recognition and Machine Learning.  Springer, New York.

周志华.  (2015). 机器学习(MACHINE LEARNING).  清华大学出版社, 北京。

李航. (2012). 统计学习方法.  清华大学出版社, 北京.

2 条评论

  • NoTKS 2016年9月7日 回复


    • D.Roger 2016年9月7日 回复 作者



电子邮件地址不会被公开。 必填项已用*标注