rnorm(5, mean = 2, sd = 1)[1] 1.974739 2.396793 2.512048 2.295439 1.534499
Probability theory gives us a language to describe uncertainty.
Probability theory begins with the idea of a random process, which an ``experiment’’ or procedure whose outcome is not known in advance.
Definition F.1 A random process is a repeatable procedure to obtain an observation from a defined set of outcomes.
Definition F.2 The sample space \(\Omega\) is the set of all possible outcomes of a random process.
Definition F.3 A realization of the random process produces an outcome from the sample space.
Definition F.4 An event is any subset of outcomes \(A \subseteq \Omega\). The probability of an event \(A\) is denoted \(\Pr(A)\).
The probability function \(P(\cdot)\) must satisfy the following three rules.
Theorem F.1 Let \(P\) be a probability function defined on a sample space \(\Omega\). Then:
1 Some notes on Axiom 3. Examples of an infinite sequence of disjoint events? For \(\Omega = \mathbb{R}^+\)? For \(S =\{0, 1\}\)? An infinite sequence of disjoint events is difficult to conceptualize. For \(S = \mathbb{R}^+\), one such sequence would be \([0, 1), [1, 2), [2, 3), ...\). For \(S =\{0, 1\}\), one such sequence would be \(\{0\}, \{1\}, \emptyset, \emptyset, \emptyset,...\).
Definition F.5 For a sample space \(\Omega\), a probability is a collection of real numbers assigned to all events \(A\) consistent with Axioms 1, 2, and 3.
Example F.1 Let \(\Omega = \{\text{H}, \text{T}\}\) represent the outcomes of a fair coin flip.
Define a probability function:
- \(P(\{\text{H}\}) = 0.5\)
- \(P(\{\text{T}\}) = 0.5\)
Then the three axioms of probability are satisfied as follows:
Theorem F.2 \(\Pr(\emptyset) = 0\).
Proof. \(\Pr(\emptyset) = \Pr(\cup_{i = 1}^\infty \emptyset) = \sum_{i = 1}^\infty \Pr(\emptyset)\). \(\Pr(\emptyset) = \sum_{i = 1}^\infty \Pr(\emptyset)\) iff \(\Pr(\emptyset) = 0\).
Theorem F.3 For every finite sequence of \(n\) disjoint events \(A_1, A_2, ..., A_n\), \[\begin{equation} \Pr \left( \displaystyle \bigcup_{i = 1}^n A_i \right) = \displaystyle \sum_{i = 1}^n \Pr(A_i). \nonumber \end{equation}\]
Theorem F.4 Addition Rule for Two Disjoint Events
For disjoint events \(A\) and \(B\), \(\Pr ( A \cup B) = \Pr(A) + \Pr(B)\)
Theorem F.5 If event \(A \subseteq B\), then \(\Pr(A) \leq \Pr(B)\).
Theorem F.6 For event \(A\), \(0 \leq \Pr(A) \leq 1\).
Theorem F.7 Addition Rule for Two Events
For any events \(A\) and \(B\), \(\Pr ( A \cup B) = \Pr(A) + \Pr(B) - \Pr(A \cap B)\).
Theorem F.8 Addition Rule for Three Events
For any events \(A\), \(B\), and \(C\), \[\begin{align*}
\Pr ( A \cup B \cup C) &= \Pr(A) + \Pr(B) + \Pr(C)\\
&- \left[ \Pr(A \cap B) + \Pr(A \cap C) + \Pr(B \cap C) \right]\\
&+ \Pr(A \cap B \cap C).
\end{align*}\]
Definition F.6 Conditional Probability
\(\Pr(A \mid B) = \dfrac{\Pr(A \cap B)}{\Pr(B)}\) for \(\Pr(B) > 0\). If \(\Pr(B) = 0\), then \(\Pr(A \mid B)\) is undefined.
We interpret the conditional probability \(\Pr(A \mid B)\) as the probability of \(A\) given that \(B\) happens (or has already happened). Suppose a bag with two green marbles and two red marbles. I draw two marbles without replacement and see that the first is green. Then the probability that the second is green, given that the first is/was green, is
\[ \Pr(\text{second is green} \mid \text{first is green}) = \frac{\Pr(\text{second is green AND first is green})}{\Pr (\text{first is green)}}. \]
Definition F.7 Independence of Two Events
Events \(A\) and \(B\) are independent if \(\Pr(A \cap B) = \Pr(A) \Pr(B)\).
If \(\Pr(A) > 0\) and \(\Pr(B) > 0\), then Definition F.6 and Definition F.7 imply that two events are independent if and only if their conditional probabilities equal their unconditional probabilities so that \(\Pr(A \mid B) = \Pr(A)\) and \(\Pr(B \mid A) = \Pr(B)\).
Definition F.8 Independence of \(n\) Events
Events \(A_1, A_2, ..., A_n\) are independent if for every subset \(A_a,..., A_m\) with at least two events, \(\Pr(A_a \cap ... \cap A_m) = \Pr(A_a)...\Pr(A_m)\).
The “every subset” part of Definition F.8 is subtle, so let’s create a specific example. “Every subset” of \(A\), \(B\), and \(C\) with at least two events includes the following: \(\{A, B\}\), \(\{A, C\}\), \(\{B, C\}\), and \(\{A, B, C\}\).
Definition F.9 To create a partition \(B_1, B_2, ..., B_k\) of the sample space \(S\), divide \(S\) into \(k\) disjoint events \(B_1, B_2, ..., B_k\) so that \(\bigcup_{i = 1}^n B_i = S\).
Theorem F.9 Law of Total Probability
Suppose a partition \(B_1, B_2, ..., B_k\) of the sample space \(S\) where \(\Pr(B_j) > 0\) for \(j = 1, 2, ... , k\). Then
\[ \Pr(A) = \sum_{j = 1}^k \Pr(B_j )\Pr(A \mid B_j). \]
Theorem F.10 Bayes’ Rule
Suppose a partition \(B_1, B_2, ..., B_k\) of the sample space \(S\) where \(\Pr(B_j) > 0\) for \(j = 1, 2, ... , k\). Suppose an event \(A\), where \(\Pr(A) > 0\). Then
\[ \Pr(B_i \mid A) = \dfrac{\Pr(B_i) \Pr(A \mid B_i)}{\sum_{j = 1}^k \Pr(B_j )\Pr(A \mid B_j)}. \]
Theorem F.11 Bayes’ Rule for a simpler partition
Suppose the simple partition \(B\) and \(B^c\) of the sample space \(S\) where \(\Pr(B) > 0\) and \(\Pr(B^c) > 0\). Suppose an event \(A\), where \(\Pr(A) > 0\). Then
\[ \Pr(B \mid A) = \dfrac{\Pr(B) \Pr(A \mid B)}{\Pr(B) \Pr(A \mid B) + \Pr(B^c) \Pr(A \mid B^c)}. \]
Exercise F.1 You’re considering getting tested for a rare disease that 1 in 100,000 people have. If given to a person with the disease, the test will produce a positive result 99% of the time. If given to a person without the disease, the test will produce a positive result 0.1% of the time (i.e., 1 in 1,000). You are given the test and the result comes back positive. Use Bayes’ rule to compute the chance that you have the disease.
Solution
Let \(D\) denote having the disease and \(T\) a positive test.
Compute the marginal
\[ P(T) \;=\; P(T\mid D)P(D) + P(T\mid D^c)P(D^c) = 0.99(10^{-5}) + 0.001(1-10^{-5}) = 9.9\times 10^{-6} + 0.00099999 = 0.00100989. \]
Apply Bayes’ rule
\[ P(D\mid T) \;=\; \frac{P(T\mid D)P(D)}{P(T)} = \frac{0.99 \times 10^{-5}}{0.00100989} \approx 0.0098. \]
So the chance you have the disease given a positive test is about \(0.98\%\) (i.e., less than 1%).
A random variable is a numerical summary of the possible outcomes from a random process.
Definition F.10 A random variable \(X\) is a function from a sample space \(\Omega\) to the real numbers:
\[ X: \Omega \rightarrow \mathbb{R} \]
That is, \(X(\omega)\) is the number assigned to the outcome \(\omega\).
The random part comes from the fact that the outcome \(\omega\) is not known in advance so the value \(X(\omega)\) is also uncertain.
We often use random variables to model outcomes of interest
We usually classify random variables as
Example F.2 Let \(X\) be the number protests in a given country-year.
- Possible values: \(0, 1, 2, \dots\)
- So \(X\) is a discrete random variable.
Let \(Y\) be the time between protests.
- Possible values: any real number \(\ge 0\)
- So \(Y\) is a continuous random variable.
Random variables allow us to define numerical events using real numbers.
For example, if \(X\) is the number of protests:
A discrete random variable \(X\) has a PMF \(p(x)\) satisfying \(p(x) \ge 0\) for all \(x\) and \(\sum_x p(x) = 1\).
A continuous random variable \(X\) has a PDF \(f(x)\) satisfying \(f(x) \ge 0\) for all \(x\) and \(\int_{-\infty}^{\infty} f(x)\, dx = 1\).
For continuous variables, probabilities are areas under the density curve:
\[ P(a \le X \le b) = \int_a^b f(x)\, dx \]
The CDF \(F(x)\) of a random variable \(X\) is \(F(x) = P(X \le x)\).
Example F.3 Let \(X \sim \text{Bernoulli}(0.7)\). Then:
Let \(Y \sim \text{Exponential}(\lambda = 2)\). Then:
Definition F.11 The expected value of a random variable \(X\) is \(\mathbb{E}[X] = \sum_x x \cdot p(x)\) for discrete random variables and \(\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f(x)\, dx\) for continuous random variables.
Definition F.12 The variance of \(X\) is \(\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2\).
Example F.4 Let \(X \sim \text{Poisson}(\lambda = 3)\). Then \(\mathbb{E}[X] = 3\) and \(\text{Var}(X) = 3\).
Let \(Y \sim \text{Exponential}(\lambda = 2)\). Then \(\mathbb{E}[Y] = \frac{1}{2} = 0.5\) and \(\text{Var}(Y) = \frac{1}{4} = 0.25\).
Example F.5 Let \(X \sim \text{Bernoulli}(0.7)\). Compute \(\mathbb{E}[X]\) and \(\text{Var}(X)\).
Solution.
PMF: \(P(X = 1) = 0.7\), \(P(X = 0) = 0.3\)
Theorem F.12 (Linearity of Expectation) For any constants \(a, b\) and random variables \(X, Y\), \(\mathbb{E}[X + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]\).
Theorem F.13 (Expectation of a Constant) For any constant \(c\), \(\mathbb{E}[c] = c\).
Theorem F.14 (Expectation of a Product) For any random variables \(X\) and \(Y\),
\(\mathbb{E}[XY] = \mathrm{Cov}(X, Y) + \mathbb{E}[X]\,\mathbb{E}[Y]\).
If \(X\) and \(Y\) are independent, then \(\mathbb{E}[XY] = \mathbb{E}[X]\,\mathbb{E}[Y]\).
Theorem F.15 (Variance Scaling) For any constant \(a\), \(\mathrm{Var}(aX) = a^2\,\mathrm{Var}(X)\).
Theorem F.16 (Variance of a Sum) For any random variables \(X\) and \(Y\),
\(\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) + 2\,\mathrm{Cov}(X, Y)\).
If \(X\) and \(Y\) are independent, then \(\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y)\).
Theorem F.17 (Covariance Scaling) For any constants \(a, b\), \(\mathrm{Cov}(aX, bY) = ab\,\mathrm{Cov}(X, Y)\).
Theorem F.18 (Linearity of Covariance) For any random variables \(X, Y, Z\),
\(\mathrm{Cov}(X + Y, Z) = \mathrm{Cov}(X, Z) + \mathrm{Cov}(Y, Z)\), and similarly in the second argument.
Theorem F.19 (Independence Implies Zero Covariance) If \(X\) and \(Y\) are independent, then \(\mathrm{Cov}(X, Y) = 0\).
However, \(\mathrm{Cov}(X, Y) = 0\) does not necessarily imply independence.
Let \(X\) and \(Y\) be random variables.
Example F.6 Let \(X, Y\) be discrete with joint PMF:
| \(X \backslash Y\) | 0 | 1 |
|---|---|---|
| 0 | 0.1 | 0.3 |
| 1 | 0.2 | 0.4 |
Compute \(P(X = 1)\), \(P(Y = 0)\), and \(P(Y = 1 \mid X = 1)\).
Solution.
Definition F.13 The covariance between \(X\) and \(Y\) is \(\text{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X] \mathbb{E}[Y]\).
Definition F.14 The correlation between \(X\) and \(Y\) is \(\text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}}\).
Example F.7 Let \(X, Y\) be defined as in Example F.6. Compute \(\text{Cov}(X, Y)\).
Solution.
First compute
Then \(\text{Cov}(X, Y) = 0.6 - (0.6)(0.7) = 0.6 - 0.42 = 0.18\).
Theorem F.20 (Law of the Unconscious Statistician) If \(X\) is a random variable and \(g\) is a function, then \(\mathbb{E}[g(X)] = \sum_x g(x) p(x)\) for discrete random variables and \(\mathbb{E}[g(X)] = \int g(x) f(x)\, dx\) for continuous random variables.
Theorem F.21 (Law of Total Expectation) For another variable \(Y\), \(\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X \mid Y]]\).
Theorem F.22 (Law of Total Variance) \[ \text{Var}(X) = \mathbb{E}[\text{Var}(X \mid Y)] + \text{Var}(\mathbb{E}[X \mid Y]) \]
Theorem F.23 (Change of Variables) Suppose \(X\) is a continuous variable with PDF \(f_X(x)\) and \(Y = g(X)\) is a strictly monotonic transformation. Then the PDF of \(Y\) is \(f_Y(y) = f_X(g^{-1}(y)) \cdot \left| \frac{d}{dy} g^{-1}(y) \right|\).
Example F.8 Let \(X \sim \text{Exponential}(1)\) and \(Y = \log(X)\). Find the PDF of \(Y\).
Solution.
Notice that \(g^{-1}(y) = e^y\). Thus, \(\frac{d}{dy} g^{-1}(y) = e^y\).
\(f_X(x) = e^{-x}\).
Then \(f_Y(y) = f_X(e^y) \cdot e^y = e^{-e^y} \cdot e^y = e^y \cdot e^{-e^y}\).
*norm() in RDefinition F.15 Let \(X \sim \mathcal{N}(\mu, \sigma^2)\). Then \(f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right)\). For the normal distribution, \(\mathbb{E}[X] = \mu\) and \(\text{Var}(X) = \sigma^2\).
We have several functions to work with the normal distribution in R.
| Function | Description |
|---|---|
dnorm(x) |
PDF: \(f(x)\) |
pnorm(x) |
CDF: \(P(X \le x)\) |
qnorm(p) |
Quantile: inverse CDF |
rnorm(n) |
Simulate from \(\mathcal{N}(\mu, \sigma^2)\) |
Example F.9 Simulate 5 draws from \(\mathcal{N}(2, 1)\).
rnorm(5, mean = 2, sd = 1)[1] 1.974739 2.396793 2.512048 2.295439 1.534499
Example F.10 Compute \(\Pr(X \le 1.96)\) for \(X \sim \mathcal{N}(0, 1)\).
pnorm(1.96)[1] 0.9750021
Returns approximately 0.975.
Example F.11 Find the \(x\) that gives \(\Pr(X \le x) = 0.8\) for \(X \sim \mathcal{N}(0, 1)\).
qnorm(0.8)[1] 0.8416212
Returns approximately 0.84.