Sample markdown

Suppose you have $n$ random variables, $\mathbf{x}_1, \mathbf{x}_2, x_3, \ldots x_n $.

Now, between any two random variables $x_i, x_j$ the covariance is defined as:

$$ \sigma_{\mathbf{x}_i, \mathbf{x}_j} = E[(\mathbf{x}i - \mu{\mathbf{x}_i})(\mathbf{x}j - \mu{\mathbf{x}_j})] $$

(where $\mu_{\mathbf{x}_i}$ is the mean of $\mathbf{x}_i$).

Important assumption: For the sake of simplicity, assume that all data is centered, as @njoshi pointed out. In other words, $\mu_{\mathbf{x}_i} = 0 \forall i \in [1, n]$.

So the expression for covariance changes to the following:

$$ \sigma_{\mathbf{x}_i, \mathbf{x}_j} = E[\mathbf{x}_i\mathbf{x}_j ] $$

I.e. the covariance of two centered RVs is expectation of their elementwise product.

Key point: There is a crucial relationship between the expectation of the product of two RVs, and the inner product of the RVs. Notice that, the RHS in the above expression can be simplified as:

$$ E[\mathbf{x}i\mathbf{x}j] = E[x{i1} \times x{j1}, x_{i2} \times x_{j2}, x_{i3} \times x_{j3}, \ldots, x_{in} \times x_{jn}] $$ $$ \therefore E[\mathbf{x}i\mathbf{x}j] = \frac{\sum^{n}{k=1}{x{ik} \times x_{jk}}}{n} $$

Note that the numerator of the RHS is just the inner product of $\mathbf{x}_i$ and $\mathbf{x}_j$!

Thus, the inner product of $\mathbf{x}_i$ and $\mathbf{x}_j$, can be expressed as:

$$ \overline{\mathbf{x}_i\mathbf{x}j} = n\times\sigma{\mathbf{x}_i, \mathbf{x}_j} $$

(Hold that thought, we'll come back to it.)

Now, consider a matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$ such that each column of it is a random variable $\mathbf{x}_i$, where $i \in [1, n]$ (any tabular dataset is of this form - each feature can be interpreted as an RV).

$$ \mathbf{X} = \begin{bmatrix} \mathbf{x}{11} & \mathbf{x}{21} & \mathbf{x}{31} & \ldots & \mathbf{x}{n1} \ \mathbf{x}{12} & \mathbf{x}{22} & \mathbf{x}{32} & \ldots & \mathbf{x}{n2} \ \mathbf{x}{13} & \mathbf{x}{23} & \mathbf{x}{33} & \ldots & \mathbf{x}{n3} \ \vdots & \vdots & \vdots & \ldots \ \ \mathbf{x}{1m} & \mathbf{x}{2m} & \mathbf{x}{3m} & \ldots & \mathbf{x}{nm} \ \end{bmatrix} $$

(Note that each RV $\mathbf{x}i$ is expanded above as $[\mathbf{x}{i1} \mathbf{x}{i3} \mathbf{x}{i3} \ldots \mathbf{x}_{in}]$)

So now, if we evaluate $\mathbf{X}^T\mathbf{X}$, it comes out to:

$$ \mathbf{X}^T\mathbf{X} = \begin{bmatrix} \overline{\mathbf{x}_1\mathbf{x}_1} & \overline{\mathbf{x}_1\mathbf{x}_2} & \overline{\mathbf{x}_1\mathbf{x}_3} & \ldots & \overline{\mathbf{x}_1\mathbf{x}_n} \

\overline{\mathbf{x}_2\mathbf{x}_1} & \overline{\mathbf{x}_2\mathbf{x}_2} & \overline{\mathbf{x}_2\mathbf{x}_3} & \ldots & \overline{\mathbf{x}_2\mathbf{x}_n} \

\overline{\mathbf{x}_3\mathbf{x}_1} & \overline{\mathbf{x}_3\mathbf{x}_2} & \overline{\mathbf{x}_3\mathbf{x}_3} & \ldots & \overline{\mathbf{x}_3\mathbf{x}_n} \

\vdots & \vdots & \vdots & \ldots & \vdots \

\overline{\mathbf{x}_n\mathbf{x}_1} & \overline{\mathbf{x}_n\mathbf{x}_2} & \overline{\mathbf{x}_n\mathbf{x}_3} & \ldots & \overline{\mathbf{x}_n\mathbf{x}_n}

\end{bmatrix} $$

where $\overline{\mathbf{x}_i\mathbf{x}_j}$ represents the inner product of $\mathbf{x}_i$ and $\mathbf{x}_j$.

Now, from the result in the previous section, we know that the inner product of two RVs is $n$ times their covariance (if the data is centered / means are zero, of course). So, the matrix above can be written as:

$$ \mathbf{X}^T\mathbf{X} = \begin{bmatrix} n\sigma_{\mathbf{x}_1\mathbf{x}1} & n\sigma{\mathbf{x}_1\mathbf{x}2} & n\sigma{\mathbf{x}_1\mathbf{x}3} & \ldots & n\sigma{\mathbf{x}_1\mathbf{x}_n} \

n\sigma_{\mathbf{x}_2\mathbf{x}1} & n\sigma{\mathbf{x}_2\mathbf{x}2} & n\sigma{\mathbf{x}_2\mathbf{x}3} & \ldots & n\sigma{\mathbf{x}_2\mathbf{x}_n} \

n\sigma_{\mathbf{x}_3\mathbf{x}1} & n\sigma{\mathbf{x}_3\mathbf{x}2} & n\sigma{\mathbf{x}_3\mathbf{x}3} & \ldots & n\sigma{\mathbf{x}_3\mathbf{x}_n} \

\vdots & \vdots & \vdots & \ldots & \vdots \

n\sigma_{\mathbf{x}_n\mathbf{x}1} & n\sigma{\mathbf{x}_n\mathbf{x}2} & n\sigma{\mathbf{x}_n\mathbf{x}3} & \ldots & n\sigma{\mathbf{x}_n\mathbf{x}_n} \end{bmatrix} $$

Dividing both sides by $n$, we get: $$ \frac{1}{n}\mathbf{X}^T\mathbf{X} = \begin{bmatrix} \sigma_{\mathbf{x}_1\mathbf{x}1} & \sigma{\mathbf{x}_1\mathbf{x}2} & \sigma{\mathbf{x}_1\mathbf{x}3} & \ldots & \sigma{\mathbf{x}_1\mathbf{x}_n} \

\sigma_{\mathbf{x}_2\mathbf{x}1} & \sigma{\mathbf{x}_2\mathbf{x}2} & \sigma{\mathbf{x}_2\mathbf{x}3} & \ldots & \sigma{\mathbf{x}_2\mathbf{x}_n} \

\sigma_{\mathbf{x}_3\mathbf{x}1} & \sigma{\mathbf{x}_3\mathbf{x}2} & \sigma{\mathbf{x}_3\mathbf{x}3} & \ldots & \sigma{\mathbf{x}_3\mathbf{x}_n} \

\vdots & \vdots & \vdots & \ldots & \vdots \

\sigma_{\mathbf{x}_n\mathbf{x}1} & \sigma{\mathbf{x}_n\mathbf{x}2} & \sigma{\mathbf{x}_n\mathbf{x}3} & \ldots & \sigma{\mathbf{x}_n\mathbf{x}_n}

\end{bmatrix} $$

$QED.$

The covariance matrix is a compact, convenient way to represent covariances of pairs of vectors.

Covariance is symmetrical, i.e. $\sigma(\mathbf{x}_i, \mathbf{x}_j) = \sigma(\mathbf{x}_j, \mathbf{x}_i)$. Therefore the matrix is also symmetrical.
The diagonal represents variances, since the covariance of an RV with itself is just the variance of that RV.
Among $n$ RVs, the number of ways in which they can be paired is $\binom{n}{2}$ ways - and that is exactly half the number of non--diagonal entries in a matrix of size $n \times n$. (the remaining half is the same, since the matrix is symmetrical).
Being symmetric and positive semi-definite, a lot of convenient properties apply to the covariance matrix - and it is because of these that we can, among other things, perform PCA and SVD.

jaidevd/covariance.md