Statistics 101 – Covariance and Linear Regression (Part 2 of 3)

Statistical inference helps us understand the data, and hypothesis testing helps us understand if the data is different from another set of data. These techniques are important when exploring data sets, as they help guide our analysis. However, these techniques are not enough. Most times, we are looking to understand the relationship between two sets of data, such as how the stock price of GOOGL moves with respect to the S&P 500.

Relationship between two sets of data

When we want to describe the relationship between two sets of data, we can plot the data sets in a scatter plot and look at four characteristics:

Direction – Are the data points sloping upwards or downwards?
Form – Do the data points form a straight line or a curved line?
Strength – Are the data points tightly clustered or spread out?
Outliers – Are there data points far away from the main body of data?

When we are looking to understand the relationship between two sets of quantitative data, we can start with correlation and covariance.

Covariance

Covariance describes how two variables, x and y, vary with respect to each other. It can be described by the following equation:

$latex Covariance = s_{xy} = \frac{1}{n}\sum(x_{i} – \bar{x})(y_{i} – \bar{y}) $

As we can see from the equation, the covariance sums the term (x_i – x̄)(y_i– ȳ) for each data point, where x̄ or x bar is the average x value, and ȳ or y bar is the average y value. The term becomes more positive if both x and y are larger or smaller than the average values in the data set and becomes more negative if only one of the variables, x or y, is larger than its average value in the data set.

As the covariance accounts for every data point in the set, a positive covariancemust mean that most, if not all, data points are in sync with respect to x and y (small y when x is small or large y when x is large). Conversely, a negative covariancemust mean that most, if not all, data points are out of sync with respect to x and y (small y when x is large or large y when x is small).

Covariance is a useful measure at describing the direction of the linear association between two quantitative variables, but it has two weaknesses:

a larger covariance does not always mean a stronger relationship, and
we cannot compare the covariances across different data sets

Covariances depend in part on the size of x and y in the data — if x is large then the covariance will be large too. For example, if we were to compare the covariance of GOOGL and S&P 500 to the covariance of GOOGL and MSFT, we will find that the first covariance is much bigger. The difference would be mainly because the S&P 500 is measured in the thousands, whereas MSFT is measured in the hundreds, and does not speak to the strength of the linear association.

Correlation

To account for this weakness, we can normalize the covariance by the standard deviation of the x values and y values, to get the correlation coefficient. Typically denoted as ρ (the Greek letter rho) or r, the equation for the correlation coefficient is:

$latex Correlation Coefficient = \rho_{xy} = \dfrac{s_{xy}}{(s_{x} \times s_{y})} $

Where:

s_xyis the covariance of x and y
s_xis the standard deviation of x
s_yis the standard deviation of y

The correlation coefficient will always turn out to be a value between -1 and 1, and measures both the direction and the strength of the linear association. One important distinction to note is that correlation does not measure the slope of the relationship — a large correlation only speaks to the strength of the relationship.

Some key points on correlation are:

Correlation measures the direction and strength of the linear association between two quantitative variables. Positive and negative indicates direction, large (close to 1 in absolute value) and small (close to 0) indicates the strength
Outliers should be noted and investigated to determine whether they should be included or excluded
Correlation has symmetry: correlation of x and y is the same as correlation of y and x
Correlation is unitless and normalized

Correlation is often presented in a correlation matrix, where the correlations of the pairs of values are reported in a table.

Linear Regression

Correlation and covariance are quantitative measures of the strength and direction of the relationship between two variables, but they do not account for the slope of the relationship. In other words, we do not know how a change in one variable could impact the other variable.

Regression is the technique that fills this void — it allows us to make a best guess as to how one variable affects the other. The simplest linear regression allows us to fit a “line of best fit” to the scatter plot and use that line (or model) to describe the relationship between the two variables. The equation for that line is:

$latex y = \beta_{0} + \beta_{1}x + \epsilon $

Where:

y is the dependent variable
x is the independent variable
Betas are the coefficients (or constants) in the equation
- β₀is the y-intercept of the line
- β₁is the slope of the line
epsilon (ε) is the error (or residual) term

You have probably seen this equation many times before, in high school (y = mx + b) and in the Capital Asset Pricing Model (CAPM)
(E(r_i) = r_F + (E(r_M) – r_F) * β_i).

On a high level, the equation describes how the observed data is affected by systematic relationships (β₀+ β₁x), and by “randomness” (ε). Randomness could come from measurement error, random chance, or systematic relationships not accounted for in the variables present. The regression minimizes the sum of squared errors between the actual y values and the y values predicted by the line of best fit.

For example, if we regress GOOGL returns on S&P 500 returns, we will find some sort of systematic relationship between the two, described by β₁ or “beta”. We will also find that the relationship between the two is not perfectly described by the model, as there are firm specific risks involved. If Sundar Pichai does something unethical and the stock price tanks, that cannot be accounted for by the variables present, and it goes into the error term. If Bloomberg glitches and reports a wrong number, that would also go into the error term.

[If you would like to download our Brief Guide to Statistics, please click here.]

Jason Oh is a management consultant at Novantas with expertise in scaling profitability for retail banks (consumer / commercial finance) and diversified financial service firms (credit card / wealth management / direct bank).

Image: Pixabay

🔴 Found these ideas useful?

Sharpen your edge

Leave a Reply Cancel reply