Regression is one of the most **useful statistical methods**. It is used in **many different fields**, including economics, marketing, and psychology. It can be applied to almost any dataset, providing you have adequate variables to represent your data.

Regression uses the least-squares regression line to find the *best fit line* for your data. The least-*squares regression line minimizes* the sum of the squares of the errors (residuals) between the observed values and those predicted by the line.

What are errors, residuals, and why do we want the sum of them to be zero? Errors are simply the difference between an observed value and its corresponding predicted value on the line. Residuals are simply the average of these errors over all observed values. We want the sum of these residuals to be zero because it would mean that all observations were located on the line, which would indicate that your model is accurate.

## Equivalent to zero

When the least-squares regression line is calculated, the algorithm finds a line that best fits the data points on the graph.

The algorithm finds a line that passes through as **many data points** as possible and has the lowest variance of difference from the actual values.

What does that mean?

A best-fitting line has the least amount of error between what it predicts and what the truth is. For example, a line that predicted income as $0 would have a *large error – someone making* $100,**000 per year would** be told they earn $0!

The variance of difference is how **much individual data points differ** from their corresponding predicted value. For example, if all people with income

## Maximally spread out

A ** linear regression model assumes** that the dependent variable (y-axis) is a

*linear function*of the independent variables (x-axis).

You can imagine this as fitting a line through as many points on the graph as possible. The line should pass through as many points on the graph as possible while being as straight as possible.

The more curved the line, the less fit it is to the data. A straight line is most fit to the data.

How can we determine if a linear regression model is good or not? The answer lies in the residuals, which are simply the difference between an observation’s y-value and the predicted y-value corresponding to a given x-value.

## Dependent on the Y value

Residuals are the differences between the **observed dependent variable values** and the

**predicted dependent variable values based**on the independent variables.

As mentioned before, in least-squares regression, the *predicted value* for the dependent variable is found by fitting a line to all of the data points.

So, in other words, a regression line is found by calculating the y-values for all x-values. This line is what predicts the y-value for any x-value.

The residuals are just the differences between these two numbers. More simply put, residuals are what remains when you take an observed y-value and predict an x-value that matches it.

The thing that should be dependent on the y-value is that there should be less of them when there are more accurate predictions.

## Independent of the X values

A residual is the difference between a data point’s value and the value predicted by the **regression line**. For example, if a data point has a y-value of 3 and the *regression line predicts* a y-value of 2, then the residual is 1.

Because we are trying to fit a line to our data, all residuals should be zero. A common way to check for this is to see if all the x-values are equal (are independent) of each other. If they are not, then your model may have an issue.

The best way to understand this concept is through an example. Let’s say we have **three data points**: 1, 2, and 3. If we were trying to fit a **linear regression model**, then 1=2=3=0. Therefore, all x-values are independent of each other.

## The Y values should be close to the X values

A good way to check if your regression line is correct is to see if the y values are close to the x values.

By x values, we mean the value that goes into the equation for the regression line. In this case, it is the weight of the fish. By y values, we mean the value that comes out of the equation for the regression line. In this case, it is the length of the fish.

If all x values were equal (say 1 kg), then all y ** values would also** be equal (say 1 m). In that case, your regression line would be Y=1, and every fish would have a length of 1 m, no matter what its weight was.

If all x values were equal (say 1 kg), then all y values would also be equal (say 1 m). In that case, your regression line would be Y=1, and every fish would have a length of 1 m, no matter what its weight was.