4.2.2.5 Lab - Evaluating Fit Errors in Linear Regression Answers

4.2.2.5 Lab – Evaluating Fit Errors in Linear Regression (Instructor Version)

Objectives

In this lab, you will become familiar with the concepts of evaluating fit errors in linear regression.

Part 1: Import the Libraries and Data
Part 2: Calculate the Errors

Scenario / Background

In statistics, linear regression is a way to model a relationship between dependent variable y and independent variable x . The goal of regression is to find a model that describes the data as accurately as possible.

In this lab, you will use the sales data and result of the linear regression from a previous lab to evaluate the accuracy of the model.

Required Resources

1 PC with Internet access
Python libraries: pandas, numpy, and sklearn
Datafiles: stores-dist.csv

Part 1: Import the Libraries and Data

In this part, you will import the libraries and the data from the file stores-dist.csv.

Step 1: Import the libraries.

In this step, you will import the following libraries:

numpy as np
pandas as pd

# Code Cell 1

# This lab produces some minor warnings that can be ignored.
# These warnings appear because some libraries are updated more often than others
# and the system is letting the user know that some function will be depricated soon
# Use the following code to prevent the warnings from being displayed, or comment them out
# to see the warnings
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression

Step 2: Import the data.

In this step, you will import the data from the file stores-dist.csv, change the column headings, and verify that the file is imported correctly.

The column headings, annual net sales and number of stores in district are renamed to make it easier during data processing.

annual net sales to sales
number of stores in district to stores

# Code Cell 2

# Import the file stores-dist.txt
salesDist = pd.read_csv('./Data/stores-dist.txt')

# Change the column headings
salesDist.columns = ['district','sales','stores']

# Verify the imported data
# ...

The cdistrict column is not necessary for the evaluation of the linear regression fit; therefore, the column can be dropped.

# Code Cell 3
# Drop the district column.
sales = salesDist.drop('district',axis=1)

# Verify that the district column has been dropped.
# ...

Part 2: Calculating the Errors

In this part, you will use numpy to generate a regression line for the analyzed data. You will also calculate the centroid for this dataset. The centrod is the mean for the dataset. The generated simple linear regression line must also pass through the centroid.

You will also use sklearn.metrics to evaluate the linear regression model. You will calculate the R2 score and mean square error (MSE).

Step 1: Assign the x and y variables.

Assign the sales from the dataframe as dependent variable y , and stores from the dataframe as the independent variable for x axis.

# Code Cell 4
#dependent variable for y axis
y = sales.sales 
#independent variable for x axis
#x = ...

Step 2: Calculate the y values in the model

In a previous lab, you calculated the components for the linear regression fit with a polynomial model using np.polyfit to calculate a vector of coefficients p that minimizes the squared error. By using np.poly1d, you can compute the corresponding value for each value of x in the estimated polynomial model.

To recall the slope and y-intercept of the line, use the variable p. The array p displays the coefficent in a descending order. For a first order polynomial, the first coefficient is the slope (m) and the second coefficent is the y-intercept (b).

# Code Cell 5
# compute the y values from the polynomial model for each x value
order = 1
p = np.poly1d(np.polyfit(x, y ,order))

print('The array p(x) stores the calculated y value from the polynomial model for each x value,\n\n{}.'.format(p(x)))
print('\nThe vector of coefficients p describes this regression model:\n{}'.format(p))
print('\nThe zeroth order term (y-intercept or b) is stored in p[0]: {}.'.format(p[0]))
print('\nThe first order term (slope or m) is stored in p[1]: {}.'.format(p[1]))

Step 3: Use different measures to evaluate models.

In this step, you will use sklearn to evaluate models. Sklean offers a range of measures. You will calculate the R2 score, mean squared error (MSE), and mean absolute error (MAE) using the functions in sklearn.

To calculate the value for each measure, provide the values from y, which is the observed values from the imported csv file, stores-dist.csv as the first argument. As the second argument, use the values from p(x), which were calculated from your first order polynomial model in the form of:

y=mx+b

where the m is p[1] and b is p[0] in the poly1d results.

The R2 (coefficent of determination) regression score function gives some information about the amount of fit of the model. The best possible score for R2 is 1.0. This score indicates how well the model is explaining the observed outcome.

# Code Cell 6
from sklearn.metrics import r2_score
r2 = r2_score(y, p(x))
r2

The mean squared error (MSE) is a measure of how well the model can be used to make a prediction. This number is always non-negative. The better values are closer to zero.

# Code Cell 7
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y, p(x))
mse

The mean absolute error (MAE) is a measure of how close predictions are to the eventual outcomes. The MAE is an average of the absolute errors between the prediction and the true value.

# Code Cell 8
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y, p(x))
mae

All these measures allow you to determine how well your model can make prediction. In this lab, you only evaluated one model, simple linear regression.