Hands On ML

Chapter 2 End to End ML Project

Selected Performance Measure

Root Measure Square Error
$$
RMSE(X,h)=\sqrt{\frac{1}{m}(h(x^i)-y^i)^2}
$$
generally the preferred performance measure for regression tasks

Mean Absolute Error
$$
MAE(X,h) =\frac{1}{m}\displaystyle\sum_{i=1}^{m}|h(x^i)-y^i|
$$
If there are many outlier districts

Check assumptions

make sure the task is about regression or classification, actually, if a task is to put the exist data into exist category , it is the classification task.

Check dataset

1
2
3
4
5
6
import pandas as pd

housing = load_housing_data()
housing.head() // describe first 4 datas of the dataset
housing.info()// describe numerical types of the dataset
houding.describe()// describe the distribution of datas

Create Test Set

we have to pick randomly for a subset of raw data as the test set to prevent overfitting. However there are some considerations

  • If we purely use random method , next time we run the code , the test set may pick up some samples from learning set.
  • we have to ensure our test set have cover all kinds of data (near) equally.

So the basic way to pick up random data and ensure the same data picked up next time we run the code is using hash function by ourselves , or using library like scikit.

1
train_test_split(housing,test_size=0.2,random_state=42)

and we have to shuffle the test data to fit all the category.

1
split = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)

Data Visualization

A good way to check the data is visualization for data correlations

we can visual the data in combination way: the Longitude Latitude and population in circle, also with the price

1
2
3
4
housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.4,
s=housing["population"]/100,label="population",figsize=(10,7),
c="median_house_value", cmap=plt.get_cmap("jet"),colorbar=True,)
plt.legend()

Looking for Correlations

we can use

housing.corr() looking for linear correlation ships between one to several data sets.

in this case, is the median house value to all the sets of data.

The correlations may go completely wrong on none linear relations .

Another way to check correlation between attributes is to use scatter_matrix from pandas library

1
housing.plot(kind="scatter",x="median_income",y="median_house_value",alpha=0.1)

Try out Attribute Combinations

we can add new attribute to the data in need

lets add rooms per household, bedrooms per room and population per household to data

1
2
3
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"]=housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

and show it using the correlation method again

1
2
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

Preparing Data

Instead of doing it manually, we do it using functions.

create a copy of training set from original one.

1
2
housing = strat_train_set.drop("median_house_value",axis = 1)
housing_labels = strat_train_set["median_house_value"].copy()

Data Cleaning

most of the machine learning stuff cannot work with missing data, so to duel with this there are three optins

  • get rid of the corresponding districts
  • get rid of whole attribute
  • set the value to some value

For these options, we can use dropna(), drop() , and fillna() to accomplish task.

1
2
3
4
housing.dropna(subset=["total_bedrooms"]) //option1
housing.drop("total_bedrooms",axis=1) //option2
median = housing["total_bedrooms"].median() //option3
housing["total_bedrooms"].fillna(median,inplace=True)

If we choose to use option 3 this time by using imputer API in scikit .

1
2
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median")

now fit the imputer to tanning data

1
2
housing_num = housing.drop("ocean_proximity",axis=1)
imputer.fit(housing_num)

to check the values are missing we have to check the result of fit in statistics_ instance, and we also want to ensure there is no missing values , so also apply imputer to all the numerical attributes.

now we can use the trained imputer to replace the missing vlaues

1
2
x = imputer.transform(housing_num)
housing_tr = pd.DataFrame(x,columns=housing_num.columns)

The result can put into pandas data frame

1
housing_tr = pd.DataFrame(x,columns=housing_num.columns)

Handling Text and Categorical Attributes

for none numerical attribute we have two tries , either abandoned it or change it from text to numerical type.

A basic encoding method is to use pandas factorize() method to mapping text to number

1
2
3
housing_cat_encoded,housing_categories = housing_cat.factorize()
housing_cat_encoded[:10]
>>>>array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)

and we also can check the catagories

1
2
housing_categories
>>Index(['<1H OCEAN', 'NEAR OCEAN', 'INLAND', 'NEAR BAY', 'ISLAND'], dtype='object')

and we can notice that the encoding is not so robust as the <1H OCEAN is encode into 1 and Near OCEAN is also encode into 1 , and we can tell they are not similar.

Scikit provide another solution for this

1
2
3
4
5
from sklearn.preprocessing import CategoricalEncoder # or get from notebook
cat_encoder = CategoricalEncoder()
housing_cat_reshaped = housing_cat.values.reshape(-1,1)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat_reshaped)
housing_cat_1hot

this will transform from text to 2-D array in one shot.

Custom Transformers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.base import BaseEstimator, TransformerMixin

# get the right column indices: safer than hard-coding indices 3, 4, 5, 6
rooms_ix, bedrooms_ix, population_ix, household_ix = [
list(housing.columns).index(col)
for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kwargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

Feature Scaling

the ML algorithm didn’t perform well on different scale of the numerical , for e.g. the number of rooms ranges from about 6 to 39320 while the median income only in range of 0 to 15. There are two ways to solve this situation:

  1. min-max scaling
  2. standarlizaiton

min-max scaling or normalization: shift all the scale into ranging from 0 to 1. we do this by subtracting the min value and dividing by the max minus the min.

standardization: first subtract the mean value, and divides by the variance so the resulting distribution has unit variance.

Transformation Pipelines

1
2
3
4
5
6
7
8
9
10
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])

housing_num_tr = num_pipeline.fit_transform(housing_num)

Select and Train model

Training and Evaluating on the training set

now we can implemented model after all manipulating to our data. Lets try the Linear regression model.

1
2
3
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared,housing_labels)

and show the results

1
2
3
4
5
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared))
1
2
Predictions: [210644.60459286 317768.80697211 210956.43331178  59218.98886849
189747.55849879]

let’s check the root mean square error

1
2
3
4
5
6
from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
1
68628.19819848922

not very good result using the prediction compare to our original result

let’s try Decision Tree Regressor model

1
2
3
4
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

and check the RMSE result

1
2
3
4
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
1
0.0

0.0? Apparently we are overfitting our model, this is not a good result.

Better Evaluation Using Cross-Validation

But how could we confirmed 0.0 is due to overfitting instead this model is very “good” ? In order to distinguish this situation , we should use Cross validation method to our training set.

This method will again , separate our training set randomly into test set and compare to the prediction and result.

1
2
3
4
5
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

let’s try Decision tree again

1
2
3
4
5
6
def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)
1
2
3
4
5
Scores: [70194.33680785 66855.16363941 72432.58244769 70758.73896782
71115.88230639 75585.14172901 70262.86139133 70273.6325285
75366.87952553 71231.65726027]
Mean: 71407.68766037929
Standard deviation: 2439.4345041191004

now we can see the RMSE is not so good now, we can confirmed the previous result is due to overfitting.

Let’s try another model Random Forest Regressor

1
2
3
4
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)
forest_reg.fit(housing_prepared, housing_labels)

check is RMSE

1
2
3
4
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse
1
21933.31414779769

and also using the cross validation method

1
2
3
4
5
6
from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
1
2
3
4
5
Scores: [51646.44545909 48940.60114882 53050.86323649 54408.98730149
50922.14870785 56482.50703987 51864.52025526 49760.85037653
55434.21627933 53326.10093303]
Mean: 52583.72407377466
Standard deviation: 2298.353351147122

now we can see the MEAN is much better than the Decision Tree’s MEAN.

Fine-Tune Your Model

There are may be many combinations of hyperparameter if we search manually thus we can use the API provide by scikit

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.model_selection import GridSearchCV

param_grid = [
# try 12 (3×4) combinations of hyperparameters
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
# then try 6 (2×3) combinations with bootstrap set as False
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

and show the result

1
grid_search.best_params_
1
{'max_features': 8, 'n_estimators': 30}

we can also let it show the best estimator directly

1
2
3
4
5
6
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=30, n_jobs=None, oob_score=False, random_state=42,
verbose=0, warm_start=False)

Also we can let it show the all results

1
2
3
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
63669.05791727153 {'max_features': 2, 'n_estimators': 3}
55627.16171305252 {'max_features': 2, 'n_estimators': 10}
53384.57867637289 {'max_features': 2, 'n_estimators': 30}
60965.99185930139 {'max_features': 4, 'n_estimators': 3}
52740.98248528835 {'max_features': 4, 'n_estimators': 10}
50377.344409590376 {'max_features': 4, 'n_estimators': 30}
58663.84733372485 {'max_features': 6, 'n_estimators': 3}
52006.15355973719 {'max_features': 6, 'n_estimators': 10}
50146.465964159885 {'max_features': 6, 'n_estimators': 30}
57869.25504027614 {'max_features': 8, 'n_estimators': 3}
51711.09443660957 {'max_features': 8, 'n_estimators': 10}
49682.25345942335 {'max_features': 8, 'n_estimators': 30}
62895.088889905004 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54658.14484390074 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59470.399594730654 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52725.01091081235 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
57490.612956065226 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51009.51445842374 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

if we are dueling with only a few parmeters then Grid Search is fine, otherwise we may consider using randomized search with RandomizedSearchCV

with this method you can

  • control the iteration numbers
  • more combination of parameters
1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
'n_estimators': randint(low=1, high=200),
'max_features': randint(low=1, high=8),
}

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)
1
2
3
4
5
6
7
8
9
10
11
12
RandomizedSearchCV(cv=5, error_score='raise-deprecating',
estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
oob_score=False, random_state=42, verbose=0, warm_start=False),
fit_params=None, iid='warn', n_iter=10, n_jobs=None,
param_distributions={'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000002055C0CFA58>, 'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000002057D30FCF8>},
pre_dispatch='2*n_jobs', random_state=42, refit=True,
return_train_score='warn', scoring='neg_mean_squared_error',
verbose=0)
1
2
3
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)

and shows the results

1
2
3
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
1
2
3
4
5
6
7
8
9
10
49150.657232934034 {'max_features': 7, 'n_estimators': 180}
51389.85295710133 {'max_features': 5, 'n_estimators': 15}
50796.12045980556 {'max_features': 3, 'n_estimators': 72}
50835.09932039744 {'max_features': 5, 'n_estimators': 21}
49280.90117886215 {'max_features': 7, 'n_estimators': 122}
50774.86679035961 {'max_features': 3, 'n_estimators': 75}
50682.75001237282 {'max_features': 3, 'n_estimators': 88}
49608.94061293652 {'max_features': 5, 'n_estimators': 100}
50473.57642831875 {'max_features': 3, 'n_estimators': 150}
64429.763804893395 {'max_features': 5, 'n_estimators': 2}

Ensemble Methods

Another way to fine tune system is to combine the models that perform best, because compare to individual model this will usually perform better.

Analyze the best Models and Their Errors

For random Forest Regressor can indicate the relative importance of each attribute for making accurate predictions.

1
2
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
1
2
3
4
array([7.33442355e-02, 6.29090705e-02, 4.11437985e-02, 1.46726854e-02,
1.41064835e-02, 1.48742809e-02, 1.42575993e-02, 3.66158981e-01,
5.64191792e-02, 1.08792957e-01, 5.33510773e-02, 1.03114883e-02,
1.64780994e-01, 6.02803867e-05, 1.96041560e-03, 2.85647464e-03])

and display these importance scores next to their attribute names:

1
2
3
4
5
6
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
#cat_encoder = cat_pipeline.named_steps["cat_encoder"] # old solution
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[(0.3661589806181342, 'median_income'),
(0.1647809935615905, 'INLAND'),
(0.10879295677551573, 'pop_per_hhold'),
(0.07334423551601242, 'longitude'),
(0.0629090704826203, 'latitude'),
(0.05641917918195401, 'rooms_per_hhold'),
(0.05335107734767581, 'bedrooms_per_room'),
(0.041143798478729635, 'housing_median_age'),
(0.014874280890402767, 'population'),
(0.014672685420543237, 'total_rooms'),
(0.014257599323407807, 'households'),
(0.014106483453584102, 'total_bedrooms'),
(0.010311488326303787, '<1H OCEAN'),
(0.002856474637320158, 'NEAR OCEAN'),
(0.00196041559947807, 'NEAR BAY'),
(6.028038672736599e-05, 'ISLAND')]

with these info we can decide which parameter to drop.

Evaluate Your System on the Test Set

1
2
3
4
5
6
7
8
9
10
final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
1
final_rmse
1
47730.22690385927

We can compute a 95% confidence interval for the test RMSE:

(A confidence interval gives an estimated range of values which is likely to include an unknown population parameter)

http://www.stat.yale.edu/Courses/1997-98/101/confint.htm

1
2
3
4
5
6
7
8
9
from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
mean = squared_errors.mean()
m = len(squared_errors)

np.sqrt(stats.t.interval(confidence, m - 1,
loc=np.mean(squared_errors),
scale=stats.sem(squared_errors)))
1
array([45685.10470776, 49691.25001878])
Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×