Root Measure Square Error $$ RMSE(X,h)=\sqrt{\frac{1}{m}(h(x^i)-y^i)^2} $$ generally the preferred performance measure for regression tasks
Mean Absolute Error $$ MAE(X,h) =\frac{1}{m}\displaystyle\sum_{i=1}^{m}|h(x^i)-y^i| $$ If there are many outlier districts
Check assumptions
make sure the task is about regression or classification, actually, if a task is to put the exist data into exist category , it is the classification task.
Check dataset
1 2 3 4 5 6
import pandas as pd
housing = load_housing_data() housing.head() // describe first 4 datas of the dataset housing.info()// describe numerical types of the dataset houding.describe()// describe the distribution of datas
Create Test Set
we have to pick randomly for a subset of raw data as the test set to prevent overfitting. However there are some considerations
If we purely use random method , next time we run the code , the test set may pick up some samples from learning set.
we have to ensure our test set have cover all kinds of data (near) equally.
So the basic way to pick up random data and ensure the same data picked up next time we run the code is using hash function by ourselves , or using library like scikit.
to check the values are missing we have to check the result of fit in statistics_ instance, and we also want to ensure there is no missing values , so also apply imputer to all the numerical attributes.
now we can use the trained imputer to replace the missing vlaues
1 2
x = imputer.transform(housing_num) housing_tr = pd.DataFrame(x,columns=housing_num.columns)
and we can notice that the encoding is not so robust as the <1H OCEAN is encode into 1 and Near OCEAN is also encode into 1 , and we can tell they are not similar.
Scikit provide another solution for this
1 2 3 4 5
from sklearn.preprocessing import CategoricalEncoder # or get from notebook cat_encoder = CategoricalEncoder() housing_cat_reshaped = housing_cat.values.reshape(-1,1) housing_cat_1hot = cat_encoder.fit_transform(housing_cat_reshaped) housing_cat_1hot
this will transform from text to 2-D array in one shot.
from sklearn.base import BaseEstimator, TransformerMixin
# get the right column indices: safer than hard-coding indices 3, 4, 5, 6 rooms_ix, bedrooms_ix, population_ix, household_ix = [ list(housing.columns).index(col) for col in ("total_rooms", "total_bedrooms", "population", "households")]
the ML algorithm didn’t perform well on different scale of the numerical , for e.g. the number of rooms ranges from about 6 to 39320 while the median income only in range of 0 to 15. There are two ways to solve this situation:
min-max scaling
standarlizaiton
min-max scaling or normalization: shift all the scale into ranging from 0 to 1. we do this by subtracting the min value and dividing by the max minus the min.
standardization: first subtract the mean value, and divides by the variance so the resulting distribution has unit variance.
Transformation Pipelines
1 2 3 4 5 6 7 8 9 10
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler
0.0? Apparently we are overfitting our model, this is not a good result.
Better Evaluation Using Cross-Validation
But how could we confirmed 0.0 is due to overfitting instead this model is very “good” ? In order to distinguish this situation , we should use Cross validation method to our training set.
This method will again , separate our training set randomly into test set and compare to the prediction and result.
1 2 3 4 5
from sklearn.model_selection import cross_val_score
Scores: [51646.4454590948940.6011488253050.8632364954408.98730149 50922.1487078556482.5070398751864.5202552649760.85037653 55434.2162793353326.10093303] Mean: 52583.72407377466 Standard deviation: 2298.353351147122
now we can see the MEAN is much better than the Decision Tree’s MEAN.
Fine-Tune Your Model
Grid Search
There are may be many combinations of hyperparameter if we search manually thus we can use the API provide by scikit
1 2 3 4 5 6 7 8 9 10 11 12 13 14
from sklearn.model_selection import GridSearchCV
param_grid = [ # try 12 (3×4) combinations of hyperparameters {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, # then try 6 (2×3) combinations with bootstrap set as False {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}, ]
forest_reg = RandomForestRegressor(random_state=42) # train across 5 folds, that's a total of (12+6)*5=90 rounds of training grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True) grid_search.fit(housing_prepared, housing_labels)
and show the result
1
grid_search.best_params_
1
{'max_features': 8, 'n_estimators': 30}
we can also let it show the best estimator directly