Model Selection : A Classification Approach
Introduction
The beauty in datasets is that different methods can be applied to gain insights. The validity of these insights is evaluated using metrics such as accuracy. In the case of predicting a feature with an imbalanced distribution, a feature whose distribution is imbalanced, an f1 score is most appropriate as it is more important to evaluate the number of falsely predicted positives and negatives.
Data
The data used is found on the UCI Machine Learning Repository and is a mixture of categorical and numerical fields related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls.
Method
The classification goal is to find out whether a client will subscribe to a product called term deposit . To achieve this , we evaluate three different solutions:Linear Regression , Gradient Boosting and a feed forward network. We then score them using the f1 and cross-validations score to get the best model. The code used can be found here
Preprocessing
Encode Categorical Variables
In order to get the categorical columns into the corresponding numerical format , we use one-hot encoding which creates a column for each category and fills it with ‘1’ or 0' where present. This is preferred over label encoding which assigns a hierarchy(1 , 2,…) for each category and would result in misleading weight assignment in columns such as job
which features categories such as blue-collar
and white-collar
has no inherent hierarchy. Pandas offers the get_dummies()
function that automates this.
converted_df = pd.get_dummies(df)
Scaling
Variables being measured at different scales leads to them contributing unequally to model fitting and this may lead the trained model to create some bias. In order to the level training field , we create a range for all our variables using a minimum and maximum scale.
col = df[column]scaler = MinMaxScaler()num2 = scaler.fit_transform(col)
Choosing Features for Training
About now , the data is made up of more than 40 columns and each of this impacts our analysis at a different level.However, the more features a model is fed , the more likely the model gets to know the training set too well and fail on a different set (over-fitting). To reduce this , a principal component approach is useful as it returns the features that create the most variation in the data.
pca_data = PCA(n_components=components).fit(df)transformed = pca_data.transform(df)
Modelling
The following part makes extensive use of the scikit-learn package modules
Split Data
For the purpose of training and testing , the sci-kit function model_selection.train_test_split()
that takes as a parameter the desired portion of the data to be withheld for testing.
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.1,random_state=1)
Logistic Regression with cross-validation
Using cross-validation , the LogisticRegressionCV()
that uses the default k=5 for the number of folds and has the attribute method predict
and scores
that return an array with the predictions on the testing set and how each i prediction compares to the k-1 set respectively.
lr = LogisticRegressionCV(random_state=0).fit(X_train, y_train)prediction=lr.predict(X_test)#get the cross validation scores
lr.scores
Gradient Boost with cross-validation
“Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made”
Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models. This is further aided by scoring the model on each cross fold.
# convert the dataset into a Dmatrix that gives it performance and efficiency gains.data_dmatrix = xgb.DMatrix(data=X,label=y)xg_reg = xgb.XGBClassifier(objective ='binary:logistic', colsample_bytree = 0.3, learning_rate = 0.1,max_depth = 5, alpha = 10, n_estimators = 10)xg_reg.fit(X_train,y_train)preds = xg_reg.predict(X_test)params = {"objective":"binary:logistic",'colsample_bytree': 0.3,'learning_rate': 0.1,'max_depth': 5, 'alpha': 10}cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,num_boost_round=50,early_stopping_rounds=10,metrics="auc", as_pandas=True, seed=123)
Feed Forward Network
Finally , we use a multi-layer perceptron that fits the training data and has a predict()
function that returns the predictions.
clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train, y_train)preds = clf.predict(X_test)
Evaluate Models
To understand the f1 score , it is essential to first understand the precision and recall. Precision means the number of correctly identified positive results divided by the number of all positive results while recall refers to the number of correctly identified positive results divided by the number of all samples that should have been identified as positively classified by your algorithm. This is clearly explained in shruti saxena’s post here.
The f1 is the harmonic mean of the precision and recall and is defined as :
We calculate the f1 score for our cross-folded result set. This is in-ideal on a k-fold validated set with imbalanced data as the chance of getting a division by zero is high. Instead , a stratified cross fold is more ideal as discussed here.
“What we are trying to achieve with the F1-score metric is to find an equal balance between precision and recall, which is useful when we are working with imbalanced datasets ”
f1_score(y_test, preds, average='macro')
Conclusion
From our analysis and scores , the best choice of model is the feed-forward network and works best with the data. The model used for the “logistic regression” is a single level perception with with custom number of inputs and one output ranging from 0 to 1 while that of the multi-layer perceptron uses multiple while improving on the errors of the previous layer,