Practical

Practical 2

Aim: Perform following Data Pre-processing (Feature Selection/Elimination) tasks using python

Theory:

What is feature selection?
Feature Selection is one of the core concepts in machine learning which hugely impacts the performance of your model. The data features that you use to train your machine learning models have a huge influence on the performance you can achieve.
Irrelevant or partially relevant features can negatively impact model performance.
Feature selection and Data cleaning should be the first and most important step of your model designing.

Why is it important?
Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.·
Improves Accuracy: Less misleading data means modeling accuracy improves.·
Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.

Different Methods of Feature Selection/Elimination:

1. Variance threshold

This method removes features with variation below a certain cutoff. The idea is when a feature doesn't vary much within itself, it generally has very little predictive power. Variance Threshold doesn't consider the relationship of features with the target variable.

2. Univariate feature selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable.The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

3. Recursive feature elimination

RFE begins by fitting a model for each predictor on the entire set of features and computing an important score. The weakest features are then removed, the model is re-fitted, and once the specified number of features are used, significant scores are computed again. Features important score are ranked by the coef or function value attributes of the model and a limited number of features per loop are recursively removed.

4. PCA (Principal Component Analysis)

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

5. Correlation

Correlation states how the features are related to each other or the target variable.
Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one value of feature decreases the value of the target variable)
Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features using the seaborn library.

Dataset Description:
I have taken mobile price classification data with below features.

battery_power: Total energy a battery can store in one time measured in mAh
blue: Has Bluetooth or not
clock_speed: the speed at which microprocessor executes instructions
dual_sim: Has dual sim support or not
fc: Front Camera megapixels
four_g: Has 4G or not
int_memory: Internal Memory in Gigabytes
m_dep: Mobile Depth in cm
mobile_wt: Weight of mobile phone
n_cores: Number of cores of the processor
pc: Primary Camera megapixels
px_height :Pixel Resolution Height
px_width: Pixel Resolution Width
ram: Random Access Memory in MegaBytes
sc_h: Screen Height of mobile in cm
sc_w: Screen Width of mobile in cm
talk_time: the longest time that a single battery charge will last when you are
three_g: Has 3G or not
touch_screen: Has touch screen or not
wifi: Has wifi or not
price_range: This is the target variable with a value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost).

1. import necessary packages.
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.ensemble import ExtraTreesClassifier

2.import database file

data = pd.read_csv("mobile_price_classification.csv")
data

3.Extracting dependent and independent variable and creating training and testing database and apply KNN before processing

X = data.iloc[:,0:20]  #independent columns
y = data.iloc[:,-1]    #target column i.e price range

X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size=0.2)

knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,Y_train)
accuracy_score(Y_test,knn.predict(X_test))

0.9075

4.Apply Univariate Selection
#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(featureScores.nlargest(10,'Score'))  #print 10 best features

5.Accuracy after applying univariate selection

X_u=data[['ram','px_height','battery_power','px_width','mobile_wt','int_memory']]
Y_u = data['price_range']
X_train,X_test,Y_train,Y_test = train_test_split(X_u,Y_u,test_size=0.2)

del knn
knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,Y_train)
accuracy_score(Y_test,knn.predict(X_test))

Accuracy: 0.935 (increased)

6.Apply Correlation
#get correlations of each features in dataset
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")

7.Accuracy after applying correlation selection
X_c=data[['ram','px_height','battery_power','px_width']]
Y_c = data['price_range']
X_train,X_test,Y_train,Y_test = train_test_split(X_c,Y_c,test_size=0.2)

del knn
knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,Y_train)
accuracy_score(Y_test,knn.predict(X_test))

Accuracy: 0.9225 (increased)

8.Apply feature Importance
model = ExtraTreesClassifier()
model.fit(X,y)

#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

Google colab link: click here