Practical - 2
Practical 2
Aim: Perform following Data Pre-processing (Feature Selection/Elimination) tasks using python
Theory:
What is feature selection?
Feature Selection is one of the core concepts in machine learning which hugely impacts the performance of your model. The data features that you use to train your machine learning models have a huge influence on the performance you can achieve.Irrelevant or partially relevant features can negatively impact model performance.Feature selection and Data cleaning should be the first and most important step of your model designing.Why is it important?Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.·Improves Accuracy: Less misleading data means modeling accuracy improves.·Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.Different Methods of Feature Selection/Elimination:
1. Variance thresholdThis method removes features with variation below a certain cutoff. The idea is when a feature doesn't vary much within itself, it generally has very little predictive power. Variance Threshold doesn't consider the relationship of features with the target variable.2. Univariate feature selectionStatistical tests can be used to select those features that have the strongest relationship with the output variable.The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.3. Recursive feature eliminationRFE begins by fitting a model for each predictor on the entire set of features and computing an important score. The weakest features are then removed, the model is re-fitted, and once the specified number of features are used, significant scores are computed again. Features important score are ranked by the coef or function value attributes of the model and a limited number of features per loop are recursively removed.4. PCA (Principal Component Analysis)Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.5. CorrelationCorrelation states how the features are related to each other or the target variable.Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one value of feature decreases the value of the target variable)Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features using the seaborn library.Dataset Description:
I have taken mobile price classification data with below features.battery_power: Total energy a battery can store in one time measured in mAh
blue: Has Bluetooth or not
clock_speed: the speed at which microprocessor executes instructions
dual_sim: Has dual sim support or not
fc: Front Camera megapixels
four_g: Has 4G or not
int_memory: Internal Memory in Gigabytes
m_dep: Mobile Depth in cm
mobile_wt: Weight of mobile phone
n_cores: Number of cores of the processor
pc: Primary Camera megapixels
px_height :Pixel Resolution Height
px_width: Pixel Resolution Width
ram: Random Access Memory in MegaBytes
sc_h: Screen Height of mobile in cm
sc_w: Screen Width of mobile in cm
talk_time: the longest time that a single battery charge will last when you are
three_g: Has 3G or not
touch_screen: Has touch screen or not
wifi: Has wifi or not
price_range: This is the target variable with a value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost).1. import necessary packages.import pandas as pdimport numpy as npfrom sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import chi2from sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scoreimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.decomposition import PCAfrom sklearn.ensemble import ExtraTreesClassifier3.Extracting dependent and independent variable and creating training and testing database and apply KNN before processingX = data.iloc[:,0:20] #independent columnsy = data.iloc[:,-1] #target column i.e price rangeX_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size=0.2)knn=KNeighborsClassifier(n_neighbors=5)knn.fit(X_train,Y_train)accuracy_score(Y_test,knn.predict(X_test))0.90754.Apply Univariate Selection#apply SelectKBest class to extract top 10 best featuresbestfeatures = SelectKBest(score_func=chi2, k=10)fit = bestfeatures.fit(X,y)dfscores = pd.DataFrame(fit.scores_)dfcolumns = pd.DataFrame(X.columns)#concat two dataframes for better visualizationfeatureScores = pd.concat([dfcolumns,dfscores],axis=1)featureScores.columns = ['Specs','Score'] #naming the dataframe columnsprint(featureScores.nlargest(10,'Score')) #print 10 best features5.Accuracy after applying univariate selectionAccuracy: 0.935 (increased)X_u=data[['ram','px_height','battery_power','px_width','mobile_wt','int_memory']]Y_u = data['price_range']X_train,X_test,Y_train,Y_test = train_test_split(X_u,Y_u,test_size=0.2)del knnknn=KNeighborsClassifier(n_neighbors=5)knn.fit(X_train,Y_train)accuracy_score(Y_test,knn.predict(X_test))6.Apply Correlation#get correlations of each features in datasetcorrmat = data.corr()top_corr_features = corrmat.indexplt.figure(figsize=(20,20))#plot heat mapg=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")7.Accuracy after applying correlation selectionX_c=data[['ram','px_height','battery_power','px_width']]Y_c = data['price_range']X_train,X_test,Y_train,Y_test = train_test_split(X_c,Y_c,test_size=0.2)del knnknn=KNeighborsClassifier(n_neighbors=5)knn.fit(X_train,Y_train)accuracy_score(Y_test,knn.predict(X_test))Accuracy: 0.9225 (increased)8.Apply feature Importancemodel = ExtraTreesClassifier()model.fit(X,y)#plot graph of feature importances for better visualizationfeat_importances = pd.Series(model.feature_importances_, index=X.columns)feat_importances.nlargest(10).plot(kind='barh')plt.show()Google colab link: click here
Comments
Post a Comment