В статье2, опубликованной на прошлой неделе, мы разобрались в концепции и нюансах различных алгоритмов классификации. Чтобы вычеркнуть все t, имеет смысл на самом деле приступить к их кодированию для решения реальных проблем. Я пошел искать конкурс Kaggle, который не видел большого обсуждения, что означает очень мало участия сообщества. Почему-то первое, что привлекло мое внимание, было Предсказать, какие кандидаты придут на собеседование. И я взял наживку. Что ж, оказалось, что данные были не в лучшем виде, поэтому много времени было потрачено на их очистку.

Для тех из вас, кто не знаком с ML, я хотел представить результаты заранее, чтобы у вас была необходимая информация, прежде чем вы начнете блуждать:

Таким образом, в этом случае XGB с точностью 70% работает лучше всего.

Я еще не расставил все точки над i, поэтому в следующей статье я сосредоточусь только на XGB, чтобы улучшить его производительность для этих данных. Что я намерен сделать, так это: во-первых, настроить гиперпараметры с помощью cvsearch, чтобы выбрать наилучшие значения, во-вторых, реализовать конвейер, чтобы избежать утечки данных, и в-третьих, внедрить ансамблевую модель, чтобы выжать из нее еще некоторые улучшения. Спасибо за чтение. Пожалуйста, не стесняйтесь связаться со мной в Twitter и LinkedIn.

Если вам нравится программировать, вы можете воспроизвести результаты, запустив код в своей собственной системе. Загрузите данные со страницы конкурса. Вырежьте и вставьте этот код в свою среду IDE и сохраните его в том же каталоге, что и файл данных. Я использовал python3 и scikit-learn.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Apr 17 09:04:39 2019
@author: sshekhar
"""
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from matplotlib import pyplot
def clean_date(date):
    date = date.str.strip()
    date = date.str.split("&").str[0]
    date = date.str.replace('–', '/')
    date = date.str.replace('.', '/')
    date = date.str.replace('Apr', '04')
    date = date.str.replace('-', '/')
    date = date.str.replace(' ', '/')
    date = date.str.replace('//+', '/')
    return date
df_raw = pd.read_csv('./Interview.csv')
df_raw.head()
# Removing empty variables
# I'll go ahead and put all this work in a new df so I have an original copy if I need to go back for any reason.
interview_df = df_raw.drop(['Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27'], axis = 1)
# Renaming variables to strings that are a little easier to work with.
interview_df.columns = ['Date', 'Client', 'Industry', 'Location', 'Position', 'Skillset',
                        'Interview_Type', 'cand_ID', 'Gender', 'Cand_Loc', 'Job_Loc', 'Venue',
                        'Native_Loc', 'Permission', 'Unsch_meeting', 'Pre_interview_call', 'Alt_phone',
                        'Resume_Printout', 'Clarify_Venue', 'Interview_call_Letter', 'Expected', 
                        'Attended', 'Martial_Status']
print(interview_df.shape)
print(interview_df.head())
#Lets lowercase every column value and remove any space from them.
interview_df = pd.concat([interview_df[c].astype(str).str.lower() for c in interview_df.columns], axis = 1)
interview_df = pd.concat([interview_df[c].astype(str).str.strip() for c in interview_df.columns], axis = 1)
#Clean the date column
interview_df['Date'] = clean_date(interview_df['Date'])
print(interview_df['Date'].unique())
#One or more rows have a null date value which has to be removed
#Let's find out all the columns that has null values
print(interview_df.loc[:, interview_df.isna().any()])
#Row# 1233 is a null row. So lets drop it
interview_df.drop(interview_df.index[[1233]], inplace = True)
#There are 3 more problems with the date column - one, some of the years are 2 digits and others are 4; two, some of the dates are projected in future 2020, 2021, 2022 and 2023; and three, some of the years have a trailing '/'
#To address the problem# one, I will break-down the date column into three columns 'day', 'month' and 'year' and then add '20' to all the 2 digit year values.
#To address the problem# two, I will replace all the future year values with '2019' because someone has done the '=!previous_value+1'.
#I will ignore the third problem because solution to first one will take care of it
interview_df['day'] = interview_df['Date'].str.split("/").str[0]
interview_df['month'] = interview_df['Date'].str.split("/").str[1]
interview_df['year'] = interview_df['Date'].str.split("/").str[2]
print(interview_df['year'].unique())
future_years=['2020','2021','2022','2023']
print(interview_df.loc[interview_df['year'].isin(future_years)])
interview_df['year'].replace(['16', '15','2020','2021','2022','2023'], ['2016', '2015','2019','2019','2019','2019'], inplace = True)
# Finally I create the new date column using cleaned values
interview_df['date'] = pd.to_datetime(pd.DataFrame({'year': interview_df['year'],'month': interview_df['month'],'day': interview_df['day']}), format = '%Y-%m-%d', errors='coerce')
#Makesure interview_df date column is of datetime data type
interview_df['date'] = interview_df['date'].astype('datetime64[D]')
interview_df.drop(['Date', 'year', 'month', 'day'], axis = 1, inplace = True)
for c in interview_df.columns:
    print(c)
    print(interview_df[c].unique())
print(interview_df.dtypes)
#The next column - Client has three redundant entries; lets replace them
#aon hewitt gurgaon with aon hewitt, hewitt with aon hewitt and standard chartered bank chennai with standard chartered bank
interview_df['Client'].replace(['standard chartered bank chennai', 'aon hewitt gurgaon', 'hewitt'], 
                              ['standard chartered bank', 'aon hewitt', 'aon hewitt'], inplace = True)
#Industry column looks OK but Location has one bad entry
interview_df['Location'].replace(['- cochin-'], ['cochin'], inplace = True)
#Candidate ID column has 'Candidate' word. We don't need it, so lets replace it and make the column type as int64.
interview_df['cand_ID'].replace(['candidate'], [' '],regex=True, inplace=True)
interview_df['cand_ID'].astype(int)
#Lets address Interview type column
interview_df['Interview_Type'].replace(['scheduled walk in', 'sceduled walkin'],['scheduled walkin', 'scheduled walkin'], inplace = True)
# I wonder why  cochin is always messed up?
interview_df['Cand_Loc'].replace(['- cochin-'], 
                              ['cochin'], inplace = True)
interview_df['Job_Loc'].replace(['- cochin-'], 
                              ['cochin'], inplace = True)
interview_df['Venue'].replace(['- cochin-'], 
                              ['cochin'], inplace = True)
interview_df['Native_Loc'].replace(['- cochin-'], 
                              ['cochin'], inplace = True)
#Permission column has few values like na, nan, not yet and yet to confirm, I will replace them all with to be decided (tbd)
interview_df['Permission'].replace(['na', 'not yet', 'yet to confirm', 'nan'],['tbd', 'tbd', 'tbd','tbd'], inplace = True)
#Lets do the same with the next two columns
interview_df['Unsch_meeting'].replace(['na', 'nan', 'not sure', 'cant say'],['tbd', 'tbd', 'tbd','tbd'], inplace = True)
interview_df['Pre_interview_call'].replace(['nan', 'na','no dont'],['tbd', 'tbd','no'], inplace = True)
#For Alt_phone column lets replace all the na, nan etc with no
interview_df['Alt_phone'].replace(['nan', 'no i have only thi number','na'],['no', 'no','no'], inplace = True)
#For Resume_Printout,Clarify_venue and Interview_call_Letter we will replace all the na variants with tbd
interview_df['Resume_Printout'].replace(['nan', 'no- will take it soon','not yet','na'],['tbd', 'tbd','tbd','tbd'], inplace = True)
interview_df['Clarify_Venue'].replace(['nan', 'no- i need to check','na'],['tbd', 'tbd','tbd'], inplace = True)
interview_df['Interview_call_Letter'].replace(['nan', 'havent checked','need to check','not sure','yet to check','not yet','na'],['tbd', 'tbd','tbd','tbd','tbd','tbd','tbd'], inplace = True)
#Expected column has misleading entries, lets make them uniform - yes or no
interview_df['Expected'].replace(['uncertain', 'nan','11:00 am','10.30 am'],['no', 'no','yes','yes'], inplace = True)
#Takecare of skillset column
#For now dropping it
interview_df.drop(['Skillset'], axis = 1, inplace = True)
#There is one other information that we want to extract from date column. For the purpose of interview lets assume that interviewees will be more comfortable with interview falling on Friday, Saturday or Sunday. So I will add another column called extn_weekend to the dataframe
## Adding more time columns
date_series = interview_df.date
interview_df.date = pd.to_datetime(date_series, infer_datetime_format=True, errors='coerce')
for n in ('Year', 'Month', 'Week', 'Day', 'Weekday_Name', 'Dayofweek', 'Dayofyear'):
        interview_df['Date'+'_'+n] = getattr(date_series.dt, n.lower())
interview_df['extn_weekend'] = np.where(interview_df['Date_Dayofweek']>4,1,0)
#Now lets look at the unique values again and convert catergorical values to numerical
#Later on we will convert those numerical values to normalized spread, so that we get values between -1 to 1. Most algorithms will like it that way
print(interview_df.dtypes)
#It makes sense to categorize the following columns - Permission, Unsch_meeting, Pre_interview_call, Resume_Printout, Clarify_Venue, Interview_call_Letter; manually as they have values - yes, no and tbd. I want to make sure that yes is more important indicator as compared to tbd, which is more important than no.
interview_df['Permission'] = pd.Categorical(interview_df['Permission'])
interview_df['Permission'].cat.set_categories(['no', 'tbd', 'yes'], ordered=True, inplace=True)
interview_df['Unsch_meeting'] = pd.Categorical(interview_df['Unsch_meeting'])
interview_df['Unsch_meeting'].cat.set_categories(['no', 'tbd', 'yes'], ordered=True, inplace=True)
interview_df['Pre_interview_call'] = pd.Categorical(interview_df['Pre_interview_call'])
interview_df['Pre_interview_call'].cat.set_categories(['no', 'tbd', 'yes'], ordered=True, inplace=True)
interview_df['Resume_Printout'] = pd.Categorical(interview_df['Resume_Printout'])
interview_df['Resume_Printout'].cat.set_categories(['no', 'tbd', 'yes'], ordered=True, inplace=True)
interview_df['Clarify_Venue'] = pd.Categorical(interview_df['Clarify_Venue'])
interview_df['Clarify_Venue'].cat.set_categories(['no', 'tbd', 'yes'], ordered=True, inplace=True)
interview_df['Interview_call_Letter'] = pd.Categorical(interview_df['Interview_call_Letter'])
interview_df['Interview_call_Letter'].cat.set_categories(['no', 'tbd', 'yes'], ordered=True, inplace=True)
#We will address two other columns in terms of importance to the model - Expected and Attended
interview_df['Expected'] = pd.Categorical(interview_df['Expected'])
pd.Categorical(interview_df['Expected'])
interview_df['Expected'].cat.set_categories(['no', 'yes'], ordered=True, inplace=True)
interview_df['Attended'] = pd.Categorical(interview_df['Attended'])
pd.Categorical(interview_df['Attended'])
interview_df['Attended'].cat.set_categories(['no', 'yes'], ordered=True, inplace=True)
#Now we are ready to convert all string values to numerics
#interview_df_with_dummies = pd.get_dummies(interview_df)
obj_df = interview_df.select_dtypes(include=['object']).copy()
#Lets drop candidate id from here, as it doesn't make sense to onehotencode it. We will add it back.
obj_df.drop(['cand_ID'], axis = 1, inplace = True)
#obj_df.head()
#obj_df.columns
#interview_df.dtypes
modeling_df = pd.get_dummies(obj_df)
modeling_df.head()
cat_df = interview_df.select_dtypes(include=['category']).copy()
#obj_df_onehotencoding['Permission'] = cat_df.Permission.cat.codes
#obj_df_onehotencoding.drop(['Permission'], axis = 1, inplace = True)
for col in cat_df.columns:
     modeling_df[col] = cat_df[col].cat.codes
#Assess if any column is missing from interview_df
print(interview_df.columns, interview_df.dtypes)
for c in modeling_df.columns:
     print(c)
#Now add the missing columns in extn_weekend, lets just add it from interview_df
modeling_df['cand_ID'] = interview_df['cand_ID']
modeling_df['cand_ID'] = pd.to_numeric(modeling_df["cand_ID"])
modeling_df['extn_weekend'] = interview_df['extn_weekend']
print(modeling_df.dtypes, modeling_df.head() )
#Now we are ready to try different algorithms
#Lets split the data into 80% for training and 20% for validation
Y=modeling_df['Attended']
modeling_df.drop(['Attended'], axis = 1, inplace = True)
print(modeling_df.dtypes)
X=modeling_df
# prepare models
models = []
models.append(( ' LR ' , LogisticRegression()))
models.append(( ' LDA ' , LinearDiscriminantAnalysis()))
models.append(( ' KNN ' , KNeighborsClassifier()))
models.append(( ' CART ' , DecisionTreeClassifier()))
models.append(( ' NB ' , GaussianNB()))
models.append(( ' SVM ' , SVC()))
models.append(( ' RF ',RandomForestClassifier()))
models.append(( ' XGB ' , XGBClassifier()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, X, Y, cv=kfold,scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
# boxplot algorithm comparison
fig = pyplot.figure()
fig.suptitle( ' Algorithm Comparison ' )
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()