Извлечение функций

Извлечение признаков — это процесс извлечения признаков (характеристик, свойств, атрибутов) из необработанных данных. Посмотрите примеры ниже.

Допустим, у нас есть такая переменная Timestamp:

Из этой переменной «Timestamp» мы можем извлечь год, месяц, день, час и название дня.

Теперь посмотрим на другой пример. Допустим, мы знаем имена пассажиров Титаника,

Из переменной «Имя» мы можем извлечь переменную «Название»,

Как видите, мы извлекли новые функции из необработанных данных. Конечно, мы можем извлечь больше функций. Единственное ограничение — наше воображение. Не обязательно всегда иметь смысл, просто попытайтесь извлечь новые функции. Возможно, вы не понимаете смысла вновь созданной функции, но на самом деле она имеет значение.

Давайте займемся кодированием!

import numpy as np
import pandas as pd
import seaborn as sns
from datetime import date
from matplotlib import pyplot as plt
from statsmodels.stats.proportion import proportions_ztest
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)

def load():
    data = pd.read_csv("titanic.csv")
    return data





#Feature Extraction aims to reduce the number of features in a dataset by 
#creating new features from the existing ones (and then discarding the original 
#features). These new reduced set of features should then be able to summarize 
#most of the information contained in the original set of features.

df = load()
print(df.head())
'''
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S
'''

Извлечение двоичных признаков. Наша цель — сгенерировать новые двоичные переменные, такие как 0–1, True False или FLAG и BOOL, из существующих переменных.

#Let's try to create a new binary column which consists of 1 (Cabin is known) 
#and 0 (Cabin is unknown(NaN))
df["NEW_CABIN_BOOL"] = df["Cabin"].notnull().astype('int')
print(df[['Cabin','NEW_CABIN_BOOL']].head())
'''
  Cabin  NEW_CABIN_BOOL
0   NaN               0
1   C85               1
2   NaN               0
3  C123               1
4   NaN               0
'''

#NOTE: Column name could be 'FLAG_NEW_CABIN' instead of 'NEW_CABIN_BOOL'





#After creating our new column, let's analyze the relationship between 
#target(Survived) variable and this new column.
print(df.groupby("NEW_CABIN_BOOL").agg({"Survived": "mean"}))
'''
                Survived
NEW_CABIN_BOOL          
0               0.299854
1               0.666667
'''

Как видите, если Cabin не NaN, то человек жив на 66%, в противном случае — на 29%. Похоже, что значение NEW_CABIN_BOOL влияет на столбец «Выживший». Нам необходимо проверить эту гипотезу.

Проверка гипотезы

Мы хотим проверить, оказывает ли наша новая переменная «NEW_CABIN_BOOL» существенное влияние на переменную «Выживший».

H0: NEW_CABIN_BOOL не оказывает существенного влияния на столбец «Выжившие».

HA: NEW_CABIN_BOOL оказывает значительное влияние на столбец «Выжившие».

Чтобы проверить альтернативную гипотезу, мы будем использовать функциюпропорции_ztest(). Он имеет два параметра: count и nobs.
Count покажет количество выживших людей, когда NEW_CABIN_BOOL равен 1 и 0. Nobs покажет количество строк (obsv), когда NEW_CABIN_BOOL = 1 и NEW_CABIN_BOOL = 0
А функция propotions_ztest() вернет нам test_stat и pvalue . Если значение p ниже критического значения (скажем, 0,05), то мы принимаем альтернативную гипотезу (HA).

test_stat, pvalue = proportions_ztest(count=[df.loc[df["NEW_CABIN_BOOL"] == 1, "Survived"].sum(),
                                             df.loc[df["NEW_CABIN_BOOL"] == 0, "Survived"].sum()],

                                      nobs=[df.loc[df["NEW_CABIN_BOOL"] == 1, "Survived"].shape[0],
                                            df.loc[df["NEW_CABIN_BOOL"] == 0, "Survived"].shape[0]])




#Since pvalue < 0.05 , we can accept that HA is true. In other words, 
#NEW_CABIN_BOOL has a significant effect on Survived column.
print(f'Test stat: {test_stat:.4f}, P value: {pvalue:.4f}') 
# Test stat: 9.4597, P value: 0.0000

Теперь давайте попробуем извлечь еще один бинарный признак. Колонки Сибс и Парч связаны с количеством родственников человека. Итак, для человека, если Сибс+Парч = 0, то мы можем сказать, что он/она одинок,

df.loc[((df['SibSp'] + df['Parch']) > 0), "NEW_IS_ALONE"] = "NOT ALONE"
df.loc[((df['SibSp'] + df['Parch']) == 0), "NEW_IS_ALONE"] = "ALONE"

print(df.groupby("NEW_IS_ALONE").agg({"Survived": "mean"}))
'''
              Survived
NEW_IS_ALONE          
ALONE         0.303538
NOT ALONE     0.505650
'''

Как видно, если человек не один, то он на 50% жив. Если человек один, то он/она жив на 30%. Таким образом, похоже, что одиночество или отсутствие человека может повлиять на нашу целевую переменную (Выжил). Поэтому нам нужно его протестировать:

H0: NEW_IS_ALONE не оказывает существенного влияния на столбец «Выжившие».

HA: NEW_IS_ALONE оказывает значительное влияние на столбец «Выжившие».

test_stat, pvalue = proportions_ztest(count=[df.loc[df["NEW_IS_ALONE"] == "ALONE", "Survived"].sum(),
                                             df.loc[df["NEW_IS_ALONE"] == "NOT ALONE", "Survived"].sum()],

                                      nobs=[df.loc[df["NEW_IS_ALONE"] == "ALONE", "Survived"].shape[0],
                                            df.loc[df["NEW_IS_ALONE"] == "NOT ALONE", "Survived"].shape[0]])



#Since pvalue < 0.05, we can accept HA. In other words, 
#being alone affects the person's survival probability.
print(f'Test stat: {test_stat:.4f}, P value: {pvalue:.4f}')  
# Test stat: -6.0704, P value: 0.0000

Текст Извлечение функций:Мы попытаемся получить новые свойства из переменных, содержащих текст.

#Letter Count: Let's look at the number of letters in the 'Name' column 
#and create a new variable from there.
df["NEW_NAME_LETTER_COUNT"] = df["Name"].str.len()



#Word Count: Look at the word count in the 'Name' column.
df["NEW_NAME_WORD_COUNT"] = df["Name"].apply(lambda x: len(str(x).split(" ")))



print(df[['Name', 'NEW_NAME_LETTER_COUNT', 'NEW_NAME_WORD_COUNT']].head())
'''
                                                Name  NEW_NAME_LETTER_COUNT  NEW_NAME_WORD_COUNT
0                            Braund, Mr. Owen Harris                     23                    4
1  Cumings, Mrs. John Bradley (Florence Briggs Th...                     51                    7
2                             Heikkinen, Miss. Laina                     22                    3
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)                     44                    7
4                           Allen, Mr. William Henry                     24                    4
'''




###########################################
#Extracting Special Titles
###########################################

#Doctor: 
df["NEW_DR"] = df["Name"].apply(
    lambda x: len([x for x in x.split() if x.startswith("Dr")])
)

#As you can see, doctors are more likely to survive. However,
#there are only 10 doctors, so we shouldn't be too confident to say 
#something for now.
print(df.groupby("NEW_DR").agg({"Survived": ["mean", "count"]}))

'''
       Survived      
           mean count
NEW_DR               
0       0.38252   881
1       0.50000    10
'''





#Miss, Mr, Mrs:

#We will use RegEx to extact Miss, Mr and Mrs.
#You can watch the video in the link to understand what is RegEx.
#https://www.youtube.com/watch?v=rhzKDrUiJVk

#We are looking for things like " Miss.", " Mr.", " Mrs."
#If expand = False, it returns Series, 
#if it is True, it returns Dataframe.
df["NEW_TITLE"] = df.Name.str.extract(" ([A-Za-z]+)\.", expand=False)


#As can be seen, some titles have very high frequences such as
#"Master", "Miss", "Mr", "Mrs"
#Therefore, for example, we can use these titles having high counts 
#to fill the NaN values in the 'Age' column.
print(df[["NEW_TITLE", "Survived", "Age"]].groupby(["NEW_TITLE"]).agg(
    {"Survived": "mean", "Age": ["count", "mean"]}))
'''
           Survived   Age           
               mean count       mean
NEW_TITLE                           
Capt       0.000000     1  70.000000
Col        0.500000     2  58.000000
Countess   1.000000     1  33.000000
Don        0.000000     1  40.000000
Dr         0.428571     6  42.000000
Jonkheer   0.000000     1  38.000000
Lady       1.000000     1  48.000000
Major      0.500000     2  48.500000
Master     0.575000    36   4.574167
Miss       0.697802   146  21.773973
Mlle       1.000000     2  24.000000
Mme        1.000000     1  24.000000
Mr         0.156673   398  32.368090
Mrs        0.792000   108  35.898148
Ms         1.000000     1  28.000000
Rev        0.000000     6  43.166667
Sir        1.000000     1  49.000000
'''

Извлечение функций даты. Если у нас есть данные даты и времени, мы можем извлечь из них другие функции. Давайте посмотрим на файл Course_reviews.csv.

dff = pd.read_csv("course_reviews.csv")
print(dff.head())
'''
   Rating            Timestamp             Enrolled  Progress  Questions Asked  Questions Answered
0     5.0  2021-02-05 07:45:55  2021-01-25 15:12:08       5.0              0.0                 0.0
1     5.0  2021-02-04 21:05:32  2021-02-04 20:43:40       1.0              0.0                 0.0
2     4.5  2021-02-04 20:34:03  2019-07-04 23:23:27       1.0              0.0                 0.0
3     5.0  2021-02-04 16:56:28  2021-02-04 14:41:29      10.0              0.0                 0.0
4     4.0  2021-02-04 15:00:24  2020-10-13 03:10:07      10.0              0.0                 0.0
'''




#The dtype of 'Timestamp' is object. To manipulate a date, we need to save it
#as datetime, therefore we will change the dtype.
print(dff.info())
'''
RangeIndex: 4323 entries, 0 to 4322
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rating              4323 non-null   float64
 1   Timestamp           4323 non-null   object 
 2   Enrolled            4323 non-null   object 
 3   Progress            4323 non-null   float64
 4   Questions Asked     4323 non-null   float64
 5   Questions Answered  4323 non-null   float64
dtypes: float64(4), object(2)
'''




#With format parameter, we should write how date was formatted
#Let's look at one of the dates: 2021-02-05 
#It is saved as Year-month-day
#So we should write this in format parameter.
dff["Timestamp"] = pd.to_datetime(dff["Timestamp"], format="%Y-%m-%d")




#Now dtype of 'Timestamp' variable is datetime.
print(dff.dtypes)
'''
Rating                       float64
Timestamp             datetime64[ns]
Enrolled                      object
Progress                     float64
Questions Asked              float64
Questions Answered           float64
dtype: object
'''





#Now we can easily manipulate and extract new features.
#year
dff["year"] = dff["Timestamp"].dt.year

# month
dff["month"] = dff["Timestamp"].dt.month

# year diff
dff["year_diff"] = date.today().year - dff["Timestamp"].dt.year

# month diff
dff["month_diff"] = (
    (date.today().year - dff["Timestamp"].dt.year) * 12
    + date.today().month
    - dff["Timestamp"].dt.month)


# day name
dff["day_name"] = dff["Timestamp"].dt.day_name()




print(dff.head())
'''
   Rating           Timestamp             Enrolled  Progress  Questions Asked  Questions Answered  year  month  year_diff  month_diff  day_name
0     5.0 2021-02-05 07:45:55  2021-01-25 15:12:08       5.0              0.0                 0.0  2021      2          2          30    Friday
1     5.0 2021-02-04 21:05:32  2021-02-04 20:43:40       1.0              0.0                 0.0  2021      2          2          30  Thursday
2     4.5 2021-02-04 20:34:03  2019-07-04 23:23:27       1.0              0.0                 0.0  2021      2          2          30  Thursday
3     5.0 2021-02-04 16:56:28  2021-02-04 14:41:29      10.0              0.0                 0.0  2021      2          2          30  Thursday
4     4.0 2021-02-04 15:00:24  2020-10-13 03:10:07      10.0              0.0                 0.0  2021      2          2          30  Thursday
'''

Взаимодействие функций.Мы получим новую переменную, попробовав все комбинации с другими переменными в рамках определенной логики или без какой-либо логики.

#We are working with the titanic dataset.
df = load()
df.head()



#The smaller the product of age and pclass, the higher the probability 
#of survival. so we can generate a new variable from here.
df["NEW_AGE_PCLASS"] = df["Age"] * df["Pclass"]




#How many people from that family are on the ship? This could be an
#important variable.
df["NEW_FAMILY_SIZE"] = df["SibSp"] + df["Parch"] + 1




#We can create a new variable according to age range and gender.
df.loc[(df["Sex"] == "male") & (df["Age"] <= 21), "NEW_SEX_CAT"] = "youngmale"
df.loc[ (df["Sex"] == "male") & (df["Age"] > 21) & (df["Age"] < 50), "NEW_SEX_CAT"] = "maturemale"
df.loc[(df["Sex"] == "male") & (df["Age"] >= 50), "NEW_SEX_CAT"] = "seniormale"

df.loc[(df["Sex"] == "female") & (df["Age"] <= 21), "NEW_SEX_CAT"] = "youngfemale"
df.loc[(df["Sex"] == "female") & (df["Age"] > 21) & (df["Age"] < 50), "NEW_SEX_CAT"] = "maturefemale"
df.loc[(df["Sex"] == "female") & (df["Age"] >= 50), "NEW_SEX_CAT"] = "seniorfemale"



print(df.head())
'''
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked  NEW_AGE_PCLASS  NEW_FAMILY_SIZE   NEW_SEX_CAT
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S            66.0                2    maturemale
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C            38.0                2  maturefemale
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S            78.0                1  maturefemale
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S            35.0                2  maturefemale
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S           105.0                1    maturemale
'''





#We can examine the survival probabilities with groupby().
print(df.groupby("NEW_SEX_CAT")["Survived"].mean())
'''
NEW_SEX_CAT
maturefemale    0.774194
maturemale      0.199288
seniorfemale    0.909091
seniormale      0.134615
youngfemale     0.678571
youngmale       0.250000
Name: Survived, dtype: float64
'''

Спасибо, что прочитали…

Извлечение функций

Проверка гипотезы

Вопросы по теме