Прогнозирование классификации здоровья плода с использованием машинного обучения

Абстрактный

Классифицировать здоровье плода, чтобы предотвратить детскую и материнскую смертность.

Контекст

Снижение детской смертности отражено в нескольких Целях устойчивого развития Организации Объединенных Наций и является ключевым показателем человеческого прогресса. ООН ожидает, что к 2030 году страны положат конец предотвратимой смертности новорожденных и детей в возрасте до 5 лет, при этом все страны будут стремиться снизить смертность в возрасте до 5 лет как минимум до 25 случаев на 1000 живорождений.

Параллельно понятию детской смертности, конечно же, существует материнская смертность, на долю которой приходится 295 000 смертей во время и после беременности и родов (по состоянию на 2017 год). Подавляющее большинство этих смертей (94%) произошло в условиях ограниченных ресурсов, и большинство из них можно было бы предотвратить.

В свете того, что было упомянуто выше, кардиотокография (КТГ) является простым и доступным по цене методом оценки здоровья плода, позволяющим медицинским работникам принимать меры для предотвращения детской и материнской смертности. Само оборудование работает, отправляя ультразвуковые импульсы и считывая их ответ, тем самым проливая свет на частоту сердечных сокращений плода (ЧСС), движения плода, сокращения матки и многое другое.

Данные

Этот набор данных содержит 2126 записей характеристик, извлеченных из исследований кардиотокограммы, которые затем были разделены тремя экспертами-акушерами на 3 класса:

Нормальный, подозрительный и патологический

Цель

Создайте мультиклассовую модель для классификации характеристик КТГ по трем состояниям здоровья плода.

Ссылка на набор данных

https://www.kaggle.com/andrewmvd/fetal-health-classification

Импорт необходимых библиотек

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier,VotingClassifier,GradientBoostingClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from mlxtend.classifier import StackingClassifier
from sklearn import model_selection
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV
sns.set(color_codes=True) # adds a nice background to the graphs
%matplotlib inline

Чтение набора данных

df = pd.read_csv('../input/fetal-health-classification/fetal_health.csv')
df.head()
df.shape
(2126, 22)

В наборе данных 2126 строк и 22 объекта.

Информация о данных

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2126 entries, 0 to 2125
Data columns (total 22 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 0   baseline value                                          2126 non-null   float64
 1   accelerations                                           2126 non-null   float64
 2   fetal_movement                                          2126 non-null   float64
 3   uterine_contractions                                    2126 non-null   float64
 4   light_decelerations                                     2126 non-null   float64
 5   severe_decelerations                                    2126 non-null   float64
 6   prolongued_decelerations                                2126 non-null   float64
 7   abnormal_short_term_variability                         2126 non-null   float64
 8   mean_value_of_short_term_variability                    2126 non-null   float64
 9   percentage_of_time_with_abnormal_long_term_variability  2126 non-null   float64
 10  mean_value_of_long_term_variability                     2126 non-null   float64
 11  histogram_width                                         2126 non-null   float64
 12  histogram_min                                           2126 non-null   float64
 13  histogram_max                                           2126 non-null   float64
 14  histogram_number_of_peaks                               2126 non-null   float64
 15  histogram_number_of_zeroes                              2126 non-null   float64
 16  histogram_mode                                          2126 non-null   float64
 17  histogram_mean                                          2126 non-null   float64
 18  histogram_median                                        2126 non-null   float64
 19  histogram_variance                                      2126 non-null   float64
 20  histogram_tendency                                      2126 non-null   float64
 21  fetal_health                                            2126 non-null   float64
dtypes: float64(22)

Проверка нулевых значений

df.isnull().sum()
baseline value                                            0
accelerations                                             0
fetal_movement                                            0
uterine_contractions                                      0
light_decelerations                                       0
severe_decelerations                                      0
prolongued_decelerations                                  0
abnormal_short_term_variability                           0
mean_value_of_short_term_variability                      0
percentage_of_time_with_abnormal_long_term_variability    0
mean_value_of_long_term_variability                       0
histogram_width                                           0
histogram_min                                             0
histogram_max                                             0
histogram_number_of_peaks                                 0
histogram_number_of_zeroes                                0
histogram_mode                                            0
histogram_mean                                            0
histogram_median                                          0
histogram_variance                                        0
histogram_tendency                                        0
fetal_health                                              0
dtype: int64

В наборе данных нет нулевых значений.

Парный сюжет

Одномерный анализ

Проверка асимметрии

df.skew()
baseline value                                             0.020312
accelerations                                              1.204392
fetal_movement                                             7.811477
uterine_contractions                                       0.159315
light_decelerations                                        1.718437
severe_decelerations                                      17.353457
prolongued_decelerations                                   4.323965
abnormal_short_term_variability                           -0.011829
mean_value_of_short_term_variability                       1.657339
percentage_of_time_with_abnormal_long_term_variability     2.195075
mean_value_of_long_term_variability                        1.331998
histogram_width                                            0.314235
histogram_min                                              0.115784
histogram_max                                              0.577862
histogram_number_of_peaks                                  0.892886
histogram_number_of_zeroes                                 3.920287
histogram_mode                                            -0.995178
histogram_mean                                            -0.651019
histogram_median                                          -0.478414
histogram_variance                                         3.219974
histogram_tendency                                        -0.311632
fetal_health                                               1.849934
dtype: float64

Давайте проверим, есть ли дубликаты в наборе данных.

df[df.duplicated()]

В наборе данных есть дубликаты

df_dup = df.drop_duplicates(subset = None , keep = 'first', inplace = False)

После удаления дубликатов у нас есть 2113 строк и 22 функции.

График корреляции

Target = df["fetal_health"]
corr = df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(15, 15))
    ax = sns.heatmap(corr,mask=mask,square=True,linewidths=2.5,cmap="viridis",annot=True)

Существует сильная корреляция между базовым значением и режимом гистограммы, медианой гистограммы и средним значением гистограммы. Количество пиков гистограммы и ширина гистограммы также имеют хорошую корреляцию.

Счетчик цели

sns.countplot(Target)
plt.show()

print("Count of type 1.0 fetal health in the dataset ",len(df.loc[df["fetal_health"]==1.0]))
print("Count of type 2.0 fetal health in the dataset ",len(df.loc[df["fetal_health"]==2.0]))
print("Count of type 3.0 fetal health in the dataset ",len(df.loc[df["fetal_health"]==3.0]))

Подсчет здоровья плода типа 1.0 в наборе данных 1655
Подсчет здоровья плода типа 2.0 в наборе данных 295
Подсчет здоровья плода типа 3.0 в наборе данных 176

Разделение набора данных между независимыми объектами и зависимыми объектами

X = df_dup.iloc[:,:-1]
y = df_dup.iloc[:,-1]

Масштабирование набора данных

scale = StandardScaler()
X = scale.fit_transform(X)
X = pd.DataFrame(X,columns=df_dup.iloc[:,:-1].columns)

Мы используем Импутатор случайной выборки для улучшения дисбаланса классов в целевом столбце.

from imblearn.over_sampling import RandomOverSampler
ROS = RandomOverSampler(random_state=42)
X_ros, y_ros = ROS.fit_resample(X,y)
from collections import Counter
print('Resampled dataset shape %s' % Counter(y_ros))

Счетчик формы набора данных с повторной выборкой ({2.0: 1646, 1.0: 1646, 3.0: 1646})

import statsmodels.api as sm
X = sm.add_constant(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 10, test_size = 0.2)

print('X_train', X_train.shape)
print('y_train', y_train.shape)

print('X_test', X_test.shape)
print('y_test', y_test.shape)

X_train (1690, 22)
y_train (1690,)
X_test (423, 22)
y_test (423,)

print("{0:0.2f}% data is in training set".format((len(X_train)/len(df.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(X_test)/len(y.index)) * 100))

79,49 % данных находятся в обучающем наборе
20,02 % данных находятся в тестовом наборе

Функция для создания отчета о поезде

def get_train_report(model):
    
    train_pred = model.predict(X_train)
    return(classification_report(y_train, train_pred))

Функция для создания отчета о тестировании

def get_test_report(model):
    test_pred = model.predict(X_test)
    return(classification_report(y_test, test_pred))

Классификатор дерева решений

decision_tree_classification = DecisionTreeClassifier(criterion = 'entropy', random_state = 10)
decision_tree = decision_tree_classification.fit(X_train, y_train)
from sklearn.metrics import classification_report
train_report = get_train_report(decision_tree)
print(train_report)
precision    recall  f1-score   support

         1.0       1.00      1.00      1.00      1320
         2.0       1.00      1.00      1.00       237
         3.0       1.00      1.00      1.00       133

    accuracy                           1.00      1690
   macro avg       1.00      1.00      1.00      1690
weighted avg       1.00      1.00      1.00      1690
test_report = get_test_report(decision_tree)
print(test_report)
precision    recall  f1-score   support

         1.0       0.95      0.97      0.96       326
         2.0       0.78      0.76      0.77        55
         3.0       0.95      0.83      0.89        42

    accuracy                           0.93       423
   macro avg       0.89      0.86      0.87       423
weighted avg       0.93      0.93      0.93       423

Настройка гиперпараметров в классификаторе дерева решений

dt_model = DecisionTreeClassifier(criterion = 'gini',
                                  max_depth = 5,
                                  min_samples_split = 4,
                                  max_leaf_nodes = 6,
                                  random_state = 10)

# fit the model using fit() on train data
decision_tree = dt_model.fit(X_train, y_train)
train_report = get_train_report(decision_tree)

# print the performance measures
print('Train data:\n', train_report)
test_report = get_test_report(decision_tree)

# print the performance measures
print('Test data:\n', test_report)
Train data:
               precision    recall  f1-score   support

         1.0       0.93      0.97      0.95      1320
         2.0       0.82      0.62      0.71       237
         3.0       0.90      0.92      0.91       133

    accuracy                           0.92      1690
   macro avg       0.88      0.84      0.85      1690
weighted avg       0.91      0.92      0.91      1690

Test data:
               precision    recall  f1-score   support

         1.0       0.92      0.95      0.94       326
         2.0       0.76      0.62      0.68        55
         3.0       0.87      0.81      0.84        42

    accuracy                           0.90       423
   macro avg       0.85      0.79      0.82       423
weighted avg       0.89      0.90      0.89       423

Случайный лесной классификатор

rf_classification = RandomForestClassifier(n_estimators = 10, random_state = 10)

# use fit() to fit the model on the train set
rf_model = rf_classification.fit(X_train, y_train)
train_report = get_train_report(rf_model)
print(train_report)
precision    recall  f1-score   support

         1.0       0.99      1.00      1.00      1320
         2.0       0.99      0.97      0.98       237
         3.0       1.00      0.98      0.99       133

    accuracy                           0.99      1690
   macro avg       1.00      0.98      0.99      1690
weighted avg       0.99      0.99      0.99      1690

test_report = get_test_report(rf_model)
print(test_report)
precision    recall  f1-score   support

         1.0       0.93      0.98      0.96       326
         2.0       0.82      0.67      0.74        55
         3.0       0.92      0.79      0.85        42

    accuracy                           0.92       423
   macro avg       0.89      0.81      0.85       423
weighted avg       0.92      0.92      0.92       423

Особенности сюжета важности

important_features = pd.DataFrame({'Features': X_train.columns, 
                                   'Importance': rf_model.feature_importances_})

# sort the dataframe in the descending order according to the feature importance
important_features = important_features.sort_values('Importance', ascending = False)

# create a barplot to visualize the features based on their importance
sns.barplot(x = 'Importance', y = 'Features', data = important_features)

# add plot and axes labels
# set text size using 'fontsize'
plt.title('Feature Importance', fontsize = 15)
plt.xlabel('Importance', fontsize = 15)
plt.ylabel('Features', fontsize = 15)

# display the plot
plt.show()

Из гистограммы выше видно, что краткосрочная изменчивость является наиболее важной характеристикой в наборе данных.

K ближайших соседей

from sklearn.metrics import confusion_matrix,roc_curve
knn_classification = KNeighborsClassifier(n_neighbors = 3)

# fit the model using fit() on train data
knn_model = knn_classification.fit(X_train, y_train)
train_report = get_train_report(knn_model)
print(train_report)
precision    recall  f1-score   support

         1.0       0.96      0.99      0.98      1320
         2.0       0.89      0.79      0.84       237
         3.0       0.94      0.88      0.91       133

    accuracy                           0.95      1690
   macro avg       0.93      0.89      0.91      1690
weighted avg       0.95      0.95      0.95      1690
test_report = get_test_report(knn_model)
print(test_report)
precision    recall  f1-score   support

         1.0       0.94      0.96      0.95       326
         2.0       0.70      0.67      0.69        55
         3.0       0.89      0.74      0.81        42

    accuracy                           0.90       423
   macro avg       0.84      0.79      0.81       423
weighted avg       0.90      0.90      0.90       423

Настройка гиперпараметров на классификаторе KNN

tuned_paramaters = {'n_neighbors': np.arange(1, 25, 2),
                   'metric': ['hamming','euclidean','manhattan','Chebyshev']}
 
# instantiate the 'KNeighborsClassifier' 
knn_classification = KNeighborsClassifier()

knn_grid = GridSearchCV(estimator = knn_classification, 
                        param_grid = tuned_paramaters, 
                        cv = 5, 
                        scoring = 'accuracy')

# fit the model on X_train and y_train using fit()
knn_grid.fit(X_train, y_train)

# get the best parameters
print('Best parameters for KNN Classifier: ', knn_grid.best_params_, '\n')

Лучшие параметры для классификатора KNN: {'метрика': 'manhattan', 'n_neighbors': 7}

from sklearn.model_selection import cross_val_score
error_rate = []

# use for loop to build a knn model for each K
for i in np.arange(1,25,2):
    
    # setup a knn classifier with k neighbors
    # use the 'euclidean' metric 
    knn = KNeighborsClassifier(i, metric = 'euclidean')
   
    # fit the model using 'cross_val_score'
    # pass the knn model as 'estimator'
    # use 5-fold cross validation
    score = cross_val_score(knn, X_train, y_train, cv = 5)
    
    # calculate the mean score
    score = score.mean()
    
    # compute error rate 
    error_rate.append(1 - score)

# plot the error_rate for different values of K 
plt.plot(range(1,25,2), error_rate)

# add plot and axes labels
# set text size using 'fontsize'
plt.title('Error Rate', fontsize = 15)
plt.xlabel('K', fontsize = 15)
plt.ylabel('Error Rate', fontsize = 15)
# set the x-axis labels
plt.xticks(np.arange(1, 25, step = 2))

# plot a vertical line across the minimum error rate
plt.axvline(x = 7, color = 'red')

# display the plot
plt.show()

Мы видим, что оптимальное значение K = 7, полученное из GridSearchCV, приводит к самой низкой частоте ошибок.

train_report = get_train_report(knn_grid)
print(train_report)
precision    recall  f1-score   support

         1.0       0.95      0.98      0.96      1320
         2.0       0.84      0.72      0.77       237
         3.0       0.98      0.83      0.90       133

    accuracy                           0.93      1690
   macro avg       0.92      0.84      0.88      1690
weighted avg       0.93      0.93      0.93      1690
test_report = get_test_report(knn_grid)
print(test_report)
precision    recall  f1-score   support

         1.0       0.91      0.97      0.94       326
         2.0       0.69      0.62      0.65        55
         3.0       0.86      0.60      0.70        42

    accuracy                           0.88       423
   macro avg       0.82      0.73      0.77       423
weighted avg       0.88      0.88      0.88       423

Наивный байесовский классификатор Гаусса

gnb = GaussianNB()
# fit the model using fit() on train data
gnb_model = gnb.fit(X_train, y_train)
test_report = get_test_report(gnb_model)
print(test_report)
precision    recall  f1-score   support
         1.0       0.98      0.86      0.91       326
         2.0       0.48      0.87      0.62        55
         3.0       0.78      0.69      0.73        42
    accuracy                           0.84       423
   macro avg       0.75      0.81      0.76       423
weighted avg       0.89      0.84      0.86       423

Классификатор Adaboost

ada_model = AdaBoostClassifier(n_estimators = 40, random_state = 10)
ada_model.fit(X_train, y_train)
ada_model = AdaBoostClassifier(n_estimators = 40, random_state = 10)
ada_model.fit(X_train, y_train)
AdaBoostClassifier(n_estimators=40, random_state=10)
test_report = get_test_report(ada_model)
print(test_report)
precision    recall  f1-score   support

         1.0       0.92      0.94      0.93       326
         2.0       0.66      0.69      0.67        55
         3.0       0.94      0.74      0.83        42

    accuracy                           0.89       423
   macro avg       0.84      0.79      0.81       423
weighted avg       0.89      0.89      0.89       423
    accuracy                           0.89       423
   macro avg       0.84      0.79      0.81       423
weighted avg       0.89      0.89      0.89       423
ada_model = AdaBoostClassifier(n_estimators = 40, random_state = 10)
ada_model.fit(X_train, y_train)
AdaBoostClassifier(n_estimators=40, random_state=10)
test_report = get_test_report(ada_model)
print(test_report)
precision    recall  f1-score   support

         1.0       0.92      0.94      0.93       326
         2.0       0.66      0.69      0.67        55
         3.0       0.94      0.74      0.83        42

    accuracy                           0.89       423
   macro avg       0.84      0.79      0.81       423
weighted avg       0.89      0.89      0.89       423

Классификатор повышения градиента

gboost_model = GradientBoostingClassifier(n_estimators = 150, max_depth = 10, random_state = 10)
gboost_model.fit(X_train, y_train)
GradientBoostingClassifier(max_depth=10, n_estimators=150, random_state=10)
test_report = get_test_report(gboost_model)
print(test_report)
precision    recall  f1-score   support

         1.0       0.95      0.98      0.97       326
         2.0       0.87      0.82      0.84        55
         3.0       0.95      0.86      0.90        42

    accuracy                           0.94       423
   macro avg       0.92      0.88      0.90       423
weighted avg       0.94      0.94      0.94       423

Классификатор повышения XG

xgb_model = XGBClassifier(max_depth = 10, gamma = 1)
xgb_model.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=1, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=10,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
test_report = get_test_report(xgb_model)
print(test_report)
precision    recall  f1-score   support

         1.0       0.95      0.98      0.96       326
         2.0       0.88      0.82      0.85        55
         3.0       0.94      0.81      0.87        42

    accuracy                           0.94       423
   macro avg       0.93      0.87      0.89       423
weighted avg       0.94      0.94      0.94       423

Машина опорных векторов

svc_model = SVC(kernel='poly',probability=True)
svc_model.fit(X_train,y_train)
SVC(kernel='poly', probability=True)
test_report = get_test_report(svc_model)
print(test_report)
precision    recall  f1-score   support

         1.0       0.92      0.97      0.95       326
         2.0       0.69      0.60      0.64        55
         3.0       0.90      0.67      0.77        42

    accuracy                           0.89       423
   macro avg       0.84      0.75      0.78       423
weighted avg       0.89      0.89      0.89       423

Классификатор голосования

clf1 = KNeighborsClassifier(n_neighbors = 7 , weights = 'distance', metric='manhattan' )
clf2 = GradientBoostingClassifier(n_estimators = 150,max_depth = 10,random_state=1)
votingclf = VotingClassifier(estimators=[('knn',clf1),('grb', clf2)],voting='hard')
votingclf = votingclf.fit(X_train,y_train)
test_report = get_test_report(votingclf)
print(test_report)
precision    recall  f1-score   support
         1.0       0.92      0.98      0.95       326
         2.0       0.82      0.65      0.73        55
         3.0       0.97      0.67      0.79        42
    accuracy                           0.91       423
   macro avg       0.90      0.77      0.82       423
weighted avg       0.91      0.91      0.90       423

Результат

Мы пробовали разные алгоритмы для этого набора данных, среди которых алгоритмы на основе повышения, т. е. алгоритмы повышения XG Boost и Gradient работают лучше всего для этого набора данных с точностью 94% в тестовом наборе данных, а показатель f1 для XGBoost составляет 0,96, 0,85 и 0,87 соответственно для три класса, за которыми следует классификатор дерева решений без его гипернастройки с точностью 93% на тестовых данных.

Надеюсь, вам понравился анализ!

Вы можете подписаться на меня в Linkedin, Github и Kaggle.