Прогнозирование цен на жилье с использованием методов регрессии (машинное обучение)

Главное, что я хочу объяснить в этой статье, — это методы регрессии и применение алгоритма машинного обучения к чистому набору данных, а наш конечный результат — прогнозирование цены дома с использованием заданных факторов и точности нашего прогноза.

Жизненный цикл проекта Data Science

Анализ данных
Разработка функций
Выбор функции
Построение модели
Развертывание модели

Анализ данных. На этапе анализа данных мы узнаем больше о таких данных, как данные, отсутствующие значения, числовые и категориальные переменные, мощность между этими переменными, выбросы, отношения между независимыми переменные и зависимая переменная.

## import some libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.pandas.set_option('display.max_columns',None)
##  reading dataset
dataset=pd.read_csv('train.csv')
## prints no.of rows and coloumns in the dataset
print(dataset.shape)
## prints the top5 records
dataset.head()

а. Отсутствующие значения: эта часть печатает функции с отсутствующими значениями, а количество% нулевых значений в этой функции означает, что она печатает процент значений nan в каждой функции.

## makes list of features which has null values
features_with_na=[features for features in dataset.columns if dataset[features].isnull().sum()>1]
## prints feature name with percentage of nan valuesfor feature in features_with_na:
    print(feature, np.round(dataset[feature].isnull().mean(), 4),  ' % missing values')

Из приведенного выше вывода мы ясно видим, что существует несколько функций со значениями nan. Давайте построим некоторую связь между отсутствующими значениями (независимые переменные) и ценой продажи (зависимая переменная).

for feature in features_with_na:
    data = dataset.copy()
    
    # let's make a variable that indicates 1 if the observation was missing or zero otherwise
    data[feature] = np.where(data[feature].isnull(), 1, 0)
    
    # let's calculate the mean SalePrice where the information is missing or present
    data.groupby(feature)['SalePrice'].median().plot.bar()
    plt.title(feature)
    plt.show()

Числовые переменные

## list of numerical variables
numerical_features = [feature for feature in dataset.columns if dataset[feature].dtypes != 'O']
print('Number of numerical variables: ', len(numerical_features))
## prints 1st 5 rows
dataset[numerical_features].head()

Временные переменные (например, переменные даты и времени)

Из набора данных у нас есть переменные за 4 года. Мы извлекаем информацию из переменных даты и времени, таких как количество лет или дней. Одним из примеров в этом конкретном сценарии может быть разница в годах между годом постройки дома и годом продажи дома. Мы будем выполнять этот анализ в Feature Engineering.

# list of variables that contain year information
year_feature = [feature for feature in numerical_features if 'Yr' in feature or 'Year' in feature]
year_feature
# let's explore the content of these year variables
for feature in year_feature:
    print(feature, dataset[feature].unique())
## let's analyze the Temporal Datetime Variables
## We will check whether there is a relation between year the house is sold and the sales price
dataset.groupby('YrSold')['SalePrice'].median().plot()
plt.xlabel('Year Sold')
plt.ylabel('Median House Price')
plt.title("House Price vs YearSold")

## Here we will compare the difference between All years feature with SalePrice
for feature in year_feature:
    if feature!='YrSold':
        data=dataset.copy()
        ## We will capture the difference between year variable and year the house was sold for
        data[feature]=data['YrSold']-data[feature]
plt.scatter(data[feature],data['SalePrice'])
        plt.xlabel(feature)
        plt.ylabel('SalePrice')
        plt.show()

Числовые переменные обычно бывают двух типов: 1. Дискретные переменные 2. Непрерывные переменные.

1.Discrete variables
## prints count of discrete variables 
discrete_feature=[feature for feature in numerical_features if len(dataset[feature].unique())<25 and feature not in year_feature+['Id']]
print("Discrete Variables Count: {}".format(len(discrete_feature)))
## prints discrete features
discrete_feature
## Lets Find the realtionship between them and Sale PRice
for feature in discrete_feature:
    data=dataset.copy()
    data.groupby(feature)['SalePrice'].median().plot.bar()
    plt.xlabel(feature)
    plt.ylabel('SalePrice')
    plt.title(feature)
    plt.show()

Непрерывная переменная

continuous_feature=[feature for feature in numerical_features if feature not in discrete_feature+year_feature+['Id']]
print("Continuous feature Count {}".format(len(continuous_feature)))
## Lets analyse the continuous values by creating histograms to understand the distribution
for feature in continuous_feature:
    data=dataset.copy()
    data[feature].hist(bins=25)
    plt.xlabel(feature)
    plt.ylabel("Count")
    plt.title(feature)
    plt.show()

## We will be using logarithmic transformation
for feature in continuous_feature:
    data=dataset.copy()
    if 0 in data[feature].unique():
        pass
    else:
        data[feature]=np.log(data[feature])
        data['SalePrice']=np.log(data['SalePrice'])
        plt.scatter(data[feature],data['SalePrice'])
        plt.xlabel(feature)
        plt.ylabel('SalesPrice')
        plt.title(feature)
        plt.show()

Выброс. Выброс — это точка наблюдения, удаленная от других наблюдений.

for feature in continuous_feature:
    data=dataset.copy()
    if 0 in data[feature].unique():
        pass
    else:
        data[feature]=np.log(data[feature])
        data.boxplot(column=feature)
        plt.ylabel(feature)
        plt.title(feature)
        plt.show()

Категориальные переменные

categorical_features=[feature for feature in dataset.columns if data[feature].dtypes=='O']
categorical_features
dataset[categorical_features].head()
for feature in categorical_features:
    print('The feature is {} and number of categories are {}'.format(feature,len(dataset[feature].unique())))
## Find out the relationship between categorical variable and dependent feature SalesPrice
for feature in categorical_features:
    data=dataset.copy()
    data.groupby(feature)['SalePrice'].median().plot.bar()
    plt.xlabel(feature)
    plt.ylabel('SalePrice')
    plt.title(feature)
    plt.show()

РАЗРАБОТКА ФУНКЦИЙ

Мы будем выполнять все указанные ниже шаги в Feature Engineering.

Отсутствующие значения
Временные переменные
Категориальные переменные: удалить редкие метки
Стандартизируйте значения переменных в одном диапазоне

## Always remember there way always be a chance of data leakage so we need to split the data first and then apply feature
## Engineering
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(dataset,dataset['SalePrice'],test_size=0.1,random_state=0)
X_train.shape, X_test.shape

Отсутствующие значения

## Let us capture all the nan values
## First lets handle Categorical features which are missing
features_nan=[feature for feature in dataset.columns if dataset[feature].isnull().sum()>1 and dataset[feature].dtypes=='O']
for feature in features_nan:
    print("{}: {}% missing values".format(feature,np.round(dataset[feature].isnull().mean(),4)))
## Replace missing value with a new label
def replace_cat_feature(dataset,features_nan):
    data=dataset.copy()
    data[features_nan]=data[features_nan].fillna('Missing')
    return data
dataset=replace_cat_feature(dataset,features_nan)
dataset[features_nan].isnull().sum()
dataset.head()

Теперь давайте проверим числовые переменные, содержащие пропущенные значения.

numerical_with_nan=[feature for feature in dataset.columns if dataset[feature].isnull().sum()>1 and dataset[feature].dtypes!='O']
## We will print the numerical nan variables and percentage of missing values
for feature in numerical_with_nan:
    print("{}: {}% missing value".format(feature,np.around(dataset[feature].isnull().mean(),4)))

Замените числовые пропущенные значения

for feature in numerical_with_nan:
    ## We will replace by using median since there are outliers
    median_value=dataset[feature].median()
    
    ## create a new feature to capture nan values
    dataset[feature+'nan']=np.where(dataset[feature].isnull(),1,0)
    dataset[feature].fillna(median_value,inplace=True)
    
dataset[numerical_with_nan].isnull().sum()
dataset.head(50)

## Temporal Variables (Date Time Variables)
for feature in ['YearBuilt','YearRemodAdd','GarageYrBlt']:
       
    dataset[feature]=dataset['YrSold']-dataset[feature]
dataset.head()
dataset[['YearBuilt','YearRemodAdd','GarageYrBlt']].head()

Числовые переменные

Поскольку числовые переменные искажены, мы будем выполнять логарифмически нормальное распределение.

import numpy as np
num_features=['LotFrontage', 'LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
for feature in num_features:
    dataset[feature]=np.log(dataset[feature])
dataset.head()

Обработка редкой категориальной функции

Мы удалим категориальные переменные, которые присутствуют менее чем в 1% наблюдений.

categorical_features=[feature for feature in dataset.columns if dataset[feature].dtype=='O']
categorical_features
for feature in categorical_features:
    temp=dataset.groupby(feature)['SalePrice'].count()/len(dataset)
    temp_df=temp[temp>0.01].index
    dataset[feature]=np.where(dataset[feature].isin(temp_df),dataset[feature],'Rare_var')
dataset.head(100)

for feature in categorical_features:
    labels_ordered=dataset.groupby([feature])['SalePrice'].mean().sort_values().index
    labels_ordered={k:i for i,k in enumerate(labels_ordered,0)}
    dataset[feature]=dataset[feature].map(labels_ordered)
dataset.head(10)

scaling_feature=[feature for feature in dataset.columns if feature not in ['Id','SalePerice'] ]
len(scaling_feature)
scaling_feature

Масштабирование функций

feature_scale=[feature for feature in dataset.columns if feature not in ['Id','SalePrice']]
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
scaler.fit(dataset[feature_scale])

scaler.transform(dataset[feature_scale])

Преобразуйте набор поездов и тестов и добавьте переменные Id и SalePrice.

data = pd.concat([dataset[['Id', 'SalePrice']].reset_index(drop=True),
                    pd.DataFrame(scaler.transform(dataset[feature_scale]), columns=feature_scale)],
                    axis=1)
data.to_csv('X_train.csv',index=False)

Удалить столбец «Идентификатор»

dataset.drop(columns=['Id'],inplace=True)
dataset.head(100)

Бросьте «ЛотФронтагенан», «МасВнрАреанан», «ГаражЮрБлтнан»

dataset.drop(columns=['LotFrontagenan','MasVnrAreanan','GarageYrBltnan'],inplace=True)
dataset

X = dataset.iloc[:, :-1].values 
y = dataset.iloc[:, -1].values

Разделение набора данных на обучающий набор и тестовый набор

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Обучение модели множественной линейной регрессии на обучающем наборе

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Прогнозирование результатов набора тестов

y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

R-квадрат

from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

Точность нашего прогноза: «0,8351288787354434»

Набор данных загружен из

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data