Цель этого проекта — разработать модель машинного обучения, которая может точно прогнозировать цены на автомобили на основе различных атрибутов. Мы будем использовать набор данных cars_price.csv, который содержит информацию о 206 автомобилях, включая их марку, модель, тип топлива, объем двигателя, мощность и другие характеристики.

Наша цель — разработать модель, которая может прогнозировать цены на автомобили с максимально возможной точностью. Мы будем использовать модель линейной регрессии для прогнозирования цен на автомобили. Почему линейная регрессия? Линейная регрессия может быть старым алгоритмом и самой базовой концепцией машинного обучения, но она все еще эффективна для построения моделей. Это алгоритм, используемый для прогнозирования значений, которые являются непрерывными по своей природе. Линейная регрессия стала более популярной, потому что это лучший алгоритм для начала, если вы новичок в машинном обучении.

Переходя к подходу проекта, начнем с импорта необходимых пакетов.

import numpy as np
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)
import matplotlib.pyplot as plt
import matplotlib as mpl
from prettytable import PrettyTable
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import seaborn as sns

Затем мы перейдем к загрузке данных, хранящихся на нашем локальном диске.

Загрузка данных

auto = pd.read_csv("C:/Users/Vishaal Grizzly/Downloads/cars_price.csv")

Проверка размеров импортируемых данных

auto.shape
(205, 26)

Проверка головы и хвоста данных

auto.head()

auto.tail()

Очистка данных

Замена ‘?’ на NAN

df = auto.replace('?',np.NAN) 
df

df.describe()

Проверка нулевых значений, присутствующих в данных

df.isnull().sum()
symboling             0
normalized-losses    41
make                  0
fuel-type             0
aspiration            0
num-of-doors          2
body-style            0
drive-wheels          0
engine-location       0
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-type           0
num-of-cylinders      0
engine-size           0
fuel-system           0
bore                  4
stroke                4
compression-ratio     0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

Проверка дубликатов в данных

print(df.loc[df.duplicated()].shape)
(0, 26)
df = df.drop_duplicates()
df.shape
(205, 26)

Список типов данных, присутствующих в каждом столбце

df.dtypes
symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

Замена нулевых значений везде, где это необходимо

Работа со значениями в столбце нормализованных ошибок

n_l_data = df[df['normalized-losses']!= '?'] 
n_l_data['normalized-losses']
0      NaN
1      NaN
2      NaN
3      164
4      164
      ... 
200     95
201     95
202     95
203     95
204     95
Name: normalized-losses, Length: 205, dtype: object
mean = n_l_data['normalized-losses'].astype(float).mean()
df['normalized-losses'] = df['normalized-losses'].replace('?', mean).fillna(mean).astype(int)
#df=df.drop(columns="normalized-losses")

Работа со значениями в столбце цены

price_data = df[df['price']!= '?']
price_data['price']
0      13495
1      16500
2      16500
3      13950
4      17450
       ...  
200    16845
201    19045
202    21485
203    22470
204    22625
Name: price, Length: 205, dtype: object
mean = price_data['price'].astype(float).mean()
df['price'] = df['price'].replace('?', mean).fillna(mean).astype(int)

Работа со значениями в колонке лошадиных сил

hp_data = df[df['horsepower'] != '?']
hp_data['price']
0      13495
1      16500
2      16500
3      13950
4      17450
       ...  
200    16845
201    19045
202    21485
203    22470
204    22625
Name: price, Length: 205, dtype: int32
mean = hp_data['price'].astype(int).mean()
df['horsepower'] = df['horsepower'].replace('?', mean).fillna(mean).astype(int)

Работа со значениями в столбце пиковых оборотов

peak_rpm_data = df[df['peak-rpm'] != '?'] 
peak_rpm_data['peak-rpm']
0      5000
1      5000
2      5000
3      5500
4      5500
       ... 
200    5400
201    5300
202    5500
203    4800
204    5400
Name: peak-rpm, Length: 205, dtype: object
mean = peak_rpm_data['peak-rpm'].astype(float).mean()
df['price'] = df['price'].replace('?', mean).fillna(mean).astype(int)

Работа со значениями в столбце штрихов

stroke_data = df[df['stroke'] != '?']
stroke_data['stroke']
0      2.68
1      2.68
2      3.47
3       3.4
4       3.4
       ... 
200    3.15
201    3.15
202    2.87
203     3.4
204    3.15
Name: stroke, Length: 205, dtype: object
mean = stroke_data['stroke'].astype(float).mean()
df['stroke'] = df['stroke'].replace('?', mean).fillna(mean).astype(float)

Работа со значениями в столбце пиковых оборотов

peak_rpm_data = df[df['peak-rpm']!='?']
peak_rpm_data['peak-rpm']
0      5000
1      5000
2      5000
3      5500
4      5500
       ... 
200    5400
201    5300
202    5500
203    4800
204    5400
Name: peak-rpm, Length: 205, dtype: object
mean = peak_rpm_data['peak-rpm'].astype(float).mean()
df['peak-rpm'] = df['peak-rpm'].replace('?', mean).fillna(mean).astype(float)

Работа со значениями в столбце отверстия

bore_data = df[df['bore'] != '?']
bore_data['bore']
0      3.47
1      3.47
2      2.68
3      3.19
4      3.19
       ... 
200    3.78
201    3.78
202    3.58
203    3.01
204    3.78
Name: bore, Length: 205, dtype: object
mean = bore_data['bore'].astype(float).mean()
df['bore'] = df['bore'].replace('?', mean).fillna(mean).astype(float)

Работа со значениями в столбце количества дверей. Мы заменим отсутствующие значения на «четыре», так как большинство автомобилей, скорее всего, будут четырехдверными.

df['num-of-doors'] = df['num-of-doors'].replace('?', 'four') 
df

df.describe
<bound method NDFrame.describe of      symboling  normalized-losses         make fuel-type aspiration  \
0            3                122  alfa-romero       gas        std   
1            3                122  alfa-romero       gas        std   
2            1                122  alfa-romero       gas        std   
3            2                164         audi       gas        std   
4            2                164         audi       gas        std   
..         ...                ...          ...       ...        ...   
200         -1                 95        volvo       gas        std   
201         -1                 95        volvo       gas      turbo   
202         -1                 95        volvo       gas        std   
203         -1                 95        volvo    diesel      turbo   
204         -1                 95        volvo       gas      turbo   

    num-of-doors   body-style drive-wheels engine-location  wheel-base  ...  \
0            two  convertible          rwd           front        88.6  ...   
1            two  convertible          rwd           front        88.6  ...   
2            two    hatchback          rwd           front        94.5  ...   
3           four        sedan          fwd           front        99.8  ...   
4           four        sedan          4wd           front        99.4  ...   
..           ...          ...          ...             ...         ...  ...   
200         four        sedan          rwd           front       109.1  ...   
201         four        sedan          rwd           front       109.1  ...   
202         four        sedan          rwd           front       109.1  ...   
203         four        sedan          rwd           front       109.1  ...   
204         four        sedan          rwd           front       109.1  ...   

     engine-size  fuel-system  bore  stroke compression-ratio horsepower  \
0            130         mpfi  3.47    2.68               9.0        111   
1            130         mpfi  3.47    2.68               9.0        111   
2            152         mpfi  2.68    3.47               9.0        154   
3            109         mpfi  3.19    3.40              10.0        102   
4            136         mpfi  3.19    3.40               8.0        115   
..           ...          ...   ...     ...               ...        ...   
200          141         mpfi  3.78    3.15               9.5        114   
201          141         mpfi  3.78    3.15               8.7        160   
202          173         mpfi  3.58    2.87               8.8        134   
203          145          idi  3.01    3.40              23.0        106   
204          141         mpfi  3.78    3.15               9.5        114   

     peak-rpm city-mpg  highway-mpg  price  
0      5000.0       21           27  13495  
1      5000.0       21           27  16500  
2      5000.0       19           26  16500  
3      5500.0       24           30  13950  
4      5500.0       18           22  17450  
..        ...      ...          ...    ...  
200    5400.0       23           28  16845  
201    5300.0       19           25  19045  
202    5500.0       18           23  21485  
203    4800.0       26           27  22470  
204    5400.0       19           25  22625  

[205 rows x 26 columns]>

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 205 entries, 0 to 204
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symboling 205 non-null int64
1 normalized-losses 205 non-null int32
2 make 205 non-null object
3 fuel-type 205 non-null object
4 aspiration 205 non-null object
5 num-of-doors 203 non-null object
6 body-style 205 non-null object
7 drive-wheels 205 non-null object
8 engine-location 205 non-null object
9 wheel-base 205 non-null float64
10 length 205 non-null float64
11 width 205 non-null float64
12 height 205 non-null float64
13 curb-weight 205 non-null int64
14 engine-type 205 non-null object
15 num-of-cylinders 205 non-null object
16 engine-size 205 non-null int64
17 fuel-system 205 non-null object
18 bore 205 non-null float64
19 stroke 205 non-null float64
20 compression-ratio 205 non-null float64
21 horsepower 205 non-null int32
22 peak-rpm 205 non-null float64
23 city-mpg 205 non-null int64
24 highway-mpg 205 non-null int64
25 price 205 non-null int32
dtypes: float64(8), int32(3), int64(5), object(10)
memory usage: 40.8+ KB

Имеется 6 значений типов объектов. Мы можем заменить значения, присутствующие в столбцах, на «0» и «1».

#Fuel column 
df['fuel-type'].value_counts()
gas       185
diesel     20
Name: fuel-type, dtype: int64
df['fuel-type'] = df['fuel-type'].map({'diesel': 0, 'gas': 1})
df['fuel-type'] = df ['fuel-type'].astype('int64')
df['fuel-type'].value_counts()
1    185
0     20
Name: fuel-type, dtype: int64
#Aspiration column 
df['aspiration'].value_counts()
std      168
turbo     37
Name: aspiration, dtype: int64
df['aspiration'] = df['aspiration'].map({'turbo': 0, 'std':1})
df['aspiration'] = df['aspiration'].astype('int64')
df['aspiration'].value_counts()
1    168
0     37
Name: aspiration, dtype: int64
#Num-of-doors column 
df['num-of-doors'].value_counts()
four    114
two      89
Name: num-of-doors, dtype: int64
df['num-of-doors'].isnull()
0      False
1      False
2      False
3      False
4      False
       ...  
200    False
201    False
202    False
203    False
204    False
Name: num-of-doors, Length: 205, dtype: bool
df['num-of-doors'].unique()
array(['two', 'four', nan], dtype=object)
num_of_na = df['num-of-doors'].isna().sum()
num_of_na
2
mask = df['num-of-doors'].isna()
result = df[mask]
result

df['num-of-doors'] = df['num-of-doors'].fillna('four')
df['num-of-doors'] = df['num-of-doors'].map({'two': 0, 'four': 1})
df['num-of-doors'] = df['num-of-doors'].fillna(0).astype('int64')
df['num-of-doors'].value_counts()
1    116
0     89
Name: num-of-doors, dtype: int64
#engine-location column 
df['engine-location'].value_counts()
front    202
rear       3
Name: engine-location, dtype: int64
df['engine-location'] = df['engine-location'].map({'rear': 0, 'front': 1})
#Peak-rpm column 
df['price'].value_counts()
13207    4
8921     2
18150    2
8845     2
8495     2
        ..
45400    1
16503    1
5389     1
6189     1
22625    1
Name: price, Length: 187, dtype: int64
df['price'].dtypes
dtype('int32')
df['price'][0]
13495
#Body-style 
df=df.drop(columns='body-style')
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 205 entries, 0 to 204
Data columns (total 25 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    int32  
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    int64  
 4   aspiration         205 non-null    int64  
 5   num-of-doors       205 non-null    int64  
 6   drive-wheels       205 non-null    object 
 7   engine-location    205 non-null    int64  
 8   wheel-base         205 non-null    float64
 9   length             205 non-null    float64
 10  width              205 non-null    float64
 11  height             205 non-null    float64
 12  curb-weight        205 non-null    int64  
 13  engine-type        205 non-null    object 
 14  num-of-cylinders   205 non-null    object 
 15  engine-size        205 non-null    int64  
 16  fuel-system        205 non-null    object 
 17  bore               205 non-null    float64
 18  stroke             205 non-null    float64
 19  compression-ratio  205 non-null    float64
 20  horsepower         205 non-null    int32  
 21  peak-rpm           205 non-null    float64
 22  city-mpg           205 non-null    int64  
 23  highway-mpg        205 non-null    int64  
 24  price              205 non-null    int32  
dtypes: float64(8), int32(3), int64(9), object(5)
memory usage: 47.3+ KB

Проверка количества уникальных значений в каждом столбце для выполнения одного горячего кодирования

for col in df:
    print(col, df[col].unique())
symboling [ 3  1  2  0 -1 -2]
normalized-losses [122 164 158 192 188 121  98  81 118 148 110 145 137 101  78 106  85 107
 104 113 150 129 115  93 142 161 153 125 128 103 168 108 194 231 119 154
  74 186  83 102  89  87  77  91 134  65 197  90  94 256  95]
make ['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'mazda' 'mercedes-benz' 'mercury' 'mitsubishi' 'nissan' 'peugot'
 'plymouth' 'porsche' 'renault' 'saab' 'subaru' 'toyota' 'volkswagen'
 'volvo']
fuel-type [1 0]
aspiration [1 0]
num-of-doors [0 1]
drive-wheels ['rwd' 'fwd' '4wd']
engine-location [1 0]
wheel-base [ 88.6  94.5  99.8  99.4 105.8  99.5 101.2 103.5 110.   88.4  93.7 103.3
  95.9  86.6  96.5  94.3  96.  113.  102.   93.1  95.3  98.8 104.9 106.7
 115.6  96.6 120.9 112.  102.7  93.   96.3  95.1  97.2 100.4  91.3  99.2
 107.9 114.2 108.   89.5  98.4  96.1  99.1  93.3  97.   96.9  95.7 102.4
 102.9 104.5  97.3 104.3 109.1]
length [168.8 171.2 176.6 177.3 192.7 178.2 176.8 189.  193.8 197.  141.1 155.9
 158.8 157.3 174.6 173.2 144.6 150.  163.4 157.1 167.5 175.4 169.1 170.7
 172.6 199.6 191.7 159.1 166.8 169.  177.8 175.  190.9 187.5 202.6 180.3
 208.1 199.2 178.4 173.  172.4 165.3 170.2 165.6 162.4 173.4 181.7 184.6
 178.5 186.7 198.9 167.3 168.9 175.7 181.5 186.6 156.9 157.9 172.  173.5
 173.6 158.7 169.7 166.3 168.7 176.2 175.6 183.5 187.8 171.7 159.3 165.7
 180.2 183.1 188.8]
width [64.1 65.5 66.2 66.4 66.3 71.4 67.9 64.8 66.9 70.9 60.3 63.6 63.8 64.6
 63.9 64.  65.2 62.5 66.  61.8 69.6 70.6 64.2 65.7 66.5 66.1 70.3 71.7
 70.5 72.  68.  64.4 65.4 68.4 68.3 65.  72.3 66.6 63.4 65.6 67.7 67.2
 68.9 68.8]
height [48.8 52.4 54.3 53.1 55.7 55.9 52.  53.7 56.3 53.2 50.8 50.6 59.8 50.2
 52.6 54.5 58.3 53.3 54.1 51.  53.5 51.4 52.8 47.8 49.6 55.5 54.4 56.5
 58.7 54.9 56.7 55.4 54.8 49.4 51.6 54.7 55.1 56.1 49.7 56.  50.5 55.2
 52.5 53.  59.1 53.9 55.6 56.2 57.5]
curb-weight [2548 2823 2337 2824 2507 2844 2954 3086 3053 2395 2710 2765 3055 3230
 3380 3505 1488 1874 1909 1876 2128 1967 1989 2191 2535 2811 1713 1819
 1837 1940 1956 2010 2024 2236 2289 2304 2372 2465 2293 2734 4066 3950
 1890 1900 1905 1945 1950 2380 2385 2500 2410 2443 2425 2670 2700 3515
 3750 3495 3770 3740 3685 3900 3715 2910 1918 1944 2004 2145 2370 2328
 2833 2921 2926 2365 2405 2403 1889 2017 1938 1951 2028 1971 2037 2008
 2324 2302 3095 3296 3060 3071 3139 3020 3197 3430 3075 3252 3285 3485
 3130 2818 2778 2756 2800 3366 2579 2460 2658 2695 2707 2758 2808 2847
 2050 2120 2240 2190 2340 2510 2290 2455 2420 2650 1985 2040 2015 2280
 3110 2081 2109 2275 2094 2122 2140 2169 2204 2265 2300 2540 2536 2551
 2679 2714 2975 2326 2480 2414 2458 2976 3016 3131 3151 2261 2209 2264
 2212 2319 2254 2221 2661 2563 2912 3034 2935 3042 3045 3157 2952 3049
 3012 3217 3062]
engine-type ['dohc' 'ohcv' 'ohc' 'l' 'rotor' 'ohcf' 'dohcv']
num-of-cylinders ['four' 'six' 'five' 'three' 'twelve' 'two' 'eight']
engine-size [130 152 109 136 131 108 164 209  61  90  98 122 156  92  79 110 111 119
 258 326  91  70  80 140 134 183 234 308 304  97 103 120 181 151 194 203
 132 121 146 171 161 141 173 145]
fuel-system ['mpfi' '2bbl' 'mfi' '1bbl' 'spfi' '4bbl' 'idi' 'spdi']
bore [3.47       2.68       3.19       3.13       3.5        3.31
 3.62       2.91       3.03       2.97       3.34       3.6
 2.92       3.15       3.43       3.63       3.54       3.08
 3.32975124 3.39       3.76       3.58       3.46       3.8
 3.78       3.17       3.35       3.59       2.99       3.33
 3.7        3.61       3.94       3.74       2.54       3.05
 3.27       3.24       3.01      ]
stroke [2.68       3.47       3.4        2.8        3.19       3.39
 3.03       3.11       3.23       3.46       3.9        3.41
 3.07       3.58       4.17       2.76       3.15       3.25542289
 3.16       3.64       3.1        3.35       3.12       3.86
 3.29       3.27       3.52       2.19       3.21       2.9
 2.07       2.36       2.64       3.08       3.5        3.54
 2.87      ]
compression-ratio [ 9.   10.    8.    8.5   8.3   7.    8.8   9.5   9.6   9.41  9.4   7.6
  9.2  10.1   9.1   8.1  11.5   8.6  22.7  22.   21.5   7.5  21.9   7.8
  8.4  21.    8.7   9.31  9.3   7.7  22.5  23.  ]
horsepower [  111   154   102   115   110   140   160   101   121   182    48    70
    68    88   145    58    76    60    86   100    78    90   176   262
   135    84    64   120    72   123   155   184   175   116    69    55
    97   152   200    95   142   143   207   288 13207    73    82    94
    62    56   112    92   161   156    52    85   114   162   134   106]
peak-rpm [5000.         5500.         5800.         4250.         5400.
 5100.         4800.         6000.         4750.         4650.
 4200.         4350.         4500.         5200.         4150.
 5600.         5900.         5750.         5125.36945813 5250.
 4900.         4400.         6600.         5300.        ]
city-mpg [21 19 24 18 17 16 23 20 15 47 38 37 31 49 30 27 25 13 26 36 22 14 45 28
 32 35 34 29 33]
highway-mpg [27 26 30 22 25 20 29 28 53 43 41 38 24 54 42 34 33 31 19 17 23 32 39 18
 16 37 50 36 47 46]
price [13495 16500 13950 17450 15250 17710 18920 23875 13207 16430 16925 20970
 21105 24565 30760 41315 36880  5151  6295  6575  5572  6377  7957  6229
  6692  7609  8558  8921 12964  6479  6855  5399  6529  7129  7295  7895
  9095  8845 10295 12945 10345  6785 11048 32250 35550 36000  5195  6095
  6795  6695  7395 10945 11845 13645 15645  8495 10595 10245 10795 11245
 18280 18344 25552 28248 28176 31600 34184 35056 40960 45400 16503  5389
  6189  6669  7689  9959  8499 12629 14869 14489  6989  8189  9279  5499
  7099  6649  6849  7349  7299  7799  7499  7999  8249  8949  9549 13499
 14399 17199 19699 18399 11900 13200 12440 13860 15580 16900 16695 17075
 16630 17950 18150 12764 22018 32528 34028 37028  9295  9895 11850 12170
 15040 15510 18620  5118  7053  7603  7126  7775  9960  9233 11259  7463
 10198  8013 11694  5348  6338  6488  6918  7898  8778  6938  7198  7788
  7738  8358  9258  8058  8238  9298  9538  8449  9639  9989 11199 11549
 17669  8948 10698  9988 10898 11248 16558 15998 15690 15750  7975  7995
  8195  9495  9995 11595  9980 13295 13845 12290 12940 13415 15985 16515
 18420 18950 16845 19045 21485 22470 22625]
one_hot_encoding = pd.get_dummies(df, columns = ['drive-wheels', 'engine-type','num-of-cylinders','fuel-system', 'make'])
df_final = one_hot_encoding.reset_index(drop = True)
df_final

df_final.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 67 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   symboling                205 non-null    int64  
 1   normalized-losses        205 non-null    int32  
 2   fuel-type                205 non-null    int64  
 3   aspiration               205 non-null    int64  
 4   num-of-doors             205 non-null    int64  
 5   engine-location          205 non-null    int64  
 6   wheel-base               205 non-null    float64
 7   length                   205 non-null    float64
 8   width                    205 non-null    float64
 9   height                   205 non-null    float64
 10  curb-weight              205 non-null    int64  
 11  engine-size              205 non-null    int64  
 12  bore                     205 non-null    float64
 13  stroke                   205 non-null    float64
 14  compression-ratio        205 non-null    float64
 15  horsepower               205 non-null    int32  
 16  peak-rpm                 205 non-null    float64
 17  city-mpg                 205 non-null    int64  
 18  highway-mpg              205 non-null    int64  
 19  price                    205 non-null    int32  
 20  drive-wheels_4wd         205 non-null    uint8  
 21  drive-wheels_fwd         205 non-null    uint8  
 22  drive-wheels_rwd         205 non-null    uint8  
 23  engine-type_dohc         205 non-null    uint8  
 24  engine-type_dohcv        205 non-null    uint8  
 25  engine-type_l            205 non-null    uint8  
 26  engine-type_ohc          205 non-null    uint8  
 27  engine-type_ohcf         205 non-null    uint8  
 28  engine-type_ohcv         205 non-null    uint8  
 29  engine-type_rotor        205 non-null    uint8  
 30  num-of-cylinders_eight   205 non-null    uint8  
 31  num-of-cylinders_five    205 non-null    uint8  
 32  num-of-cylinders_four    205 non-null    uint8  
 33  num-of-cylinders_six     205 non-null    uint8  
 34  num-of-cylinders_three   205 non-null    uint8  
 35  num-of-cylinders_twelve  205 non-null    uint8  
 36  num-of-cylinders_two     205 non-null    uint8  
 37  fuel-system_1bbl         205 non-null    uint8  
 38  fuel-system_2bbl         205 non-null    uint8  
 39  fuel-system_4bbl         205 non-null    uint8  
 40  fuel-system_idi          205 non-null    uint8  
 41  fuel-system_mfi          205 non-null    uint8  
 42  fuel-system_mpfi         205 non-null    uint8  
 43  fuel-system_spdi         205 non-null    uint8  
 44  fuel-system_spfi         205 non-null    uint8  
 45  make_alfa-romero         205 non-null    uint8  
 46  make_audi                205 non-null    uint8  
 47  make_bmw                 205 non-null    uint8  
 48  make_chevrolet           205 non-null    uint8  
 49  make_dodge               205 non-null    uint8  
 50  make_honda               205 non-null    uint8  
 51  make_isuzu               205 non-null    uint8  
 52  make_jaguar              205 non-null    uint8  
 53  make_mazda               205 non-null    uint8  
 54  make_mercedes-benz       205 non-null    uint8  
 55  make_mercury             205 non-null    uint8  
 56  make_mitsubishi          205 non-null    uint8  
 57  make_nissan              205 non-null    uint8  
 58  make_peugot              205 non-null    uint8  
 59  make_plymouth            205 non-null    uint8  
 60  make_porsche             205 non-null    uint8  
 61  make_renault             205 non-null    uint8  
 62  make_saab                205 non-null    uint8  
 63  make_subaru              205 non-null    uint8  
 64  make_toyota              205 non-null    uint8  
 65  make_volkswagen          205 non-null    uint8  
 66  make_volvo               205 non-null    uint8  
dtypes: float64(8), int32(3), int64(9), uint8(47)
memory usage: 39.2 KB

Применение методов машинного обучения

from sklearn.model_selection import train_test_split
X = df_final.drop(columns = 'price')
Y = df_final.price
print (X.shape)
print (Y.shape)
(205, 66)
(205,)
#Scaling data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42 )
from sklearn.preprocessing import StandardScaler
sc = StandardScaler() 
X_train = sc.fit_transform(X_train) 
X_test = sc.transform(X_test)
print (X_train.shape)
print (X_test.shape)
(164, 66)
(41, 66)
#Building Linear Regression technique
from sklearn.linear_model import LinearRegression 
Lin = LinearRegression()
Lin.fit(X_train, Y_train)
LinearRegression()
Y_pred = Lin.predict(X_test)
from sklearn.metrics import r2_score
r2_score(Y_test, Y_pred)
-3.861645124236883e+22
#Gradient boosting regressor
from sklearn.ensemble import GradientBoostingRegressor
gbr= GradientBoostingRegressor(random_state=0) 
gbr.fit(X_train,Y_train)
GradientBoostingRegressor(random_state=0)
Y_pred_gbr=gbr.predict(X_test)
r2_score(Y_test,Y_pred_gbr)
0.9241628097439228
#random forest regressor
from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(random_state=0)
regr.fit(X_train, Y_train)
RandomForestRegressor(random_state=0)
Y_pred_rf= regr.predict(X_test) 
r2_score(Y_test,Y_pred_rf)
0.94359287053183
n_estimators = [5,20,50,100] # number of trees in the random forest
max_features = [ 'sqrt'] # number of features in consideration at every split
max_depth = [2,4,6,8,10,12] # maximum number of levels allowed in each decision tree
min_samples_split = [2, 6, 10] # minimum sample number to split a node
min_samples_leaf = [1, 3, 4] # minimum sample number that can be stored in a leaf  

random_grid = {'n_estimators': n_estimators,

'max_features': max_features,

'max_depth': max_depth,

'min_samples_split': min_samples_split,

'min_samples_leaf': min_samples_leaf}
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(RandomForestRegressor(),
                           param_grid=random_grid)
grid_search.fit(X_train, Y_train)
print(grid_search.best_estimator_)
RandomForestRegressor(max_depth=12, max_features='sqrt', n_estimators=20)
#random forest regressor
from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(random_state=0,bootstrap=False, max_depth=6, max_features='sqrt',min_samples_split=6, n_estimators=20)
regr.fit(X_train, Y_train)

Y_pred_rf= regr.predict(X_test) 
r2_score(Y_test,Y_pred_rf)
0.871234128824367

Случайный лесной регрессор

from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(random_state=0)
regr.fit(X_train, Y_train)
Y_pred_rf= regr.predict(X_test) 
r2_score(Y_test,Y_pred_rf)
0.94359287053183

Импорт средней абсолютной ошибки и среднеквадратической ошибки из показателей sklearn

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
print('mae',mean_absolute_error(Y_test,Y_pred_rf))
print('mse',mean_squared_error(Y_test,Y_pred_rf))
print('r2',r2_score(Y_test,Y_pred_rf))
mae 1407.029491869919
mse 4398181.163008079
r2 0.94359287053183

Последние мысли

Пройдя через кучу процессов, мы успешно построили и оценили модель линейной регрессии в python. Анализируя отчет, мы можем сделать вывод, что регрессор случайного леса дает наивысшую оценку R2 для набора данных об автомобилях, который мы использовали.

Вот ссылка на полный код на GitHub для лучшего понимания.