В этой статье мы сосредоточимся на анализе данных обзоров фильмов IMDb и попытаемся предсказать, будет ли отзыв положительным или отрицательным. Знакомство с некоторыми концепциями машинного обучения поможет понять используемый код и алгоритмы. Мы будем использовать популярную платформу машинного обучения scikit-learn.

Подготовка набора данных:

Мы будем использовать набор данных отсюда - http://ai.stanford.edu/~amaas/data/sentiment/

После загрузки набора данных ненужные файлы / папки были удалены, поэтому структура папок выглядит следующим образом:

Загрузить данные в программу:

Мы загрузим и изучим данные обучения и тестирования, чтобы понять природу данных. В этом случае данные обучения и тестирования имеют одинаковый формат.

from sklearn.datasets import load_files
reviews_train = load_files("aclImdb/train/")
text_train, y_train = reviews_train.data, reviews_train.target

print("Number of documents in train data: {}".format(len(text_train)))
print("Samples per class (train): {}".format(np.bincount(y_train)))

reviews_test = load_files("aclImdb/test/")
text_test, y_test = reviews_test.data, reviews_test.target

print("Number of documents in test data: {}".format(len(text_test)))
print("Samples per class (test): {}".format(np.bincount(y_test)))

scikit - learn предоставляет load_files для чтения такого рода текстовых данных. После загрузки данных мы распечатали количество документов (поезд / тест) и образцов для каждого класса (положительное / отрицательное), которое выглядит следующим образом:

Количество документов в данных поезда: 25000
Образцов на класс (поезд): [12500 12500]
Количество документов в данных теста: 25000
Образцов на класс (тест): [12500 12500]

Мы можем увидеть всего 25000 образцов обучающих и тестовых данных с 12500 на класс pos и neg.

Представление текстовых данных в виде набора слов:

Мы хотим подсчитать количество вхождений слов как мешок слов, который включает следующие шаги на диаграмме:

Чтобы представить входной набор данных как мешок слов, мы воспользуемся CountVectorizer и вызовем его метод transform. CountVectorizer - это преобразователь, который преобразует входные документы в разреженную матрицу функций.

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(min_df=5, ngram_range=(2, 2))
X_train = vect.fit(text_train).transform(text_train)
X_test = vect.transform(text_test)

print("Vocabulary size: {}".format(len(vect.vocabulary_)))
print("X_train:\n{}".format(repr(X_train)))
print("X_test: \n{}".format(repr(X_test)))

feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)))

CountVectorizer используется с двумя параметрами -

  1. min_df (= 5): определяет минимальную частоту слова, при которой оно будет считаться признаком.
  2. ngram_range (= (2,2)): параметр ngram_range представляет собой кортеж. Он определяет минимальную и максимальную длину рассматриваемой последовательности токенов. В данном случае эта длина равна 2. Таким образом, будет найдена последовательность из 2 токенов, например «но», «мудрый человек» и т. Д.

Каждая запись в результирующей матрице считается функцией. Вывод из приведенного выше фрагмента кода выглядит следующим образом:

Vocabulary size: 129549
X_train:
<25000x129549 sparse matrix of type '<class 'numpy.int64'>'
 with 3607330 stored elements in Compressed Sparse Row format>
X_test: 
<25000x129549 sparse matrix of type '<class 'numpy.int64'>'
 with 3392376 stored elements in Compressed Sparse Row format>
Number of features: 129549

Всего найдено 129549 функций.

Разработка модели:

Мы будем использовать LogisticRegression для разработки модели, так как для разреженных данных большого размера, таких как наши, LogisticRegression часто работает лучше всего.

При разработке модели нам нужно сделать еще две вещи:

  1. Поиск по сетке: для настройки параметров LogisticRegression. Мы хотим определить, какое значение coefficeint «C» обеспечивает лучшую точность.
  2. Перекрестная проверка: чтобы избежать переобучения данных.

Дополнительную информацию о GridSearch и перекрестной проверке см. В [2].

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)
print("Best estimator: ", grid.best_estimator_)

Здесь мы используем пятикратную перекрестную проверку с GridSearchCV. После подбора данных поезда мы видим best_score_, best_params_ для «C» и best_estimator_ (модель, которую мы собираемся использовать).

Вывод из приведенного выше фрагмента кода выглядит следующим образом:

Best cross-validation score: 0.88
Best parameters:  {'C': 1}
Best estimator:  LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None,  solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

У нас есть модель с «C» = 1 и точностью 88 процентов.

Мы хотим отобразить 25 лучших и худших характеристик.

import matplotlib.pyplot as plt
import mglearn
mglearn.tools.visualize_coefficients(grid.best_estimator_.coef_, feature_names, n_top_features=25)
plt.show()

* mglearn - это библиотека, которая поставляется с книгой [1]. Вы можете скачать его здесь - https://github.com/amueller/mglearn.

Делаем прогноз:

Теперь мы собираемся сделать прогноз по нашим тестовым данным, используя обученную модель.

lr = grid.best_estimator_
lr.fit(X_train, y_train)
lr.predict(X_test)
print("Score: {:.2f}".format(lr.score(X_test, y_test)))

Результат прогноза показывает результат 88% по сравнению с данными теста.

Score: 0.88

Чтобы проверить, как наша модель работает с отдельными данными, мы сделаем один прогноз с положительным обзором фильма и один с отрицательным.

pos = ["I've seen this story before but my kids haven't. Boy with troubled past joins military, faces his past, falls in love and becomes a man. "
       "The mentor this time is played perfectly by Kevin Costner; An ordinary man with common everyday problems who lives an extraordinary "
       "conviction, to save lives. After losing his team he takes a teaching position training the next generation of heroes. The young troubled "
       "recruit is played by Kutcher. While his scenes with the local love interest are a tad stiff and don't generate enough heat to melt butter, "
       "he compliments Costner well. I never really understood Sela Ward as the neglected wife and felt she should of wanted Costner to quit out of "
       "concern for his safety as opposed to her selfish needs. But her presence on screen is a pleasure. The two unaccredited stars of this movie "
       "are the Coast Guard and the Sea. Both powerful forces which should not be taken for granted in real life or this movie. The movie has some "
       "slow spots and could have used the wasted 15 minutes to strengthen the character relationships. But it still works. The rescue scenes are "
       "intense and well filmed and edited to provide maximum impact. This movie earns the audience applause. And the applause of my two sons."]
print("Pos prediction: {}". format(lr.predict(vect.transform(pos))))

Это выводит -

Pos prediction: [1]

Здесь 1 означает, что получен положительный отзыв.

neg = ["David Bryce\'s comments nearby are exceptionally well written and informative as almost say everything "
       "I feel about DARLING LILI. This massive musical is so peculiar and over blown, over produced and must have "
       "caused ruptures at Paramount in 1970. It cost 22 million dollars! That is simply irresponsible. DARLING LILI "
       "must have been greenlit from a board meeting that said \"hey we got that Pink Panther guy and that Sound Of Music gal... "
       "lets get this too\" and handed over a blank cheque. The result is a hybrid of GIGI, ZEPPELIN, HALF A SIXPENCE, some MGM 40s "
       "song and dance numbers of a style (daisies and boaters!) so hopelessly old fashioned as to be like musical porridge, and MATA HARI "
       "dramatics. The production is colossal, lush, breathtaking to view, but the rest: the ridiculous romance, Julie looking befuddled, Hudson "
       "already dead, the mistimed comedy, and the astoundingly boring songs deaden this spectacular film into being irritating. LILI is"
       " like a twee 1940s mega musical with some vulgar bits to spice it up. STAR! released the year before sadly crashed and now is being "
       "finally appreciated for the excellent film is genuinely is... and Andrews looks sublime, mature, especially in the last half hour......"
       "but LILI is POPPINS and DOLLY frilly and I believe really killed off the mega musical binge of the 60s..... "
       "and made Andrews look like Poppins again... which I believe was not Edwards intention. Paramount must have collectively fainted "
       "when they saw this: and with another $20 million festering in CATCH 22, and $12 million in ON A CLEAR DAY and $25 million in PAINT YOUR WAGON...."
       "they had a financial abyss of CLEOPATRA proportions with $77 million tied into 4 films with very uncertain futures. Maybe they should have asked seer "
       "Daisy Gamble from ON A CLEAR DAY ......LILI was very popular on immediate first release in Australia and ran in 70mm cinemas for months but it failed "
       "once out in the subs and the sticks and only ever surfaced after that on one night stands with ON A CLEAR DAY as a Sunday night double. Thank "
       "god Paramount had their simple $1million (yes, ONE MILLION DOLLAR) film LOVE STORY and that $4 million dollar gangster pic THE GODFATHER "
       "also ready to recover all the $77 million in just the next two years....for just $5m.... incredible!"]
print("Neg prediction: {}". format(lr.predict(vect.transform(neg))))

Это выводит -

Neg prediction: [0]

Здесь 0 означает, что получен отрицательный отзыв.

Полный исходный код:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_files
from sklearn.model_selection import GridSearchCV
import numpy as np
import mglearn
import matplotlib.pyplot as plt

reviews_train = load_files("aclImdb/train/")
text_train, y_train = reviews_train.data, reviews_train.target

print("Number of documents in train data: {}".format(len(text_train)))
print("Samples per class (train): {}".format(np.bincount(y_train)))

reviews_test = load_files("aclImdb/test/")
text_test, y_test = reviews_test.data, reviews_test.target

print("Number of documents in test data: {}".format(len(text_test)))
print("Samples per class (test): {}".format(np.bincount(y_test)))


vect = CountVectorizer(min_df=5, ngram_range=(2, 2))
X_train = vect.fit(text_train).transform(text_train)
X_test = vect.transform(text_test)

print("Vocabulary size: {}".format(len(vect.vocabulary_)))
print("X_train:\n{}".format(repr(X_train)))
print("X_test: \n{}".format(repr(X_test)))

feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)))

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)
print("Best estimator: ", grid.best_estimator_)

mglearn.tools.visualize_coefficients(grid.best_estimator_.coef_, feature_names, n_top_features=25)
plt.show()

lr = grid.best_estimator_
lr.predict(X_test)
print("Score: {:.2f}".format(lr.score(X_test, y_test)))

pos = ["I've seen this story before but my kids haven't. Boy with troubled past joins military, faces his past, falls in love and becomes a man. "
       "The mentor this time is played perfectly by Kevin Costner; An ordinary man with common everyday problems who lives an extraordinary "
       "conviction, to save lives. After losing his team he takes a teaching position training the next generation of heroes. The young troubled "
       "recruit is played by Kutcher. While his scenes with the local love interest are a tad stiff and don't generate enough heat to melt butter, "
       "he compliments Costner well. I never really understood Sela Ward as the neglected wife and felt she should of wanted Costner to quit out of "
       "concern for his safety as opposed to her selfish needs. But her presence on screen is a pleasure. The two unaccredited stars of this movie "
       "are the Coast Guard and the Sea. Both powerful forces which should not be taken for granted in real life or this movie. The movie has some "
       "slow spots and could have used the wasted 15 minutes to strengthen the character relationships. But it still works. The rescue scenes are "
       "intense and well filmed and edited to provide maximum impact. This movie earns the audience applause. And the applause of my two sons."]
print("Pos prediction: {}". format(lr.predict(vect.transform(pos))))

neg = ["David Bryce\'s comments nearby are exceptionally well written and informative as almost say everything "
       "I feel about DARLING LILI. This massive musical is so peculiar and over blown, over produced and must have "
       "caused ruptures at Paramount in 1970. It cost 22 million dollars! That is simply irresponsible. DARLING LILI "
       "must have been greenlit from a board meeting that said \"hey we got that Pink Panther guy and that Sound Of Music gal... "
       "lets get this too\" and handed over a blank cheque. The result is a hybrid of GIGI, ZEPPELIN, HALF A SIXPENCE, some MGM 40s "
       "song and dance numbers of a style (daisies and boaters!) so hopelessly old fashioned as to be like musical porridge, and MATA HARI "
       "dramatics. The production is colossal, lush, breathtaking to view, but the rest: the ridiculous romance, Julie looking befuddled, Hudson "
       "already dead, the mistimed comedy, and the astoundingly boring songs deaden this spectacular film into being irritating. LILI is"
       " like a twee 1940s mega musical with some vulgar bits to spice it up. STAR! released the year before sadly crashed and now is being "
       "finally appreciated for the excellent film is genuinely is... and Andrews looks sublime, mature, especially in the last half hour......"
       "but LILI is POPPINS and DOLLY frilly and I believe really killed off the mega musical binge of the 60s..... "
       "and made Andrews look like Poppins again... which I believe was not Edwards intention. Paramount must have collectively fainted "
       "when they saw this: and with another $20 million festering in CATCH 22, and $12 million in ON A CLEAR DAY and $25 million in PAINT YOUR WAGON...."
       "they had a financial abyss of CLEOPATRA proportions with $77 million tied into 4 films with very uncertain futures. Maybe they should have asked seer "
       "Daisy Gamble from ON A CLEAR DAY ......LILI was very popular on immediate first release in Australia and ran in 70mm cinemas for months but it failed "
       "once out in the subs and the sticks and only ever surfaced after that on one night stands with ON A CLEAR DAY as a Sunday night double. Thank "
       "god Paramount had their simple $1million (yes, ONE MILLION DOLLAR) film LOVE STORY and that $4 million dollar gangster pic THE GODFATHER "
       "also ready to recover all the $77 million in just the next two years....for just $5m.... incredible!"]
print("Neg prediction: {}". format(lr.predict(vect.transform(neg))))

Еще о чем следует подумать:

При анализе текста следует учесть следующие моменты: Лемматизация, Стемминг и частота запроса - обратная частота документа (tf – idf) и т. д. Вы можете узнать, как использовать их в Интернете, а также из [1]. Но для цели этого примера проекта я обнаружил, что эти методы значительно увеличивают время выполнения без какого-либо значительного повышения точности.

Я надеюсь, что эта статья была полезна некоторым, если не многим. Это моя первая статья по теме машинного обучения, и я не являюсь экспертом в этой области, вроде все еще учусь. Если вам понравилась эта статья, подпишитесь на меня здесь или в твиттере.

Ссылки:

[1] http://shop.oreilly.com/product/0636920030515.do

[2] http://nbviewer.jupyter.org/github/rhiever/Data-Analysis-and-Machine-Learning-Projects/blob/master/example-data-science-notebook/Example%20Machine%20Learning%20Notebook. ipynb

[3] https://medium.com/@rnbrown/more-nlp-with-sklearns-countvectorizer-add577a0b8c8