В этой статье мы сосредоточимся на анализе данных обзоров фильмов IMDb и попытаемся предсказать, будет ли отзыв положительным или отрицательным. Знакомство с некоторыми концепциями машинного обучения поможет понять используемый код и алгоритмы. Мы будем использовать популярную платформу машинного обучения scikit-learn.
Подготовка набора данных:
Мы будем использовать набор данных отсюда - http://ai.stanford.edu/~amaas/data/sentiment/
После загрузки набора данных ненужные файлы / папки были удалены, поэтому структура папок выглядит следующим образом:
Загрузить данные в программу:
Мы загрузим и изучим данные обучения и тестирования, чтобы понять природу данных. В этом случае данные обучения и тестирования имеют одинаковый формат.
from sklearn.datasets import load_files reviews_train = load_files("aclImdb/train/") text_train, y_train = reviews_train.data, reviews_train.target print("Number of documents in train data: {}".format(len(text_train))) print("Samples per class (train): {}".format(np.bincount(y_train))) reviews_test = load_files("aclImdb/test/") text_test, y_test = reviews_test.data, reviews_test.target print("Number of documents in test data: {}".format(len(text_test))) print("Samples per class (test): {}".format(np.bincount(y_test)))
scikit - learn предоставляет load_files для чтения такого рода текстовых данных. После загрузки данных мы распечатали количество документов (поезд / тест) и образцов для каждого класса (положительное / отрицательное), которое выглядит следующим образом:
Количество документов в данных поезда: 25000
Образцов на класс (поезд): [12500 12500]
Количество документов в данных теста: 25000
Образцов на класс (тест): [12500 12500]
Мы можем увидеть всего 25000 образцов обучающих и тестовых данных с 12500 на класс pos и neg.
Представление текстовых данных в виде набора слов:
Мы хотим подсчитать количество вхождений слов как мешок слов, который включает следующие шаги на диаграмме:
Чтобы представить входной набор данных как мешок слов, мы воспользуемся CountVectorizer и вызовем его метод transform. CountVectorizer - это преобразователь, который преобразует входные документы в разреженную матрицу функций.
from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer(min_df=5, ngram_range=(2, 2)) X_train = vect.fit(text_train).transform(text_train) X_test = vect.transform(text_test) print("Vocabulary size: {}".format(len(vect.vocabulary_))) print("X_train:\n{}".format(repr(X_train))) print("X_test: \n{}".format(repr(X_test))) feature_names = vect.get_feature_names() print("Number of features: {}".format(len(feature_names)))
CountVectorizer используется с двумя параметрами -
- min_df (= 5): определяет минимальную частоту слова, при которой оно будет считаться признаком.
- ngram_range (= (2,2)): параметр ngram_range представляет собой кортеж. Он определяет минимальную и максимальную длину рассматриваемой последовательности токенов. В данном случае эта длина равна 2. Таким образом, будет найдена последовательность из 2 токенов, например «но», «мудрый человек» и т. Д.
Каждая запись в результирующей матрице считается функцией. Вывод из приведенного выше фрагмента кода выглядит следующим образом:
Vocabulary size: 129549 X_train: <25000x129549 sparse matrix of type '<class 'numpy.int64'>' with 3607330 stored elements in Compressed Sparse Row format> X_test: <25000x129549 sparse matrix of type '<class 'numpy.int64'>' with 3392376 stored elements in Compressed Sparse Row format> Number of features: 129549
Всего найдено 129549 функций.
Разработка модели:
Мы будем использовать LogisticRegression для разработки модели, так как для разреженных данных большого размера, таких как наши, LogisticRegression часто работает лучше всего.
При разработке модели нам нужно сделать еще две вещи:
- Поиск по сетке: для настройки параметров LogisticRegression. Мы хотим определить, какое значение coefficeint «C» обеспечивает лучшую точность.
- Перекрестная проверка: чтобы избежать переобучения данных.
Дополнительную информацию о GridSearch и перекрестной проверке см. В [2].
from sklearn.model_selection import GridSearchCV from sklearn.linear_model import LogisticRegression param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]} grid = GridSearchCV(LogisticRegression(), param_grid, cv=5) grid.fit(X_train, y_train) print("Best cross-validation score: {:.2f}".format(grid.best_score_)) print("Best parameters: ", grid.best_params_) print("Best estimator: ", grid.best_estimator_)
Здесь мы используем пятикратную перекрестную проверку с GridSearchCV. После подбора данных поезда мы видим best_score_, best_params_ для «C» и best_estimator_ (модель, которую мы собираемся использовать).
Вывод из приведенного выше фрагмента кода выглядит следующим образом:
Best cross-validation score: 0.88 Best parameters: {'C': 1} Best estimator: LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False)
У нас есть модель с «C» = 1 и точностью 88 процентов.
Мы хотим отобразить 25 лучших и худших характеристик.
import matplotlib.pyplot as plt import mglearn mglearn.tools.visualize_coefficients(grid.best_estimator_.coef_, feature_names, n_top_features=25) plt.show()
* mglearn - это библиотека, которая поставляется с книгой [1]. Вы можете скачать его здесь - https://github.com/amueller/mglearn.
Делаем прогноз:
Теперь мы собираемся сделать прогноз по нашим тестовым данным, используя обученную модель.
lr = grid.best_estimator_ lr.fit(X_train, y_train) lr.predict(X_test) print("Score: {:.2f}".format(lr.score(X_test, y_test)))
Результат прогноза показывает результат 88% по сравнению с данными теста.
Score: 0.88
Чтобы проверить, как наша модель работает с отдельными данными, мы сделаем один прогноз с положительным обзором фильма и один с отрицательным.
pos = ["I've seen this story before but my kids haven't. Boy with troubled past joins military, faces his past, falls in love and becomes a man. " "The mentor this time is played perfectly by Kevin Costner; An ordinary man with common everyday problems who lives an extraordinary " "conviction, to save lives. After losing his team he takes a teaching position training the next generation of heroes. The young troubled " "recruit is played by Kutcher. While his scenes with the local love interest are a tad stiff and don't generate enough heat to melt butter, " "he compliments Costner well. I never really understood Sela Ward as the neglected wife and felt she should of wanted Costner to quit out of " "concern for his safety as opposed to her selfish needs. But her presence on screen is a pleasure. The two unaccredited stars of this movie " "are the Coast Guard and the Sea. Both powerful forces which should not be taken for granted in real life or this movie. The movie has some " "slow spots and could have used the wasted 15 minutes to strengthen the character relationships. But it still works. The rescue scenes are " "intense and well filmed and edited to provide maximum impact. This movie earns the audience applause. And the applause of my two sons."] print("Pos prediction: {}". format(lr.predict(vect.transform(pos))))
Это выводит -
Pos prediction: [1]
Здесь 1 означает, что получен положительный отзыв.
neg = ["David Bryce\'s comments nearby are exceptionally well written and informative as almost say everything " "I feel about DARLING LILI. This massive musical is so peculiar and over blown, over produced and must have " "caused ruptures at Paramount in 1970. It cost 22 million dollars! That is simply irresponsible. DARLING LILI " "must have been greenlit from a board meeting that said \"hey we got that Pink Panther guy and that Sound Of Music gal... " "lets get this too\" and handed over a blank cheque. The result is a hybrid of GIGI, ZEPPELIN, HALF A SIXPENCE, some MGM 40s " "song and dance numbers of a style (daisies and boaters!) so hopelessly old fashioned as to be like musical porridge, and MATA HARI " "dramatics. The production is colossal, lush, breathtaking to view, but the rest: the ridiculous romance, Julie looking befuddled, Hudson " "already dead, the mistimed comedy, and the astoundingly boring songs deaden this spectacular film into being irritating. LILI is" " like a twee 1940s mega musical with some vulgar bits to spice it up. STAR! released the year before sadly crashed and now is being " "finally appreciated for the excellent film is genuinely is... and Andrews looks sublime, mature, especially in the last half hour......" "but LILI is POPPINS and DOLLY frilly and I believe really killed off the mega musical binge of the 60s..... " "and made Andrews look like Poppins again... which I believe was not Edwards intention. Paramount must have collectively fainted " "when they saw this: and with another $20 million festering in CATCH 22, and $12 million in ON A CLEAR DAY and $25 million in PAINT YOUR WAGON...." "they had a financial abyss of CLEOPATRA proportions with $77 million tied into 4 films with very uncertain futures. Maybe they should have asked seer " "Daisy Gamble from ON A CLEAR DAY ......LILI was very popular on immediate first release in Australia and ran in 70mm cinemas for months but it failed " "once out in the subs and the sticks and only ever surfaced after that on one night stands with ON A CLEAR DAY as a Sunday night double. Thank " "god Paramount had their simple $1million (yes, ONE MILLION DOLLAR) film LOVE STORY and that $4 million dollar gangster pic THE GODFATHER " "also ready to recover all the $77 million in just the next two years....for just $5m.... incredible!"] print("Neg prediction: {}". format(lr.predict(vect.transform(neg))))
Это выводит -
Neg prediction: [0]
Здесь 0 означает, что получен отрицательный отзыв.
Полный исходный код:
from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_files from sklearn.model_selection import GridSearchCV import numpy as np import mglearn import matplotlib.pyplot as plt reviews_train = load_files("aclImdb/train/") text_train, y_train = reviews_train.data, reviews_train.target print("Number of documents in train data: {}".format(len(text_train))) print("Samples per class (train): {}".format(np.bincount(y_train))) reviews_test = load_files("aclImdb/test/") text_test, y_test = reviews_test.data, reviews_test.target print("Number of documents in test data: {}".format(len(text_test))) print("Samples per class (test): {}".format(np.bincount(y_test))) vect = CountVectorizer(min_df=5, ngram_range=(2, 2)) X_train = vect.fit(text_train).transform(text_train) X_test = vect.transform(text_test) print("Vocabulary size: {}".format(len(vect.vocabulary_))) print("X_train:\n{}".format(repr(X_train))) print("X_test: \n{}".format(repr(X_test))) feature_names = vect.get_feature_names() print("Number of features: {}".format(len(feature_names))) param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]} grid = GridSearchCV(LogisticRegression(), param_grid, cv=5) grid.fit(X_train, y_train) print("Best cross-validation score: {:.2f}".format(grid.best_score_)) print("Best parameters: ", grid.best_params_) print("Best estimator: ", grid.best_estimator_) mglearn.tools.visualize_coefficients(grid.best_estimator_.coef_, feature_names, n_top_features=25) plt.show() lr = grid.best_estimator_ lr.predict(X_test) print("Score: {:.2f}".format(lr.score(X_test, y_test))) pos = ["I've seen this story before but my kids haven't. Boy with troubled past joins military, faces his past, falls in love and becomes a man. " "The mentor this time is played perfectly by Kevin Costner; An ordinary man with common everyday problems who lives an extraordinary " "conviction, to save lives. After losing his team he takes a teaching position training the next generation of heroes. The young troubled " "recruit is played by Kutcher. While his scenes with the local love interest are a tad stiff and don't generate enough heat to melt butter, " "he compliments Costner well. I never really understood Sela Ward as the neglected wife and felt she should of wanted Costner to quit out of " "concern for his safety as opposed to her selfish needs. But her presence on screen is a pleasure. The two unaccredited stars of this movie " "are the Coast Guard and the Sea. Both powerful forces which should not be taken for granted in real life or this movie. The movie has some " "slow spots and could have used the wasted 15 minutes to strengthen the character relationships. But it still works. The rescue scenes are " "intense and well filmed and edited to provide maximum impact. This movie earns the audience applause. And the applause of my two sons."] print("Pos prediction: {}". format(lr.predict(vect.transform(pos)))) neg = ["David Bryce\'s comments nearby are exceptionally well written and informative as almost say everything " "I feel about DARLING LILI. This massive musical is so peculiar and over blown, over produced and must have " "caused ruptures at Paramount in 1970. It cost 22 million dollars! That is simply irresponsible. DARLING LILI " "must have been greenlit from a board meeting that said \"hey we got that Pink Panther guy and that Sound Of Music gal... " "lets get this too\" and handed over a blank cheque. The result is a hybrid of GIGI, ZEPPELIN, HALF A SIXPENCE, some MGM 40s " "song and dance numbers of a style (daisies and boaters!) so hopelessly old fashioned as to be like musical porridge, and MATA HARI " "dramatics. The production is colossal, lush, breathtaking to view, but the rest: the ridiculous romance, Julie looking befuddled, Hudson " "already dead, the mistimed comedy, and the astoundingly boring songs deaden this spectacular film into being irritating. LILI is" " like a twee 1940s mega musical with some vulgar bits to spice it up. STAR! released the year before sadly crashed and now is being " "finally appreciated for the excellent film is genuinely is... and Andrews looks sublime, mature, especially in the last half hour......" "but LILI is POPPINS and DOLLY frilly and I believe really killed off the mega musical binge of the 60s..... " "and made Andrews look like Poppins again... which I believe was not Edwards intention. Paramount must have collectively fainted " "when they saw this: and with another $20 million festering in CATCH 22, and $12 million in ON A CLEAR DAY and $25 million in PAINT YOUR WAGON...." "they had a financial abyss of CLEOPATRA proportions with $77 million tied into 4 films with very uncertain futures. Maybe they should have asked seer " "Daisy Gamble from ON A CLEAR DAY ......LILI was very popular on immediate first release in Australia and ran in 70mm cinemas for months but it failed " "once out in the subs and the sticks and only ever surfaced after that on one night stands with ON A CLEAR DAY as a Sunday night double. Thank " "god Paramount had their simple $1million (yes, ONE MILLION DOLLAR) film LOVE STORY and that $4 million dollar gangster pic THE GODFATHER " "also ready to recover all the $77 million in just the next two years....for just $5m.... incredible!"] print("Neg prediction: {}". format(lr.predict(vect.transform(neg))))
Еще о чем следует подумать:
При анализе текста следует учесть следующие моменты: Лемматизация, Стемминг и частота запроса - обратная частота документа (tf – idf) и т. д. Вы можете узнать, как использовать их в Интернете, а также из [1]. Но для цели этого примера проекта я обнаружил, что эти методы значительно увеличивают время выполнения без какого-либо значительного повышения точности.
Я надеюсь, что эта статья была полезна некоторым, если не многим. Это моя первая статья по теме машинного обучения, и я не являюсь экспертом в этой области, вроде все еще учусь. Если вам понравилась эта статья, подпишитесь на меня здесь или в твиттере.
Ссылки:
[1] http://shop.oreilly.com/product/0636920030515.do
[3] https://medium.com/@rnbrown/more-nlp-with-sklearns-countvectorizer-add577a0b8c8