# 深度学习实践：从零开始做电影评论文本情感分析

http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

`tree aclImdb -L 2`

`vim 1234_10.txt`

`I grew up watching this movie ,and I still love it just as much today as when i was a kid. Don't listen to the critic reviews. They are not accurate on this film.Eddie Murphy really shines in his roll.You can sit down with your whole family and everybody will enjoy it.I recommend this movie to everybody to see. It is a comedy with a touch of fantasy.With demons ,dragons,and a little bald kid with God like powers.This movie takes you from L.A. to Tibet , of into the amazing view of the wondrous temples of the mountains in Tibet.Just a beautiful view! So go do your self a favor and snatch this one up! You wont regret it!`

https://github.com/keras-team/keras/blob/master/keras/datasets/imdb.py

```In [1]: import numpy as np

In [3]: f.keys()
Out[3]: ['x_test', 'x_train', 'y_train', 'y_test']

In [4]: x_train, y_train, x_test, y_test = f['x_train'], f['y_train'], f['x_test'], f['y_test']

In [5]: len(x_train), len(y_train), len(x_test), len(y_test)
Out[5]: (25000, 25000, 25000, 25000)

In [6]: x_train.shape
Out[6]: (25000,)

In [7]: y_train.shape
Out[7]: (25000,)
...

In [12]: x_train[0:2]
Out[12]:
array([ [23022, 309, 6, 3, 1069, 209, 9, 2175, 30, 1, 169, 55, 14, 46, 82, 5869, 41, 393, 110, 138, 14, 5359, 58, 4477, 150, 8, 1, 5032, 5948, 482, 69, 5, 261, 12, 23022, 73935, 2003, 6, 73, 2436, 5, 632, 71, 6, 5359, 1, 25279, 5, 2004, 10471, 1, 5941, 1534, 34, 67, 64, 205, 140, 65, 1232, 63526, 21145, 1, 49265, 4, 1, 223, 901, 29, 3024, 69, 4, 1, 5863, 10, 694, 2, 65, 1534, 51, 10, 216, 1, 387, 8, 60, 3, 1472, 3724, 802, 5, 3521, 177, 1, 393, 10, 1238, 14030, 30, 309, 3, 353, 344, 2989, 143, 130, 5, 7804, 28, 4, 126, 5359, 1472, 2375, 5, 23022, 309, 10, 532, 12, 108, 1470, 4, 58, 556, 101, 12, 23022, 309, 6, 227, 4187, 48, 3, 2237, 12, 9, 215],
[23777, 39, 81226, 14, 739, 20387, 3428, 44, 74, 32, 1831, 15, 150, 18, 112, 3, 1344, 5, 336, 145, 20, 1, 887, 12, 68, 277, 1189, 403, 34, 119, 282, 36, 167, 5, 393, 154, 39, 2299, 15, 1, 548, 88, 81, 101, 4, 1, 3273, 14, 40, 3, 413, 1200, 134, 8208, 41, 180, 138, 14, 3086, 1, 322, 20, 4930, 28948, 359, 5, 3112, 2128, 1, 20045, 19339, 39, 8208, 45, 3661, 27, 372, 5, 127, 53, 20, 1, 1983, 7, 7, 18, 48, 45, 22, 68, 345, 3, 2131, 5, 409, 20, 1, 1983, 15, 3, 3238, 206, 1, 31645, 22, 277, 66, 36, 3, 341, 1, 719, 729, 3, 3865, 1265, 20, 1, 1510, 3, 1219, 2, 282, 22, 277, 2525, 5, 64, 48, 42, 37, 5, 27, 3273, 12, 6, 23030, 75120, 2034, 7, 7, 3771, 3225, 34, 4186, 34, 378, 14, 12583, 296, 3, 1023, 129, 34, 44, 282, 8, 1, 179, 363, 7067, 5, 94, 3, 2131, 16, 3, 5211, 3005, 15913, 21720, 5, 64, 45, 26, 67, 409, 8, 1, 1983, 15, 3261, 501, 206, 1, 31645, 45, 12583, 2877, 26, 67, 78, 48, 26, 491, 16, 3, 702, 1184, 4, 228, 50, 4505, 1, 43259, 20, 118, 12583, 6, 1373, 20, 1, 887, 16, 3, 20447, 20, 24, 3964, 5, 10455, 24, 172, 844, 118, 26, 188, 1488, 122, 1, 6616, 237, 345, 1, 13891, 32804, 31, 3, 39870, 100, 42, 395, 20, 24, 12130, 118, 12583, 889, 82, 102, 584, 3, 252, 31, 1, 400, 4, 4787, 16974, 1962, 3861, 32, 1230, 3186, 34, 185, 4310, 156, 2325, 38, 341, 2, 38, 9048, 7355, 2231, 4846, 2, 32880, 8938, 2610, 34, 23, 457, 340, 5, 1, 1983, 504, 4355, 12583, 215, 237, 21, 340, 5, 4468, 5996, 34689, 37, 26, 277, 119, 51, 109, 1023, 118, 42, 545, 39, 2814, 513, 39, 27, 553, 7, 7, 134, 1, 116, 2022, 197, 4787, 2, 12583, 283, 1667, 5, 111, 10, 255, 110, 4382, 5, 27, 28, 4, 3771, 12267, 16617, 105, 118, 2597, 5, 109, 3, 209, 9, 284, 3, 4325, 496, 1076, 5, 24, 2761, 154, 138, 14, 7673, 11900, 182, 5276, 39, 20422, 15, 1, 548, 5, 120, 48, 42, 37, 257, 139, 4530, 156, 2325, 9, 1, 372, 248, 39, 20, 1, 82, 505, 228, 3, 376, 2131, 37, 29, 1023, 81, 78, 51, 33, 89, 121, 48, 5, 78, 16, 65, 275, 276, 33, 141, 199, 9, 5, 1, 3273, 302, 4, 769, 9, 37, 17648, 275, 7, 7, 39, 276, 11, 19, 77, 6018, 22, 5, 336, 406]], dtype=object)

In [13]: y_train[0:2]
Out[13]: array([1, 1])

In [14]: x_test.shape
Out[14]: (25000,)

In [15]: y_test.shape
Out[15]: (25000,)

In [16]: x_test[0:2]
Out[16]:
array([ [10, 432, 2, 216, 11, 17, 233, 311, 100, 109, 27791, 5, 31, 3, 168, 366, 4, 1920, 634, 971, 12, 10, 13, 5523, 5, 64, 9, 85, 36, 48, 10, 694, 4, 13059, 15969, 26, 13, 61, 499, 5, 78, 209, 10, 13, 352, 15969, 253, 1, 106, 4, 3270, 14998, 52, 70, 2, 1839, 11762, 253, 1019, 7655, 16, 138, 12866, 1, 1910, 4, 3, 49, 17, 6, 12, 9, 67, 2885, 16, 260, 1435, 11, 28, 119, 615, 12, 1, 433, 747, 60, 13, 2959, 43, 13, 3080, 31, 2126, 312, 1, 83, 317, 4, 1, 17, 2, 68, 1678, 5, 1671, 312, 1, 330, 317, 134, 14200, 1, 747, 10, 21, 61, 216, 108, 369, 8, 1671, 18, 108, 365, 2068, 346, 14, 70, 266, 2721, 21, 5, 384, 256, 64, 95, 2575, 11, 17, 13, 84, 2, 10, 1464, 12, 22, 137, 64, 9, 156, 22, 1916],
[281, 676, 164, 985, 5696, 1157, 53, 24, 2425, 2013, 1, 3357, 186, 11603, 16, 11, 220, 2572, 2252, 450, 41, 1, 21308, 1203, 587, 908, 118, 3, 182, 295, 47415, 5157, 36, 24, 4486, 975, 5, 294, 426, 24, 7117, 8, 48, 13, 2275, 14, 1, 830, 497, 123, 253, 143, 54, 334, 4, 8891, 2, 131, 10465, 9594, 2252, 1551, 23, 3, 9591, 3, 2517, 88, 1030, 221, 5, 1755, 959, 16, 4628, 2, 2376, 129, 18, 46, 86, 11, 19, 13, 8480, 29, 1, 169, 7, 7, 1, 19, 514, 16, 46, 1515, 633, 895, 835, 3, 51329, 307, 4, 1, 1122, 633, 895, 4, 27000, 49040, 2, 5544, 18, 35402, 364, 1361, 15, 91, 83, 31, 1, 1393, 531, 277, 1, 203, 1099, 5, 1, 1203, 587, 908, 180, 1258, 53, 52, 70, 5696, 124, 3, 324, 289, 2, 284, 3, 9408, 15, 1131, 3664, 15697, 10, 444, 1, 2514, 11836, 4223, 4, 1, 203, 20, 248, 104, 4, 1, 908, 12, 19323, 1, 111, 1034, 39, 760, 46, 2073, 1984, 1134, 5, 1, 3917, 222, 46, 1441, 106, 940, 51, 1, 695, 1332, 6, 2365, 31, 1215, 4, 1, 15171, 8, 325, 3672, 2, 347, 6085, 34, 2727, 24, 220, 17370, 14, 3, 503, 5, 94, 93, 15, 3, 8891, 262, 26, 79, 124, 3, 49, 289, 4, 2006, 5004, 48, 268, 20, 8, 1, 73329, 1825, 464, 5097, 8891, 3, 2146, 354, 4106, 6, 836, 6313, 1236, 130, 1106, 141, 79, 27, 345, 1, 267, 16132, 2, 2295, 2547, 15, 1852, 32, 1725, 807, 415, 838, 4, 1313, 2, 5788, 30, 1, 451, 4, 1, 10257, 1114, 7, 7, 22, 121, 86, 11, 6, 167, 5, 127, 21, 61, 85, 42, 445, 20, 3, 280, 62, 18, 79, 85, 105, 8, 11, 509, 791, 1, 169, 14212, 117, 2, 117, 18, 5696, 1454, 20, 3, 125, 71, 853, 120, 2, 379, 10442, 50, 673, 493, 1, 367, 71, 26, 123, 66, 8, 1008, 4, 9, 463, 1, 4374, 873, 11, 6, 3, 324, 2, 773, 19, 5, 3660, 15, 12, 1012, 5, 166, 32, 308]], dtype=object)

In [17]: y_test[0:2]
Out[17]: array([1, 1])```

```numpy==1.15.2
sacremoses==0.0.5
six==1.11.0```

```#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: TextMiner ([email protected])

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import json
import numpy as np
import re
import six

from collections import OrderedDict
from os import walk
from sacremoses import MosesTokenizer

tokenizer = MosesTokenizer()

def build_word_index(input_dir, output_json):
word_count = OrderedDict()
for root, dirs, files in walk(input_dir):
for filename in files:
if re.match(".*\d+_\d+.txt", filename):
filepath = root + '/' + filename
print(filepath)
if 'unsup' in filepath:
continue
with open(filepath, 'r') as f:
for line in f:
if six.PY2:
tokenize_words = tokenizer.tokenize(
line.decode('utf-8').strip())
else:
tokenize_words = tokenizer.tokenize(line.strip())
lower_words = [word.lower() for word in tokenize_words]
for word in lower_words:
if word not in word_count:
word_count[word] = 0
word_count[word] += 1
words = list(word_count.keys())
counts = list(word_count.values())

sorted_idx = np.argsort(counts)
sorted_words = [words[ii] for ii in sorted_idx[::-1]]

word_index = OrderedDict()
for ii, ww in enumerate(sorted_words):
word_index[ww] = ii + 1

with open(output_json, 'w') as fp:
json.dump(word_index, fp)

if __name__ == '__main__':
parser = argparse.ArgumentParser()
default='./data/aclImdb/',
help='input data directory')
default='./data/aclimdb_word_index.json',
help='output word index dict json')
args = parser.parse_args()
input_dir = args.input_dir
output_json = args.output_json
build_word_index(input_dir, output_json)```

`python build_word_index.py`

```#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: TextMiner ([email protected])

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import json
import numpy as np
import re
import six

from collections import OrderedDict
from os import walk
from sacremoses import MosesTokenizer

tokenizer = MosesTokenizer()

def get_word_index(word_index_path):
with open(word_index_path) as f:

def build_data_index(input_dir, word_index):
train_x = []
train_y = []
for root, dirs, files in walk(input_dir):
for filename in files:
if re.match(".*\d+_\d+.txt", filename):
filepath = root + '/' + filename
print(filepath)
if 'pos' in filepath:
train_y.append(1)
elif 'neg' in filepath:
train_y.append(0)
else:
continue
train_list = []
with open(filepath, 'r') as f:
for line in f:
if six.PY2:
tokenize_words = tokenizer.tokenize(
line.decode('utf-8').strip())
else:
tokenize_words = tokenizer.tokenize(line.strip())
lower_words = [word.lower() for word in tokenize_words]
for word in lower_words:
train_list.append(word_index.get(word, 0))
train_x.append(train_list)
return train_x, train_y

if __name__ == '__main__':
parser = argparse.ArgumentParser()
default='./data/aclImdb/train/',
help='train data directory')
default='./data/aclImdb/test/',
help='test data directory')
default='./data/aclimdb_word_index.json',
help='aclimdb word index json')
default='./data/aclimdb.npz',
help='output npz')
args = parser.parse_args()
train_dir = args.train_dir
test_dir = args.test_dir
word_index_path = args.word_index_path
output_npz = args.output_npz
word_index = get_word_index(word_index_path)
train_x, train_y = build_data_index(train_dir, word_index)
test_x, test_y = build_data_index(test_dir, word_index)
np.savez(output_npz,
x_train=np.asarray(train_x),
y_train=np.asarray(train_y),
x_test=np.asarray(test_x),
y_test=np.asarray(test_y))```

```#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: TextMiner ([email protected])

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import json
import numpy as np

def get_word_index(path='./data/aclimdb_word_index.json'):
with open(path) as f:

seed=113, start_char=1, oov_char=2, index_from=3):
"""A simplify version of the origin imdb.py load_data function
https://github.com/keras-team/keras/blob/master/keras/datasets/imdb.py
"""
x_train, labels_train = f['x_train'], f['y_train']
x_test, labels_test = f['x_test'], f['y_test']

np.random.seed(seed)
indices = np.arange(len(x_train))
np.random.shuffle(indices)
x_train = x_train[indices]
labels_train = labels_train[indices]

indices = np.arange(len(x_test))
np.random.shuffle(indices)
x_test = x_test[indices]
labels_test = labels_test[indices]

xs = np.concatenate([x_train, x_test])
labels = np.concatenate([labels_train, labels_test])

if start_char is not None:
xs = [[start_char] + [w + index_from for w in x] for x in xs]
elif index_from:
xs = [[w + index_from for w in x] for x in xs]

if not num_words:
num_words = max([max(x) for x in xs])

# 0 (padding), 1 (start), 2(OOV)
if oov_char is not None:
xs = [[w if (skip_top <= w < num_words) else oov_char for w in x]
for x in xs]
else:
xs = [[w for w in x if skip_top <= w < num_words]
for x in xs]

idx = len(x_train)
x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])

return (x_train, y_train), (x_test, y_test)```

```In [1]: import aclimdb

# 注意，代码里已经写了数据文件aclimdb.npz的相对路径，如果在其他位置运行，请加上参数path
In [2]: (train_data, train_labels), (test_data, test_labels) = aclimdb.load_data(num_words=10000)

In [3]: train_data[0]
Out[3]:
[1,
7799,
1459,
...
11,
13,
3320,
2]

In [4]: train_labels[0]
Out[4]: 0

In [5]: max([max(sequence) for sequence in train_data])
Out[5]: 9999

In [6]: word_index = aclimdb.get_word_index()

In [8]: reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [9]: decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])

In [10]: decoded_review
Out[10]: u'? hi folks < br / > < br / > forget about that movie . john c. should be ashamed that he appears as executive producer in the ? bon ? has never been and will never be an actor and the fx are a joke . < br / > < br / > the first vampires was good ... and it was the only vampires . this thing here just wears the same name . < br / > < br / > just a waste of time thinks ... < br / > < br / > jake ?'

In [11]: import numpy as np

In [13]: def vectorize_sequences(sequences, dimension=10000):
...:     results = np.zeros((len(sequences), dimension))
...:     for i, sequence in enumerate(sequences):
...:         results[i, sequence] = 1
...:     return results
...:

In [14]: x_train = vectorize_sequences(train_data)

In [15]: x_test = vectorize_sequences(test_data)

In [16]: x_train[0]
Out[16]: array([0., 1., 1., ..., 0., 0., 0.])

In [17]: y_train = np.asarray(train_labels).astype('float32')

In [18]: y_test = np.asarray(test_labels).astype('float32')

In [19]: from keras import models
Using TensorFlow backend.

In [20]: from keras import layers

In [21]: model = models.Sequential()

In [25]: model.compile(optimizer='rmsprop',
...:               loss='binary_crossentropy',
...:               metrics=['accuracy'])

In [26]: model.fit(x_train, y_train, epochs=4, batch_size=512)
Epoch 1/4
25000/25000 [==============================] - 3s 140us/step - loss: 0.4544 - acc: 0.8192
Epoch 2/4
25000/25000 [==============================] - 2s 93us/step - loss: 0.2632 - acc: 0.9077
Epoch 3/4
25000/25000 [==============================] - 2s 92us/step - loss: 0.2053 - acc: 0.9244
Epoch 4/4
25000/25000 [==============================] - 2s 92us/step - loss: 0.1708 - acc: 0.9388
Out[26]: <keras.callbacks.History at 0x206cfdc10>

In [27]: resuls = model.evaluate(x_test, y_test)
25000/25000 [==============================] - 4s 145us/step

In [28]: resuls
Out[28]: [0.2953770682477951, 0.88304]

In [29]: model.predict(x_test)
Out[29]:
array([[9.9612302e-01],
[9.5416462e-01],
[1.5807265e-05],
...,
[9.9868757e-01],
[8.4713501e-01],
[5.7828808e-01]], dtype=float32)```