6. 推薦システムのコード例¶
参考: Collaborative Filtering for Movie Recommendations by Keras例題
全体の流れ
データセットの用意
学習用データ・検証用データに分割
モデル構築
学習
学習過程の観察
top-N推薦
6.1. 環境構築¶
6.2. データセットの用意¶
MovieLensの小データセットをダウンロード。
pd.read_csvで ratings.csv を DataFrame として読み込む。
# Download the actual data from http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
# Use the ratings.csv file
movielens_data_file_url = (
"http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
)
movielens_zipped_file = keras.utils.get_file(
"ml-latest-small.zip", movielens_data_file_url, extract=False
)
keras_datasets_path = Path(movielens_zipped_file).parents[0]
movielens_dir = keras_datasets_path / "ml-latest-small"
# Only extract the data the first time the script is run.
if not movielens_dir.exists():
with ZipFile(movielens_zipped_file, "r") as zip:
# Extract files
print("Extracting all the files now...")
zip.extractall(path=keras_datasets_path)
print("Done!")
ratings_file = movielens_dir / "ratings.csv"
df = pd.read_csv(ratings_file)
df
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1 | 4.0 | 964982703 |
1 | 1 | 3 | 4.0 | 964981247 |
2 | 1 | 6 | 4.0 | 964982224 |
3 | 1 | 47 | 5.0 | 964983815 |
4 | 1 | 50 | 5.0 | 964982931 |
... | ... | ... | ... | ... |
100831 | 610 | 166534 | 4.0 | 1493848402 |
100832 | 610 | 168248 | 5.0 | 1493850091 |
100833 | 610 | 168250 | 5.0 | 1494273047 |
100834 | 610 | 168252 | 5.0 | 1493846352 |
100835 | 610 | 170875 | 3.0 | 1493846415 |
100836 rows × 4 columns
6.3. データ前処理1:連番振り直し¶
userId, movieIDは整数がラベルとして振られているが、欠番が存在する。このままでは扱いづらいため番号を振り直し。
user_ids = df["userId"].unique().tolist()
user2user_encoded = {x: i for i, x in enumerate(user_ids)}
userencoded2user = {i: x for i, x in enumerate(user_ids)}
movie_ids = df["movieId"].unique().tolist()
movie2movie_encoded = {x: i for i, x in enumerate(movie_ids)}
movie_encoded2movie = {i: x for i, x in enumerate(movie_ids)}
df["user"] = df["userId"].map(user2user_encoded)
df["movie"] = df["movieId"].map(movie2movie_encoded)
num_users = len(user2user_encoded)
num_movies = len(movie_encoded2movie)
df["rating"] = df["rating"].values.astype(np.float32)
# min and max ratings will be used to normalize the ratings later
min_rating = min(df["rating"])
max_rating = max(df["rating"])
print(
"Number of users: {}, Number of Movies: {}, Min rating: {}, Max rating: {}".format(
num_users, num_movies, min_rating, max_rating
)
)
userId | movieId | rating | timestamp | user | movie | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 4.0 | 964982703 | 0 | 0 |
1 | 1 | 3 | 4.0 | 964981247 | 0 | 1 |
2 | 1 | 6 | 4.0 | 964982224 | 0 | 2 |
3 | 1 | 47 | 5.0 | 964983815 | 0 | 3 |
4 | 1 | 50 | 5.0 | 964982931 | 0 | 4 |
... | ... | ... | ... | ... | ... | ... |
100831 | 610 | 166534 | 4.0 | 1493848402 | 609 | 3120 |
100832 | 610 | 168248 | 5.0 | 1493850091 | 609 | 2035 |
100833 | 610 | 168250 | 5.0 | 1494273047 | 609 | 3121 |
100834 | 610 | 168252 | 5.0 | 1493846352 | 609 | 1392 |
100835 | 610 | 170875 | 3.0 | 1493846415 | 609 | 2873 |
100836 rows × 6 columns
6.5. モデル構築¶
word2vecでも用いるembeddingレイヤーを用いてモデルを構築している。「ユーザ x レーティング」をtf.tensordotで演算してるだけのシンプルなモデル。
EMBEDDING_SIZE = 50
class RecommenderNet(keras.Model):
def __init__(self, num_users, num_movies, embedding_size, **kwargs):
super(RecommenderNet, self).__init__(**kwargs)
self.num_users = num_users
self.num_movies = num_movies
self.embedding_size = embedding_size
self.user_embedding = layers.Embedding(
num_users,
embedding_size,
embeddings_initializer="he_normal",
embeddings_regularizer=keras.regularizers.l2(1e-6),
)
self.user_bias = layers.Embedding(num_users, 1)
self.movie_embedding = layers.Embedding(
num_movies,
embedding_size,
embeddings_initializer="he_normal",
embeddings_regularizer=keras.regularizers.l2(1e-6),
)
self.movie_bias = layers.Embedding(num_movies, 1)
def call(self, inputs):
user_vector = self.user_embedding(inputs[:, 0])
user_bias = self.user_bias(inputs[:, 0])
movie_vector = self.movie_embedding(inputs[:, 1])
movie_bias = self.movie_bias(inputs[:, 1])
dot_user_movie = tf.tensordot(user_vector, movie_vector, 2)
# Add all the components (including bias)
x = dot_user_movie + user_bias + movie_bias
# The sigmoid activation forces the rating to between 0 and 1
return tf.nn.sigmoid(x)
model = RecommenderNet(num_users, num_movies, EMBEDDING_SIZE)
model.compile(
loss=tf.keras.losses.BinaryCrossentropy(), optimizer=keras.optimizers.Adam(lr=0.001)
)
6.6. 学習¶
Epoch 1/5
1418/1418 [==============================] - 12s 7ms/step - loss: 0.6370 - val_loss: 0.6206
Epoch 2/5
1418/1418 [==============================] - 10s 7ms/step - loss: 0.6135 - val_loss: 0.6168
Epoch 3/5
1418/1418 [==============================] - 10s 7ms/step - loss: 0.6082 - val_loss: 0.6126
Epoch 4/5
1418/1418 [==============================] - 11s 7ms/step - loss: 0.6071 - val_loss: 0.6150
Epoch 5/5
1418/1418 [==============================] - 10s 7ms/step - loss: 0.6078 - val_loss: 0.6123
6.7. 学習履歴のグラフ化¶
6.8. 上位N件の推薦¶
movie_df = pd.read_csv(movielens_dir / "movies.csv")
# Let us get a user and see the top recommendations.
user_id = df.userId.sample(1).iloc[0]
# 視聴済み映画リスト。
movies_watched_by_user = df[df.userId == user_id]
# 未視聴映画リスト。
# not演算、重複排除。前処理で用意した movie2movie_encoded で前処理し直し。
movies_not_watched = movie_df[
~movie_df["movieId"].isin(movies_watched_by_user.movieId.values)
]["movieId"]
movies_not_watched = list(
set(movies_not_watched).intersection(set(movie2movie_encoded.keys()))
)
movies_not_watched = [[movie2movie_encoded.get(x)] for x in movies_not_watched]
# モデルで予測するためのデータ整形。
user_encoder = user2user_encoded.get(user_id)
user_movie_array = np.hstack(
([[user_encoder]] * len(movies_not_watched), movies_not_watched)
)
# 学習したモデルで予測。上位10件の映画idを取得。
ratings = model.predict(user_movie_array).flatten()
top_ratings_indices = ratings.argsort()[-10:][::-1]
recommended_movie_ids = [
movie_encoded2movie.get(movies_not_watched[x][0]) for x in top_ratings_indices
]
# 視聴済み映画のうち上位5件を出力。
print("Showing recommendations for user: {}".format(user_id))
print("====" * 9)
print("Movies with high ratings from user")
print("----" * 8)
top_movies_user = (
movies_watched_by_user.sort_values(by="rating", ascending=False)
.head(5)
.movieId.values
)
movie_df_rows = movie_df[movie_df["movieId"].isin(top_movies_user)]
for row in movie_df_rows.itertuples():
print(row.title, ":", row.genres)
# 推薦候補上位10件を出力。
print("----" * 8)
print("Top 10 movie recommendations")
print("----" * 8)
recommended_movies = movie_df[movie_df["movieId"].isin(recommended_movie_ids)]
for row in recommended_movies.itertuples():
print(row.title, ":", row.genres)
Showing recommendations for user: 174
====================================
Movies with high ratings from user
--------------------------------
French Kiss (1995) : Action|Comedy|Romance
Ace Ventura: Pet Detective (1994) : Comedy
Jurassic Park (1993) : Action|Adventure|Sci-Fi|Thriller
Tombstone (1993) : Action|Drama|Western
Batman (1989) : Action|Crime|Thriller
--------------------------------
Top 10 movie recommendations
--------------------------------
Braveheart (1995) : Action|Drama|War
Taxi Driver (1976) : Crime|Drama|Thriller
Godfather, The (1972) : Crime|Drama
Reservoir Dogs (1992) : Crime|Mystery|Thriller
Star Wars: Episode V - The Empire Strikes Back (1980) : Action|Adventure|Sci-Fi
Princess Bride, The (1987) : Action|Adventure|Comedy|Fantasy|Romance
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981) : Action|Adventure
Lawrence of Arabia (1962) : Adventure|Drama|War
Apocalypse Now (1979) : Action|Drama|War
Goodfellas (1990) : Crime|Drama