{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# トピックモデルによるクラスタリング\n", "トピックモデルとは文書中の単語出現分布を元に傾向(≒トピックらしきもの)を観察しようとするアプローチで、クラスタリングの一種である。なお、一般的なクラスタリング(例えば[k平均法](https://ja.wikipedia.org/wiki/K平均法))では一つのサンプルが一つのクラスタに属するという前提でグルーピングを行うのに対し、トピックモデルでは一つのサンプルが複数のクラスタを内包しているという前提でグルーピングを行う。次の例を眺めるとイメージをつかみやすいだろう。\n", "\n", "- 例1: [トピックモデル入門:WikipediaをLDAモデル化してみた](https://recruit.gmo.jp/engineer/jisedai/blog/topic-model/)\n", "- 例2: [Wikipedia: Topic model](https://en.wikipedia.org/wiki/Topic_model)\n", "\n", "基本的には文書を BoW (CountVectrizor) やそれの重みを調整した TF-IDF 等の「文書単語行列」を作成し、ここから文書館類似度や単語間類似度を元に集約(≒次元削減)を試みる。文書単語行列の作成方法や次元削減方法、類似度の求め方などで様々なアルゴリズムが提案されている。ここでは (1) Bow + [LDA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) によりトピックモデルを行い、[PyLDAvis](https://github.com/bmabey/pyLDAvis)による可視化を通してトピックを観察してみよう。\n", "\n", "なお、トピックモデルの注意点として、トピックそのものは人手による解釈が求められる点が挙げられる。例えば先に上げた[トピックモデル入門:WikipediaをLDAモデル化してみた](https://recruit.gmo.jp/engineer/jisedai/blog/topic-model/)における図2(下図)では「政治」「スポーツ」「国際」といったトピックが並んでいるが、実際には「4-1. トピック観察」を行う必要がある。実際に観察してみよう。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## データの準備\n", "これまで見てきたいつものやつ。" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 34834 100 34834 0 0 100k 0 --:--:-- --:--:-- --:--:-- 103k\n" ] } ], "source": [ "!curl -O https://ie.u-ryukyu.ac.jp/~tnal/2022/dm/static/r_assesment.pkl" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlegraderequiredq_idcomment
0工業数学Ⅰ1TrueQ21 (1)特になし
1工業数学Ⅰ1TrueQ21 (2)正直わかりずらい。むだに間があるし。
2工業数学Ⅰ1TrueQ21 (2)例題を取り入れて理解しやすくしてほしい。
3工業数学Ⅰ1TrueQ21 (2)特になし
4工業数学Ⅰ1TrueQ21 (2)スライドに書く文字をもう少しわかりやすくして欲しいです。
\n", "
" ], "text/plain": [ " title grade required q_id comment\n", "0 工業数学Ⅰ 1 True Q21 (1) 特になし\n", "1 工業数学Ⅰ 1 True Q21 (2) 正直わかりずらい。むだに間があるし。\n", "2 工業数学Ⅰ 1 True Q21 (2) 例題を取り入れて理解しやすくしてほしい。\n", "3 工業数学Ⅰ 1 True Q21 (2) 特になし\n", "4 工業数学Ⅰ 1 True Q21 (2) スライドに書く文字をもう少しわかりやすくして欲しいです。" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import collections\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import spacy\n", "from wordcloud import WordCloud\n", "\n", "nlp = spacy.load(\"ja_ginza\")\n", "\n", "assesment_df = pd.read_pickle('r_assesment.pkl')\n", "assesment_df.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlegraderequiredq_idcommentwakati
0工業数学Ⅰ1TrueQ21 (1)特になし特に なし
1工業数学Ⅰ1TrueQ21 (2)正直わかりずらい。むだに間があるし。正直 わかる ずらい むだ 間 ある
2工業数学Ⅰ1TrueQ21 (2)例題を取り入れて理解しやすくしてほしい。例題 取り入れる 理解 する
3工業数学Ⅰ1TrueQ21 (2)特になし特に なし
4工業数学Ⅰ1TrueQ21 (2)スライドに書く文字をもう少しわかりやすくして欲しいです。スライド 書く 文字 もう 少し わかる する
.....................
165データマイニング3FalseQ22課題が難しいものが多く、時間を多くとってもらえたのは非常に良かったですがかなりきつかったです...課題 難しい もの 多い 時間 多い とる もらえる 非常 良い かなり きつい ござる
166ICT実践英語Ⅰ3FalseQ22オンラインなどで顔を合わせてやりたかったです。オンライン 顔 合わせる やる
167知能情報実験Ⅲ3TrueQ21 (2)unityの操作方法の説明などを最初に行ってもらえたらもう少しスムーズにできたのではないかと思う。unity 操作方法 説明 最初 行く もらえる もう 少し スムーズ できる 思う
168知能情報実験Ⅲ3TrueQ22それぞれに任せるといった形で進められたものだったのでそれなりに進めやすかったですが、オンライ...それぞれ 任せる いう 形 進める もの なり 進める オンライン 班 員 指導 全く する...
169知能情報実験Ⅲ3TrueQ22モバイルアプリ班\\r\\nHTML/CSS,JavaScriptなどを用いてアプリケーションを...モバイルアプリ 班 \\r\\n HTML CSS javascript 用いる アプリケーショ...
\n", "

170 rows × 6 columns

\n", "
" ], "text/plain": [ " title grade required q_id \\\n", "0 工業数学Ⅰ 1 True Q21 (1) \n", "1 工業数学Ⅰ 1 True Q21 (2) \n", "2 工業数学Ⅰ 1 True Q21 (2) \n", "3 工業数学Ⅰ 1 True Q21 (2) \n", "4 工業数学Ⅰ 1 True Q21 (2) \n", ".. ... ... ... ... \n", "165 データマイニング 3 False Q22 \n", "166 ICT実践英語Ⅰ 3 False Q22 \n", "167 知能情報実験Ⅲ 3 True Q21 (2) \n", "168 知能情報実験Ⅲ 3 True Q22 \n", "169 知能情報実験Ⅲ 3 True Q22 \n", "\n", " comment \\\n", "0 特になし \n", "1 正直わかりずらい。むだに間があるし。 \n", "2 例題を取り入れて理解しやすくしてほしい。 \n", "3 特になし \n", "4 スライドに書く文字をもう少しわかりやすくして欲しいです。 \n", ".. ... \n", "165 課題が難しいものが多く、時間を多くとってもらえたのは非常に良かったですがかなりきつかったです... \n", "166 オンラインなどで顔を合わせてやりたかったです。 \n", "167 unityの操作方法の説明などを最初に行ってもらえたらもう少しスムーズにできたのではないかと思う。 \n", "168 それぞれに任せるといった形で進められたものだったのでそれなりに進めやすかったですが、オンライ... \n", "169 モバイルアプリ班\\r\\nHTML/CSS,JavaScriptなどを用いてアプリケーションを... \n", "\n", " wakati \n", "0 特に なし \n", "1 正直 わかる ずらい むだ 間 ある \n", "2 例題 取り入れる 理解 する \n", "3 特に なし \n", "4 スライド 書く 文字 もう 少し わかる する \n", ".. ... \n", "165 課題 難しい もの 多い 時間 多い とる もらえる 非常 良い かなり きつい ござる \n", "166 オンライン 顔 合わせる やる \n", "167 unity 操作方法 説明 最初 行く もらえる もう 少し スムーズ できる 思う \n", "168 それぞれ 任せる いう 形 進める もの なり 進める オンライン 班 員 指導 全く する... \n", "169 モバイルアプリ 班 \\r\\n HTML CSS javascript 用いる アプリケーショ... \n", "\n", "[170 rows x 6 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 分かち書き\n", "poses = ['PROPN', 'NOUN', 'VERB', 'ADJ', 'ADV'] #名詞、動詞、形容詞、形容動詞\n", "\n", "assesment_df['wakati'] = ''\n", "for index, comment in enumerate(assesment_df['comment']):\n", " doc = nlp(comment)\n", " wakati_words = []\n", " for token in doc:\n", " if token.pos_ in poses:\n", " wakati_words.append(token.lemma_)\n", " wakati_text = ' '.join(wakati_words)\n", " assesment_df.at[index, 'wakati'] = wakati_text\n", "\n", "assesment_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 文書ベクトルの作成\n", "ここでは CountVectorizer (Bag-of-Words) で作成してみよう。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bow_tf_vector.shape = (170, 738)\n" ] } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "stop_words = ['こと', '\\r\\n', 'ため', '思う', 'いる', 'ある', 'する', 'なる']\n", "vectorizer = CountVectorizer(stop_words=stop_words)\n", "bow_tf_vector = vectorizer.fit_transform(assesment_df['wakati'])\n", "print('bow_tf_vector.shape = ', bow_tf_vector.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## LDAによるトピックモデル解析\n", "sklearnでは [LatentDirichletAllocation](https://scikit-learn.org/stable/modules/decomposition.html?highlight=lda#latent-dirichlet-allocation-lda) として用意されている。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "iteration: 1 of max_iter: 100\n", "iteration: 2 of max_iter: 100\n", "iteration: 3 of max_iter: 100\n", "iteration: 4 of max_iter: 100\n", "iteration: 5 of max_iter: 100\n", "iteration: 6 of max_iter: 100\n", "iteration: 7 of max_iter: 100\n", "iteration: 8 of max_iter: 100\n", "iteration: 9 of max_iter: 100\n", "iteration: 10 of max_iter: 100\n", "iteration: 11 of max_iter: 100\n", "iteration: 12 of max_iter: 100\n", "iteration: 13 of max_iter: 100\n", "iteration: 14 of max_iter: 100\n", "iteration: 15 of max_iter: 100\n", "iteration: 16 of max_iter: 100\n", "iteration: 17 of max_iter: 100\n", "iteration: 18 of max_iter: 100\n", "iteration: 19 of max_iter: 100\n", "iteration: 20 of max_iter: 100\n", "iteration: 21 of max_iter: 100\n", "iteration: 22 of max_iter: 100\n", "iteration: 23 of max_iter: 100\n", "iteration: 24 of max_iter: 100\n", "iteration: 25 of max_iter: 100\n", "iteration: 26 of max_iter: 100\n", "iteration: 27 of max_iter: 100\n", "iteration: 28 of max_iter: 100\n", "iteration: 29 of max_iter: 100\n", "iteration: 30 of max_iter: 100\n", "iteration: 31 of max_iter: 100\n", "iteration: 32 of max_iter: 100\n", "iteration: 33 of max_iter: 100\n", "iteration: 34 of max_iter: 100\n", "iteration: 35 of max_iter: 100\n", "iteration: 36 of max_iter: 100\n", "iteration: 37 of max_iter: 100\n", "iteration: 38 of max_iter: 100\n", "iteration: 39 of max_iter: 100\n", "iteration: 40 of max_iter: 100\n", "iteration: 41 of max_iter: 100\n", "iteration: 42 of max_iter: 100\n", "iteration: 43 of max_iter: 100\n", "iteration: 44 of max_iter: 100\n", "iteration: 45 of max_iter: 100\n", "iteration: 46 of max_iter: 100\n", "iteration: 47 of max_iter: 100\n", "iteration: 48 of max_iter: 100\n", "iteration: 49 of max_iter: 100\n", "iteration: 50 of max_iter: 100\n", "iteration: 51 of max_iter: 100\n", "iteration: 52 of max_iter: 100\n", "iteration: 53 of max_iter: 100\n", "iteration: 54 of max_iter: 100\n", "iteration: 55 of max_iter: 100\n", "iteration: 56 of max_iter: 100\n", "iteration: 57 of max_iter: 100\n", "iteration: 58 of max_iter: 100\n", "iteration: 59 of max_iter: 100\n", "iteration: 60 of max_iter: 100\n", "iteration: 61 of max_iter: 100\n", "iteration: 62 of max_iter: 100\n", "iteration: 63 of max_iter: 100\n", "iteration: 64 of max_iter: 100\n", "iteration: 65 of max_iter: 100\n", "iteration: 66 of max_iter: 100\n", "iteration: 67 of max_iter: 100\n", "iteration: 68 of max_iter: 100\n", "iteration: 69 of max_iter: 100\n", "iteration: 70 of max_iter: 100\n", "iteration: 71 of max_iter: 100\n", "iteration: 72 of max_iter: 100\n", "iteration: 73 of max_iter: 100\n", "iteration: 74 of max_iter: 100\n", "iteration: 75 of max_iter: 100\n", "iteration: 76 of max_iter: 100\n", "iteration: 77 of max_iter: 100\n", "iteration: 78 of max_iter: 100\n", "iteration: 79 of max_iter: 100\n", "iteration: 80 of max_iter: 100\n", "iteration: 81 of max_iter: 100\n", "iteration: 82 of max_iter: 100\n", "iteration: 83 of max_iter: 100\n", "iteration: 84 of max_iter: 100\n", "iteration: 85 of max_iter: 100\n", "iteration: 86 of max_iter: 100\n", "iteration: 87 of max_iter: 100\n", "iteration: 88 of max_iter: 100\n", "iteration: 89 of max_iter: 100\n", "iteration: 90 of max_iter: 100\n", "iteration: 91 of max_iter: 100\n", "iteration: 92 of max_iter: 100\n", "iteration: 93 of max_iter: 100\n", "iteration: 94 of max_iter: 100\n", "iteration: 95 of max_iter: 100\n", "iteration: 96 of max_iter: 100\n", "iteration: 97 of max_iter: 100\n", "iteration: 98 of max_iter: 100\n", "iteration: 99 of max_iter: 100\n", "iteration: 100 of max_iter: 100\n" ] } ], "source": [ "from sklearn.decomposition import LatentDirichletAllocation\n", "\n", "NUM_TOPICS = 20 #トピック数\n", "max_iter = 100 #LDAによる学習回数\n", "lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=max_iter, learning_method='online',verbose=True)\n", "data_lda = lda.fit_transform(bow_tf_vector)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## トピックの観察。\n", "- [pyLDAvis](https://github.com/bmabey/pyLDAvis)によりトピックを観察してみよう。\n", "- 下図の左側がトピック分布を表している。丸の大きさがトピック内に含まれる文書数、丸と丸の距離はトピック間の距離。\n", "- 下図の右側が単語の発生頻度を表している。\n", " - トピックを選択するとそのトピックにおける単語の発生頻度を観察できる。\n", " - 単語を選択すると、その単語がどのようにトピック分布上にバラけているかを観察できる。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/tnal/.venv/dm/lib/python3.8/site-packages/past/builtins/misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses\n", " from imp import reload\n", "/Users/tnal/.venv/dm/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n", " warnings.warn(msg, category=FutureWarning)\n", "/Users/tnal/.venv/dm/lib/python3.8/site-packages/pyLDAvis/_prepare.py:246: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.\n", " default_term_info = default_term_info.sort_values(\n", "/Users/tnal/.venv/dm/lib/python3.8/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.\n", " warnings.warn(\n", "/Users/tnal/.venv/dm/lib/python3.8/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.\n", " warnings.warn(\n", "/Users/tnal/.venv/dm/lib/python3.8/site-packages/sklearn/manifold/_t_sne.py:819: FutureWarning: 'square_distances' has been introduced in 0.24 to help phase out legacy squaring behavior. The 'legacy' setting will be removed in 1.1 (renaming of 0.26), and the default setting will be changed to True. In 1.3, 'square_distances' will be removed altogether, and distances will be squared by default. Set 'square_distances'=True to silence this warning.\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "PreparedData(topic_coordinates= x y topics cluster Freq\n", "topic \n", "17 -10.878068 10.538728 1 1 41.804803\n", "11 -60.884026 -75.591820 2 1 10.331622\n", "12 -2.336119 -53.289074 3 1 8.402243\n", "15 -61.385319 -12.914634 4 1 6.262956\n", "19 -28.737743 126.296997 5 1 4.654199\n", "8 1.066606 -123.552811 6 1 4.468259\n", "7 45.140720 -7.947869 7 1 3.440363\n", "5 -140.795975 -98.530548 8 1 3.251865\n", "3 61.531681 -83.240562 9 1 3.157605\n", "14 -72.563057 -140.396881 10 1 3.122404\n", "4 15.984491 63.578186 11 1 2.675185\n", "10 -160.851624 68.945145 12 1 1.630948\n", "16 85.961830 51.922935 13 1 1.574962\n", "13 -99.020836 109.936348 14 1 1.199596\n", "2 -176.764130 -8.681684 15 1 0.928375\n", "18 -117.189880 -37.420200 16 1 0.921461\n", "6 108.829918 -22.946203 17 1 0.850258\n", "1 47.588097 121.934814 18 1 0.507192\n", "9 -106.467834 30.176889 19 1 0.462147\n", "0 -50.613804 59.328842 20 1 0.353556, topic_info= Term Freq Total Category logprob loglift\n", "108 よい 18.000000 18.000000 Default 30.0000 30.0000\n", "667 課題 31.000000 31.000000 Default 29.0000 29.0000\n", "669 講義 30.000000 30.000000 Default 28.0000 28.0000\n", "114 わかる 21.000000 21.000000 Default 27.0000 27.0000\n", "437 授業 36.000000 36.000000 Default 26.0000 26.0000\n", ".. ... ... ... ... ... ...\n", "375 延期 0.009928 1.111538 Topic20 -6.6038 0.9267\n", "422 承知 0.009928 2.803466 Topic20 -6.6038 0.0016\n", "241 入れる 0.009928 1.078309 Topic20 -6.6038 0.9571\n", "478 最初 0.009928 6.636561 Topic20 -6.6038 -0.8601\n", "173 メンバー 0.009928 3.081424 Topic20 -6.6038 -0.0929\n", "\n", "[958 rows x 6 columns], token_table= Topic Freq Term\n", "term \n", "0 1 0.405249 cm\n", "0 3 0.607873 cm\n", "2 10 0.573755 css\n", "3 1 0.791793 denchu\n", "4 10 0.573756 github\n", "... ... ... ...\n", "728 9 0.235766 面白い\n", "731 1 0.987628 願う\n", "732 1 0.371879 高い\n", "732 2 0.371879 高い\n", "734 5 0.939695 高校\n", "\n", "[620 rows x 3 columns], R=30, lambda_step=0.01, plot_opts={'xlab': 'PC1', 'ylab': 'PC2'}, topic_order=[18, 12, 13, 16, 20, 9, 8, 6, 4, 15, 5, 11, 17, 14, 3, 19, 7, 2, 10, 1])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pyLDAvis\n", "from pyLDAvis import sklearn as sklearn_vis\n", "\n", "pyLDAvis.enable_notebook()\n", "dash = sklearn_vis.prepare(lda, bow_tf_vector, vectorizer, mds='tsne')\n", "dash" ] } ], "metadata": { "interpreter": { "hash": "880b2a8c90f9e6beae80b56829e3f671fedd58b6d14887184ddce26124cedfbd" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.9" } }, "nbformat": 4, "nbformat_minor": 4 }