!date
!python --version
Thu Jun  6 03:11:24 AM UTC 2024
Python 3.10.12

24. トピックモデルによるクラスタリング#

トピックモデルとは文書中の単語出現分布を元に傾向(≒トピックらしきもの)を観察しようとするアプローチで、クラスタリングの一種である。なお、一般的なクラスタリング(例えばk平均法)では一つのサンプルが一つのクラスタに属するという前提でグルーピングを行うのに対し、トピックモデルでは一つのサンプルが複数のクラスタを内包しているという前提でグルーピングを行う。次の例を眺めるとイメージをつかみやすいだろう。

基本的には文書を BoW (CountVectrizor) やそれの重みを調整した TF-IDF 等の「文書単語行列」を作成し、ここから文書館類似度や単語間類似度を元に集約(≒次元削減)を試みる。文書単語行列の作成方法や次元削減方法、類似度の求め方などで様々なアルゴリズムが提案されている。ここでは (1) Bow + LDA によりトピックモデルを行い、PyLDAvisによる可視化を通してトピックを観察してみよう。

なお、トピックモデルの注意点として、トピックそのものは人手による解釈が求められる点が挙げられる。例えば先に上げたトピックモデル入門:WikipediaをLDAモデル化してみたにおける図2(下図)では「政治」「スポーツ」「国際」といったトピックが並んでいるが、実際には「4-1. トピック観察」を行う必要がある。実際に観察してみよう。

# spacy, ginzaインストール
!pip install -U ginza ja_ginza

# PyLDAvis
!pip install pyldavis
Collecting ginza
  Downloading ginza-5.2.0-py3-none-any.whl (21 kB)
Collecting ja_ginza
  Downloading ja_ginza-5.2.0-py3-none-any.whl (59.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.1/59.1 MB 14.2 MB/s eta 0:00:00
?25hRequirement already satisfied: spacy<4.0.0,>=3.4.4 in /usr/local/lib/python3.10/dist-packages (from ginza) (3.7.4)
Collecting plac>=1.3.3 (from ginza)
  Downloading plac-1.4.3-py2.py3-none-any.whl (22 kB)
Collecting SudachiPy<0.7.0,>=0.6.2 (from ginza)
  Downloading SudachiPy-0.6.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 21.8 MB/s eta 0:00:00
?25hCollecting SudachiDict-core>=20210802 (from ginza)
  Downloading SudachiDict_core-20240409-py3-none-any.whl (72.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.0/72.0 MB 6.4 MB/s eta 0:00:00
?25hRequirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.0.10)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.0.8)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.0.9)
Requirement already satisfied: thinc<8.3.0,>=8.2.2 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (8.2.3)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.1.3)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.4.8)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.0.10)
Requirement already satisfied: weasel<0.4.0,>=0.1.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (0.3.4)
Requirement already satisfied: typer<0.10.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (0.9.4)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (6.4.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (4.66.4)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.31.0)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.7.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.1.4)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (67.7.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (24.0)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.4.0)
Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.25.2)
Requirement already satisfied: language-data>=1.2 in /usr/local/lib/python3.10/dist-packages (from langcodes<4.0.0,>=3.2.0->spacy<4.0.0,>=3.4.4->ginza) (1.2.0)
Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<4.0.0,>=3.4.4->ginza) (0.7.0)
Requirement already satisfied: pydantic-core==2.18.4 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<4.0.0,>=3.4.4->ginza) (2.18.4)
Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<4.0.0,>=3.4.4->ginza) (4.12.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (2024.6.2)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<4.0.0,>=3.4.4->ginza) (0.7.11)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<4.0.0,>=3.4.4->ginza) (0.1.5)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.10/dist-packages (from typer<0.10.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (8.1.7)
Requirement already satisfied: cloudpathlib<0.17.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from weasel<0.4.0,>=0.1.0->spacy<4.0.0,>=3.4.4->ginza) (0.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy<4.0.0,>=3.4.4->ginza) (2.1.5)
Requirement already satisfied: marisa-trie>=0.7.7 in /usr/local/lib/python3.10/dist-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy<4.0.0,>=3.4.4->ginza) (1.1.1)
Installing collected packages: SudachiPy, plac, SudachiDict-core, ginza, ja_ginza
Successfully installed SudachiDict-core-20240409 SudachiPy-0.6.8 ginza-5.2.0 ja_ginza-5.2.0 plac-1.4.3
Collecting pyldavis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 11.0 MB/s eta 0:00:00
?25hRequirement already satisfied: numpy>=1.24.2 in /usr/local/lib/python3.10/dist-packages (from pyldavis) (1.25.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from pyldavis) (1.11.4)
Requirement already satisfied: pandas>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from pyldavis) (2.0.3)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from pyldavis) (1.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from pyldavis) (3.1.4)
Requirement already satisfied: numexpr in /usr/local/lib/python3.10/dist-packages (from pyldavis) (2.10.0)
Collecting funcy (from pyldavis)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Requirement already satisfied: scikit-learn>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from pyldavis) (1.2.2)
Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (from pyldavis) (4.3.2)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from pyldavis) (67.7.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0.0->pyldavis) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0.0->pyldavis) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0.0->pyldavis) (2024.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0.0->pyldavis) (3.5.0)
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages (from gensim->pyldavis) (6.4.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->pyldavis) (2.1.5)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=2.0.0->pyldavis) (1.16.0)
Installing collected packages: funcy, pyldavis
Successfully installed funcy-2.0 pyldavis-3.4.1

24.1. データの準備#

これまで見てきたいつものやつ。

!curl -O https://ie.u-ryukyu.ac.jp/~tnal/2022/dm/static/r_assesment.pkl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 34834  100 34834    0     0  15859      0  0:00:02  0:00:02 --:--:-- 15862
import collections

import numpy as np
import pandas as pd
import spacy
from wordcloud import WordCloud

nlp = spacy.load("ja_ginza")

assesment_df = pd.read_pickle('r_assesment.pkl')
assesment_df.head()
title grade required q_id comment
0 工業数学Ⅰ 1 True Q21 (1) 特になし
1 工業数学Ⅰ 1 True Q21 (2) 正直わかりずらい。むだに間があるし。
2 工業数学Ⅰ 1 True Q21 (2) 例題を取り入れて理解しやすくしてほしい。
3 工業数学Ⅰ 1 True Q21 (2) 特になし
4 工業数学Ⅰ 1 True Q21 (2) スライドに書く文字をもう少しわかりやすくして欲しいです。
# 分かち書き
poses = ['PROPN', 'NOUN', 'VERB', 'ADJ', 'ADV'] #名詞、動詞、形容詞、形容動詞

assesment_df['wakati'] = ''
for index, comment in enumerate(assesment_df['comment']):
    doc = nlp(comment)
    wakati_words = []
    for token in doc:
        if token.pos_ in poses:
            wakati_words.append(token.lemma_)
    wakati_text = ' '.join(wakati_words)
    assesment_df.at[index, 'wakati'] = wakati_text

assesment_df
title grade required q_id comment wakati
0 工業数学Ⅰ 1 True Q21 (1) 特になし 特に なし
1 工業数学Ⅰ 1 True Q21 (2) 正直わかりずらい。むだに間があるし。 正直 わかる ずらい むだ 間 ある
2 工業数学Ⅰ 1 True Q21 (2) 例題を取り入れて理解しやすくしてほしい。 例題 取り入れる 理解 する
3 工業数学Ⅰ 1 True Q21 (2) 特になし 特に なし
4 工業数学Ⅰ 1 True Q21 (2) スライドに書く文字をもう少しわかりやすくして欲しいです。 スライド 書く 文字 もう 少し わかる する
... ... ... ... ... ... ...
165 データマイニング 3 False Q22 課題が難しいものが多く、時間を多くとってもらえたのは非常に良かったですがかなりきつかったです... 課題 難しい もの 多い 時間 多い とる もらえる 非常 良い かなり きつい ござる
166 ICT実践英語Ⅰ 3 False Q22 オンラインなどで顔を合わせてやりたかったです。 オンライン 顔 合わせる やる
167 知能情報実験Ⅲ 3 True Q21 (2) unityの操作方法の説明などを最初に行ってもらえたらもう少しスムーズにできたのではないかと思う。 unity 操作方法 説明 最初 行く もらえる もう 少し スムーズ できる 思う
168 知能情報実験Ⅲ 3 True Q22 それぞれに任せるといった形で進められたものだったのでそれなりに進めやすかったですが、オンライ... それぞれ 任せる いう 形 進める もの なり 進める オンライン 班 員 指導 全く する...
169 知能情報実験Ⅲ 3 True Q22 モバイルアプリ班\r\nHTML/CSS,JavaScriptなどを用いてアプリケーションを... モバイルアプリ 班 \r\n HTML CSS javascript 用いる アプリケーショ...

170 rows × 6 columns

24.2. 文書ベクトルの作成#

ここでは CountVectorizer (Bag-of-Words) で作成してみよう。

from sklearn.feature_extraction.text import CountVectorizer

stop_words = ['こと', '\r\n', 'ため', '思う', 'いる', 'ある', 'する', 'なる']
vectorizer = CountVectorizer(stop_words=stop_words)
bow_tf_vector = vectorizer.fit_transform(assesment_df['wakati'])
print('bow_tf_vector.shape = ', bow_tf_vector.shape)
bow_tf_vector.shape =  (170, 740)

24.3. LDAによるトピックモデル解析#

sklearnでは LatentDirichletAllocation として用意されている。

from sklearn.decomposition import LatentDirichletAllocation

NUM_TOPICS = 20 #トピック数
max_iter = 100  #LDAによる学習回数
lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=max_iter, learning_method='online',verbose=True)
data_lda = lda.fit_transform(bow_tf_vector)
iteration: 1 of max_iter: 100
iteration: 2 of max_iter: 100
iteration: 3 of max_iter: 100
iteration: 4 of max_iter: 100
iteration: 5 of max_iter: 100
iteration: 6 of max_iter: 100
iteration: 7 of max_iter: 100
iteration: 8 of max_iter: 100
iteration: 9 of max_iter: 100
iteration: 10 of max_iter: 100
iteration: 11 of max_iter: 100
iteration: 12 of max_iter: 100
iteration: 13 of max_iter: 100
iteration: 14 of max_iter: 100
iteration: 15 of max_iter: 100
iteration: 16 of max_iter: 100
iteration: 17 of max_iter: 100
iteration: 18 of max_iter: 100
iteration: 19 of max_iter: 100
iteration: 20 of max_iter: 100
iteration: 21 of max_iter: 100
iteration: 22 of max_iter: 100
iteration: 23 of max_iter: 100
iteration: 24 of max_iter: 100
iteration: 25 of max_iter: 100
iteration: 26 of max_iter: 100
iteration: 27 of max_iter: 100
iteration: 28 of max_iter: 100
iteration: 29 of max_iter: 100
iteration: 30 of max_iter: 100
iteration: 31 of max_iter: 100
iteration: 32 of max_iter: 100
iteration: 33 of max_iter: 100
iteration: 34 of max_iter: 100
iteration: 35 of max_iter: 100
iteration: 36 of max_iter: 100
iteration: 37 of max_iter: 100
iteration: 38 of max_iter: 100
iteration: 39 of max_iter: 100
iteration: 40 of max_iter: 100
iteration: 41 of max_iter: 100
iteration: 42 of max_iter: 100
iteration: 43 of max_iter: 100
iteration: 44 of max_iter: 100
iteration: 45 of max_iter: 100
iteration: 46 of max_iter: 100
iteration: 47 of max_iter: 100
iteration: 48 of max_iter: 100
iteration: 49 of max_iter: 100
iteration: 50 of max_iter: 100
iteration: 51 of max_iter: 100
iteration: 52 of max_iter: 100
iteration: 53 of max_iter: 100
iteration: 54 of max_iter: 100
iteration: 55 of max_iter: 100
iteration: 56 of max_iter: 100
iteration: 57 of max_iter: 100
iteration: 58 of max_iter: 100
iteration: 59 of max_iter: 100
iteration: 60 of max_iter: 100
iteration: 61 of max_iter: 100
iteration: 62 of max_iter: 100
iteration: 63 of max_iter: 100
iteration: 64 of max_iter: 100
iteration: 65 of max_iter: 100
iteration: 66 of max_iter: 100
iteration: 67 of max_iter: 100
iteration: 68 of max_iter: 100
iteration: 69 of max_iter: 100
iteration: 70 of max_iter: 100
iteration: 71 of max_iter: 100
iteration: 72 of max_iter: 100
iteration: 73 of max_iter: 100
iteration: 74 of max_iter: 100
iteration: 75 of max_iter: 100
iteration: 76 of max_iter: 100
iteration: 77 of max_iter: 100
iteration: 78 of max_iter: 100
iteration: 79 of max_iter: 100
iteration: 80 of max_iter: 100
iteration: 81 of max_iter: 100
iteration: 82 of max_iter: 100
iteration: 83 of max_iter: 100
iteration: 84 of max_iter: 100
iteration: 85 of max_iter: 100
iteration: 86 of max_iter: 100
iteration: 87 of max_iter: 100
iteration: 88 of max_iter: 100
iteration: 89 of max_iter: 100
iteration: 90 of max_iter: 100
iteration: 91 of max_iter: 100
iteration: 92 of max_iter: 100
iteration: 93 of max_iter: 100
iteration: 94 of max_iter: 100
iteration: 95 of max_iter: 100
iteration: 96 of max_iter: 100
iteration: 97 of max_iter: 100
iteration: 98 of max_iter: 100
iteration: 99 of max_iter: 100
iteration: 100 of max_iter: 100

24.4. トピックの観察。#

  • pyLDAvisによりトピックを観察してみよう。

  • 下図の左側がトピック分布を表している。丸の大きさがトピック内に含まれる文書数、丸と丸の距離はトピック間の距離。

  • 下図の右側が単語の発生頻度を表している。

    • トピックを選択するとそのトピックにおける単語の発生頻度を観察できる。

    • 単語を選択すると、その単語がどのようにトピック分布上にバラけているかを観察できる。

import pyLDAvis.lda_model
import pyLDAvis
from pyLDAvis import lda_model

pyLDAvis.enable_notebook()
dash = lda_model.prepare(lda, bow_tf_vector, vectorizer, mds='tsne')
dash
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)