トピックモデルによるクラスタリング

24. トピックモデルによるクラスタリング#

トピックモデルとは文書中の単語出現分布を元に傾向（≒トピックらしきもの）を観察しようとするアプローチで、クラスタリングの一種である。なお、一般的なクラスタリング（例えばk平均法）では一つのサンプルが一つのクラスタに属するという前提でグルーピングを行うのに対し、トピックモデルでは一つのサンプルが複数のクラスタを内包しているという前提でグルーピングを行う。次の例を眺めるとイメージをつかみやすいだろう。

基本的には文書を BoW (CountVectrizor) やそれの重みを調整した TF-IDF 等の「文書単語行列」を作成し、ここから文書館類似度や単語間類似度を元に集約（≒次元削減）を試みる。文書単語行列の作成方法や次元削減方法、類似度の求め方などで様々なアルゴリズムが提案されている。ここでは (1) Bow + LDA によりトピックモデルを行い、PyLDAvisによる可視化を通してトピックを観察してみよう。

なお、トピックモデルの注意点として、トピックそのものは人手による解釈が求められる点が挙げられる。例えば先に上げたトピックモデル入門：WikipediaをLDAモデル化してみたにおける図2（下図）では「政治」「スポーツ」「国際」といったトピックが並んでいるが、実際には「4-1. トピック観察」を行う必要がある。実際に観察してみよう。

# spacy, ginzaインストール
!pip install -U ginza ja_ginza

# PyLDAvis
!pip install pyldavis

Collecting ginza
  Downloading ginza-5.2.0-py3-none-any.whl (21 kB)
Collecting ja_ginza
  Downloading ja_ginza-5.2.0-py3-none-any.whl (59.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.1/59.1 MB 14.2 MB/s eta 0:00:00
?25hRequirement already satisfied: spacy<4.0.0,>=3.4.4 in /usr/local/lib/python3.10/dist-packages (from ginza) (3.7.4)
Collecting plac>=1.3.3 (from ginza)
  Downloading plac-1.4.3-py2.py3-none-any.whl (22 kB)
Collecting SudachiPy<0.7.0,>=0.6.2 (from ginza)
  Downloading SudachiPy-0.6.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 21.8 MB/s eta 0:00:00
?25hCollecting SudachiDict-core>=20210802 (from ginza)
  Downloading SudachiDict_core-20240409-py3-none-any.whl (72.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.0/72.0 MB 6.4 MB/s eta 0:00:00
?25hRequirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.0.10)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.0.8)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.0.9)
Requirement already satisfied: thinc<8.3.0,>=8.2.2 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (8.2.3)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.1.3)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.4.8)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.0.10)
Requirement already satisfied: weasel<0.4.0,>=0.1.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (0.3.4)
Requirement already satisfied: typer<0.10.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (0.9.4)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (6.4.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (4.66.4)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.31.0)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.7.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.1.4)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (67.7.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (24.0)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.4.0)
Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.10/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.25.2)
Requirement already satisfied: language-data>=1.2 in /usr/local/lib/python3.10/dist-packages (from langcodes<4.0.0,>=3.2.0->spacy<4.0.0,>=3.4.4->ginza) (1.2.0)
Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<4.0.0,>=3.4.4->ginza) (0.7.0)
Requirement already satisfied: pydantic-core==2.18.4 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<4.0.0,>=3.4.4->ginza) (2.18.4)
Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<4.0.0,>=3.4.4->ginza) (4.12.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (2024.6.2)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<4.0.0,>=3.4.4->ginza) (0.7.11)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<4.0.0,>=3.4.4->ginza) (0.1.5)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.10/dist-packages (from typer<0.10.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (8.1.7)
Requirement already satisfied: cloudpathlib<0.17.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from weasel<0.4.0,>=0.1.0->spacy<4.0.0,>=3.4.4->ginza) (0.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy<4.0.0,>=3.4.4->ginza) (2.1.5)
Requirement already satisfied: marisa-trie>=0.7.7 in /usr/local/lib/python3.10/dist-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy<4.0.0,>=3.4.4->ginza) (1.1.1)
Installing collected packages: SudachiPy, plac, SudachiDict-core, ginza, ja_ginza
Successfully installed SudachiDict-core-20240409 SudachiPy-0.6.8 ginza-5.2.0 ja_ginza-5.2.0 plac-1.4.3
Collecting pyldavis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 11.0 MB/s eta 0:00:00
?25hRequirement already satisfied: numpy>=1.24.2 in /usr/local/lib/python3.10/dist-packages (from pyldavis) (1.25.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from pyldavis) (1.11.4)
Requirement already satisfied: pandas>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from pyldavis) (2.0.3)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from pyldavis) (1.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from pyldavis) (3.1.4)
Requirement already satisfied: numexpr in /usr/local/lib/python3.10/dist-packages (from pyldavis) (2.10.0)
Collecting funcy (from pyldavis)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Requirement already satisfied: scikit-learn>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from pyldavis) (1.2.2)
Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (from pyldavis) (4.3.2)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from pyldavis) (67.7.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0.0->pyldavis) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0.0->pyldavis) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0.0->pyldavis) (2024.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0.0->pyldavis) (3.5.0)
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages (from gensim->pyldavis) (6.4.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->pyldavis) (2.1.5)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=2.0.0->pyldavis) (1.16.0)
Installing collected packages: funcy, pyldavis
Successfully installed funcy-2.0 pyldavis-3.4.1

24.1. データの準備#

これまで見てきたいつものやつ。

!curl -O https://ie.u-ryukyu.ac.jp/~tnal/2022/dm/static/r_assesment.pkl

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 34834  100 34834    0     0  15859      0  0:00:02  0:00:02 --:--:-- 15862

import collections

import numpy as np
import pandas as pd
import spacy
from wordcloud import WordCloud

nlp = spacy.load("ja_ginza")

assesment_df = pd.read_pickle('r_assesment.pkl')
assesment_df.head()

	title	grade	required	q_id	comment
0	工業数学Ⅰ	1	True	Q21 (1)	特になし
1	工業数学Ⅰ	1	True	Q21 (2)	正直わかりずらい。むだに間があるし。
2	工業数学Ⅰ	1	True	Q21 (2)	例題を取り入れて理解しやすくしてほしい。
3	工業数学Ⅰ	1	True	Q21 (2)	特になし
4	工業数学Ⅰ	1	True	Q21 (2)	スライドに書く文字をもう少しわかりやすくして欲しいです。

# 分かち書き
poses = ['PROPN', 'NOUN', 'VERB', 'ADJ', 'ADV'] #名詞、動詞、形容詞、形容動詞

assesment_df['wakati'] = ''
for index, comment in enumerate(assesment_df['comment']):
    doc = nlp(comment)
    wakati_words = []
    for token in doc:
        if token.pos_ in poses:
            wakati_words.append(token.lemma_)
    wakati_text = ' '.join(wakati_words)
    assesment_df.at[index, 'wakati'] = wakati_text

assesment_df

	title	grade	required	q_id	comment	wakati
0	工業数学Ⅰ	1	True	Q21 (1)	特になし	特になし
1	工業数学Ⅰ	1	True	Q21 (2)	正直わかりずらい。むだに間があるし。	正直わかるずらいむだ間ある
2	工業数学Ⅰ	1	True	Q21 (2)	例題を取り入れて理解しやすくしてほしい。	例題取り入れる理解する
3	工業数学Ⅰ	1	True	Q21 (2)	特になし	特になし
4	工業数学Ⅰ	1	True	Q21 (2)	スライドに書く文字をもう少しわかりやすくして欲しいです。	スライド書く文字もう少しわかるする
...	...	...	...	...	...	...
165	データマイニング	3	False	Q22	課題が難しいものが多く、時間を多くとってもらえたのは非常に良かったですがかなりきつかったです...	課題難しいもの多い時間多いとるもらえる非常良いかなりきついござる
166	ICT実践英語Ⅰ	3	False	Q22	オンラインなどで顔を合わせてやりたかったです。	オンライン顔合わせるやる
167	知能情報実験Ⅲ	3	True	Q21 (2)	unityの操作方法の説明などを最初に行ってもらえたらもう少しスムーズにできたのではないかと思う。	unity 操作方法説明最初行くもらえるもう少しスムーズできる思う
168	知能情報実験Ⅲ	3	True	Q22	それぞれに任せるといった形で進められたものだったのでそれなりに進めやすかったですが、オンライ...	それぞれ任せるいう形進めるものなり進めるオンライン班員指導全くする...
169	知能情報実験Ⅲ	3	True	Q22	モバイルアプリ班\r\nHTML/CSS，JavaScriptなどを用いてアプリケーションを...	モバイルアプリ班 \r\n HTML CSS javascript 用いるアプリケーショ...

170 rows × 6 columns

24.2. 文書ベクトルの作成#

ここでは CountVectorizer (Bag-of-Words) で作成してみよう。

from sklearn.feature_extraction.text import CountVectorizer

stop_words = ['こと', '\r\n', 'ため', '思う', 'いる', 'ある', 'する', 'なる']
vectorizer = CountVectorizer(stop_words=stop_words)
bow_tf_vector = vectorizer.fit_transform(assesment_df['wakati'])
print('bow_tf_vector.shape = ', bow_tf_vector.shape)

bow_tf_vector.shape =  (170, 740)

24.3. LDAによるトピックモデル解析#

sklearnでは LatentDirichletAllocation として用意されている。

from sklearn.decomposition import LatentDirichletAllocation

NUM_TOPICS = 20 #トピック数
max_iter = 100  #LDAによる学習回数
lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=max_iter, learning_method='online',verbose=True)
data_lda = lda.fit_transform(bow_tf_vector)

iteration: 1 of max_iter: 100
iteration: 2 of max_iter: 100
iteration: 3 of max_iter: 100
iteration: 4 of max_iter: 100
iteration: 5 of max_iter: 100
iteration: 6 of max_iter: 100
iteration: 7 of max_iter: 100
iteration: 8 of max_iter: 100
iteration: 9 of max_iter: 100
iteration: 10 of max_iter: 100
iteration: 11 of max_iter: 100
iteration: 12 of max_iter: 100
iteration: 13 of max_iter: 100
iteration: 14 of max_iter: 100
iteration: 15 of max_iter: 100
iteration: 16 of max_iter: 100
iteration: 17 of max_iter: 100
iteration: 18 of max_iter: 100
iteration: 19 of max_iter: 100
iteration: 20 of max_iter: 100
iteration: 21 of max_iter: 100
iteration: 22 of max_iter: 100
iteration: 23 of max_iter: 100
iteration: 24 of max_iter: 100
iteration: 25 of max_iter: 100
iteration: 26 of max_iter: 100
iteration: 27 of max_iter: 100
iteration: 28 of max_iter: 100
iteration: 29 of max_iter: 100
iteration: 30 of max_iter: 100
iteration: 31 of max_iter: 100
iteration: 32 of max_iter: 100
iteration: 33 of max_iter: 100
iteration: 34 of max_iter: 100
iteration: 35 of max_iter: 100
iteration: 36 of max_iter: 100
iteration: 37 of max_iter: 100
iteration: 38 of max_iter: 100
iteration: 39 of max_iter: 100
iteration: 40 of max_iter: 100
iteration: 41 of max_iter: 100
iteration: 42 of max_iter: 100
iteration: 43 of max_iter: 100
iteration: 44 of max_iter: 100
iteration: 45 of max_iter: 100
iteration: 46 of max_iter: 100
iteration: 47 of max_iter: 100
iteration: 48 of max_iter: 100
iteration: 49 of max_iter: 100
iteration: 50 of max_iter: 100
iteration: 51 of max_iter: 100
iteration: 52 of max_iter: 100
iteration: 53 of max_iter: 100
iteration: 54 of max_iter: 100
iteration: 55 of max_iter: 100
iteration: 56 of max_iter: 100
iteration: 57 of max_iter: 100
iteration: 58 of max_iter: 100
iteration: 59 of max_iter: 100
iteration: 60 of max_iter: 100
iteration: 61 of max_iter: 100
iteration: 62 of max_iter: 100
iteration: 63 of max_iter: 100
iteration: 64 of max_iter: 100
iteration: 65 of max_iter: 100
iteration: 66 of max_iter: 100
iteration: 67 of max_iter: 100
iteration: 68 of max_iter: 100
iteration: 69 of max_iter: 100
iteration: 70 of max_iter: 100
iteration: 71 of max_iter: 100
iteration: 72 of max_iter: 100
iteration: 73 of max_iter: 100
iteration: 74 of max_iter: 100
iteration: 75 of max_iter: 100
iteration: 76 of max_iter: 100
iteration: 77 of max_iter: 100
iteration: 78 of max_iter: 100
iteration: 79 of max_iter: 100
iteration: 80 of max_iter: 100
iteration: 81 of max_iter: 100
iteration: 82 of max_iter: 100
iteration: 83 of max_iter: 100
iteration: 84 of max_iter: 100
iteration: 85 of max_iter: 100
iteration: 86 of max_iter: 100
iteration: 87 of max_iter: 100
iteration: 88 of max_iter: 100
iteration: 89 of max_iter: 100
iteration: 90 of max_iter: 100
iteration: 91 of max_iter: 100
iteration: 92 of max_iter: 100
iteration: 93 of max_iter: 100
iteration: 94 of max_iter: 100
iteration: 95 of max_iter: 100
iteration: 96 of max_iter: 100
iteration: 97 of max_iter: 100
iteration: 98 of max_iter: 100
iteration: 99 of max_iter: 100
iteration: 100 of max_iter: 100

24.4. トピックの観察。#

pyLDAvisによりトピックを観察してみよう。
下図の左側がトピック分布を表している。丸の大きさがトピック内に含まれる文書数、丸と丸の距離はトピック間の距離。
下図の右側が単語の発生頻度を表している。
- トピックを選択するとそのトピックにおける単語の発生頻度を観察できる。
- 単語を選択すると、その単語がどのようにトピック分布上にバラけているかを観察できる。

import pyLDAvis.lda_model

import pyLDAvis
from pyLDAvis import lda_model

pyLDAvis.enable_notebook()
dash = lda_model.prepare(lda, bow_tf_vector, vectorizer, mds='tsne')
dash

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)

Selected Topic:

Slide to adjust relevance metric:(2)

λ = 1