トピックモデルによるクラスタリング

24. トピックモデルによるクラスタリング#

トピックモデルとは文書中の単語出現分布を元に傾向（≒トピックらしきもの）を観察しようとするアプローチで、クラスタリングの一種である。なお、一般的なクラスタリング（例えばk平均法）では一つのサンプルが一つのクラスタに属するという前提でグルーピングを行うのに対し、トピックモデルでは一つのサンプルが複数のクラスタを内包しているという前提でグルーピングを行う。次の例を眺めるとイメージをつかみやすいだろう。

基本的には文書を BoW (CountVectrizor) やそれの重みを調整した TF-IDF 等の「文書単語行列」を作成し、ここから文書館類似度や単語間類似度を元に集約（≒次元削減）を試みる。文書単語行列の作成方法や次元削減方法、類似度の求め方などで様々なアルゴリズムが提案されている。ここでは (1) BowベースのLDAと、(2) TF-IDFベースのLDAを行い、それぞれどのようなトピックが出てくるのか眺めてみよう。

なお、トピックモデルの注意点として、トピックそのものは人手による解釈が求められる 点が挙げられる。例えば先に上げたトピックモデル入門：WikipediaをLDAモデル化してみたにおける図2（下図）では「政治」「スポーツ」「国際」といったトピックが並んでいるが、実際には「4-1. トピック観察」を行う必要がある。実際に観察してみよう。

# spacy, ginza インストール
!pip install -U ginza ja_ginza

# plotlyで作図した図をファイル出力するためのパッケージ
#!pip install -U kaleido

Collecting ginza
  Downloading ginza-5.2.0-py3-none-any.whl.metadata (448 bytes)
Collecting ja_ginza
  Downloading ja_ginza-5.2.0-py3-none-any.whl.metadata (5.8 kB)
Requirement already satisfied: spacy<4.0.0,>=3.4.4 in /usr/local/lib/python3.12/dist-packages (from ginza) (3.8.14)
Collecting plac>=1.3.3 (from ginza)
  Downloading plac-1.4.5-py2.py3-none-any.whl.metadata (5.9 kB)
Collecting SudachiPy<0.7.0,>=0.6.2 (from ginza)
  Downloading sudachipy-0.6.11-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (12 kB)
Collecting SudachiDict-core>=20210802 (from ginza)
  Downloading sudachidict_core-20260428-py3-none-any.whl.metadata (2.7 kB)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.0.15)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.0.13)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.0.13)
Requirement already satisfied: thinc<8.4.0,>=8.3.12 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (8.3.13)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.1.3)
Requirement already satisfied: srsly<3.0.0,>=2.5.3 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.5.3)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.0.10)
Requirement already satisfied: weasel<2.0.0,>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.0.0)
Requirement already satisfied: confection<2.0.0,>=1.3.2 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.3.3)
Requirement already satisfied: typer<1.0.0,>=0.3.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (0.25.1)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (4.67.3)
Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.0.2)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.32.4)
Requirement already satisfied: pydantic<3.0.0,>=2.0.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.12.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.1.6)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (75.2.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (26.2)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.0.0->spacy<4.0.0,>=3.4.4->ginza) (0.7.0)
Requirement already satisfied: pydantic-core==2.41.4 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.0.0->spacy<4.0.0,>=3.4.4->ginza) (2.41.4)
Requirement already satisfied: typing-extensions>=4.14.1 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.0.0->spacy<4.0.0,>=3.4.4->ginza) (4.15.0)
Requirement already satisfied: typing-inspection>=0.4.2 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.0.0->spacy<4.0.0,>=3.4.4->ginza) (0.4.2)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (3.4.7)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (3.15)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (2026.5.20)
Requirement already satisfied: blis<1.4.0,>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from thinc<8.4.0,>=8.3.12->spacy<4.0.0,>=3.4.4->ginza) (1.3.3)
Requirement already satisfied: click>=8.2.1 in /usr/local/lib/python3.12/dist-packages (from typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (8.4.0)
Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (1.5.4)
Requirement already satisfied: rich>=13.8.0 in /usr/local/lib/python3.12/dist-packages (from typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (13.9.4)
Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.12/dist-packages (from typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (0.0.4)
Requirement already satisfied: cloudpathlib>=0.7.0 in /usr/local/lib/python3.12/dist-packages (from weasel<2.0.0,>=1.0.0->spacy<4.0.0,>=3.4.4->ginza) (0.24.0)
Requirement already satisfied: smart-open>=5.2.1 in /usr/local/lib/python3.12/dist-packages (from weasel<2.0.0,>=1.0.0->spacy<4.0.0,>=3.4.4->ginza) (7.6.1)
Requirement already satisfied: httpx>=0.24.0 in /usr/local/lib/python3.12/dist-packages (from weasel<2.0.0,>=1.0.0->spacy<4.0.0,>=3.4.4->ginza) (0.28.1)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->spacy<4.0.0,>=3.4.4->ginza) (3.0.3)
Requirement already satisfied: anyio in /usr/local/lib/python3.12/dist-packages (from httpx>=0.24.0->weasel<2.0.0,>=1.0.0->spacy<4.0.0,>=3.4.4->ginza) (4.13.0)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx>=0.24.0->weasel<2.0.0,>=1.0.0->spacy<4.0.0,>=3.4.4->ginza) (1.0.9)
Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx>=0.24.0->weasel<2.0.0,>=1.0.0->spacy<4.0.0,>=3.4.4->ginza) (0.16.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=13.8.0->typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (4.2.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=13.8.0->typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (2.20.0)
Requirement already satisfied: wrapt in /usr/local/lib/python3.12/dist-packages (from smart-open>=5.2.1->weasel<2.0.0,>=1.0.0->spacy<4.0.0,>=3.4.4->ginza) (2.2.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=13.8.0->typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (0.1.2)
Downloading ginza-5.2.0-py3-none-any.whl (21 kB)
Downloading ja_ginza-5.2.0-py3-none-any.whl (59.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.1/59.1 MB 8.4 MB/s eta 0:00:00
?25hDownloading plac-1.4.5-py2.py3-none-any.whl (22 kB)
Downloading sudachidict_core-20260428-py3-none-any.whl (72.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.2/72.2 MB 8.0 MB/s eta 0:00:00
?25hDownloading sudachipy-0.6.11-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 24.7 MB/s eta 0:00:00
?25hInstalling collected packages: SudachiPy, plac, SudachiDict-core, ginza, ja_ginza
Successfully installed SudachiDict-core-20260428 SudachiPy-0.6.11 ginza-5.2.0 ja_ginza-5.2.0 plac-1.4.5

24.1. データの準備#

これまで見てきたいつものやつ。

!curl -O https://ie.u-ryukyu.ac.jp/~tnal/2022/dm/static/r_assesment.pkl

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 34834  100 34834    0     0  14587      0  0:00:02  0:00:02 --:--:-- 14593

import collections

import numpy as np
import pandas as pd
import spacy
from wordcloud import WordCloud

# Python 3.12 + Spacy 3.8 + Ginza 5.2 の構成だとそのままでは動作しないため、
# 以下の設定を追加指定
config = {
    "components": {
        "compound_splitter": {
            "split_mode": "A"
        }
    }
}
nlp = spacy.load("ja_ginza", config=config)

assesment_df = pd.read_pickle('r_assesment.pkl')
assesment_df.head()

	title	grade	required	q_id	comment
0	工業数学Ⅰ	1	True	Q21 (1)	特になし
1	工業数学Ⅰ	1	True	Q21 (2)	正直わかりずらい。むだに間があるし。
2	工業数学Ⅰ	1	True	Q21 (2)	例題を取り入れて理解しやすくしてほしい。
3	工業数学Ⅰ	1	True	Q21 (2)	特になし
4	工業数学Ⅰ	1	True	Q21 (2)	スライドに書く文字をもう少しわかりやすくして欲しいです。

# 分かち書き
poses = ['PROPN', 'NOUN', 'VERB', 'ADJ', 'ADV'] #名詞、動詞、形容詞、形容動詞

assesment_df['wakati'] = ''
for index, comment in enumerate(assesment_df['comment']):
    doc = nlp(comment)
    wakati_words = []
    for token in doc:
        if token.pos_ in poses:
            wakati_words.append(token.lemma_)
    wakati_text = ' '.join(wakati_words)
    assesment_df.at[index, 'wakati'] = wakati_text

assesment_df

	title	grade	required	q_id	comment	wakati
0	工業数学Ⅰ	1	True	Q21 (1)	特になし	特になし
1	工業数学Ⅰ	1	True	Q21 (2)	正直わかりずらい。むだに間があるし。	正直わかるずらいむだ間ある
2	工業数学Ⅰ	1	True	Q21 (2)	例題を取り入れて理解しやすくしてほしい。	例題取り入れる理解する
3	工業数学Ⅰ	1	True	Q21 (2)	特になし	特になし
4	工業数学Ⅰ	1	True	Q21 (2)	スライドに書く文字をもう少しわかりやすくして欲しいです。	スライド書く文字もう少しわかるする
...	...	...	...	...	...	...
165	データマイニング	3	False	Q22	課題が難しいものが多く、時間を多くとってもらえたのは非常に良かったですがかなりきつかったです...	課題難しいもの多い時間多いとるもらえる非常良いかなりきついござる
166	ICT実践英語Ⅰ	3	False	Q22	オンラインなどで顔を合わせてやりたかったです。	オンライン顔合わせるやる
167	知能情報実験Ⅲ	3	True	Q21 (2)	unityの操作方法の説明などを最初に行ってもらえたらもう少しスムーズにできたのではないかと思う。	unity 操作方法説明最初行くもらえるもう少しスムーズできる思う
168	知能情報実験Ⅲ	3	True	Q22	それぞれに任せるといった形で進められたものだったのでそれなりに進めやすかったですが、オンライ...	それぞれ任せるいう形進めるものなり進めるオンライン班員指導全くする...
169	知能情報実験Ⅲ	3	True	Q22	モバイルアプリ班\r\nHTML/CSS，JavaScriptなどを用いてアプリケーションを...	モバイルアプリ班 \r\n HTML CSS javascript 用いるアプリケーショ...

170 rows × 6 columns

24.2. 文書ベクトルの作成#

ここでは CountVectorizer (Bag-of-Words) で作成してみよう。

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

stop_words = ['こと', '\r\n', 'ため', '思う', 'いる', 'ある', 'する', 'なる']
vectorizer = CountVectorizer(stop_words=stop_words)
bow_tf_vector = vectorizer.fit_transform(assesment_df['wakati'])
print('bow_tf_vector.shape = ', bow_tf_vector.shape)

bow_tf_vector.shape =  (170, 741)

24.3. LDAによるトピックモデル解析#

sklearnでは LatentDirichletAllocation として用意されている。

from sklearn.decomposition import LatentDirichletAllocation

NUM_TOPICS = 5 #トピック数
max_iter = 100  #LDAによる学習回数
lda = LatentDirichletAllocation(n_components=NUM_TOPICS,
                                max_iter=max_iter,
                                learning_method='online',
                                random_state=123) # シード値を指定すると結果を再現できる
data_lda = lda.fit_transform(bow_tf_vector)

24.4. トピックの観察#

import plotly.graph_objects as go
from plotly.subplots import make_subplots

def plot_top_words(model, feature_names, n_top_words, title):
    """
    LDA のトピックごと上位語を水平方向のバーで表示する Plotly 版

    Parameters
    ----------
    model : sklearn.decomposition.LatentDirichletAllocation
        すでに fit_transform 済みの LDA モデル
    feature_names : array‑like, shape (n_features,)
        model.get_feature_names_out() で得た語彙
    n_top_words : int
        各トピックで表示したい単語数
    title : str
        図全体のタイトル
    """
    n_topics = model.components_.shape[0]
    n_cols = 5                                # 列数は固定
    n_rows = int(np.ceil(n_topics / n_cols))  # トピック数に応じて行数を決定

    # サブプロット用の Figure を用意
    fig = make_subplots(
        rows=n_rows,
        cols=n_cols,
        shared_xaxes=False,
        horizontal_spacing=0.08,
        vertical_spacing=0.06,
        subplot_titles=[f"Topic {i + 1}" for i in range(n_topics)],
    )

    for topic_idx, topic in enumerate(model.components_):
        # 指定トピックの上位語と重み
        top_idx = topic.argsort()[-n_top_words:]
        top_features = [feature_names[i] for i in top_idx]
        weights = topic[top_idx]

        row = topic_idx // n_cols + 1
        col = topic_idx % n_cols + 1

        # 水平バーを追加
        fig.add_trace(
            go.Bar(
                x=weights,
                y=top_features,
                orientation="h",
                marker=dict(line=dict(width=0)),  # 枠線を消してすっきり
            ),
            row=row,
            col=col,
        )

        # y 軸を上から下に並べ替え（matplotlib の barh と同じ見た目）
        fig.update_yaxes(autorange="reversed", row=row, col=col)

    # 図全体のレイアウト調整
    fig.update_layout(
        height=450 * n_rows,
        width=1700,
        title=dict(text=title, x=0.5, xanchor="center", font=dict(size=40)),
        showlegend=False,
        margin=dict(t=120, l=20, r=20, b=20),
    )

    # サブプロットタイトル（各トピック）のフォントサイズを揃える
    fig.update_annotations(font_size=22)

    fig.show()
    #file_title = title.replace(' ', '_')
    #fig.write_image(f'{file_title}.png')

n_top_words = 10
plot_top_words(lda, vectorizer.get_feature_names_out(), n_top_words, "Topics in LDA model (TF)")

24.5. 文書ベクトル2(TF-IDF）#

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

stop_words = ['こと', '\r\n', 'ため', '思う', 'いる', 'ある', 'する', 'なる']
vectorizer2 = TfidfVectorizer(stop_words=stop_words)
tfidf_vector = vectorizer2.fit_transform(assesment_df['wakati'])
print('tfidf_vector.shape = ', tfidf_vector.shape)

tfidf_vector.shape =  (170, 741)

lda2 = LatentDirichletAllocation(n_components=NUM_TOPICS,
                                max_iter=max_iter,
                                learning_method='online',
                                random_state=123) # シード値を指定すると結果を再現できる

data_lda2 = lda2.fit_transform(bow_tf_vector)

plot_top_words(lda2, vectorizer2.get_feature_names_out(), n_top_words, "Topics in LDA model (TF-IDF)")