特徴的な単語の抽出

25. 特徴的な単語の抽出#

ある文書における特徴的な単語とは何だろうか。様々な指標が提案されているが、基本的には (1) 何か特徴を設定し、(2) その重要度を求め、(3) ランキングすることで求める事が多い。最もシンプルなアプローチは (1) 単語毎に、(2) 出現頻度を求め、 (3) 頻出上位を特徴的な単語と捉える方法だ。ここではワードクラウド形式で眺める例と、2文書間の出現頻度分布を眺める例を観察してみよう。

参考: はじめての自然言語処理第6回 OSS によるテキストマイニング

25.1. required#

spacy, sklearn
wordcloud: pip install wordcloud
scattertext: pip install scattertext

# spacy, ginzaインストール
!pip install -U ginza ja_ginza scattertext pandas

#!pip install scattertext pandas

Collecting ginza
  Downloading ginza-5.2.0-py3-none-any.whl.metadata (448 bytes)
Collecting ja_ginza
  Downloading ja_ginza-5.2.0-py3-none-any.whl.metadata (5.8 kB)
Collecting scattertext
  Downloading scattertext-0.2.2-py3-none-any.whl.metadata (581 bytes)
Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)
Collecting pandas
  Downloading pandas-3.0.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (79 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.5/79.5 kB 1.7 MB/s eta 0:00:00
?25hRequirement already satisfied: spacy<4.0.0,>=3.4.4 in /usr/local/lib/python3.12/dist-packages (from ginza) (3.8.14)
Collecting plac>=1.3.3 (from ginza)
  Downloading plac-1.4.5-py2.py3-none-any.whl.metadata (5.9 kB)
Collecting SudachiPy<0.7.0,>=0.6.2 (from ginza)
  Downloading sudachipy-0.6.11-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (12 kB)
Collecting SudachiDict-core>=20210802 (from ginza)
  Downloading sudachidict_core-20260428-py3-none-any.whl.metadata (2.7 kB)
Requirement already satisfied: numpy>=1.2.6 in /usr/local/lib/python3.12/dist-packages (from scattertext) (2.0.2)
Collecting scipy<1.14.0,>=1.7.0 (from scattertext)
  Downloading scipy-1.13.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.6/60.6 kB 1.9 MB/s eta 0:00:00
?25hRequirement already satisfied: scikit-learn>=1.4 in /usr/local/lib/python3.12/dist-packages (from scattertext) (1.6.1)
Requirement already satisfied: statsmodels>=0.14.1 in /usr/local/lib/python3.12/dist-packages (from scattertext) (0.14.6)
Collecting flashtext>=2.7 (from scattertext)
  Downloading flashtext-2.7.tar.gz (14 kB)
  Preparing metadata (setup.py) ... ?25l?25hdone
Collecting gensim>=4.0.0 (from scattertext)
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Requirement already satisfied: tqdm>=4.0 in /usr/local/lib/python3.12/dist-packages (from scattertext) (4.67.3)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: smart_open>=1.8.1 in /usr/local/lib/python3.12/dist-packages (from gensim>=4.0.0->scattertext) (7.6.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn>=1.4->scattertext) (1.5.3)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn>=1.4->scattertext) (3.6.0)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.0.15)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.0.13)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.0.13)
Requirement already satisfied: thinc<8.4.0,>=8.3.12 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (8.3.13)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.1.3)
Requirement already satisfied: srsly<3.0.0,>=2.5.3 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.5.3)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.0.10)
Requirement already satisfied: weasel<2.0.0,>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.0.0)
Requirement already satisfied: confection<2.0.0,>=1.3.2 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.3.3)
Requirement already satisfied: typer<1.0.0,>=0.3.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (0.25.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.32.4)
Requirement already satisfied: pydantic<3.0.0,>=2.0.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.12.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.1.6)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (75.2.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (26.2)
Requirement already satisfied: patsy>=0.5.6 in /usr/local/lib/python3.12/dist-packages (from statsmodels>=0.14.1->scattertext) (1.0.2)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.0.0->spacy<4.0.0,>=3.4.4->ginza) (0.7.0)
Requirement already satisfied: pydantic-core==2.41.4 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.0.0->spacy<4.0.0,>=3.4.4->ginza) (2.41.4)
Requirement already satisfied: typing-extensions>=4.14.1 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.0.0->spacy<4.0.0,>=3.4.4->ginza) (4.15.0)
Requirement already satisfied: typing-inspection>=0.4.2 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.0.0->spacy<4.0.0,>=3.4.4->ginza) (0.4.2)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (3.4.7)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (3.15)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (2026.5.20)
Requirement already satisfied: wrapt in /usr/local/lib/python3.12/dist-packages (from smart_open>=1.8.1->gensim>=4.0.0->scattertext) (2.2.0)
Requirement already satisfied: blis<1.4.0,>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from thinc<8.4.0,>=8.3.12->spacy<4.0.0,>=3.4.4->ginza) (1.3.3)
Requirement already satisfied: click>=8.2.1 in /usr/local/lib/python3.12/dist-packages (from typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (8.4.0)
Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (1.5.4)
Requirement already satisfied: rich>=13.8.0 in /usr/local/lib/python3.12/dist-packages (from typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (13.9.4)
Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.12/dist-packages (from typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (0.0.4)
Requirement already satisfied: cloudpathlib>=0.7.0 in /usr/local/lib/python3.12/dist-packages (from weasel<2.0.0,>=1.0.0->spacy<4.0.0,>=3.4.4->ginza) (0.24.0)
Requirement already satisfied: httpx>=0.24.0 in /usr/local/lib/python3.12/dist-packages (from weasel<2.0.0,>=1.0.0->spacy<4.0.0,>=3.4.4->ginza) (0.28.1)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->spacy<4.0.0,>=3.4.4->ginza) (3.0.3)
Requirement already satisfied: anyio in /usr/local/lib/python3.12/dist-packages (from httpx>=0.24.0->weasel<2.0.0,>=1.0.0->spacy<4.0.0,>=3.4.4->ginza) (4.13.0)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx>=0.24.0->weasel<2.0.0,>=1.0.0->spacy<4.0.0,>=3.4.4->ginza) (1.0.9)
Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx>=0.24.0->weasel<2.0.0,>=1.0.0->spacy<4.0.0,>=3.4.4->ginza) (0.16.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=13.8.0->typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (4.2.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=13.8.0->typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (2.20.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=13.8.0->typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (0.1.2)
Downloading ginza-5.2.0-py3-none-any.whl (21 kB)
Downloading ja_ginza-5.2.0-py3-none-any.whl (59.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.1/59.1 MB 16.0 MB/s eta 0:00:00
?25hDownloading scattertext-0.2.2-py3-none-any.whl (9.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.4/9.4 MB 65.2 MB/s eta 0:00:00
?25hDownloading pandas-3.0.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (10.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.9/10.9 MB 103.1 MB/s eta 0:00:00
?25hDownloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 16.6 MB/s eta 0:00:00
?25hDownloading plac-1.4.5-py2.py3-none-any.whl (22 kB)
Downloading scipy-1.13.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.2/38.2 MB 48.8 MB/s eta 0:00:00
?25hDownloading sudachidict_core-20260428-py3-none-any.whl (72.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.2/72.2 MB 9.4 MB/s eta 0:00:00
?25hDownloading sudachipy-0.6.11-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 62.2 MB/s eta 0:00:00
?25hBuilding wheels for collected packages: flashtext
  Building wheel for flashtext (setup.py) ... ?25l?25hdone
  Created wheel for flashtext: filename=flashtext-2.7-py2.py3-none-any.whl size=9300 sha256=d453ebf93a88778f98fcdd216164815b52b2169ecec773d7f715310b2951146c
  Stored in directory: /root/.cache/pip/wheels/8c/24/da/4d994d7a27cfc73a4e513a669fbeec4a71f871fe245a81977f
Successfully built flashtext
Installing collected packages: SudachiPy, plac, flashtext, SudachiDict-core, scipy, pandas, gensim, scattertext, ginza, ja_ginza
  Attempting uninstall: scipy
    Found existing installation: scipy 1.16.3
    Uninstalling scipy-1.16.3:
      Successfully uninstalled scipy-1.16.3
  Attempting uninstall: pandas
    Found existing installation: pandas 2.2.2
    Uninstalling pandas-2.2.2:
      Successfully uninstalled pandas-2.2.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 3.0.3 which is incompatible.
tsfresh 0.21.1 requires scipy>=1.14.0; python_version >= "3.10", but you have scipy 1.13.1 which is incompatible.
gradio 5.50.0 requires pandas<3.0,>=1.0, but you have pandas 3.0.3 which is incompatible.
db-dtypes 1.6.0 requires pandas<3.0.0,>=1.5.3, but you have pandas 3.0.3 which is incompatible.
access 1.1.10.post3 requires scipy>=1.14.1, but you have scipy 1.13.1 which is incompatible.
Successfully installed SudachiDict-core-20260428 SudachiPy-0.6.11 flashtext-2.7 gensim-4.4.0 ginza-5.2.0 ja_ginza-5.2.0 pandas-3.0.3 plac-1.4.5 scattertext-0.2.2 scipy-1.13.1

25.2. 利用ライブラリの用意、データセット準備#

事前に、load_r_assesment.ipynb でデータセットを作成し、pkl形式でファイル保存(r_assesment.pkl)しておく。今回は作成済みファイルをダウンロードして利用することにする。

r_assesment.pklは授業評価アンケートの自由記述欄をpd.DataFrame形式で保存したもので、授業名(title)、学年(grade)、必修か否か(required)、質問番号(q_id)、コメント(comment)で構成される。

!curl -O https://ie.u-ryukyu.ac.jp/~tnal/2022/dm/static/r_assesment.pkl

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 34834  100 34834    0     0  14502      0  0:00:02  0:00:02 --:--:-- 14502

import collections

import numpy as np
import pandas as pd
import spacy

# Python 3.12 + Spacy 3.8 + Ginza 5.2 の構成だとそのままでは動作しないため、
# 以下の設定を追加指定
config = {
    "components": {
        "compound_splitter": {
            "split_mode": "A"
        }
    }
}
nlp = spacy.load("ja_ginza", config=config)

assesment_df = pd.read_pickle('r_assesment.pkl')
assesment_df.head()

	title	grade	required	q_id	comment
0	工業数学Ⅰ	1	True	Q21 (1)	特になし
1	工業数学Ⅰ	1	True	Q21 (2)	正直わかりずらい。むだに間があるし。
2	工業数学Ⅰ	1	True	Q21 (2)	例題を取り入れて理解しやすくしてほしい。
3	工業数学Ⅰ	1	True	Q21 (2)	特になし
4	工業数学Ⅰ	1	True	Q21 (2)	スライドに書く文字をもう少しわかりやすくして欲しいです。

25.3. (何故かみんな大好き) ワードクラウド#

分かち書きした文章を用意し、最大フォントサイズや画像サイズを指定するぐらいで作成可能。
wordcloudで日本語を扱う場合、フォント指定が必要。OS毎にフォントの場所が異なるので「Windows wordcolud 日本語」のようにググってみよう。

# 分かち書き
assesment_df['wakati'] = ''
for index, comment in enumerate(assesment_df['comment']):
    doc = nlp(comment)
    wakati_words = []
    for token in doc:
        wakati_words.append(token.lemma_)
    wakati_text = ' '.join(wakati_words)
    assesment_df.at[index, 'wakati'] = wakati_text

assesment_df

	title	grade	required	q_id	comment	wakati
0	工業数学Ⅰ	1	True	Q21 (1)	特になし	特になし
1	工業数学Ⅰ	1	True	Q21 (2)	正直わかりずらい。むだに間があるし。	正直わかるずらい。むだだ間があるし。
2	工業数学Ⅰ	1	True	Q21 (2)	例題を取り入れて理解しやすくしてほしい。	例題を取り入れるて理解するやすいするてほしい。
3	工業数学Ⅰ	1	True	Q21 (2)	特になし	特になし
4	工業数学Ⅰ	1	True	Q21 (2)	スライドに書く文字をもう少しわかりやすくして欲しいです。	スライドに書く文字をもう少しわかるやすいするて欲しいです。
...	...	...	...	...	...	...
165	データマイニング	3	False	Q22	課題が難しいものが多く、時間を多くとってもらえたのは非常に良かったですがかなりきつかったです...	課題が難しいものが多い、時間を多いとるてもらえるたのは非常 ...
166	ICT実践英語Ⅰ	3	False	Q22	オンラインなどで顔を合わせてやりたかったです。	オンラインなどで顔を合わせるてやるたいたです。
167	知能情報実験Ⅲ	3	True	Q21 (2)	unityの操作方法の説明などを最初に行ってもらえたらもう少しスムーズにできたのではないかと思う。	unity の操作方法の説明などを最初に行くてもらえるたもう少し ...
168	知能情報実験Ⅲ	3	True	Q22	それぞれに任せるといった形で進められたものだったのでそれなりに進めやすかったですが、オンライ...	それぞれに任せるというた形で進めるられるたものだたのだそれ ...
169	知能情報実験Ⅲ	3	True	Q22	モバイルアプリ班\r\nHTML/CSS，JavaScriptなどを用いてアプリケーションを...	モバイルアプリ班 \r\n HTML / CSS , javascript などを用い...

170 rows × 6 columns

# フォントのインストールと設定
!apt-get -y install fonts-ipafont-gothic
font_path = '/usr/share/fonts/truetype/fonts-japanese-mincho.ttf'

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  fonts-ipafont-mincho
The following NEW packages will be installed:
  fonts-ipafont-gothic fonts-ipafont-mincho
0 upgraded, 2 newly installed, 0 to remove and 2 not upgraded.
Need to get 8,237 kB of archives.
After this operation, 28.7 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 fonts-ipafont-gothic all 00303-21ubuntu1 [3,513 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 fonts-ipafont-mincho all 00303-21ubuntu1 [4,724 kB]
Fetched 8,237 kB in 0s (26.9 MB/s)
Selecting previously unselected package fonts-ipafont-gothic.
(Reading database ... 118242 files and directories currently installed.)
Preparing to unpack .../fonts-ipafont-gothic_00303-21ubuntu1_all.deb ...
Unpacking fonts-ipafont-gothic (00303-21ubuntu1) ...
Selecting previously unselected package fonts-ipafont-mincho.
Preparing to unpack .../fonts-ipafont-mincho_00303-21ubuntu1_all.deb ...
Unpacking fonts-ipafont-mincho (00303-21ubuntu1) ...
Setting up fonts-ipafont-mincho (00303-21ubuntu1) ...
update-alternatives: using /usr/share/fonts/opentype/ipafont-mincho/ipam.ttf to provide /usr/share/fonts/truetype/fonts-japanese-mincho.ttf (fonts-japanese-mincho.ttf) in auto mode
Setting up fonts-ipafont-gothic (00303-21ubuntu1) ...
update-alternatives: using /usr/share/fonts/opentype/ipafont-gothic/ipag.ttf to provide /usr/share/fonts/truetype/fonts-japanese-gothic.ttf (fonts-japanese-gothic.ttf) in auto mode
Processing triggers for fontconfig (2.13.1-4.2ubuntu5) ...

from wordcloud import WordCloud

#font_path = '/Library/Fonts/Arial Unicode' # macOSでデフォルトであると思われるフォント
#wc = WordCloud(background_color='white', font_path=font_path, max_font_size=100, width=1000, height=500).generate(' '.join(assesment_df['wakati']))
wc = WordCloud(background_color='white', font_path=font_path, max_font_size=100, width=1000, height=500).generate(' '.join(assesment_df['wakati']))
wc.to_image()

../_images/1adfc72bf11b6158a30a7a6a4f2198858adc73333205435024ab6770dec7901e.png

25.4. scattertextによる2文書の傾向比較#

scattertextは、2つの文書（もしくは2つの文書集合）の違いを単語出現分布から観察するのに適した可視化ツールだ。対比させるという点が重要であり、そうではないタスク、例えばある文書を要約する（重要語を抽出する）というタスクには向いていない。対比する文書は1文書単位でも良いし、複数文書でも構わない。

なお、3種類以上を同時に比較することはできない。もしそのような場合に用いたいのであれば、例えば「文書1とそれ以外」「文書2とそれ以外」のように one-vs-rest を複数回実行すると良いだろう。

以下では、授業毎のコメント数上位2科目を比較対象とし、以下の手順で描画する。

上位2科目の dataframe を用意する。
コメントの spacy.nlp 解析結果（Doc形式）を用意する。
- コメント文そのものや分かち書き結果ではなく、Doc型を用意する必要がある。
dataframe と列を指定して scattertext に処理してもらう。

25.4.1. 前処理なし#

# 授業毎のコメント数上位を確認
assesment_df['title'].value_counts()

title
コンピュータシステム      32
プログラミングⅠ        19
技術者の倫理          18
工業数学Ⅰ           16
アルゴリズムとデータ構造    15
データサイエンス基礎      15
プログラミング演習Ⅰ      13
工学基礎演習          12
プロジェクトデザイン       9
情報ネットワークⅠ        7
情報処理技術概論         7
知能情報実験Ⅲ          3
ディジタル回路          1
キャリアデザイン         1
データマイニング         1
ICT実践英語Ⅰ         1
Name: count, dtype: int64

# 上位2科目のみの dataframe を用意。
# (1) 比較対象をカテゴリ名として保存している列（以下では new_df['title']）と、
# (2) 処理対象となる文書（以下では new_df['comment']）を保存すること。
title1 = 'コンピュータシステム'
title2 = 'プログラミングⅠ'
condition1 = assesment_df['title'] == title1
condition2 = assesment_df['title'] == title2
new_df = assesment_df[condition1 | condition2].loc[:,['title', 'comment']]

# コメント文の nlp 解析結果を用意し、new_df に新しい列として保存する。
# new_df['doc'] の中は丸括弧付きで分かち書きされているように出力されるが、中身はDoc形式である点に注意。
docs = []
for comment in new_df['comment']:
    doc = nlp(comment)
    docs.append(doc)

new_df['doc'] = docs
new_df.head()

	title	comment	doc
46	プログラミングⅠ	特になし	(特に, なし)
47	プログラミングⅠ	たまに説明がないコードがあったりしたので少し戸惑った。いずれはやっていくものではあるが、、、	(たまに, 説明, が, ない, コード, が, あっ, たり, し, た, の, で, 少...
48	プログラミングⅠ	できれば、対面を増やして欲しい	(できれ, ば, 、, 対面, を, 増やし, て, 欲しい)
49	プログラミングⅠ	特になし	(特に, なし)
50	プログラミングⅠ	他人の課題を変更できてしまうのが怖い。	(他人, の, 課題, を, 変更, でき, て, しまう, の, が, 怖い, 。)

import scattertext as st

# 用意したdataframeと、比較対象カテゴリを保存している列(title)、Docを保存している列(doc)を指定。
corpus = st.CorpusFromParsedDocuments(new_df,
                                      category_col='title',
                                      parsed_col='doc').build()

# 上記で用意した corpusと、比較対象したいカテゴリ名（title1, title2）を指定。
html = st.produce_scattertext_explorer(corpus,
                                       category=title1,
                                       category_name=title1,
                                       not_category_name=title2)

/usr/local/lib/python3.12/dist-packages/jieba/__init__.py:44: SyntaxWarning: invalid escape sequence '\.'
  re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)
/usr/local/lib/python3.12/dist-packages/jieba/__init__.py:46: SyntaxWarning: invalid escape sequence '\s'
  re_skip_default = re.compile("(\r\n|\s)", re.U)
/usr/local/lib/python3.12/dist-packages/jieba/finalseg/__init__.py:78: SyntaxWarning: invalid escape sequence '\.'
  re_skip = re.compile("([a-zA-Z0-9]+(?:\.\d+)?%?)")

# 生成されたHTMLを描画。
from IPython.display import display, HTML

# 単純にdisplay使うだけでは描画できないための工夫
import html as htmllib        # 文字列エスケープ用

iframe = f'''
<iframe
  srcdoc="{htmllib.escape(html)}"
  style="width:100%;height:700px;border:none;"
  sandbox="allow-scripts allow-same-origin">
</iframe>
'''
HTML(iframe)

25.4.2. 前処理あり#

コメント文をそのまま処理してしまうと観察したくない単語（助詞など）が多々現れているため、傾向を掴みづらい結果となってしまった。これまでにも見てきたように品詞を指定して観察するとしよう。このためには、(1) new_df[‘comment’] に含まれるコメントを事前に分かち書きし、そのタイミングで品詞判定をして不要語を削除する方法と、(2) scattertext側でオプション指定する方法がある。ここでは(2)の方法を眺めてみよう。

scattertext側で品詞指定するには、(a) コメント文そのもの、(b) 解析器(spacy.nlp)、(c) 解析クラスを用意する必要がある。(a) は new_df[‘comment’] をそのまま用いれば良い。(b)は既に用意している nlp を用いれば良い。(c)については st.FeatsFromSpacyDoc を継承した子クラスを作成し、その中で解析方法を書く必要がある。

NOTE
- 先程は処理済みDoc型を利用するため st.CorpusFromParsedDocuments() を利用した。今回はテキストと解析機を渡して scattertext 内部で処理するため、st.CorpusFromPandas() を利用している。

class SelectPOS(st.FeatsFromSpacyDoc):
    '''小クラス。
    get_feats() で解析方法を指定する。
    '''
    poses = ['PRPON', 'NOUN', 'VERB', 'ADJ', 'ADV']
    def __init__(self, use_pos=poses):
        super().__init__()
        self._use_pos = use_pos

    def get_feats(self, doc):
        return collections.Counter([c.lemma_ for c in doc if c.pos_ in self._use_pos])

corpus = st.CorpusFromPandas(new_df,
                             category_col='title',
                             text_col='comment',
                             nlp=nlp,
                             feats_from_spacy_doc=SelectPOS()).build()

html2 = st.produce_scattertext_explorer(corpus,
                                       category=title1,
                                       category_name=title1,
                                       not_category_name=title2)

iframe = f'''
<iframe
  srcdoc="{htmllib.escape(html2)}"
  style="width:100%;height:700px;border:none;"
  sandbox="allow-scripts allow-same-origin">
</iframe>
'''
HTML(iframe)

# 描画が被って見づらい場合には、ファイル出力したものをダウンロードし、
# 別途ブラウザで閲覧する方法と見やすいことも。

with open('scattertext_result.html', 'w') as f:
    f.write(html2)