テキストをトークン出現回数でベクトル化する例（Spacy版）

Contents

!date
!python --version

Thu May 30 04:39:14 AM UTC 2024
Python 3.10.12

20. テキストをトークン出現回数でベクトル化する例（Spacy版）#

基本的には「形態素解析して単語に分割し、その回数をカウントしたうえでベクトル化する」という手順を取る。形態素解析には様々なツールがあるが、ここでは spacy.load("ja_ginza") を用いた例を示す。単語分割した文字列を作成したら、後はCountVectorizerを使うのが楽だ。

!pip install -U spacy ja_ginza

Requirement already satisfied: spacy in /usr/local/lib/python3.10/dist-packages (3.7.4)
Collecting ja_ginza
  Downloading ja_ginza-5.2.0-py3-none-any.whl (59.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.1/59.1 MB 12.4 MB/s eta 0:00:00
?25hRequirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (1.0.10)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy) (2.0.8)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy) (3.0.9)
Requirement already satisfied: thinc<8.3.0,>=8.2.2 in /usr/local/lib/python3.10/dist-packages (from spacy) (8.2.3)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy) (1.1.2)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy) (2.4.8)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy) (2.0.10)
Requirement already satisfied: weasel<0.4.0,>=0.1.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (0.3.4)
Requirement already satisfied: typer<0.10.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (0.9.4)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /usr/local/lib/python3.10/dist-packages (from spacy) (6.4.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (4.66.4)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (2.31.0)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from spacy) (2.7.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy) (3.1.4)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy) (67.7.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (24.0)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (3.4.0)
Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.10/dist-packages (from spacy) (1.25.2)
Collecting sudachipy<0.7.0,>=0.6.2 (from ja_ginza)
  Downloading SudachiPy-0.6.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 51.0 MB/s eta 0:00:00
?25hCollecting sudachidict-core>=20210802 (from ja_ginza)
  Downloading SudachiDict_core-20240409-py3-none-any.whl (72.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.0/72.0 MB 9.1 MB/s eta 0:00:00
?25hCollecting ginza<5.3.0,>=5.2.0 (from ja_ginza)
  Downloading ginza-5.2.0-py3-none-any.whl (21 kB)
Collecting plac>=1.3.3 (from ginza<5.3.0,>=5.2.0->ja_ginza)
  Downloading plac-1.4.3-py2.py3-none-any.whl (22 kB)
Requirement already satisfied: language-data>=1.2 in /usr/local/lib/python3.10/dist-packages (from langcodes<4.0.0,>=3.2.0->spacy) (1.2.0)
Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (0.7.0)
Requirement already satisfied: pydantic-core==2.18.2 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (2.18.2)
Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (4.11.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2024.2.2)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy) (0.7.11)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy) (0.1.4)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.10/dist-packages (from typer<0.10.0,>=0.3.0->spacy) (8.1.7)
Requirement already satisfied: cloudpathlib<0.17.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from weasel<0.4.0,>=0.1.0->spacy) (0.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy) (2.1.5)
Requirement already satisfied: marisa-trie>=0.7.7 in /usr/local/lib/python3.10/dist-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy) (1.1.1)
Installing collected packages: sudachipy, plac, sudachidict-core, ginza, ja_ginza
Successfully installed ginza-5.2.0 ja_ginza-5.2.0 plac-1.4.3 sudachidict-core-20240409 sudachipy-0.6.8

20.1. spacyで単語分割#

単に分割するだけではなく、token.lemma_ により基本形に変換している。元のままが良ければ token.text にしよう。
タスクによっては不要な単語や品詞もあるだろう。その場合には不要なものを除外しよう。
タスクによっては集約（例えば数字を全て<数字>という単語に集約する。感情語を全て<感情>に集約する）を検討すると良いだろう。

import spacy

# テキスト例
texts = ['特になし',
        '正直わかりずらい。むだに間があるし。',
        '例題を取り入れて理解しやすくしてほしい。']

# 解析器を用意
nlp = spacy.load("ja_ginza")

# 解析例
doc = nlp("正直わかりずらい。")
for token in doc:
    print(f"{token.i=}, {token.text=}, {token.lemma_=}")

token.i=0, token.text='正直', token.lemma_='正直'
token.i=1, token.text='わかり', token.lemma_='わかる'
token.i=2, token.text='ずらい', token.lemma_='ずらい'
token.i=3, token.text='。', token.lemma_='。'

def text2tokens(nlp:spacy.language.Language, text:str, sep=' '):
    """テキストを単語に分割した文字列に変換。
    args:
      nlp: spacy.load()で用意した解析器。
      text: テキスト。
      sep: セパレータ。単語と単語の間を埋める記号。

    >>> nlp = spacy.load("ja_ginza")
    >>> result = text2tokens(nlp, "これはテストです")
    >>> result
    'これ は テスト です'
    """
    doc = nlp(text)
    tokens = []
    for token in doc:
        tokens.append(token.lemma_)
    result = sep.join(tokens)
    return result

# 実行例
tokens = []
for text in texts:
    tokens.append(text2tokens(nlp, text))

tokens

['特に なし', '正直 わかる ずらい 。 むだ だ 間 が ある し 。', '例題 を 取り入れる て 理解 する やすい する て ほしい 。']

20.2. CountVectorizerでベクトル化#

デフォルトでは1-gramモデル（各単語の出現回数に基づいた特徴）によりベクトル化する。引数指定により以下のような設定も可能。詳細はドキュメント参照。

ngram_range
- 2-gram, 3-gram,,,といった「連続した語＝フレーズ」に基づいたベクトル化を行う。
stop_words
- 無視したい単語（ストップワード）を指定することができる。標準で用意されているリストを利用することも可能。
analyzer
- デフォルトでは単語を特徴として捉えるが、この単語とは「スペースで区切られたもの」として解釈される。
- ‘char’ を指定すると「文字」を特徴として捉えるようになる。

from sklearn.feature_extraction.text import CountVectorizer

# 1-gramで特徴ベクトル作成
vectorizer = CountVectorizer() # デフォルトでは単語出現回数でベクトル化
X = vectorizer.fit_transform(tokens) # ベクトル構築
features = vectorizer.get_feature_names_out() # ベクトル構築した際の単語一覧
print(f"{features=}")
print(X.toarray())

features=array(['ある', 'する', 'ずらい', 'なし', 'ほしい', 'むだ', 'やすい', 'わかる', '例題', '取り入れる',
       '正直', '特に', '理解'], dtype=object)
[[0 0 0 1 0 0 0 0 0 0 0 1 0]
 [1 0 1 0 0 1 0 1 0 0 1 0 0]
 [0 2 0 0 1 0 1 0 1 1 0 0 1]]

# 2-gramで特徴ベクトル作成
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(tokens)
features = vectorizer.get_feature_names_out()
print(f"{features=}")
print(X.toarray())

features=array(['する ほしい', 'する やすい', 'ずらい むだ', 'むだ ある', 'やすい する', 'わかる ずらい',
       '例題 取り入れる', '取り入れる 理解', '正直 わかる', '特に なし', '理解 する'], dtype=object)
[[0 0 0 0 0 0 0 0 0 1 0]
 [0 0 1 1 0 1 0 0 1 0 0]
 [1 1 0 0 1 0 1 1 0 0 1]]