テキストをトークン出現回数でベクトル化する例（Spacy版）

Contents

!date
!python --version

Wed Jun  3 02:03:09 AM UTC 2026
Python 3.12.13

20. テキストをトークン出現回数でベクトル化する例（Spacy版）#

基本的には「形態素解析して単語に分割し、その回数をカウントしたうえでベクトル化する」という手順を取る。形態素解析には様々なツールがあるが、ここでは spacy.load("ja_ginza") を用いた例を示す。単語分割した文字列を作成したら、後はCountVectorizerを使うのが楽だ。

!pip install -U spacy ja_ginza

Requirement already satisfied: spacy in /usr/local/lib/python3.12/dist-packages (3.8.14)
Collecting ja_ginza
  Downloading ja_ginza-5.2.0-py3-none-any.whl.metadata (5.8 kB)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.12/dist-packages (from spacy) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (1.0.15)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.0.13)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.12/dist-packages (from spacy) (3.0.13)
Requirement already satisfied: thinc<8.4.0,>=8.3.12 in /usr/local/lib/python3.12/dist-packages (from spacy) (8.3.13)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.12/dist-packages (from spacy) (1.1.3)
Requirement already satisfied: srsly<3.0.0,>=2.5.3 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.5.3)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.0.10)
Requirement already satisfied: weasel<2.0.0,>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (1.0.0)
Requirement already satisfied: confection<2.0.0,>=1.3.2 in /usr/local/lib/python3.12/dist-packages (from spacy) (1.3.3)
Requirement already satisfied: typer<1.0.0,>=0.3.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (0.25.1)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (4.67.3)
Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.0.2)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.32.4)
Requirement already satisfied: pydantic<3.0.0,>=2.0.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.12.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from spacy) (3.1.6)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from spacy) (75.2.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (26.2)
Collecting sudachipy<0.7.0,>=0.6.2 (from ja_ginza)
  Downloading sudachipy-0.6.11-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (12 kB)
Collecting sudachidict-core>=20210802 (from ja_ginza)
  Downloading sudachidict_core-20260428-py3-none-any.whl.metadata (2.7 kB)
Collecting ginza<5.3.0,>=5.2.0 (from ja_ginza)
  Downloading ginza-5.2.0-py3-none-any.whl.metadata (448 bytes)
Collecting plac>=1.3.3 (from ginza<5.3.0,>=5.2.0->ja_ginza)
  Downloading plac-1.4.5-py2.py3-none-any.whl.metadata (5.9 kB)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.0.0->spacy) (0.7.0)
Requirement already satisfied: pydantic-core==2.41.4 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.0.0->spacy) (2.41.4)
Requirement already satisfied: typing-extensions>=4.14.1 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.0.0->spacy) (4.15.0)
Requirement already satisfied: typing-inspection>=0.4.2 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.0.0->spacy) (0.4.2)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.4.7)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.15)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2026.5.20)
Requirement already satisfied: blis<1.4.0,>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from thinc<8.4.0,>=8.3.12->spacy) (1.3.3)
Requirement already satisfied: click>=8.2.1 in /usr/local/lib/python3.12/dist-packages (from typer<1.0.0,>=0.3.0->spacy) (8.4.0)
Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer<1.0.0,>=0.3.0->spacy) (1.5.4)
Requirement already satisfied: rich>=13.8.0 in /usr/local/lib/python3.12/dist-packages (from typer<1.0.0,>=0.3.0->spacy) (13.9.4)
Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.12/dist-packages (from typer<1.0.0,>=0.3.0->spacy) (0.0.4)
Requirement already satisfied: cloudpathlib>=0.7.0 in /usr/local/lib/python3.12/dist-packages (from weasel<2.0.0,>=1.0.0->spacy) (0.24.0)
Requirement already satisfied: smart-open>=5.2.1 in /usr/local/lib/python3.12/dist-packages (from weasel<2.0.0,>=1.0.0->spacy) (7.6.1)
Requirement already satisfied: httpx>=0.24.0 in /usr/local/lib/python3.12/dist-packages (from weasel<2.0.0,>=1.0.0->spacy) (0.28.1)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->spacy) (3.0.3)
Requirement already satisfied: anyio in /usr/local/lib/python3.12/dist-packages (from httpx>=0.24.0->weasel<2.0.0,>=1.0.0->spacy) (4.13.0)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx>=0.24.0->weasel<2.0.0,>=1.0.0->spacy) (1.0.9)
Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx>=0.24.0->weasel<2.0.0,>=1.0.0->spacy) (0.16.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=13.8.0->typer<1.0.0,>=0.3.0->spacy) (4.2.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=13.8.0->typer<1.0.0,>=0.3.0->spacy) (2.20.0)
Requirement already satisfied: wrapt in /usr/local/lib/python3.12/dist-packages (from smart-open>=5.2.1->weasel<2.0.0,>=1.0.0->spacy) (2.2.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=13.8.0->typer<1.0.0,>=0.3.0->spacy) (0.1.2)
Downloading ja_ginza-5.2.0-py3-none-any.whl (59.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.1/59.1 MB 16.5 MB/s eta 0:00:00
?25hDownloading ginza-5.2.0-py3-none-any.whl (21 kB)
Downloading sudachidict_core-20260428-py3-none-any.whl (72.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.2/72.2 MB 8.1 MB/s eta 0:00:00
?25hDownloading sudachipy-0.6.11-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 24.0 MB/s eta 0:00:00
?25hDownloading plac-1.4.5-py2.py3-none-any.whl (22 kB)
Installing collected packages: sudachipy, plac, sudachidict-core, ginza, ja_ginza
Successfully installed ginza-5.2.0 ja_ginza-5.2.0 plac-1.4.5 sudachidict-core-20260428 sudachipy-0.6.11

20.1. spacyで単語分割#

単に分割するだけではなく、token.lemma_ により基本形に変換している。元のままが良ければ token.text にしよう。
タスクによっては不要な単語や品詞もあるだろう。その場合には不要なものを除外しよう。
タスクによっては集約（例えば数字を全て<数字>という単語に集約する。感情語を全て<感情>に集約する）を検討すると良いだろう。

import spacy

# テキスト例
texts = ['特になし',
        '正直わかりずらい。むだに間があるし。',
        '例題を取り入れて理解しやすくしてほしい。']

# 解析器を用意
# Python 3.12 + Spacy 3.8 + Ginza 5.2 の構成だとそのままでは動作しないため、
# 以下の設定を追加指定
config = {
    "components": {
        "compound_splitter": {
            "split_mode": "A"
        }
    }
}
nlp = spacy.load("ja_ginza", config=config)

# 解析例
doc = nlp("正直わかりずらい。")
for token in doc:
    print(f"{token.i=}, {token.text=}, {token.lemma_=}")

token.i=0, token.text='正直', token.lemma_='正直'
token.i=1, token.text='わかり', token.lemma_='わかる'
token.i=2, token.text='ずらい', token.lemma_='ずらい'
token.i=3, token.text='。', token.lemma_='。'

def text2tokens(nlp:spacy.language.Language, text:str, sep=' '):
    """テキストを単語に分割した文字列に変換。
    args:
      nlp: spacy.load()で用意した解析器。
      text: テキスト。
      sep: セパレータ。単語と単語の間を埋める記号。

    >>> nlp = spacy.load("ja_ginza")
    >>> result = text2tokens(nlp, "これはテストです")
    >>> result
    'これ は テスト です'
    """
    doc = nlp(text)
    tokens = []
    for token in doc:
        tokens.append(token.lemma_)
    result = sep.join(tokens)
    return result

# 実行例
tokens = []
for text in texts:
    tokens.append(text2tokens(nlp, text))

tokens

['特に なし', '正直 わかる ずらい 。 むだ だ 間 が ある し 。', '例題 を 取り入れる て 理解 する やすい する て ほしい 。']

20.2. CountVectorizerでベクトル化#

デフォルトでは1-gramモデル（各単語の出現回数に基づいた特徴）によりベクトル化する。引数指定により以下のような設定も可能。詳細はドキュメント参照。

ngram_range
- 2-gram, 3-gram,,,といった「連続した語＝フレーズ」に基づいたベクトル化を行う。
stop_words
- 無視したい単語（ストップワード）を指定することができる。標準で用意されているリストを利用することも可能。
analyzer
- デフォルトでは単語を特徴として捉えるが、この単語とは「スペースで区切られたもの」として解釈される。
- ‘char’ を指定すると「文字」を特徴として捉えるようになる。

from sklearn.feature_extraction.text import CountVectorizer

# 1-gramで特徴ベクトル作成
vectorizer = CountVectorizer() # デフォルトでは単語出現回数でベクトル化
X = vectorizer.fit_transform(tokens) # ベクトル構築
features = vectorizer.get_feature_names_out() # ベクトル構築した際の単語一覧
print(f"{features=}")
print(X.toarray())

features=array(['ある', 'する', 'ずらい', 'なし', 'ほしい', 'むだ', 'やすい', 'わかる', '例題', '取り入れる',
       '正直', '特に', '理解'], dtype=object)
[[0 0 0 1 0 0 0 0 0 0 0 1 0]
 [1 0 1 0 0 1 0 1 0 0 1 0 0]
 [0 2 0 0 1 0 1 0 1 1 0 0 1]]

# 2-gramで特徴ベクトル作成
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(tokens)
features = vectorizer.get_feature_names_out()
print(f"{features=}")
print(X.toarray())

features=array(['する ほしい', 'する やすい', 'ずらい むだ', 'むだ ある', 'やすい する', 'わかる ずらい',
       '例題 取り入れる', '取り入れる 理解', '正直 わかる', '特に なし', '理解 する'], dtype=object)
[[0 0 0 0 0 0 0 0 0 1 0]
 [0 0 1 1 0 1 0 0 1 0 0]
 [1 1 0 0 1 0 1 1 0 0 1]]