コード例：シソーラス、カウントと推論に基づいた設計（生実装、NLTK編）

18. コード例：シソーラス、カウントと推論に基づいた設計（生実装、NLTK編）#

補足
- 自然言語処理は利用するツールによって操作が大きく異なります。ここでは代表的な前処理（文分割、トークナイズ、ステミング等）を観察しやすくすることを優先しています。後日より使いやすいツールについても紹介する予定です。
全体の流れ
- 事前準備
- シソーラスの例
- Bag-of-Words
- sklearnのBoWとTF-IDFを使った例
- 共起行列に基づいた単語のベクトル化
- 相互情報量による分散表現の高度化
- SVDによる次元削減

18.1. 事前準備#

実行する際の注意
- Google ColabではNatural Language Toolkit; NLTKが標準でインストールされています。ただしコーパスの追加ダウンロードが必要。（全てをまとめてインストールすることも可能だが、それなりに容量を必要とするためデフォルトでは最小限しかインストールされてない）。
- 今回は英語文書を対象としている。対象言語毎にそれぞれダウンロードして利用する必要がある。どのようなものが用意されているかの一覧を確認したい場合には nltk.download() を実行しよう。

# 2024年5月現在、インストール不要。
#!pip install nltk

import nltk
nltk.download(['wordnet', 'stopwords', 'punkt', 'punkt_tab'])

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.

True

18.2. シソーラスの例#

ここでは以下の状況を想定した検索タスクを通してシソーラスの使い方を観察しよう。

iPhoneについて書かれている3つの文書がある。
タスク: 「how much iphone」という質問に対し、最も適切な文書を探し出す。

上記タスクを解くために、(1)各文書を前処理し、(2)2種類の方法で文書検索を行うことを考えます。

(1)の前処理は、例えば「iPhone」と「iphone」を同一視するために小文字に統一するといったことを指します。このための実装例を preprocess_docs() として示しています。

(2)の文書検索としては単純なマッチングと、シソーラスを使ったマッチングの2通りを考えてみます。

単純なマッチング simple_matching() では、ユーザクエリに対する単語マッチング数をスコアとします。
シソーラスを使ったマッチング relation_matching() では、まず単語マッチングで評価し、その後でシソーラスを用いて加点を行い、それらの合計値を最終スコアとします。

# 前処理
from nltk.tokenize import wordpunct_tokenize, sent_tokenize
# ＜使用しているNLTKライブラリの説明＞
# nltk.corpus.stopwords: 文章を特徴付ける要素として不適切なものを除外するためのブラックリスト。通称ストップワード。
# nltk.sent_tokenize: 文章(doc)を文(sentence)に分割する。
# nltk.wordpunct_tokenize: 文(sentence)を単語(word)に分割する。通称トークン化。
# nltk.lemmatize: 単語(word)を基本形(らしきもの)に修正する。通称ステミング。

import numpy as np

# ドキュメント例（3つのドキュメント）
docs = []
docs.append("You can get dis-counted price with trade-in.")
docs.append("iPhone 11 shoots beautifully sharp 4K video at 60 fps across all its cameras.")
docs.append("From $16.62/mo. or $399 with trade-in.")

def preprocess_docs(docs):
    '''英文書集合 docs に対し前処理を施し、分かち書きしたリストのリストとして返す。

    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :return (list): 文分割、単語分割、基本形、ストップワード除去した結果。
    '''
    stopwords = nltk.corpus.stopwords.words('english')
    stopwords.append('.')  # ピリオドを追加。
    stopwords.append(',')  # カンマを追加。
    stopwords.append('')  # 空文字を追加。

    result = []
    wnl = nltk.stem.wordnet.WordNetLemmatizer()
    for doc in docs:
        temp = []
        for sent in sent_tokenize(doc):
            for word in wordpunct_tokenize(sent):
                this_word = wnl.lemmatize(word.lower())
                if this_word not in stopwords:
                    temp.append(this_word)
        result.append(temp)
    return result

docs2 = preprocess_docs(docs)
for index in range(len(docs2)):
    print('before: ', docs[index])
    print('after: ', docs2[index])
    print('----')

before:  You can get dis-counted price with trade-in.
after:  ['get', 'dis', '-', 'counted', 'price', 'trade', '-']
----
before:  iPhone 11 shoots beautifully sharp 4K video at 60 fps across all its cameras.
after:  ['iphone', '11', 'shoot', 'beautifully', 'sharp', '4k', 'video', '60', 'fps', 'across', 'camera']
----
before:  From $16.62/mo. or $399 with trade-in.
after:  ['$', '16', '62', '/', 'mo', '$', '399', 'trade', '-']
----

18.2.1. simple_matching()#

# simple matching
def simple_matching(query, docs):
    '''単純な単語マッチングによりマッチ数でスコアを算出。

    :param query(str): クエリ（検索要求）。
    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :return (list): 文書毎のスコア。
    '''
    query = query.split(" ")
    result = []
    for doc in docs:
        score = 0
        for word in doc:
            for key in query:
                if key == word:
                    score += 1
        result.append(score)
    return result

user_query = "how much iphone"
scores = simple_matching(user_query, docs2)
print('simple_matching scores = ', scores)

simple_matching scores =  [0, 1, 0]

18.2.2. relation_matching()#

# relation matching
related_words = {}
related_words['buy'] = ['buy', '$', 'price', 'how much', 'trade-in']
related_words['UX'] = ['UX', 'stylish', 'seamless']

def relation_matching(query, docs, related_words):
    '''予め用意された関連用語を利用し、マッチする数を加点して算出。

    :param query(str): クエリ（検索要求）。
    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :param related_words:
    :return (list): 文書毎のスコア。
    '''
    scores = simple_matching(query, docs)

    query = query.split(" ")
    for q in query:
        for relation in related_words:
            matches = [q in word for word in related_words[relation]]
            if True in matches:
                new_query = ' '.join(related_words[relation])
                temp_scores = simple_matching(new_query, docs)
                print('# q = {}, relation = {} => temp_scores = {}'.format(q, relation, temp_scores))
                scores = list(np.array(scores) + np.array(temp_scores))
    scores = list(scores)
    return scores

scores2 = relation_matching(user_query, docs2, related_words)
print('simple_matching scores = ', scores)
print('relation_matching scores = ', scores2)

# q = how, relation = buy => temp_scores = [1, 0, 2]
# q = much, relation = buy => temp_scores = [1, 0, 2]
simple_matching scores =  [0, 1, 0]
relation_matching scores =  [np.int64(2), np.int64(1), np.int64(4)]

18.3. Bag-of-Words (BoW)#

BoWでテキストをベクトル化するためには語彙集合を作り、各語彙が出現した回数（もしくは出現したか否かのバイナリコーディング）をカウントして構築することになる。

以下ではまず collect_words_eng() でコードブック（語彙集合）を構築し、make_vectors_eng() で文書ベクトルを作成している。（なお今回の実装では分けているが、実際には両方を同時に行う方が処理効率が良いことが多い）

特徴ベクトルを作成した後は、ユークリッド距離（euclidean_distance()）、コサイン距離（cosine_distance()）、コサイン類似度（cosine_similarity()）により文書館距離や類似度を確認してみている。

import scipy.spatial.distance as distance

# BoW
# ドキュメント例（3つのドキュメント）
docs3 = []
docs3.append("This is test.")
docs3.append("That is test too.")
docs3.append("There are so many many tests.")


# 文書集合からターム素性集合（コードブック）を作る
def collect_words_eng(docs):
    '''英文書集合から単語コードブック作成。
    シンプルに文書集合を予め決めうちした方式で処理する。
    必要に応じて指定できるようにしていた方が使い易いかも。

    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :return (list): 文分割、単語分割、基本形、ストップワード除去した、ユニークな単語一覧。
    '''
    codebook = []
    stopwords = nltk.corpus.stopwords.words('english')
    stopwords.append('.')   # ピリオドを追加。
    stopwords.append(',')   # カンマを追加。
    stopwords.append('')    # 空文字を追加。
    wnl = nltk.stem.wordnet.WordNetLemmatizer()
    for doc in docs:
        for sent in sent_tokenize(doc):
            for word in wordpunct_tokenize(sent):
                this_word = wnl.lemmatize(word.lower())
                if this_word not in codebook and this_word not in stopwords:
                    codebook.append(this_word)
    return codebook

codebook = collect_words_eng(docs3)
print('codebook = ',codebook)

codebook =  ['test', 'many']

# コードブックを素性とする文書ベクトルを作る (直接ベクトル生成)
def make_vectors_eng(docs, codebook):
    '''コードブックを素性とする文書ベクトルを作る（直接ベクトル生成）

    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :param codebook(list): ユニークな単語一覧。
    :return (list): コードブックを元に、出現回数を特徴量とするベクトルを返す。
    '''
    vectors = []
    wnl = nltk.stem.wordnet.WordNetLemmatizer()
    for doc in docs:
        this_vector = []
        fdist = nltk.FreqDist()
        for sent in sent_tokenize(doc):
            for word in wordpunct_tokenize(sent):
                this_word = wnl.lemmatize(word.lower())
                fdist[this_word] += 1
        for word in codebook:
            this_vector.append(fdist[word])
        vectors.append(this_vector)
    return vectors

vectors = make_vectors_eng(docs3, codebook)
for index in range(len(docs3)):
    print('docs[{}] = {}'.format(index,docs3[index]))
    print('vectors[{}] = {}'.format(index,vectors[index]))
    print('----')

docs[0] = This is test.
vectors[0] = [1, 0]
----
docs[1] = That is test too.
vectors[1] = [1, 0]
----
docs[2] = There are so many many tests.
vectors[2] = [1, 2]
----

def euclidean_distance(vectors):
    vectors = np.array(vectors)
    distances = []
    for i in range(len(vectors)):
        temp = []
        for j in range(len(vectors)):
            temp.append(np.linalg.norm(vectors[i] - vectors[j]))
        distances.append(temp)
    return distances

distances = euclidean_distance(vectors)
print('# euclidean_distance')
for index in range(len(distances)):
    print(distances[index])

def cosine_distance(vectors):
    vectors = np.array(vectors)
    distances = []
    for i in range(len(vectors)):
        temp = []
        for j in range(len(vectors)):
            temp.append(distance.cosine(vectors[i], vectors[j]))
        distances.append(temp)
    return distances

distances = cosine_distance(vectors)
print('# cosine_distance')
for index in range(len(distances)):
    print(distances[index])


import sklearn.metrics.pairwise as pairwise
distances = pairwise.cosine_similarity(vectors)
print('# cosine_similarity')
for index in range(len(distances)):
    print(distances[index])

# euclidean_distance
[np.float64(0.0), np.float64(0.0), np.float64(2.0)]
[np.float64(0.0), np.float64(0.0), np.float64(2.0)]
[np.float64(2.0), np.float64(2.0), np.float64(0.0)]
# cosine_distance
[np.float64(0.0), np.float64(0.0), np.float64(0.5527864045000421)]
[np.float64(0.0), np.float64(0.0), np.float64(0.5527864045000421)]
[np.float64(0.5527864045000421), np.float64(0.5527864045000421), np.float64(0.0)]
# cosine_similarity
[1.        1.        0.4472136]
[1.        1.        0.4472136]
[0.4472136 0.4472136 1.       ]

18.4. sklearnのBoWとTF-IDFを使った例#

BoWベースの特徴ベクトルと、TF-IDFベースの特徴ベクトルを作成し、どのような違いがあるかを観察している例。特徴として列挙されている素性群は同一だが、TF-IDFでは少し濃淡が表現されていることを確認しよう。

import sklearn.feature_extraction.text as fe_text

def bow(docs):
    '''Bag-of-Wordsによるベクトルを生成。

    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :return: 文書ベクトル。
    '''
    vectorizer = fe_text.CountVectorizer(stop_words='english')
    vectors = vectorizer.fit_transform(docs)
    return vectors.toarray(), vectorizer

vectors, vectorizer = bow(docs)
print('# normal BoW')
print(vectorizer.get_feature_names_out())
print(vectors)

def bow_tfidf(docs):
    '''Bag-of-WordsにTF-IDFで重み調整したベクトルを生成。

    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :return: 重み調整したベクトル。
    '''
    vectorizer = fe_text.TfidfVectorizer(norm=None, stop_words='english')
    vectors = vectorizer.fit_transform(docs)
    return vectors.toarray(), vectorizer

vectors, vectorizer = bow_tfidf(docs)
print('# BoW + tfidf')
print(vectorizer.get_feature_names_out())
print(vectors)

# normal BoW
['11' '16' '399' '4k' '60' '62' 'beautifully' 'cameras' 'counted' 'dis'
 'fps' 'iphone' 'mo' 'price' 'sharp' 'shoots' 'trade' 'video']
[[0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0]
 [1 0 0 1 1 0 1 1 0 0 1 1 0 0 1 1 0 1]
 [0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0]]
# BoW + tfidf
['11' '16' '399' '4k' '60' '62' 'beautifully' 'cameras' 'counted' 'dis'
 'fps' 'iphone' 'mo' 'price' 'sharp' 'shoots' 'trade' 'video']
[[0.         0.         0.         0.         0.         0.
  0.         0.         1.69314718 1.69314718 0.         0.
  0.         1.69314718 0.         0.         1.28768207 0.        ]
 [1.69314718 0.         0.         1.69314718 1.69314718 0.
  1.69314718 1.69314718 0.         0.         1.69314718 1.69314718
  0.         0.         1.69314718 1.69314718 0.         1.69314718]
 [0.         1.69314718 1.69314718 0.         0.         1.69314718
  0.         0.         0.         0.         0.         0.
  1.69314718 0.         0.         0.         1.28768207 0.        ]]

18.5. 共起行列に基づいた単語のベクトル化#

分布仮説に基づいた単語を特徴ベクトル表現する例として、共起行列を利用したコードを示している。なおここでは文書ベクトルではなく 単語ベクトルを構築している ことに注意すること。

preprocess()では、テキストに対する前処理として小文字化し、ピリオドの前にスペースを追加（ピリオド付きの単語にしたくない）した上で単語分割し、語彙集合を作成。処理しやすくするために単語=>id、id=>単語の両方向を参照するための辞書も用意し、文書をid系列として表現し直している。

create_to_matrix()では、id系列となった文書を受け取り、共起行列を作成している。これで単語ベクトルを構築したことになる。

most_similar()では、入力された単語に最も違い単語を共起行列から探し出している例を示している。

import pandas as pd

sentence = 'pandas is an open source programming tools. The best way to get pandas is via conda. "conda install pandas"'
print(f"{sentence=}")
print(f"{len(sentence)=}")

sentence='pandas is an open source programming tools. The best way to get pandas is via conda. "conda install pandas"'
len(sentence)=107

def preprocess(text):
    """テキストに対する前処理。
    「ゼロから作るDeepLearning2 自然言語処理辺」p.66より。

    :param text:
    :return:
      courpus(list): id_to_wordのidに基づいたone-hot vector。
      word_to_id(dict): 単語をkeyとして、idを参照する辞書。
      id_to_word(dict): idをkeyとして、単語を参照する辞書。
    """
    text = text.lower()
    text = text.replace('.', ' .')
    text = text.replace('"', '')
    words = text.split(' ')

    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word
    corpus = np.array([word_to_id[w] for w in words])
    return corpus, word_to_id, id_to_word

corpus, word_to_id, id_to_word = preprocess(sentence)
vocab_size = len(word_to_id)
print(f"{corpus=}")
print(f"{word_to_id=}")
print(f"{id_to_word=}")

corpus=array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  0,  1, 13, 14,
        7, 14, 15,  0])
word_to_id={'pandas': 0, 'is': 1, 'an': 2, 'open': 3, 'source': 4, 'programming': 5, 'tools': 6, '.': 7, 'the': 8, 'best': 9, 'way': 10, 'to': 11, 'get': 12, 'via': 13, 'conda': 14, 'install': 15}
id_to_word={0: 'pandas', 1: 'is', 2: 'an', 3: 'open', 4: 'source', 5: 'programming', 6: 'tools', 7: '.', 8: 'the', 9: 'best', 10: 'way', 11: 'to', 12: 'get', 13: 'via', 14: 'conda', 15: 'install'}

def create_co_matrix(corpus, vocab_size, window_size=1):
    """共起行列を作成。
    「ゼロから作るDeepLearning2 自然言語処理辺」p.72より。

    :param corpus(str): テキスト文。
    :param vocab_size: 語彙数。
    :param window_size: 共起判定の範囲。
    :return:
    """
    corpus_size = len(corpus)
    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)

    for idx, word_id in enumerate(corpus):
        for i in range(1, window_size+1):
            left_idx = idx - i
            right_idx = idx + i
            if left_idx >= 0:
                left_word_id = corpus[left_idx]
                co_matrix[word_id, left_word_id] += 1
            if right_idx < corpus_size:
                right_word_id = corpus[right_idx]
                co_matrix[word_id, right_word_id] += 1
    return co_matrix

co_matrix = create_co_matrix(corpus, vocab_size, window_size=2)
df = pd.DataFrame(co_matrix, index=word_to_id.keys(), columns=word_to_id.keys())
df

	pandas	is	an	open	source	programming	tools	.	the	best	way	to	get	via	conda	install
pandas	0	2	1	0	0	0	0	0	0	0	0	1	1	1	1	1
is	2	0	1	1	0	0	0	0	0	0	0	0	1	1	1	0
an	1	1	0	1	1	0	0	0	0	0	0	0	0	0	0	0
open	0	1	1	0	1	1	0	0	0	0	0	0	0	0	0	0
source	0	0	1	1	0	1	1	0	0	0	0	0	0	0	0	0
programming	0	0	0	1	1	0	1	1	0	0	0	0	0	0	0	0
tools	0	0	0	0	1	1	0	1	1	0	0	0	0	0	0	0
.	0	0	0	0	0	1	1	0	1	1	0	0	0	1	2	1
the	0	0	0	0	0	0	1	1	0	1	1	0	0	0	0	0
best	0	0	0	0	0	0	0	1	1	0	1	1	0	0	0	0
way	0	0	0	0	0	0	0	0	1	1	0	1	1	0	0	0
to	1	0	0	0	0	0	0	0	0	1	1	0	1	0	0	0
get	1	1	0	0	0	0	0	0	0	0	1	1	0	0	0	0
via	1	1	0	0	0	0	0	1	0	0	0	0	0	0	1	0
conda	1	1	0	0	0	0	0	2	0	0	0	0	0	1	2	1
install	1	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0

def cos_similarity(x, y, eps=1e-8):
    nx = x / (np.sqrt(np.sum(x ** 2)) + eps)
    ny = y / (np.sqrt(np.sum(y ** 2)) + eps)
    return np.dot(nx, ny)

def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
    """コサイン類似度Top5を出力。

    :param query(str): クエリ。
    :param word_to_id(dict): 単語をkeyとして、idを参照する辞書。
    :param id_to_word(dict): idをkeyとして、単語を参照する辞書。
    :param word_matrix: 共起行列。
    :param top(int): 上位何件まで表示させるか。
    :return: なし。
    """
    if query not in word_to_id:
        print('%s is not found' % query)
        return

    print('[query] ' + query)
    query_id = word_to_id[query]
    query_vec = word_matrix[query_id]

    vocab_size = len(word_to_id)
    similarity = np.zeros(vocab_size)
    for i in range(vocab_size):
        similarity[i] = cos_similarity(word_matrix[i], query_vec)

    count = 0
    for i in (-1 * similarity).argsort():
        if id_to_word[i] == query:
            continue
        print(' %s: %s' % (id_to_word[i], similarity[i]))
        count += 1
        if count >= top:
            return

print('\n# most_similar() with co_matrix')
user_query = "pandas"
most_similar(user_query, word_to_id, id_to_word, co_matrix)

# most_similar() with co_matrix
[query] pandas
 conda: 0.5477225541919766
 get: 0.4743416451535486
 open: 0.4743416451535486
 via: 0.4743416451535486
 is: 0.4216370186169938

18.6. 相互情報量による分散表現の高度化#

共起行列をそのまま特徴してしまうと、theやaのような出現しやすい単語の重みを強くしすぎる傾向がある。これを緩和するため相互情報量を導入してみよう。

ppmi(): Positive PMI（正の相互情報量）。

def ppmi(C, verbose=False, eps=1e-8):
    """Positive PMI（正の相互情報量）
    「ゼロから作るDeepLearning2 自然言語処理辺」p.79より。

    :param C: 共起行列。
    :param verbose(boolean): 処理状況を出力するためのフラグ。
    :param eps(float): np.log2演算時に-infとなるのを避けるための微小な値。
    :return:
    """
    M = np.zeros_like(C, dtype=np.float32)
    N = np.sum(C)
    S = np.sum(C, axis=0)
    total = C.shape[0] * C.shape[1]
    cnt = 0

    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            pmi = np.log2(C[i, j] * N / (S[j]*S[i]) + eps)
            M[i, j] = max(0, pmi)

            if verbose:
                cnt += 1
                if cnt % (total//100) == 0:
                    print('%.1f%% done' % (100+cnt/total))
    return M

M = ppmi(co_matrix)
print('\n# PPMI')
df2 = pd.DataFrame(M, index=word_to_id.keys(), columns=word_to_id.keys())
df2

# PPMI

	pandas	is	an	open	source	programming	tools	.	the	best	way	to	get	via	conda	install
pandas	0.000000	1.478047	1.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.285402	1.285402	1.285402	0.285402	1.70044
is	1.478047	0.000000	1.478047	1.478047	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.478047	1.478047	0.478047	0.00000
an	1.285402	1.478047	0.000000	2.285402	2.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000
open	0.000000	1.478047	2.285402	0.000000	2.285402	2.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000
source	0.000000	0.000000	2.285402	2.285402	0.000000	2.285402	2.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000
programming	0.000000	0.000000	0.000000	2.285402	2.285402	0.000000	2.285402	1.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000
tools	0.000000	0.000000	0.000000	0.000000	2.285402	2.285402	0.000000	1.285402	2.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000
.	0.000000	0.000000	0.000000	0.000000	0.000000	1.285402	1.285402	0.000000	1.285402	1.285402	0.000000	0.000000	0.000000	1.285402	1.285402	1.70044
the	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2.285402	1.285402	0.000000	2.285402	2.285402	0.000000	0.000000	0.000000	0.000000	0.00000
best	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.285402	2.285402	0.000000	2.285402	2.285402	0.000000	0.000000	0.000000	0.00000
way	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2.285402	2.285402	0.000000	2.285402	2.285402	0.000000	0.000000	0.00000
to	1.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2.285402	2.285402	0.000000	2.285402	0.000000	0.000000	0.00000
get	1.285402	1.478047	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2.285402	2.285402	0.000000	0.000000	0.000000	0.00000
via	1.285402	1.478047	0.000000	0.000000	0.000000	0.000000	0.000000	1.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.285402	0.00000
conda	0.285402	0.478047	0.000000	0.000000	0.000000	0.000000	0.000000	1.285402	0.000000	0.000000	0.000000	0.000000	0.000000	1.285402	1.285402	1.70044
install	1.700440	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.700440	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.700440	0.00000

#np.set_printoptions(precision=3) # 有効桁3桁（表示上の省略で、データは保持）
pd.options.display.precision = 3 # 同上
print('\n# PPMI with precision=3')
df2 = pd.DataFrame(M, index=word_to_id.keys(), columns=word_to_id.keys())
df2

# PPMI with precision=3

	pandas	is	an	open	source	programming	tools	.	the	best	way	to	get	via	conda	install
pandas	0.000	1.478	1.285	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	1.285	1.285	1.285	0.285	1.7
is	1.478	0.000	1.478	1.478	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	1.478	1.478	0.478	0.0
an	1.285	1.478	0.000	2.285	2.285	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.0
open	0.000	1.478	2.285	0.000	2.285	2.285	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.0
source	0.000	0.000	2.285	2.285	0.000	2.285	2.285	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.0
programming	0.000	0.000	0.000	2.285	2.285	0.000	2.285	1.285	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.0
tools	0.000	0.000	0.000	0.000	2.285	2.285	0.000	1.285	2.285	0.000	0.000	0.000	0.000	0.000	0.000	0.0
.	0.000	0.000	0.000	0.000	0.000	1.285	1.285	0.000	1.285	1.285	0.000	0.000	0.000	1.285	1.285	1.7
the	0.000	0.000	0.000	0.000	0.000	0.000	2.285	1.285	0.000	2.285	2.285	0.000	0.000	0.000	0.000	0.0
best	0.000	0.000	0.000	0.000	0.000	0.000	0.000	1.285	2.285	0.000	2.285	2.285	0.000	0.000	0.000	0.0
way	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	2.285	2.285	0.000	2.285	2.285	0.000	0.000	0.0
to	1.285	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	2.285	2.285	0.000	2.285	0.000	0.000	0.0
get	1.285	1.478	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	2.285	2.285	0.000	0.000	0.000	0.0
via	1.285	1.478	0.000	0.000	0.000	0.000	0.000	1.285	0.000	0.000	0.000	0.000	0.000	0.000	1.285	0.0
conda	0.285	0.478	0.000	0.000	0.000	0.000	0.000	1.285	0.000	0.000	0.000	0.000	0.000	1.285	1.285	1.7
install	1.700	0.000	0.000	0.000	0.000	0.000	0.000	1.700	0.000	0.000	0.000	0.000	0.000	0.000	1.700	0.0

print('\n# most_similar() with PPMI')
most_similar(user_query, word_to_id, id_to_word, M)

# most_similar() with PPMI
[query] pandas
 conda: 0.5733166933059692
 is: 0.5094797611236572
 .: 0.40005457401275635
 get: 0.39511924982070923
 way: 0.3747256100177765

18.7. SVDによる次元削減#

np.linalg.svd(): 線形代数ライブラリを利用。

# svd
U, S, V = np.linalg.svd(M)
print('\n# SVD: dense vectors with all singular values')
print(U)

use_s_values = 2
U2 = U[:,0:use_s_values]
print('\n# SVD: dense vectors with singular values = {}'.format(use_s_values))
print(U2)

print('\n# most_similar() with SVD-2')
most_similar(user_query, word_to_id, id_to_word, U2)

# SVD: dense vectors with all singular values
[[-0.20909968 -0.05742509 -0.4260506  -0.17504837  0.2712971  -0.04672289
   0.18823615 -0.18591174 -0.36170715 -0.02090307  0.33414906 -0.12803775
  -0.20772134 -0.17504339  0.12016748 -0.49748433]
 [-0.20854177  0.06963065 -0.3888203   0.1402168   0.20635465 -0.15388878
  -0.2102995  -0.13104966  0.38273078  0.39125082  0.11626503  0.32180378
   0.35602686 -0.3037927  -0.09017423  0.09666564]
 [-0.23961712  0.26655334 -0.18719402  0.05390574 -0.35268205 -0.2592225
  -0.32773367 -0.09349788 -0.2730225   0.0377553  -0.40234843  0.31079715
  -0.17229767  0.28257638 -0.12586372 -0.25696287]
 [-0.2782892   0.35838643 -0.04367146 -0.32175225  0.05215078 -0.23192607
   0.21995124  0.49517277 -0.11888672 -0.1294459   0.23939379  0.12363093
   0.32118776  0.27299288  0.13509066  0.19793455]
 [-0.3138919   0.39330027  0.14138082  0.21887659  0.3510343  -0.1303136
   0.33645037 -0.16895744  0.4147285  -0.14546476 -0.16639212 -0.12915234
  -0.36436176  0.11097264 -0.11952758 -0.00717432]
 [-0.29259852  0.32289356  0.2055024   0.29696342 -0.23372465  0.05240921
  -0.22590032 -0.37780765 -0.24121043 -0.03001606  0.29355177 -0.3378207
   0.1754489  -0.11105473  0.2770583   0.22205435]
 [-0.29597434  0.18007208  0.3224701  -0.47838306 -0.15175404  0.19339524
  -0.20657967  0.22351523  0.11675979  0.25128582  0.00841147 -0.15087281
  -0.17578007 -0.40954658 -0.2627039  -0.1637667 ]
 [-0.25471854 -0.01025081  0.02076114 -0.05029602  0.36506298  0.47563514
   0.0229755   0.00612366 -0.19963232  0.08954368 -0.58078074  0.02569766
   0.26210678 -0.01673562  0.34222415 -0.00723523]
 [-0.29356492 -0.19401811  0.31909984  0.49437767 -0.02606995  0.1744973
   0.20824735  0.21605654 -0.2066092  -0.09872612  0.19246763  0.389128
   0.11316317 -0.11955503 -0.29818308 -0.22771738]
 [-0.28762156 -0.33267495  0.200779   -0.2494268  -0.28890833  0.02771297
   0.27570227 -0.329145    0.23007524  0.28904447  0.13154629  0.28043836
  -0.10255558  0.28061515  0.3373071   0.00459014]
 [-0.30796608 -0.3980363   0.13565156 -0.2586614   0.28628355 -0.16009486
  -0.41644257 -0.21415797  0.02268608 -0.5027486   0.02590389  0.01947945
   0.08477696  0.10067189 -0.20825593  0.13973713]
 [-0.26908046 -0.35622442 -0.04449803  0.3045457   0.09906036 -0.229743
  -0.24744193  0.4765816   0.00974799  0.2936326  -0.00161449 -0.30523014
  -0.30354518  0.11671898  0.24741183  0.0925235 ]
 [-0.23532367 -0.25857002 -0.19031367 -0.01856818 -0.34355265 -0.2738812
   0.43220395 -0.04992938 -0.06145945 -0.03610538 -0.34404024 -0.3561516
   0.258372   -0.24605696 -0.2454495   0.14502974]
 [-0.132769    0.00144362 -0.31328392  0.02380027 -0.25689903  0.2137918
  -0.00529088  0.15167178  0.13748148 -0.45615938 -0.02669457  0.27819303
  -0.35791093 -0.38731208  0.29279673  0.28283092]
 [-0.12850079 -0.00502957 -0.3181515  -0.0227449   0.01897933  0.46368992
   0.04571127 -0.06436063 -0.18308993  0.2039377   0.152593   -0.05987117
  -0.20431407  0.32411188 -0.46140748  0.44548053]
 [-0.12502334 -0.01886328 -0.2561411   0.08992951 -0.24950773  0.35275292
  -0.11904678  0.12406847  0.44487035 -0.22767347  0.08375221 -0.287249
   0.2736007   0.31247255 -0.00371157 -0.42694545]]

# SVD: dense vectors with singular values = 2
[[-0.20909968 -0.05742509]
 [-0.20854177  0.06963065]
 [-0.23961712  0.26655334]
 [-0.2782892   0.35838643]
 [-0.3138919   0.39330027]
 [-0.29259852  0.32289356]
 [-0.29597434  0.18007208]
 [-0.25471854 -0.01025081]
 [-0.29356492 -0.19401811]
 [-0.28762156 -0.33267495]
 [-0.30796608 -0.3980363 ]
 [-0.26908046 -0.35622442]
 [-0.23532367 -0.25857002]
 [-0.132769    0.00144362]
 [-0.12850079 -0.00502957]
 [-0.12502334 -0.01886328]]

# most_similar() with SVD-2
[query] pandas
 install: 0.9930136799812317
 .: 0.9741653800010681
 conda: 0.9739159345626831
 via: 0.9613599181175232
 the: 0.950492262840271