2. コード例：シソーラス、カウントと推論に基づいた設計（生実装、NLTK編）¶

補足
- 自然言語処理は利用するツールによって操作が大きく異なります。ここでは代表的な前処理（文分割、トークナイズ、ステミング等）を観察しやすくすることを優先しています。後日より使いやすいツールについても紹介する予定です。
全体の流れ
- 事前準備
- シソーラスの例
- Bag-of-Words
- sklearnのBoWとTF-IDFを使った例
- 共起行列に基づいた単語のベクトル化
- 相互情報量による分散表現の高度化
- SVDによる次元削減

2.1. 事前準備¶

実行する際の注意
- Natural Language Toolkit; NLTKのインストールと、コーパス等の追加ダウンロードが必要。（全てをまとめてインストールすることも可能だが、それなりに容量を必要とするためデフォルトでは最小限しかインストールされない）。
- 手順
  - NLTKインストール。
  - pythonインタプリタから nltk.download() を実行。関連コーパス等（下記）をダウンロード。
    - Corporaタブにある wordnet, stopwords
    - Modelsタブにある punkt

!pip install nltk

Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (3.2.5)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from nltk) (1.15.0)

import nltk
nltk.download(['wordnet', 'stopwords', 'punkt'])

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

True

2.2. シソーラスの例¶

preprocess_docs(): テキストに対する前処理の例。
simple_matching(): ユーザクエリに対する単純な単語マッチングによるスコアを算出。
relation_matching(): 単純マッチングに加え、シソーラスを使って加点する例。

# 前処理
from nltk.tokenize import wordpunct_tokenize, sent_tokenize
# ＜使用しているNLTKライブラリの説明＞
# nltk.corpus.stopwords: 文章を特徴付ける要素として不適切なものを除外するためのブラックリスト。通称ストップワード。
# nltk.sent_tokenize: 文章(doc)を文(sentence)に分割する。
# nltk.wordpunct_tokenize: 文(sentence)を単語(word)に分割する。通称トークン化。
# nltk.lemmatize: 単語(word)を基本形(らしきもの)に修正する。通称ステミング。

import numpy as np

# ドキュメント例（3つのドキュメント）
docs = []
docs.append("You can get dis-counted price with trade-in.")
docs.append("iPhone 11 shoots beautifully sharp 4K video at 60 fps across all its cameras.")
docs.append("From $16.62/mo. or $399 with trade-in.")

def preprocess_docs(docs):
    '''英文書集合 docs に対し前処理を施し、分かち書きしたリストのリストとして返す。

    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :return (list): 文分割、単語分割、基本形、ストップワード除去した結果。
    '''
    stopwords = nltk.corpus.stopwords.words('english')
    stopwords.append('.')  # ピリオドを追加。
    stopwords.append(',')  # カンマを追加。
    stopwords.append('')  # 空文字を追加。

    result = []
    wnl = nltk.stem.wordnet.WordNetLemmatizer()
    for doc in docs:
        temp = []
        for sent in sent_tokenize(doc):
            for word in wordpunct_tokenize(sent):
                this_word = wnl.lemmatize(word.lower())
                if this_word not in stopwords:
                    temp.append(this_word)
        result.append(temp)
    return result

docs2 = preprocess_docs(docs)
for index in range(len(docs2)):
    print('before: ', docs[index])
    print('after: ', docs2[index])
    print('----')

before:  You can get dis-counted price with trade-in.
after:  ['get', 'dis', '-', 'counted', 'price', 'trade', '-']
----
before:  iPhone 11 shoots beautifully sharp 4K video at 60 fps across all its cameras.
after:  ['iphone', '11', 'shoot', 'beautifully', 'sharp', '4k', 'video', '60', 'fps', 'across', 'camera']
----
before:  From $16.62/mo. or $399 with trade-in.
after:  ['$', '16', '62', '/', 'mo', '$', '399', 'trade', '-']
----

2.2.1. simple_matching()¶

# simple matching
def simple_matching(query, docs):
    '''単純な単語マッチングによりマッチ数でスコアを算出。

    :param query(str): クエリ（検索要求）。
    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :return (list): 文書毎のスコア。
    '''
    query = query.split(" ")
    result = []
    for doc in docs:
        score = 0
        for word in doc:
            for key in query:
                if key == word:
                    score += 1
        result.append(score)
    return result

user_query = "how much iphone"
scores = simple_matching(user_query, docs2)
print('simple_matching scores = ', scores)

simple_matching scores =  [0, 1, 0]

2.2.2. relation_matching()¶

# relation matching
related_words = {}
related_words['buy'] = ['buy', '$', 'price', 'how much', 'trade-in']
related_words['UX'] = ['UX', 'stylish', 'seamless']

def relation_matching(query, docs, related_words):
    '''予め用意された関連用語を利用し、マッチする数を加点して算出。

    :param query(str): クエリ（検索要求）。
    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :param related_words:
    :return (list): 文書毎のスコア。
    '''
    scores = simple_matching(query, docs)

    query = query.split(" ")
    for q in query:
        for relation in related_words:
            matches = [q in word for word in related_words[relation]]
            if True in matches:
                new_query = ' '.join(related_words[relation])
                temp_scores = simple_matching(new_query, docs)
                print('# q = {}, relation = {} => temp_scores = {}'.format(q, relation, temp_scores))
                scores = list(np.array(scores) + np.array(temp_scores))
    scores = list(scores)
    return scores

scores2 = relation_matching(user_query, docs2, related_words)
print('simple_matching scores = ', scores)
print('relation_matching scores = ', scores2)

# q = how, relation = buy => temp_scores = [1, 0, 2]
# q = much, relation = buy => temp_scores = [1, 0, 2]
simple_matching scores =  [0, 1, 0]
relation_matching scores =  [2, 1, 4]

2.3. Bag-of-Words¶

collect_words_eng(): 英文書集合から単語コードブック作成
make_vectors_eng(): コードブックを素性とする文書ベクトルを作る
euclidean_distance(): ユークリッド距離
cosine_distance(): コサイン距離
cosine_similarity(): コサイン類似度

import scipy.spatial.distance as distance

# BoW
# ドキュメント例（3つのドキュメント）
docs3 = []
docs3.append("This is test.")
docs3.append("That is test too.")
docs3.append("There are so many many tests.")


# 文書集合からターム素性集合（コードブック）を作る
def collect_words_eng(docs):
    '''英文書集合から単語コードブック作成。
    シンプルに文書集合を予め決めうちした方式で処理する。
    必要に応じて指定できるようにしていた方が使い易いかも。

    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :return (list): 文分割、単語分割、基本形、ストップワード除去した、ユニークな単語一覧。
    '''
    codebook = []
    stopwords = nltk.corpus.stopwords.words('english')
    stopwords.append('.')   # ピリオドを追加。
    stopwords.append(',')   # カンマを追加。
    stopwords.append('')    # 空文字を追加。
    wnl = nltk.stem.wordnet.WordNetLemmatizer()
    for doc in docs:
        for sent in sent_tokenize(doc):
            for word in wordpunct_tokenize(sent):
                this_word = wnl.lemmatize(word.lower())
                if this_word not in codebook and this_word not in stopwords:
                    codebook.append(this_word)
    return codebook

codebook = collect_words_eng(docs3)
print('codebook = ',codebook)

codebook =  ['test', 'many']

# コードブックを素性とする文書ベクトルを作る (直接ベクトル生成)
def make_vectors_eng(docs, codebook):
    '''コードブックを素性とする文書ベクトルを作る（直接ベクトル生成）

    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :param codebook(list): ユニークな単語一覧。
    :return (list): コードブックを元に、出現回数を特徴量とするベクトルを返す。
    '''
    vectors = []
    wnl = nltk.stem.wordnet.WordNetLemmatizer()
    for doc in docs:
        this_vector = []
        fdist = nltk.FreqDist()
        for sent in sent_tokenize(doc):
            for word in wordpunct_tokenize(sent):
                this_word = wnl.lemmatize(word.lower())
                fdist[this_word] += 1
        for word in codebook:
            this_vector.append(fdist[word])
        vectors.append(this_vector)
    return vectors

vectors = make_vectors_eng(docs3, codebook)
for index in range(len(docs3)):
    print('docs[{}] = {}'.format(index,docs3[index]))
    print('vectors[{}] = {}'.format(index,vectors[index]))
    print('----')

docs[0] = This is test.
vectors[0] = [1, 0]
----
docs[1] = That is test too.
vectors[1] = [1, 0]
----
docs[2] = There are so many many tests.
vectors[2] = [1, 2]
----

def euclidean_distance(vectors):
    vectors = np.array(vectors)
    distances = []
    for i in range(len(vectors)):
        temp = []
        for j in range(len(vectors)):
            temp.append(np.linalg.norm(vectors[i] - vectors[j]))
        distances.append(temp)
    return distances

distances = euclidean_distance(vectors)
print('# euclidean_distance')
for index in range(len(distances)):
    print(distances[index])

def cosine_distance(vectors):
    vectors = np.array(vectors)
    distances = []
    for i in range(len(vectors)):
        temp = []
        for j in range(len(vectors)):
            temp.append(distance.cosine(vectors[i], vectors[j]))
        distances.append(temp)
    return distances

distances = cosine_distance(vectors)
print('# cosine_distance')
for index in range(len(distances)):
    print(distances[index])


import sklearn.metrics.pairwise as pairwise
distances = pairwise.cosine_similarity(vectors)
print('# cosine_similarity')
for index in range(len(distances)):
    print(distances[index])

# euclidean_distance
[0.0, 0.0, 2.0]
[0.0, 0.0, 2.0]
[2.0, 2.0, 0.0]
# cosine_distance
[0.0, 0.0, 0.5527864045000421]
[0.0, 0.0, 0.5527864045000421]
[0.5527864045000421, 0.5527864045000421, 0.0]
# cosine_similarity
[1.        1.        0.4472136]
[1.        1.        0.4472136]
[0.4472136 0.4472136 1.       ]

2.4. sklearnのBoWとTF-IDFを使った例¶

ステミング、ストップワード等の指定もできるが、細かな制御はしにくいかも。（主観）

import sklearn.feature_extraction.text as fe_text

def bow(docs):
    '''Bag-of-Wordsによるベクトルを生成。

    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :return: 文書ベクトル。
    '''
    vectorizer = fe_text.CountVectorizer(stop_words='english')
    vectors = vectorizer.fit_transform(docs)
    return vectors.toarray(), vectorizer

vectors, vectorizer = bow(docs)
print('# normal BoW')
print(vectorizer.get_feature_names())
print(vectors)

def bow_tfidf(docs):
    '''Bag-of-WordsにTF-IDFで重み調整したベクトルを生成。

    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :return: 重み調整したベクトル。
    '''
    vectorizer = fe_text.TfidfVectorizer(norm=None, stop_words='english')
    vectors = vectorizer.fit_transform(docs)
    return vectors.toarray(), vectorizer

vectors, vectorizer = bow_tfidf(docs)
print('# BoW + tfidf')
print(vectorizer.get_feature_names())
print(vectors)

# normal BoW
['11', '16', '399', '4k', '60', '62', 'beautifully', 'cameras', 'counted', 'dis', 'fps', 'iphone', 'mo', 'price', 'sharp', 'shoots', 'trade', 'video']
[[0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0]
 [1 0 0 1 1 0 1 1 0 0 1 1 0 0 1 1 0 1]
 [0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0]]
# BoW + tfidf
['11', '16', '399', '4k', '60', '62', 'beautifully', 'cameras', 'counted', 'dis', 'fps', 'iphone', 'mo', 'price', 'sharp', 'shoots', 'trade', 'video']
[[0.         0.         0.         0.         0.         0.
  0.         0.         1.69314718 1.69314718 0.         0.
  0.         1.69314718 0.         0.         1.28768207 0.        ]
 [1.69314718 0.         0.         1.69314718 1.69314718 0.
  1.69314718 1.69314718 0.         0.         1.69314718 1.69314718
  0.         0.         1.69314718 1.69314718 0.         1.69314718]
 [0.         1.69314718 1.69314718 0.         0.         1.69314718
  0.         0.         0.         0.         0.         0.
  1.69314718 0.         0.         0.         1.28768207 0.        ]]

2.5. 共起行列に基づいた単語のベクトル化¶

preprocess(): テキストに対する前処理。
create_co_matrix(): 共起行列を作成。
most_similar(): コサイン類似度Top5を出力。

import pandas as pd

sentence = 'pandas is an open source programming tools. The best way to get pandas is via conda. "conda install pandas"'
print(sentence)
print('len(sentence) = ', len(sentence))


def preprocess(text):
    """テキストに対する前処理。
    「ゼロから作るDeepLearning2 自然言語処理辺」p.66より。

    :param text:
    :return:
      courpus(list): id_to_wordのidに基づいたone-hot vector。
      word_to_id(dict): 単語をkeyとして、idを参照する辞書。
      id_to_word(dict): idをkeyとして、単語を参照する辞書。
    """
    text = text.lower()
    text = text.replace('.', ' .')
    text = text.replace('"', '')
    words = text.split(' ')

    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word
    corpus = np.array([word_to_id[w] for w in words])
    return corpus, word_to_id, id_to_word

corpus, word_to_id, id_to_word = preprocess(sentence)
vocab_size = len(word_to_id)
print(corpus)
print(word_to_id)
print(id_to_word)

pandas is an open source programming tools. The best way to get pandas is via conda. "conda install pandas"
len(sentence) =  107
[ 0  1  2  3  4  5  6  7  8  9 10 11 12  0  1 13 14  7 14 15  0]
{'pandas': 0, 'is': 1, 'an': 2, 'open': 3, 'source': 4, 'programming': 5, 'tools': 6, '.': 7, 'the': 8, 'best': 9, 'way': 10, 'to': 11, 'get': 12, 'via': 13, 'conda': 14, 'install': 15}
{0: 'pandas', 1: 'is', 2: 'an', 3: 'open', 4: 'source', 5: 'programming', 6: 'tools', 7: '.', 8: 'the', 9: 'best', 10: 'way', 11: 'to', 12: 'get', 13: 'via', 14: 'conda', 15: 'install'}

def create_co_matrix(corpus, vocab_size, window_size=1):
    """共起行列を作成。
    「ゼロから作るDeepLearning2 自然言語処理辺」p.72より。

    :param corpus(str): テキスト文。
    :param vocab_size: 語彙数。
    :param window_size: 共起判定の範囲。
    :return:
    """
    corpus_size = len(corpus)
    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)

    for idx, word_id in enumerate(corpus):
        for i in range(1, window_size+1):
            left_idx = idx - i
            right_idx = idx + i
            if left_idx >= 0:
                left_word_id = corpus[left_idx]
                co_matrix[word_id, left_word_id] += 1
            if right_idx < corpus_size:
                right_word_id = corpus[right_idx]
                co_matrix[word_id, right_word_id] += 1
    return co_matrix

co_matrix = create_co_matrix(corpus, vocab_size, window_size=2)
df = pd.DataFrame(co_matrix, index=word_to_id.keys(), columns=word_to_id.keys())
df

	pandas	is	an	open	source	programming	tools	.	the	best	way	to	get	via	conda	install
pandas	0	2	1	0	0	0	0	0	0	0	0	1	1	1	1	1
is	2	0	1	1	0	0	0	0	0	0	0	0	1	1	1	0
an	1	1	0	1	1	0	0	0	0	0	0	0	0	0	0	0
open	0	1	1	0	1	1	0	0	0	0	0	0	0	0	0	0
source	0	0	1	1	0	1	1	0	0	0	0	0	0	0	0	0
programming	0	0	0	1	1	0	1	1	0	0	0	0	0	0	0	0
tools	0	0	0	0	1	1	0	1	1	0	0	0	0	0	0	0
.	0	0	0	0	0	1	1	0	1	1	0	0	0	1	2	1
the	0	0	0	0	0	0	1	1	0	1	1	0	0	0	0	0
best	0	0	0	0	0	0	0	1	1	0	1	1	0	0	0	0
way	0	0	0	0	0	0	0	0	1	1	0	1	1	0	0	0
to	1	0	0	0	0	0	0	0	0	1	1	0	1	0	0	0
get	1	1	0	0	0	0	0	0	0	0	1	1	0	0	0	0
via	1	1	0	0	0	0	0	1	0	0	0	0	0	0	1	0
conda	1	1	0	0	0	0	0	2	0	0	0	0	0	1	2	1
install	1	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0

def cos_similarity(x, y, eps=1e-8):
    nx = x / (np.sqrt(np.sum(x ** 2)) + eps)
    ny = y / (np.sqrt(np.sum(y ** 2)) + eps)
    return np.dot(nx, ny)

def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
    """コサイン類似度Top5を出力。

    :param query(str): クエリ。
    :param word_to_id(dict): 単語をkeyとして、idを参照する辞書。
    :param id_to_word(dict): idをkeyとして、単語を参照する辞書。
    :param word_matrix: 共起行列。
    :param top(int): 上位何件まで表示させるか。
    :return: なし。
    """
    if query not in word_to_id:
        print('%s is not found' % query)
        return

    print('[query] ' + query)
    query_id = word_to_id[query]
    query_vec = word_matrix[query_id]

    vocab_size = len(word_to_id)
    similarity = np.zeros(vocab_size)
    for i in range(vocab_size):
        similarity[i] = cos_similarity(word_matrix[i], query_vec)

    count = 0
    for i in (-1 * similarity).argsort():
        if id_to_word[i] == query:
            continue
        print(' %s: %s' % (id_to_word[i], similarity[i]))
        count += 1
        if count >= top:
            return

print('\n# most_similar() with co_matrix')
user_query = "pandas"
most_similar(user_query, word_to_id, id_to_word, co_matrix)

# most_similar() with co_matrix
[query] pandas
 conda: 0.5477225541919766
 open: 0.4743416451535486
 get: 0.4743416451535486
 via: 0.4743416451535486
 is: 0.4216370186169938

2.6. 相互情報量による分散表現の高度化¶

ppmi(): Positive PMI（正の相互情報量）。

def ppmi(C, verbose=False, eps=1e-8):
    """Positive PMI（正の相互情報量）
    「ゼロから作るDeepLearning2 自然言語処理辺」p.79より。

    :param C: 共起行列。
    :param verbose(boolean): 処理状況を出力するためのフラグ。
    :param eps(float): np.log2演算時に-infとなるのを避けるための微小な値。
    :return:
    """
    M = np.zeros_like(C, dtype=np.float32)
    N = np.sum(C)
    S = np.sum(C, axis=0)
    total = C.shape[0] * C.shape[1]
    cnt = 0

    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            pmi = np.log2(C[i, j] * N / (S[j]*S[i]) + eps)
            M[i, j] = max(0, pmi)

            if verbose:
                cnt += 1
                if cnt % (total//100) == 0:
                    print('%.1f%% done' % (100+cnt/total))
    return M

M = ppmi(co_matrix)
print('\n# PPMI')
df2 = pd.DataFrame(M, index=word_to_id.keys(), columns=word_to_id.keys())
df2

# PPMI

	pandas	is	an	open	source	programming	tools	.	the	best	way	to	get	via	conda	install
pandas	0.000000	1.478047	1.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.285402	1.285402	1.285402	0.285402	1.70044
is	1.478047	0.000000	1.478047	1.478047	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.478047	1.478047	0.478047	0.00000
an	1.285402	1.478047	0.000000	2.285402	2.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000
open	0.000000	1.478047	2.285402	0.000000	2.285402	2.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000
source	0.000000	0.000000	2.285402	2.285402	0.000000	2.285402	2.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000
programming	0.000000	0.000000	0.000000	2.285402	2.285402	0.000000	2.285402	1.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000
tools	0.000000	0.000000	0.000000	0.000000	2.285402	2.285402	0.000000	1.285402	2.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000
.	0.000000	0.000000	0.000000	0.000000	0.000000	1.285402	1.285402	0.000000	1.285402	1.285402	0.000000	0.000000	0.000000	1.285402	1.285402	1.70044
the	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2.285402	1.285402	0.000000	2.285402	2.285402	0.000000	0.000000	0.000000	0.000000	0.00000
best	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.285402	2.285402	0.000000	2.285402	2.285402	0.000000	0.000000	0.000000	0.00000
way	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2.285402	2.285402	0.000000	2.285402	2.285402	0.000000	0.000000	0.00000
to	1.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2.285402	2.285402	0.000000	2.285402	0.000000	0.000000	0.00000
get	1.285402	1.478047	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2.285402	2.285402	0.000000	0.000000	0.000000	0.00000
via	1.285402	1.478047	0.000000	0.000000	0.000000	0.000000	0.000000	1.285402	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.285402	0.00000
conda	0.285402	0.478047	0.000000	0.000000	0.000000	0.000000	0.000000	1.285402	0.000000	0.000000	0.000000	0.000000	0.000000	1.285402	1.285402	1.70044
install	1.700440	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.700440	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.700440	0.00000

#np.set_printoptions(precision=3) # 有効桁3桁（表示上の省略で、データは保持）
pd.options.display.precision = 3 # 同上
print('\n# PPMI with precision=3')
df2 = pd.DataFrame(M, index=word_to_id.keys(), columns=word_to_id.keys())
df2

# PPMI with precision=3

	pandas	is	an	open	source	programming	tools	.	the	best	way	to	get	via	conda	install
pandas	0.000	1.478	1.285	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	1.285	1.285	1.285	0.285	1.7
is	1.478	0.000	1.478	1.478	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	1.478	1.478	0.478	0.0
an	1.285	1.478	0.000	2.285	2.285	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.0
open	0.000	1.478	2.285	0.000	2.285	2.285	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.0
source	0.000	0.000	2.285	2.285	0.000	2.285	2.285	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.0
programming	0.000	0.000	0.000	2.285	2.285	0.000	2.285	1.285	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.0
tools	0.000	0.000	0.000	0.000	2.285	2.285	0.000	1.285	2.285	0.000	0.000	0.000	0.000	0.000	0.000	0.0
.	0.000	0.000	0.000	0.000	0.000	1.285	1.285	0.000	1.285	1.285	0.000	0.000	0.000	1.285	1.285	1.7
the	0.000	0.000	0.000	0.000	0.000	0.000	2.285	1.285	0.000	2.285	2.285	0.000	0.000	0.000	0.000	0.0
best	0.000	0.000	0.000	0.000	0.000	0.000	0.000	1.285	2.285	0.000	2.285	2.285	0.000	0.000	0.000	0.0
way	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	2.285	2.285	0.000	2.285	2.285	0.000	0.000	0.0
to	1.285	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	2.285	2.285	0.000	2.285	0.000	0.000	0.0
get	1.285	1.478	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	2.285	2.285	0.000	0.000	0.000	0.0
via	1.285	1.478	0.000	0.000	0.000	0.000	0.000	1.285	0.000	0.000	0.000	0.000	0.000	0.000	1.285	0.0
conda	0.285	0.478	0.000	0.000	0.000	0.000	0.000	1.285	0.000	0.000	0.000	0.000	0.000	1.285	1.285	1.7
install	1.700	0.000	0.000	0.000	0.000	0.000	0.000	1.700	0.000	0.000	0.000	0.000	0.000	0.000	1.700	0.0

print('\n# most_similar() with PPMI')
most_similar(user_query, word_to_id, id_to_word, M)

# most_similar() with PPMI
[query] pandas
 conda: 0.5733166933059692
 is: 0.5094797611236572
 .: 0.40005457401275635
 get: 0.39511924982070923
 way: 0.3747256100177765

2.7. SVDによる次元削減¶

np.linalg.svd(): 線形代数ライブラリを利用。

# svd
U, S, V = np.linalg.svd(M)
print('\n# SVD: dense vectors with all singular values')
print(U)

use_s_values = 2
U2 = U[:,0:use_s_values]
print('\n# SVD: dense vectors with singular values = {}'.format(use_s_values))
print(U2)

print('\n# most_similar() with SVD-2')
most_similar(user_query, word_to_id, id_to_word, U2)

# SVD: dense vectors with all singular values
[[-0.20909968 -0.05742509 -0.4260506  -0.17504837  0.2712971   0.04672289
   0.18823615 -0.18591174  0.36170715 -0.02090307  0.33414906  0.12803775
   0.20772134 -0.17504339  0.12016748 -0.49748433]
 [-0.20854177  0.06963065 -0.3888203   0.1402168   0.20635465  0.15388878
  -0.2102995  -0.13104966 -0.38273078  0.39125082  0.11626503 -0.32180378
  -0.35602686 -0.3037927  -0.09017423  0.09666564]
 [-0.23961712  0.26655334 -0.18719402  0.05390574 -0.35268205  0.2592225
  -0.32773367 -0.09349788  0.2730225   0.0377553  -0.40234843 -0.31079715
   0.17229767  0.28257638 -0.12586372 -0.25696287]
 [-0.2782892   0.35838643 -0.04367146 -0.32175225  0.05215078  0.23192607
   0.21995124  0.49517277  0.11888672 -0.1294459   0.23939379 -0.12363093
  -0.32118776  0.27299288  0.13509066  0.19793455]
 [-0.3138919   0.39330027  0.14138082  0.21887659  0.3510343   0.1303136
   0.33645037 -0.16895744 -0.4147285  -0.14546476 -0.16639212  0.12915234
   0.36436176  0.11097264 -0.11952758 -0.00717432]
 [-0.29259852  0.32289356  0.2055024   0.29696342 -0.23372465 -0.05240921
  -0.22590032 -0.37780765  0.24121043 -0.03001606  0.29355177  0.3378207
  -0.1754489  -0.11105473  0.2770583   0.22205435]
 [-0.29597434  0.18007208  0.3224701  -0.47838306 -0.15175404 -0.19339524
  -0.20657967  0.22351523 -0.11675979  0.25128582  0.00841147  0.15087281
   0.17578007 -0.40954658 -0.2627039  -0.1637667 ]
 [-0.25471854 -0.01025081  0.02076114 -0.05029602  0.36506298 -0.47563514
   0.0229755   0.00612366  0.19963232  0.08954368 -0.58078074 -0.02569766
  -0.26210678 -0.01673562  0.34222415 -0.00723523]
 [-0.29356492 -0.19401811  0.31909984  0.49437767 -0.02606995 -0.1744973
   0.20824735  0.21605654  0.2066092  -0.09872612  0.19246763 -0.389128
  -0.11316317 -0.11955503 -0.29818308 -0.22771738]
 [-0.28762156 -0.33267495  0.200779   -0.2494268  -0.28890833 -0.02771297
   0.27570227 -0.329145   -0.23007524  0.28904447  0.13154629 -0.28043836
   0.10255558  0.28061515  0.3373071   0.00459014]
 [-0.30796608 -0.3980363   0.13565156 -0.2586614   0.28628355  0.16009486
  -0.41644257 -0.21415797 -0.02268608 -0.5027486   0.02590389 -0.01947945
  -0.08477696  0.10067189 -0.20825593  0.13973713]
 [-0.26908046 -0.35622442 -0.04449803  0.3045457   0.09906036  0.229743
  -0.24744193  0.4765816  -0.00974799  0.2936326  -0.00161449  0.30523014
   0.30354518  0.11671898  0.24741183  0.0925235 ]
 [-0.23532367 -0.25857002 -0.19031367 -0.01856818 -0.34355265  0.2738812
   0.43220395 -0.04992938  0.06145945 -0.03610538 -0.34404024  0.3561516
  -0.258372   -0.24605696 -0.2454495   0.14502974]
 [-0.132769    0.00144362 -0.31328392  0.02380027 -0.25689903 -0.2137918
  -0.00529088  0.15167178 -0.13748148 -0.45615938 -0.02669457 -0.27819303
   0.35791093 -0.38731208  0.29279673  0.28283092]
 [-0.12850079 -0.00502957 -0.3181515  -0.0227449   0.01897933 -0.46368992
   0.04571127 -0.06436063  0.18308993  0.2039377   0.152593    0.05987117
   0.20431407  0.32411188 -0.46140748  0.44548053]
 [-0.12502334 -0.01886328 -0.2561411   0.08992951 -0.24950773 -0.35275292
  -0.11904678  0.12406847 -0.44487035 -0.22767347  0.08375221  0.287249
  -0.2736007   0.31247255 -0.00371157 -0.42694545]]

# SVD: dense vectors with singular values = 2
[[-0.20909968 -0.05742509]
 [-0.20854177  0.06963065]
 [-0.23961712  0.26655334]
 [-0.2782892   0.35838643]
 [-0.3138919   0.39330027]
 [-0.29259852  0.32289356]
 [-0.29597434  0.18007208]
 [-0.25471854 -0.01025081]
 [-0.29356492 -0.19401811]
 [-0.28762156 -0.33267495]
 [-0.30796608 -0.3980363 ]
 [-0.26908046 -0.35622442]
 [-0.23532367 -0.25857002]
 [-0.132769    0.00144362]
 [-0.12850079 -0.00502957]
 [-0.12502334 -0.01886328]]

# most_similar() with SVD-2
[query] pandas
 install: 0.9930136799812317
 .: 0.9741653800010681
 conda: 0.9739159345626831
 via: 0.9613599181175232
 the: 0.950492262840271

データマイニング

2. コード例：シソーラス、カウントと推論に基づいた設計（生実装、NLTK編）¶

2.1. 事前準備¶

2.2. シソーラスの例¶

2.2.1. simple_matching()¶

2.2.2. relation_matching()¶

2.3. Bag-of-Words¶

2.4. sklearnのBoWとTF-IDFを使った例¶

2.5. 共起行列に基づいた単語のベクトル化¶

2.6. 相互情報量による分散表現の高度化¶

2.7. SVDによる次元削減¶