{"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.6"},"colab":{"name":"nlp1.ipynb","provenance":[],"collapsed_sections":[],"toc_visible":true}},"cells":[{"cell_type":"markdown","metadata":{"id":"mf7eFNWYRoYA"},"source":["# コード例：シソーラス、カウントと推論に基づいた設計（生実装、NLTK編）\n","- 補足\n","  - 自然言語処理は利用するツールによって操作が大きく異なります。ここでは代表的な前処理（文分割、トークナイズ、ステミング等）を観察しやすくすることを優先しています。後日より使いやすいツールについても紹介する予定です。\n","- 全体の流れ\n","    - 事前準備\n","    - シソーラスの例\n","    - Bag-of-Words\n","    - sklearnのBoWとTF-IDFを使った例\n","    - 共起行列に基づいた単語のベクトル化\n","    - 相互情報量による分散表現の高度化\n","    - SVDによる次元削減"]},{"cell_type":"markdown","metadata":{"id":"Dqa9QDJ7RoYG"},"source":["## 事前準備\n","- 実行する際の注意\n","    - [Natural Language Toolkit; NLTK](https://www.nltk.org)のインストールと、コーパス等の追加ダウンロードが必要。（全てをまとめてインストールすることも可能だが、それなりに容量を必要とするためデフォルトでは最小限しかインストールされない）。\n","    - 手順\n","      - NLTKインストール。\n","      - pythonインタプリタから ``nltk.download()`` を実行。関連コーパス等（下記）をダウンロード。\n","        - Corporaタブにある wordnet, stopwords\n","        - Modelsタブにある punkt\n"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"MWObKXywR_i5","executionInfo":{"status":"ok","timestamp":1618203558634,"user_tz":-540,"elapsed":5252,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"cae87c73-7d1b-4beb-fb6b-f8635190aeb4"},"source":["!pip install nltk"],"execution_count":1,"outputs":[{"output_type":"stream","text":["Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (3.2.5)\n","Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from nltk) (1.15.0)\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"huRK9h9VSPil","executionInfo":{"status":"ok","timestamp":1618203565909,"user_tz":-540,"elapsed":4910,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"b05a33e6-401a-4aa8-8701-fbefe152a848"},"source":["import nltk\n","nltk.download(['wordnet', 'stopwords', 'punkt'])"],"execution_count":2,"outputs":[{"output_type":"stream","text":["[nltk_data] Downloading package wordnet to /root/nltk_data...\n","[nltk_data]   Unzipping corpora/wordnet.zip.\n","[nltk_data] Downloading package stopwords to /root/nltk_data...\n","[nltk_data]   Unzipping corpora/stopwords.zip.\n","[nltk_data] Downloading package punkt to /root/nltk_data...\n","[nltk_data]   Unzipping tokenizers/punkt.zip.\n"],"name":"stdout"},{"output_type":"execute_result","data":{"text/plain":["True"]},"metadata":{"tags":[]},"execution_count":2}]},{"cell_type":"markdown","metadata":{"id":"e4EZ_PuwRoYH"},"source":["## シソーラスの例\n","- preprocess_docs(): テキストに対する前処理の例。\n","- simple_matching(): ユーザクエリに対する単純な単語マッチングによるスコアを算出。\n","- relation_matching(): 単純マッチングに加え、シソーラスを使って加点する例。"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"rrF7i9ojRoYH","executionInfo":{"status":"ok","timestamp":1618203580208,"user_tz":-540,"elapsed":2400,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"85e42d11-c987-401a-f82d-01e2e9317b86"},"source":["# 前処理\n","from nltk.tokenize import wordpunct_tokenize, sent_tokenize\n","# ＜使用しているNLTKライブラリの説明＞\n","# nltk.corpus.stopwords: 文章を特徴付ける要素として不適切なものを除外するためのブラックリスト。通称ストップワード。\n","# nltk.sent_tokenize: 文章(doc)を文(sentence)に分割する。\n","# nltk.wordpunct_tokenize: 文(sentence)を単語(word)に分割する。通称トークン化。\n","# nltk.lemmatize: 単語(word)を基本形(らしきもの)に修正する。通称ステミング。\n","\n","import numpy as np\n","\n","# ドキュメント例（3つのドキュメント）\n","docs = []\n","docs.append(\"You can get dis-counted price with trade-in.\")\n","docs.append(\"iPhone 11 shoots beautifully sharp 4K video at 60 fps across all its cameras.\")\n","docs.append(\"From $16.62/mo. or $399 with trade-in.\")\n","\n","def preprocess_docs(docs):\n","    '''英文書集合 docs に対し前処理を施し、分かち書きしたリストのリストとして返す。\n","\n","    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。\n","    :return (list): 文分割、単語分割、基本形、ストップワード除去した結果。\n","    '''\n","    stopwords = nltk.corpus.stopwords.words('english')\n","    stopwords.append('.')  # ピリオドを追加。\n","    stopwords.append(',')  # カンマを追加。\n","    stopwords.append('')  # 空文字を追加。\n","\n","    result = []\n","    wnl = nltk.stem.wordnet.WordNetLemmatizer()\n","    for doc in docs:\n","        temp = []\n","        for sent in sent_tokenize(doc):\n","            for word in wordpunct_tokenize(sent):\n","                this_word = wnl.lemmatize(word.lower())\n","                if this_word not in stopwords:\n","                    temp.append(this_word)\n","        result.append(temp)\n","    return result\n","\n","docs2 = preprocess_docs(docs)\n","for index in range(len(docs2)):\n","    print('before: ', docs[index])\n","    print('after: ', docs2[index])\n","    print('----')\n"],"execution_count":3,"outputs":[{"output_type":"stream","text":["before:  You can get dis-counted price with trade-in.\n","after:  ['get', 'dis', '-', 'counted', 'price', 'trade', '-']\n","----\n","before:  iPhone 11 shoots beautifully sharp 4K video at 60 fps across all its cameras.\n","after:  ['iphone', '11', 'shoot', 'beautifully', 'sharp', '4k', 'video', '60', 'fps', 'across', 'camera']\n","----\n","before:  From $16.62/mo. or $399 with trade-in.\n","after:  ['$', '16', '62', '/', 'mo', '$', '399', 'trade', '-']\n","----\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"T7zIO15ERoYI"},"source":["### simple_matching()\n"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"yH2IszZMRoYI","executionInfo":{"status":"ok","timestamp":1618203631287,"user_tz":-540,"elapsed":792,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"c90e3031-e766-4f5a-dd56-204edb1a0bdb"},"source":["# simple matching\n","def simple_matching(query, docs):\n","    '''単純な単語マッチングによりマッチ数でスコアを算出。\n","\n","    :param query(str): クエリ（検索要求）。\n","    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。\n","    :return (list): 文書毎のスコア。\n","    '''\n","    query = query.split(\" \")\n","    result = []\n","    for doc in docs:\n","        score = 0\n","        for word in doc:\n","            for key in query:\n","                if key == word:\n","                    score += 1\n","        result.append(score)\n","    return result\n","\n","user_query = \"how much iphone\"\n","scores = simple_matching(user_query, docs2)\n","print('simple_matching scores = ', scores)\n"],"execution_count":4,"outputs":[{"output_type":"stream","text":["simple_matching scores =  [0, 1, 0]\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"LT5p-w38RoYJ"},"source":["### relation_matching()\n"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"dsWTpzGrRoYJ","executionInfo":{"status":"ok","timestamp":1618203651128,"user_tz":-540,"elapsed":756,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"bfd87fd4-6357-4a74-bad3-17afb9578656"},"source":["# relation matching\n","related_words = {}\n","related_words['buy'] = ['buy', '$', 'price', 'how much', 'trade-in']\n","related_words['UX'] = ['UX', 'stylish', 'seamless']\n","\n","def relation_matching(query, docs, related_words):\n","    '''予め用意された関連用語を利用し、マッチする数を加点して算出。\n","\n","    :param query(str): クエリ（検索要求）。\n","    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。\n","    :param related_words:\n","    :return (list): 文書毎のスコア。\n","    '''\n","    scores = simple_matching(query, docs)\n","\n","    query = query.split(\" \")\n","    for q in query:\n","        for relation in related_words:\n","            matches = [q in word for word in related_words[relation]]\n","            if True in matches:\n","                new_query = ' '.join(related_words[relation])\n","                temp_scores = simple_matching(new_query, docs)\n","                print('# q = {}, relation = {} => temp_scores = {}'.format(q, relation, temp_scores))\n","                scores = list(np.array(scores) + np.array(temp_scores))\n","    scores = list(scores)\n","    return scores\n","\n","scores2 = relation_matching(user_query, docs2, related_words)\n","print('simple_matching scores = ', scores)\n","print('relation_matching scores = ', scores2)\n"],"execution_count":5,"outputs":[{"output_type":"stream","text":["# q = how, relation = buy => temp_scores = [1, 0, 2]\n","# q = much, relation = buy => temp_scores = [1, 0, 2]\n","simple_matching scores =  [0, 1, 0]\n","relation_matching scores =  [2, 1, 4]\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"Uhc5-DUQRoYK"},"source":["## Bag-of-Words\n","- collect_words_eng(): 英文書集合から単語コードブック作成\n","- make_vectors_eng(): コードブックを素性とする文書ベクトルを作る\n","- euclidean_distance(): ユークリッド距離\n","- cosine_distance(): コサイン距離\n","- cosine_similarity(): コサイン類似度"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Gbedc9x4RoYK","executionInfo":{"status":"ok","timestamp":1618203699857,"user_tz":-540,"elapsed":609,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"ac4edd80-9bda-48af-a140-3682560db28d"},"source":["import scipy.spatial.distance as distance\n","\n","# BoW\n","# ドキュメント例（3つのドキュメント）\n","docs3 = []\n","docs3.append(\"This is test.\")\n","docs3.append(\"That is test too.\")\n","docs3.append(\"There are so many many tests.\")\n","\n","\n","# 文書集合からターム素性集合（コードブック）を作る\n","def collect_words_eng(docs):\n","    '''英文書集合から単語コードブック作成。\n","    シンプルに文書集合を予め決めうちした方式で処理する。\n","    必要に応じて指定できるようにしていた方が使い易いかも。\n","\n","    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。\n","    :return (list): 文分割、単語分割、基本形、ストップワード除去した、ユニークな単語一覧。\n","    '''\n","    codebook = []\n","    stopwords = nltk.corpus.stopwords.words('english')\n","    stopwords.append('.')   # ピリオドを追加。\n","    stopwords.append(',')   # カンマを追加。\n","    stopwords.append('')    # 空文字を追加。\n","    wnl = nltk.stem.wordnet.WordNetLemmatizer()\n","    for doc in docs:\n","        for sent in sent_tokenize(doc):\n","            for word in wordpunct_tokenize(sent):\n","                this_word = wnl.lemmatize(word.lower())\n","                if this_word not in codebook and this_word not in stopwords:\n","                    codebook.append(this_word)\n","    return codebook\n","\n","codebook = collect_words_eng(docs3)\n","print('codebook = ',codebook)\n"],"execution_count":6,"outputs":[{"output_type":"stream","text":["codebook =  ['test', 'many']\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"OFSt8UnORoYK","executionInfo":{"status":"ok","timestamp":1618203731872,"user_tz":-540,"elapsed":618,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"17146ea6-7ba5-41ac-fae5-8534f790ecc6"},"source":["# コードブックを素性とする文書ベクトルを作る (直接ベクトル生成)\n","def make_vectors_eng(docs, codebook):\n","    '''コードブックを素性とする文書ベクトルを作る（直接ベクトル生成）\n","\n","    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。\n","    :param codebook(list): ユニークな単語一覧。\n","    :return (list): コードブックを元に、出現回数を特徴量とするベクトルを返す。\n","    '''\n","    vectors = []\n","    wnl = nltk.stem.wordnet.WordNetLemmatizer()\n","    for doc in docs:\n","        this_vector = []\n","        fdist = nltk.FreqDist()\n","        for sent in sent_tokenize(doc):\n","            for word in wordpunct_tokenize(sent):\n","                this_word = wnl.lemmatize(word.lower())\n","                fdist[this_word] += 1\n","        for word in codebook:\n","            this_vector.append(fdist[word])\n","        vectors.append(this_vector)\n","    return vectors\n","\n","vectors = make_vectors_eng(docs3, codebook)\n","for index in range(len(docs3)):\n","    print('docs[{}] = {}'.format(index,docs3[index]))\n","    print('vectors[{}] = {}'.format(index,vectors[index]))\n","    print('----')\n"],"execution_count":7,"outputs":[{"output_type":"stream","text":["docs[0] = This is test.\n","vectors[0] = [1, 0]\n","----\n","docs[1] = That is test too.\n","vectors[1] = [1, 0]\n","----\n","docs[2] = There are so many many tests.\n","vectors[2] = [1, 2]\n","----\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"fCFUHGcdRoYL","executionInfo":{"status":"ok","timestamp":1618203762277,"user_tz":-540,"elapsed":860,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"71fe6798-145d-4684-b142-86a33423eca6"},"source":["def euclidean_distance(vectors):\n","    vectors = np.array(vectors)\n","    distances = []\n","    for i in range(len(vectors)):\n","        temp = []\n","        for j in range(len(vectors)):\n","            temp.append(np.linalg.norm(vectors[i] - vectors[j]))\n","        distances.append(temp)\n","    return distances\n","\n","distances = euclidean_distance(vectors)\n","print('# euclidean_distance')\n","for index in range(len(distances)):\n","    print(distances[index])\n","\n","def cosine_distance(vectors):\n","    vectors = np.array(vectors)\n","    distances = []\n","    for i in range(len(vectors)):\n","        temp = []\n","        for j in range(len(vectors)):\n","            temp.append(distance.cosine(vectors[i], vectors[j]))\n","        distances.append(temp)\n","    return distances\n","\n","distances = cosine_distance(vectors)\n","print('# cosine_distance')\n","for index in range(len(distances)):\n","    print(distances[index])\n","\n","\n","import sklearn.metrics.pairwise as pairwise\n","distances = pairwise.cosine_similarity(vectors)\n","print('# cosine_similarity')\n","for index in range(len(distances)):\n","    print(distances[index])\n"],"execution_count":8,"outputs":[{"output_type":"stream","text":["# euclidean_distance\n","[0.0, 0.0, 2.0]\n","[0.0, 0.0, 2.0]\n","[2.0, 2.0, 0.0]\n","# cosine_distance\n","[0.0, 0.0, 0.5527864045000421]\n","[0.0, 0.0, 0.5527864045000421]\n","[0.5527864045000421, 0.5527864045000421, 0.0]\n","# cosine_similarity\n","[1.        1.        0.4472136]\n","[1.        1.        0.4472136]\n","[0.4472136 0.4472136 1.       ]\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"jGuHZ15hRoYM"},"source":["## sklearnのBoWとTF-IDFを使った例\n","- ステミング、ストップワード等の指定もできるが、細かな制御はしにくいかも。（主観）"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"c4T04o3URoYM","executionInfo":{"status":"ok","timestamp":1618203792672,"user_tz":-540,"elapsed":884,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"5c3a3b01-32e2-40ee-8794-69887a4fbce0"},"source":["import sklearn.feature_extraction.text as fe_text\n","\n","def bow(docs):\n","    '''Bag-of-Wordsによるベクトルを生成。\n","\n","    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。\n","    :return: 文書ベクトル。\n","    '''\n","    vectorizer = fe_text.CountVectorizer(stop_words='english')\n","    vectors = vectorizer.fit_transform(docs)\n","    return vectors.toarray(), vectorizer\n","\n","vectors, vectorizer = bow(docs)\n","print('# normal BoW')\n","print(vectorizer.get_feature_names())\n","print(vectors)\n","\n","def bow_tfidf(docs):\n","    '''Bag-of-WordsにTF-IDFで重み調整したベクトルを生成。\n","\n","    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。\n","    :return: 重み調整したベクトル。\n","    '''\n","    vectorizer = fe_text.TfidfVectorizer(norm=None, stop_words='english')\n","    vectors = vectorizer.fit_transform(docs)\n","    return vectors.toarray(), vectorizer\n","\n","vectors, vectorizer = bow_tfidf(docs)\n","print('# BoW + tfidf')\n","print(vectorizer.get_feature_names())\n","print(vectors)\n"],"execution_count":9,"outputs":[{"output_type":"stream","text":["# normal BoW\n","['11', '16', '399', '4k', '60', '62', 'beautifully', 'cameras', 'counted', 'dis', 'fps', 'iphone', 'mo', 'price', 'sharp', 'shoots', 'trade', 'video']\n","[[0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0]\n"," [1 0 0 1 1 0 1 1 0 0 1 1 0 0 1 1 0 1]\n"," [0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0]]\n","# BoW + tfidf\n","['11', '16', '399', '4k', '60', '62', 'beautifully', 'cameras', 'counted', 'dis', 'fps', 'iphone', 'mo', 'price', 'sharp', 'shoots', 'trade', 'video']\n","[[0.         0.         0.         0.         0.         0.\n","  0.         0.         1.69314718 1.69314718 0.         0.\n","  0.         1.69314718 0.         0.         1.28768207 0.        ]\n"," [1.69314718 0.         0.         1.69314718 1.69314718 0.\n","  1.69314718 1.69314718 0.         0.         1.69314718 1.69314718\n","  0.         0.         1.69314718 1.69314718 0.         1.69314718]\n"," [0.         1.69314718 1.69314718 0.         0.         1.69314718\n","  0.         0.         0.         0.         0.         0.\n","  1.69314718 0.         0.         0.         1.28768207 0.        ]]\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"WckDMfhZRoYM"},"source":["## 共起行列に基づいた単語のベクトル化\n","- preprocess(): テキストに対する前処理。\n","- create_co_matrix(): 共起行列を作成。\n","- most_similar(): コサイン類似度Top5を出力。"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"EjfrZOVIRoYM","executionInfo":{"status":"ok","timestamp":1618203822642,"user_tz":-540,"elapsed":627,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"055c1cfe-15a5-4e26-dde8-fab5867462dd"},"source":["import pandas as pd\n","\n","sentence = 'pandas is an open source programming tools. The best way to get pandas is via conda. \"conda install pandas\"'\n","print(sentence)\n","print('len(sentence) = ', len(sentence))\n","\n","\n","def preprocess(text):\n","    \"\"\"テキストに対する前処理。\n","    「ゼロから作るDeepLearning2 自然言語処理辺」p.66より。\n","\n","    :param text:\n","    :return:\n","      courpus(list): id_to_wordのidに基づいたone-hot vector。\n","      word_to_id(dict): 単語をkeyとして、idを参照する辞書。\n","      id_to_word(dict): idをkeyとして、単語を参照する辞書。\n","    \"\"\"\n","    text = text.lower()\n","    text = text.replace('.', ' .')\n","    text = text.replace('\"', '')\n","    words = text.split(' ')\n","\n","    word_to_id = {}\n","    id_to_word = {}\n","    for word in words:\n","        if word not in word_to_id:\n","            new_id = len(word_to_id)\n","            word_to_id[word] = new_id\n","            id_to_word[new_id] = word\n","    corpus = np.array([word_to_id[w] for w in words])\n","    return corpus, word_to_id, id_to_word\n","\n","corpus, word_to_id, id_to_word = preprocess(sentence)\n","vocab_size = len(word_to_id)\n","print(corpus)\n","print(word_to_id)\n","print(id_to_word)"],"execution_count":10,"outputs":[{"output_type":"stream","text":["pandas is an open source programming tools. The best way to get pandas is via conda. \"conda install pandas\"\n","len(sentence) =  107\n","[ 0  1  2  3  4  5  6  7  8  9 10 11 12  0  1 13 14  7 14 15  0]\n","{'pandas': 0, 'is': 1, 'an': 2, 'open': 3, 'source': 4, 'programming': 5, 'tools': 6, '.': 7, 'the': 8, 'best': 9, 'way': 10, 'to': 11, 'get': 12, 'via': 13, 'conda': 14, 'install': 15}\n","{0: 'pandas', 1: 'is', 2: 'an', 3: 'open', 4: 'source', 5: 'programming', 6: 'tools', 7: '.', 8: 'the', 9: 'best', 10: 'way', 11: 'to', 12: 'get', 13: 'via', 14: 'conda', 15: 'install'}\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":565},"id":"8FhiMV71RoYN","executionInfo":{"status":"ok","timestamp":1618203837270,"user_tz":-540,"elapsed":618,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"0d73890b-d4a5-4925-f1a2-22a07680f390"},"source":["def create_co_matrix(corpus, vocab_size, window_size=1):\n","    \"\"\"共起行列を作成。\n","    「ゼロから作るDeepLearning2 自然言語処理辺」p.72より。\n","\n","    :param corpus(str): テキスト文。\n","    :param vocab_size: 語彙数。\n","    :param window_size: 共起判定の範囲。\n","    :return:\n","    \"\"\"\n","    corpus_size = len(corpus)\n","    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)\n","\n","    for idx, word_id in enumerate(corpus):\n","        for i in range(1, window_size+1):\n","            left_idx = idx - i\n","            right_idx = idx + i\n","            if left_idx >= 0:\n","                left_word_id = corpus[left_idx]\n","                co_matrix[word_id, left_word_id] += 1\n","            if right_idx < corpus_size:\n","                right_word_id = corpus[right_idx]\n","                co_matrix[word_id, right_word_id] += 1\n","    return co_matrix\n","\n","co_matrix = create_co_matrix(corpus, vocab_size, window_size=2)\n","df = pd.DataFrame(co_matrix, index=word_to_id.keys(), columns=word_to_id.keys())\n","df"],"execution_count":11,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>pandas</th>\n","      <th>is</th>\n","      <th>an</th>\n","      <th>open</th>\n","      <th>source</th>\n","      <th>programming</th>\n","      <th>tools</th>\n","      <th>.</th>\n","      <th>the</th>\n","      <th>best</th>\n","      <th>way</th>\n","      <th>to</th>\n","      <th>get</th>\n","      <th>via</th>\n","      <th>conda</th>\n","      <th>install</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>pandas</th>\n","      <td>0</td>\n","      <td>2</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>is</th>\n","      <td>2</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>an</th>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>open</th>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>source</th>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>programming</th>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>tools</th>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>.</th>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>2</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>the</th>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>best</th>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>way</th>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>to</th>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>get</th>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>via</th>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>0</td>\n","    </tr>\n","    <tr>\n","      <th>conda</th>\n","      <td>1</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>2</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>2</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>install</th>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>0</td>\n","      <td>1</td>\n","      <td>0</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["             pandas  is  an  open  source  ...  to  get  via  conda  install\n","pandas            0   2   1     0       0  ...   1    1    1      1        1\n","is                2   0   1     1       0  ...   0    1    1      1        0\n","an                1   1   0     1       1  ...   0    0    0      0        0\n","open              0   1   1     0       1  ...   0    0    0      0        0\n","source            0   0   1     1       0  ...   0    0    0      0        0\n","programming       0   0   0     1       1  ...   0    0    0      0        0\n","tools             0   0   0     0       1  ...   0    0    0      0        0\n",".                 0   0   0     0       0  ...   0    0    1      2        1\n","the               0   0   0     0       0  ...   0    0    0      0        0\n","best              0   0   0     0       0  ...   1    0    0      0        0\n","way               0   0   0     0       0  ...   1    1    0      0        0\n","to                1   0   0     0       0  ...   0    1    0      0        0\n","get               1   1   0     0       0  ...   1    0    0      0        0\n","via               1   1   0     0       0  ...   0    0    0      1        0\n","conda             1   1   0     0       0  ...   0    0    1      2        1\n","install           1   0   0     0       0  ...   0    0    0      1        0\n","\n","[16 rows x 16 columns]"]},"metadata":{"tags":[]},"execution_count":11}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"MRbXiIsKRoYN","executionInfo":{"status":"ok","timestamp":1618203913968,"user_tz":-540,"elapsed":709,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"d0321c3b-9eb1-42a1-e54b-4f50cbdec6d9"},"source":["def cos_similarity(x, y, eps=1e-8):\n","    nx = x / (np.sqrt(np.sum(x ** 2)) + eps)\n","    ny = y / (np.sqrt(np.sum(y ** 2)) + eps)\n","    return np.dot(nx, ny)\n","\n","def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):\n","    \"\"\"コサイン類似度Top5を出力。\n","\n","    :param query(str): クエリ。\n","    :param word_to_id(dict): 単語をkeyとして、idを参照する辞書。\n","    :param id_to_word(dict): idをkeyとして、単語を参照する辞書。\n","    :param word_matrix: 共起行列。\n","    :param top(int): 上位何件まで表示させるか。\n","    :return: なし。\n","    \"\"\"\n","    if query not in word_to_id:\n","        print('%s is not found' % query)\n","        return\n","\n","    print('[query] ' + query)\n","    query_id = word_to_id[query]\n","    query_vec = word_matrix[query_id]\n","\n","    vocab_size = len(word_to_id)\n","    similarity = np.zeros(vocab_size)\n","    for i in range(vocab_size):\n","        similarity[i] = cos_similarity(word_matrix[i], query_vec)\n","\n","    count = 0\n","    for i in (-1 * similarity).argsort():\n","        if id_to_word[i] == query:\n","            continue\n","        print(' %s: %s' % (id_to_word[i], similarity[i]))\n","        count += 1\n","        if count >= top:\n","            return\n","\n","print('\\n# most_similar() with co_matrix')\n","user_query = \"pandas\"\n","most_similar(user_query, word_to_id, id_to_word, co_matrix)"],"execution_count":12,"outputs":[{"output_type":"stream","text":["\n","# most_similar() with co_matrix\n","[query] pandas\n"," conda: 0.5477225541919766\n"," open: 0.4743416451535486\n"," get: 0.4743416451535486\n"," via: 0.4743416451535486\n"," is: 0.4216370186169938\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"JcE9RH1kRoYO"},"source":["## 相互情報量による分散表現の高度化\n","- ppmi(): Positive PMI（正の相互情報量）。"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":599},"id":"K8e08GIORoYP","executionInfo":{"status":"ok","timestamp":1618203923685,"user_tz":-540,"elapsed":711,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"57b90e6a-01fc-4133-a43c-d09a0f12611b"},"source":["def ppmi(C, verbose=False, eps=1e-8):\n","    \"\"\"Positive PMI（正の相互情報量）\n","    「ゼロから作るDeepLearning2 自然言語処理辺」p.79より。\n","\n","    :param C: 共起行列。\n","    :param verbose(boolean): 処理状況を出力するためのフラグ。\n","    :param eps(float): np.log2演算時に-infとなるのを避けるための微小な値。\n","    :return:\n","    \"\"\"\n","    M = np.zeros_like(C, dtype=np.float32)\n","    N = np.sum(C)\n","    S = np.sum(C, axis=0)\n","    total = C.shape[0] * C.shape[1]\n","    cnt = 0\n","\n","    for i in range(C.shape[0]):\n","        for j in range(C.shape[1]):\n","            pmi = np.log2(C[i, j] * N / (S[j]*S[i]) + eps)\n","            M[i, j] = max(0, pmi)\n","\n","            if verbose:\n","                cnt += 1\n","                if cnt % (total//100) == 0:\n","                    print('%.1f%% done' % (100+cnt/total))\n","    return M\n","\n","M = ppmi(co_matrix)\n","print('\\n# PPMI')\n","df2 = pd.DataFrame(M, index=word_to_id.keys(), columns=word_to_id.keys())\n","df2"],"execution_count":13,"outputs":[{"output_type":"stream","text":["\n","# PPMI\n"],"name":"stdout"},{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>pandas</th>\n","      <th>is</th>\n","      <th>an</th>\n","      <th>open</th>\n","      <th>source</th>\n","      <th>programming</th>\n","      <th>tools</th>\n","      <th>.</th>\n","      <th>the</th>\n","      <th>best</th>\n","      <th>way</th>\n","      <th>to</th>\n","      <th>get</th>\n","      <th>via</th>\n","      <th>conda</th>\n","      <th>install</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>pandas</th>\n","      <td>0.000000</td>\n","      <td>1.478047</td>\n","      <td>1.285402</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>1.285402</td>\n","      <td>1.285402</td>\n","      <td>1.285402</td>\n","      <td>0.285402</td>\n","      <td>1.70044</td>\n","    </tr>\n","    <tr>\n","      <th>is</th>\n","      <td>1.478047</td>\n","      <td>0.000000</td>\n","      <td>1.478047</td>\n","      <td>1.478047</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>1.478047</td>\n","      <td>1.478047</td>\n","      <td>0.478047</td>\n","      <td>0.00000</td>\n","    </tr>\n","    <tr>\n","      <th>an</th>\n","      <td>1.285402</td>\n","      <td>1.478047</td>\n","      <td>0.000000</td>\n","      <td>2.285402</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.00000</td>\n","    </tr>\n","    <tr>\n","      <th>open</th>\n","      <td>0.000000</td>\n","      <td>1.478047</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>2.285402</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.00000</td>\n","    </tr>\n","    <tr>\n","      <th>source</th>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>2.285402</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>2.285402</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.00000</td>\n","    </tr>\n","    <tr>\n","      <th>programming</th>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>2.285402</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>2.285402</td>\n","      <td>1.285402</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.00000</td>\n","    </tr>\n","    <tr>\n","      <th>tools</th>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>2.285402</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>1.285402</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.00000</td>\n","    </tr>\n","    <tr>\n","      <th>.</th>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>1.285402</td>\n","      <td>1.285402</td>\n","      <td>0.000000</td>\n","      <td>1.285402</td>\n","      <td>1.285402</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>1.285402</td>\n","      <td>1.285402</td>\n","      <td>1.70044</td>\n","    </tr>\n","    <tr>\n","      <th>the</th>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>2.285402</td>\n","      <td>1.285402</td>\n","      <td>0.000000</td>\n","      <td>2.285402</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.00000</td>\n","    </tr>\n","    <tr>\n","      <th>best</th>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>1.285402</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>2.285402</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.00000</td>\n","    </tr>\n","    <tr>\n","      <th>way</th>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>2.285402</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>2.285402</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.00000</td>\n","    </tr>\n","    <tr>\n","      <th>to</th>\n","      <td>1.285402</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>2.285402</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.00000</td>\n","    </tr>\n","    <tr>\n","      <th>get</th>\n","      <td>1.285402</td>\n","      <td>1.478047</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>2.285402</td>\n","      <td>2.285402</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.00000</td>\n","    </tr>\n","    <tr>\n","      <th>via</th>\n","      <td>1.285402</td>\n","      <td>1.478047</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>1.285402</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>1.285402</td>\n","      <td>0.00000</td>\n","    </tr>\n","    <tr>\n","      <th>conda</th>\n","      <td>0.285402</td>\n","      <td>0.478047</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>1.285402</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>1.285402</td>\n","      <td>1.285402</td>\n","      <td>1.70044</td>\n","    </tr>\n","    <tr>\n","      <th>install</th>\n","      <td>1.700440</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>1.700440</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>0.000000</td>\n","      <td>1.700440</td>\n","      <td>0.00000</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["               pandas        is        an  ...       via     conda  install\n","pandas       0.000000  1.478047  1.285402  ...  1.285402  0.285402  1.70044\n","is           1.478047  0.000000  1.478047  ...  1.478047  0.478047  0.00000\n","an           1.285402  1.478047  0.000000  ...  0.000000  0.000000  0.00000\n","open         0.000000  1.478047  2.285402  ...  0.000000  0.000000  0.00000\n","source       0.000000  0.000000  2.285402  ...  0.000000  0.000000  0.00000\n","programming  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.00000\n","tools        0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.00000\n",".            0.000000  0.000000  0.000000  ...  1.285402  1.285402  1.70044\n","the          0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.00000\n","best         0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.00000\n","way          0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.00000\n","to           1.285402  0.000000  0.000000  ...  0.000000  0.000000  0.00000\n","get          1.285402  1.478047  0.000000  ...  0.000000  0.000000  0.00000\n","via          1.285402  1.478047  0.000000  ...  0.000000  1.285402  0.00000\n","conda        0.285402  0.478047  0.000000  ...  1.285402  1.285402  1.70044\n","install      1.700440  0.000000  0.000000  ...  0.000000  1.700440  0.00000\n","\n","[16 rows x 16 columns]"]},"metadata":{"tags":[]},"execution_count":13}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":599},"id":"8tUoxAYZRoYP","executionInfo":{"status":"ok","timestamp":1618203936692,"user_tz":-540,"elapsed":644,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"e559abab-52b8-47cc-908a-53c1039e55e6"},"source":["#np.set_printoptions(precision=3) # 有効桁3桁（表示上の省略で、データは保持）\n","pd.options.display.precision = 3 # 同上\n","print('\\n# PPMI with precision=3')\n","df2 = pd.DataFrame(M, index=word_to_id.keys(), columns=word_to_id.keys())\n","df2"],"execution_count":14,"outputs":[{"output_type":"stream","text":["\n","# PPMI with precision=3\n"],"name":"stdout"},{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>pandas</th>\n","      <th>is</th>\n","      <th>an</th>\n","      <th>open</th>\n","      <th>source</th>\n","      <th>programming</th>\n","      <th>tools</th>\n","      <th>.</th>\n","      <th>the</th>\n","      <th>best</th>\n","      <th>way</th>\n","      <th>to</th>\n","      <th>get</th>\n","      <th>via</th>\n","      <th>conda</th>\n","      <th>install</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>pandas</th>\n","      <td>0.000</td>\n","      <td>1.478</td>\n","      <td>1.285</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>1.285</td>\n","      <td>1.285</td>\n","      <td>1.285</td>\n","      <td>0.285</td>\n","      <td>1.7</td>\n","    </tr>\n","    <tr>\n","      <th>is</th>\n","      <td>1.478</td>\n","      <td>0.000</td>\n","      <td>1.478</td>\n","      <td>1.478</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>1.478</td>\n","      <td>1.478</td>\n","      <td>0.478</td>\n","      <td>0.0</td>\n","    </tr>\n","    <tr>\n","      <th>an</th>\n","      <td>1.285</td>\n","      <td>1.478</td>\n","      <td>0.000</td>\n","      <td>2.285</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.0</td>\n","    </tr>\n","    <tr>\n","      <th>open</th>\n","      <td>0.000</td>\n","      <td>1.478</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>2.285</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.0</td>\n","    </tr>\n","    <tr>\n","      <th>source</th>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>2.285</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>2.285</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.0</td>\n","    </tr>\n","    <tr>\n","      <th>programming</th>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>2.285</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>2.285</td>\n","      <td>1.285</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.0</td>\n","    </tr>\n","    <tr>\n","      <th>tools</th>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>2.285</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>1.285</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.0</td>\n","    </tr>\n","    <tr>\n","      <th>.</th>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>1.285</td>\n","      <td>1.285</td>\n","      <td>0.000</td>\n","      <td>1.285</td>\n","      <td>1.285</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>1.285</td>\n","      <td>1.285</td>\n","      <td>1.7</td>\n","    </tr>\n","    <tr>\n","      <th>the</th>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>2.285</td>\n","      <td>1.285</td>\n","      <td>0.000</td>\n","      <td>2.285</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.0</td>\n","    </tr>\n","    <tr>\n","      <th>best</th>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>1.285</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>2.285</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.0</td>\n","    </tr>\n","    <tr>\n","      <th>way</th>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>2.285</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>2.285</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.0</td>\n","    </tr>\n","    <tr>\n","      <th>to</th>\n","      <td>1.285</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>2.285</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.0</td>\n","    </tr>\n","    <tr>\n","      <th>get</th>\n","      <td>1.285</td>\n","      <td>1.478</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>2.285</td>\n","      <td>2.285</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.0</td>\n","    </tr>\n","    <tr>\n","      <th>via</th>\n","      <td>1.285</td>\n","      <td>1.478</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>1.285</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>1.285</td>\n","      <td>0.0</td>\n","    </tr>\n","    <tr>\n","      <th>conda</th>\n","      <td>0.285</td>\n","      <td>0.478</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>1.285</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>1.285</td>\n","      <td>1.285</td>\n","      <td>1.7</td>\n","    </tr>\n","    <tr>\n","      <th>install</th>\n","      <td>1.700</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>1.700</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>0.000</td>\n","      <td>1.700</td>\n","      <td>0.0</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["             pandas     is     an   open  ...    get    via  conda  install\n","pandas        0.000  1.478  1.285  0.000  ...  1.285  1.285  0.285      1.7\n","is            1.478  0.000  1.478  1.478  ...  1.478  1.478  0.478      0.0\n","an            1.285  1.478  0.000  2.285  ...  0.000  0.000  0.000      0.0\n","open          0.000  1.478  2.285  0.000  ...  0.000  0.000  0.000      0.0\n","source        0.000  0.000  2.285  2.285  ...  0.000  0.000  0.000      0.0\n","programming   0.000  0.000  0.000  2.285  ...  0.000  0.000  0.000      0.0\n","tools         0.000  0.000  0.000  0.000  ...  0.000  0.000  0.000      0.0\n",".             0.000  0.000  0.000  0.000  ...  0.000  1.285  1.285      1.7\n","the           0.000  0.000  0.000  0.000  ...  0.000  0.000  0.000      0.0\n","best          0.000  0.000  0.000  0.000  ...  0.000  0.000  0.000      0.0\n","way           0.000  0.000  0.000  0.000  ...  2.285  0.000  0.000      0.0\n","to            1.285  0.000  0.000  0.000  ...  2.285  0.000  0.000      0.0\n","get           1.285  1.478  0.000  0.000  ...  0.000  0.000  0.000      0.0\n","via           1.285  1.478  0.000  0.000  ...  0.000  0.000  1.285      0.0\n","conda         0.285  0.478  0.000  0.000  ...  0.000  1.285  1.285      1.7\n","install       1.700  0.000  0.000  0.000  ...  0.000  0.000  1.700      0.0\n","\n","[16 rows x 16 columns]"]},"metadata":{"tags":[]},"execution_count":14}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"3OuF6wnMRoYP","executionInfo":{"status":"ok","timestamp":1618203942208,"user_tz":-540,"elapsed":615,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"453b33df-7c06-435a-e84a-9d71a2df5d12"},"source":["print('\\n# most_similar() with PPMI')\n","most_similar(user_query, word_to_id, id_to_word, M)\n"],"execution_count":15,"outputs":[{"output_type":"stream","text":["\n","# most_similar() with PPMI\n","[query] pandas\n"," conda: 0.5733166933059692\n"," is: 0.5094797611236572\n"," .: 0.40005457401275635\n"," get: 0.39511924982070923\n"," way: 0.3747256100177765\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"1qf7f7ZbRoYQ"},"source":["## SVDによる次元削減\n","- np.linalg.svd(): 線形代数ライブラリを利用。\n"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"ZhHJvY9pRoYQ","executionInfo":{"status":"ok","timestamp":1618203952284,"user_tz":-540,"elapsed":601,"user":{"displayName":"TOMA Naruaki","photoUrl":"","userId":"11747312442870110137"}},"outputId":"7ec3cf9e-d1aa-49f6-9666-af95f57edddc"},"source":["# svd\n","U, S, V = np.linalg.svd(M)\n","print('\\n# SVD: dense vectors with all singular values')\n","print(U)\n","\n","use_s_values = 2\n","U2 = U[:,0:use_s_values]\n","print('\\n# SVD: dense vectors with singular values = {}'.format(use_s_values))\n","print(U2)\n","\n","print('\\n# most_similar() with SVD-2')\n","most_similar(user_query, word_to_id, id_to_word, U2)\n"],"execution_count":16,"outputs":[{"output_type":"stream","text":["\n","# SVD: dense vectors with all singular values\n","[[-0.20909968 -0.05742509 -0.4260506  -0.17504837  0.2712971   0.04672289\n","   0.18823615 -0.18591174  0.36170715 -0.02090307  0.33414906  0.12803775\n","   0.20772134 -0.17504339  0.12016748 -0.49748433]\n"," [-0.20854177  0.06963065 -0.3888203   0.1402168   0.20635465  0.15388878\n","  -0.2102995  -0.13104966 -0.38273078  0.39125082  0.11626503 -0.32180378\n","  -0.35602686 -0.3037927  -0.09017423  0.09666564]\n"," [-0.23961712  0.26655334 -0.18719402  0.05390574 -0.35268205  0.2592225\n","  -0.32773367 -0.09349788  0.2730225   0.0377553  -0.40234843 -0.31079715\n","   0.17229767  0.28257638 -0.12586372 -0.25696287]\n"," [-0.2782892   0.35838643 -0.04367146 -0.32175225  0.05215078  0.23192607\n","   0.21995124  0.49517277  0.11888672 -0.1294459   0.23939379 -0.12363093\n","  -0.32118776  0.27299288  0.13509066  0.19793455]\n"," [-0.3138919   0.39330027  0.14138082  0.21887659  0.3510343   0.1303136\n","   0.33645037 -0.16895744 -0.4147285  -0.14546476 -0.16639212  0.12915234\n","   0.36436176  0.11097264 -0.11952758 -0.00717432]\n"," [-0.29259852  0.32289356  0.2055024   0.29696342 -0.23372465 -0.05240921\n","  -0.22590032 -0.37780765  0.24121043 -0.03001606  0.29355177  0.3378207\n","  -0.1754489  -0.11105473  0.2770583   0.22205435]\n"," [-0.29597434  0.18007208  0.3224701  -0.47838306 -0.15175404 -0.19339524\n","  -0.20657967  0.22351523 -0.11675979  0.25128582  0.00841147  0.15087281\n","   0.17578007 -0.40954658 -0.2627039  -0.1637667 ]\n"," [-0.25471854 -0.01025081  0.02076114 -0.05029602  0.36506298 -0.47563514\n","   0.0229755   0.00612366  0.19963232  0.08954368 -0.58078074 -0.02569766\n","  -0.26210678 -0.01673562  0.34222415 -0.00723523]\n"," [-0.29356492 -0.19401811  0.31909984  0.49437767 -0.02606995 -0.1744973\n","   0.20824735  0.21605654  0.2066092  -0.09872612  0.19246763 -0.389128\n","  -0.11316317 -0.11955503 -0.29818308 -0.22771738]\n"," [-0.28762156 -0.33267495  0.200779   -0.2494268  -0.28890833 -0.02771297\n","   0.27570227 -0.329145   -0.23007524  0.28904447  0.13154629 -0.28043836\n","   0.10255558  0.28061515  0.3373071   0.00459014]\n"," [-0.30796608 -0.3980363   0.13565156 -0.2586614   0.28628355  0.16009486\n","  -0.41644257 -0.21415797 -0.02268608 -0.5027486   0.02590389 -0.01947945\n","  -0.08477696  0.10067189 -0.20825593  0.13973713]\n"," [-0.26908046 -0.35622442 -0.04449803  0.3045457   0.09906036  0.229743\n","  -0.24744193  0.4765816  -0.00974799  0.2936326  -0.00161449  0.30523014\n","   0.30354518  0.11671898  0.24741183  0.0925235 ]\n"," [-0.23532367 -0.25857002 -0.19031367 -0.01856818 -0.34355265  0.2738812\n","   0.43220395 -0.04992938  0.06145945 -0.03610538 -0.34404024  0.3561516\n","  -0.258372   -0.24605696 -0.2454495   0.14502974]\n"," [-0.132769    0.00144362 -0.31328392  0.02380027 -0.25689903 -0.2137918\n","  -0.00529088  0.15167178 -0.13748148 -0.45615938 -0.02669457 -0.27819303\n","   0.35791093 -0.38731208  0.29279673  0.28283092]\n"," [-0.12850079 -0.00502957 -0.3181515  -0.0227449   0.01897933 -0.46368992\n","   0.04571127 -0.06436063  0.18308993  0.2039377   0.152593    0.05987117\n","   0.20431407  0.32411188 -0.46140748  0.44548053]\n"," [-0.12502334 -0.01886328 -0.2561411   0.08992951 -0.24950773 -0.35275292\n","  -0.11904678  0.12406847 -0.44487035 -0.22767347  0.08375221  0.287249\n","  -0.2736007   0.31247255 -0.00371157 -0.42694545]]\n","\n","# SVD: dense vectors with singular values = 2\n","[[-0.20909968 -0.05742509]\n"," [-0.20854177  0.06963065]\n"," [-0.23961712  0.26655334]\n"," [-0.2782892   0.35838643]\n"," [-0.3138919   0.39330027]\n"," [-0.29259852  0.32289356]\n"," [-0.29597434  0.18007208]\n"," [-0.25471854 -0.01025081]\n"," [-0.29356492 -0.19401811]\n"," [-0.28762156 -0.33267495]\n"," [-0.30796608 -0.3980363 ]\n"," [-0.26908046 -0.35622442]\n"," [-0.23532367 -0.25857002]\n"," [-0.132769    0.00144362]\n"," [-0.12850079 -0.00502957]\n"," [-0.12502334 -0.01886328]]\n","\n","# most_similar() with SVD-2\n","[query] pandas\n"," install: 0.9930136799812317\n"," .: 0.9741653800010681\n"," conda: 0.9739159345626831\n"," via: 0.9613599181175232\n"," the: 0.950492262840271\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"2nsKV0lhRoYQ"},"source":[""],"execution_count":null,"outputs":[]}]}