!date
!python --version
Thu May  8 03:48:57 AM UTC 2025
Python 3.11.12

BoWとword2vec用の前処理#

分かち書きに時間がかかる(當間環境で約20分)ため、実験コードを実行し直す都度分かち書きからやり直すのは時間がもったいない。そこでここでは分かち書きだけを行い、結果を(pickle形式)ファイルとして保存しておく。機械学習する側のコードでは、このファイルを読み込むだけの時間で実験を再開することができるようになる。

本コードを実行すると、以下のファイルに前処理済みデータが保存される。

  • 学習データ: prepcoecced_train.pkl

  • 検証データ: prepcoecced_val.pkl

  • テストデータ: prepcoecced_test.pkl

pickleファイルの中身は「1件のデータが辞書型」であり、それをリストとして保存している。sent1は前提文、sent2は仮説文。bow_inputがBoW用の分かち書き結果。w2v_inputがword2vec用の分散ベクトル。labelが教師ラベル(文字列)である。

[
  {
    "sent1": "男の子が水たまりで遊んでいる。",
    "sent2": "子供が遊んでいる。",
    "bow_input": "男の子 が 水たまり で 遊ん で いる 。 [SEP] 子供 が 遊ん で いる 。",
    "w2v_input": np.array([...]),  # shape: (600,)
    "label": "entailment"
  },
  ...
]

実行後、作成されるファイルをダウンロードするのを忘れないようにしよう。忘れてしまったらまたこれを実行し直すことになります。

前提#

  • Github: JGLUE/datasets/jnil-v1.3/ にある以下のファイルをダウンロードしておき、本ノートブックを実行する前にアップロードしておいてください。

    • train-v1.3.json(学習用データセット)

    • val-v1.3.json(検証用データセット)

    • test-v1.3.json(テスト用データセット)

# spacyインストール
!pip install ginza ja_ginza
Collecting ginza
  Downloading ginza-5.2.0-py3-none-any.whl.metadata (448 bytes)
Collecting ja_ginza
  Downloading ja_ginza-5.2.0-py3-none-any.whl.metadata (5.8 kB)
Requirement already satisfied: spacy<4.0.0,>=3.4.4 in /usr/local/lib/python3.11/dist-packages (from ginza) (3.8.5)
Collecting plac>=1.3.3 (from ginza)
  Downloading plac-1.4.5-py2.py3-none-any.whl.metadata (5.9 kB)
Collecting SudachiPy<0.7.0,>=0.6.2 (from ginza)
  Downloading SudachiPy-0.6.10-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting SudachiDict-core>=20210802 (from ginza)
  Downloading SudachiDict_core-20250129-py3-none-any.whl.metadata (2.5 kB)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.0.12)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.0.11)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.0.9)
Requirement already satisfied: thinc<8.4.0,>=8.3.4 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (8.3.6)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (1.1.3)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.5.1)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.0.10)
Requirement already satisfied: weasel<0.5.0,>=0.1.0 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (0.4.1)
Requirement already satisfied: typer<1.0.0,>=0.3.0 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (0.15.3)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (4.67.1)
Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.0.2)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.32.3)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (2.11.4)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.1.6)
Requirement already satisfied: setuptools in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (75.2.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (24.2)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.11/dist-packages (from spacy<4.0.0,>=3.4.4->ginza) (3.5.0)
Requirement already satisfied: language-data>=1.2 in /usr/local/lib/python3.11/dist-packages (from langcodes<4.0.0,>=3.2.0->spacy<4.0.0,>=3.4.4->ginza) (1.3.0)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<4.0.0,>=3.4.4->ginza) (0.7.0)
Requirement already satisfied: pydantic-core==2.33.2 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<4.0.0,>=3.4.4->ginza) (2.33.2)
Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<4.0.0,>=3.4.4->ginza) (4.13.2)
Requirement already satisfied: typing-inspection>=0.4.0 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<4.0.0,>=3.4.4->ginza) (0.4.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (2.4.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0.0,>=3.4.4->ginza) (2025.4.26)
Requirement already satisfied: blis<1.4.0,>=1.3.0 in /usr/local/lib/python3.11/dist-packages (from thinc<8.4.0,>=8.3.4->spacy<4.0.0,>=3.4.4->ginza) (1.3.0)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.11/dist-packages (from thinc<8.4.0,>=8.3.4->spacy<4.0.0,>=3.4.4->ginza) (0.1.5)
Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (8.1.8)
Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (1.5.4)
Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (13.9.4)
Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /usr/local/lib/python3.11/dist-packages (from weasel<0.5.0,>=0.1.0->spacy<4.0.0,>=3.4.4->ginza) (0.21.0)
Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /usr/local/lib/python3.11/dist-packages (from weasel<0.5.0,>=0.1.0->spacy<4.0.0,>=3.4.4->ginza) (7.1.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->spacy<4.0.0,>=3.4.4->ginza) (3.0.2)
Requirement already satisfied: marisa-trie>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy<4.0.0,>=3.4.4->ginza) (1.2.1)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.11/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (2.19.1)
Requirement already satisfied: wrapt in /usr/local/lib/python3.11/dist-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spacy<4.0.0,>=3.4.4->ginza) (1.17.2)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.11/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy<4.0.0,>=3.4.4->ginza) (0.1.2)
Downloading ginza-5.2.0-py3-none-any.whl (21 kB)
Downloading ja_ginza-5.2.0-py3-none-any.whl (59.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.1/59.1 MB 16.7 MB/s eta 0:00:00
?25hDownloading plac-1.4.5-py2.py3-none-any.whl (22 kB)
Downloading SudachiDict_core-20250129-py3-none-any.whl (72.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.1/72.1 MB 9.3 MB/s eta 0:00:00
?25hDownloading SudachiPy-0.6.10-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 63.3 MB/s eta 0:00:00
?25hInstalling collected packages: SudachiPy, plac, SudachiDict-core, ginza, ja_ginza
Successfully installed SudachiDict-core-20250129 SudachiPy-0.6.10 ginza-5.2.0 ja_ginza-5.2.0 plac-1.4.5
# ====================================
# ✅ 必要ライブラリ
# ====================================
import spacy
import pickle
import os
from tqdm import tqdm
import pandas as pd
import numpy as np

# ====================================
# ✅ GiNZAのロード(時間がかかります)
# ====================================
nlp = spacy.load("ja_ginza")

# ====================================
# ✅ 入力ファイルの読み込み
# ====================================
def load_jsonl(filepath):
    with open(filepath, encoding="utf-8") as f:
        return [eval(line.strip()) for line in f]

def load_and_prepare(file_path):
    data = load_jsonl(file_path)
    return [(ex["sentence1"], ex["sentence2"], ex["label"]) for ex in data]

train_data = load_and_prepare("train-v1.3.json")
val_data = load_and_prepare("valid-v1.3.json")
test_data = load_and_prepare("test-v1.3.json")

# ====================================
# ✅ 文をspaCyで処理して保存
# ====================================
def preprocess_and_save(data, output_file):
    results = []
    for sent1, sent2, label in tqdm(data):
        # GiNZAで処理
        doc1 = nlp(sent1)
        doc2 = nlp(sent2)

        # BoW用分かち書き(空白区切り文字列)
        tokens1 = [t.text for t in doc1 if not t.is_space]
        tokens2 = [t.text for t in doc2 if not t.is_space]
        bow_joined = " ".join(tokens1 + ["[SEP]"] + tokens2)

        # word2vec用ベクトル平均
        vecs1 = [t.vector for t in doc1 if t.has_vector and not t.is_space]
        vecs2 = [t.vector for t in doc2 if t.has_vector and not t.is_space]
        if vecs1:
            vec1_avg = np.mean(vecs1, axis=0)
        else:
            vec1_avg = np.zeros(nlp.vocab.vectors_length)
        if vecs2:
            vec2_avg = np.mean(vecs2, axis=0)
        else:
            vec2_avg = np.zeros(nlp.vocab.vectors_length)
        vec_concat = np.concatenate([vec1_avg, vec2_avg])

        results.append({
            "sent1": sent1, # 前提文
            "sent2": sent2, # 仮説文
            "bow_input": bow_joined, # BoW用の分かち書きした文([SEP]付きで結合)
            "w2v_input": vec_concat, # word2vec用の平均ベクトトル(2文の平均ベクトルを結合)
            "label": label
        })

    # pickle保存
    with open(output_file, "wb") as f:
        pickle.dump(results, f)

# ====================================
# ✅ 実行:train / val / test それぞれ保存
# ====================================
preprocess_and_save(train_data, "preprocessed_train.pkl")
preprocess_and_save(val_data, "preprocessed_val.pkl")
preprocess_and_save(test_data, "preprocessed_test.pkl")
100%|██████████| 20073/20073 [15:05<00:00, 22.16it/s]
100%|██████████| 2434/2434 [01:51<00:00, 21.83it/s]
100%|██████████| 2508/2508 [01:53<00:00, 22.12it/s]

実行終了後、3つのpklファイルをダウンロードすること#