{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UedTzbMFg8S7"
   },
   "source": [
    "# シンプルなファインチューニング例\n",
    "- やりたいこと\n",
    "  - [20 newsgroups text dataset](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset)を分類タスクとして学習したい。\n",
    "- 方針\n",
    "  - fastTextにより「Wikipedia(en)コーパスの一部」を用いて事前学習する。（異なるソースを元に言語モデルを構築する）\n",
    "    - なお、ここではfastTextで学習するためにどのようにデータを要ししたら良いのかを確認しやすくするためにWikipediaコーパスから事前学習を行っている。しかし自前でWikipedia事前学習するぐらいなら、最初から[FastText](https://fasttext.cc)で公開されている事前学習済みモデルをダウンロードして用いるほうが良い。\n",
    "  - fastText学習済みモデルを用いて、20 newsgroups textの記事をベクトル化する。\n",
    "  - 比較対象としてTF-IDFによるベクトル化も用意する。\n",
    "  - 分類学習にはナイーブベイズ、SVM、NNを用いる。なお、ナイーブベイズは基本的にはカウント情報を想定しているため、fastTextベクトルには適用できないことからTF-IDFのみに適用する。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 243,
     "status": "ok",
     "timestamp": 1622628591595,
     "user": {
      "displayName": "TOMA Naruaki",
      "photoUrl": "",
      "userId": "11747312442870110137"
     },
     "user_tz": -540
    },
    "id": "Gi4xg_nAJ5Xi",
    "outputId": "37f61bf2-13a0-4add-b76b-72472569f601"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2022年 3月 9日 水曜日 15時52分25秒 JST\n"
     ]
    }
   ],
   "source": [
    "!date"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "z2fRtSAyxKp-"
   },
   "source": [
    "## 事前学習"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lCUiutDriMiR"
   },
   "source": [
    "### 環境構築\n",
    "- fastTextモデルのために[gensim](https://radimrehurek.com/gensim/)を利用。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 8541,
     "status": "ok",
     "timestamp": 1622628600976,
     "user": {
      "displayName": "TOMA Naruaki",
      "photoUrl": "",
      "userId": "11747312442870110137"
     },
     "user_tz": -540
    },
    "id": "n1N_-9Zi2u97",
    "outputId": "7b80153b-fdc8-44e1-82d2-45324f704366"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting package metadata (current_repodata.json): done\n",
      "Solving environment: failed with initial frozen solve. Retrying with flexible solve.\n",
      "Collecting package metadata (repodata.json): done\n",
      "Solving environment: failed with initial frozen solve. Retrying with flexible solve.\n",
      "\n",
      "PackagesNotFoundError: The following packages are not available from current channels:\n",
      "\n",
      "  - gensim\n",
      "\n",
      "Current channels:\n",
      "\n",
      "  - https://conda.anaconda.org/conda-forge/osx-arm64\n",
      "  - https://conda.anaconda.org/conda-forge/noarch\n",
      "  - https://repo.anaconda.com/pkgs/main/osx-arm64\n",
      "  - https://repo.anaconda.com/pkgs/main/noarch\n",
      "  - https://repo.anaconda.com/pkgs/r/osx-arm64\n",
      "  - https://repo.anaconda.com/pkgs/r/noarch\n",
      "\n",
      "To search for alternate channels that may provide the conda package you're\n",
      "looking for, navigate to\n",
      "\n",
      "    https://anaconda.org\n",
      "\n",
      "and use the search bar at the top of the page.\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "#!pip install --upgrade gensim\n",
    "!conda install -c conda-forge gensim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "O6q4nSdxifdA"
   },
   "source": [
    "### データセットの準備\n",
    "- [英語版Wikipediaのダンプデータ](https://dumps.wikimedia.org/enwiki/latest/)をダウンロードし、これを事前学習用コーパスとして利用する。なお、全データを用いると圧縮状態で15GBを超えて待ち時間が長いため、ここでは小規模で提供されているものを指定している。\n",
    "- ダンプデータはbzcatで確認しているように、XML形式で書かれている。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 52062,
     "status": "ok",
     "timestamp": 1622628653035,
     "user": {
      "displayName": "TOMA Naruaki",
      "photoUrl": "",
      "userId": "11747312442870110137"
     },
     "user_tz": -540
    },
    "id": "vPcG1mAI2W6o",
    "outputId": "3b84e714-5e5b-4c4c-96b4-4de196b6da4f"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100  245M  100  245M    0     0   856k      0  0:04:53  0:04:53 --:--:--  860kk    0     0   755k      0  0:05:32  0:00:10  0:05:22  871k0     0   798k      0  0:05:14  0:00:18  0:04:56  870k 0     0   821k      0  0:05:05  0:00:31  0:04:34  848k:01:22  0:03:34  867k55  0:02:07  0:02:48  862k4:55  0:02:11  0:02:44  859k0   852k      0  0:04:54  0:02:22  0:02:32  860k52k      0  0:04:54  0:02:31  0:02:23  864k    0  0:04:54  0:02:49  0:02:05  859k  853k0:01:26  863k2  178M    0     0   854k      0  0:04:53  0:03:33  0:01:20  864k     0  0:04:53  0:04:21  0:00:32  864k 855k      0  0:04:53  0:04:23  0:00:30  860k   0  0:04:53  0:04:42  0:00:11  857k\n"
     ]
    }
   ],
   "source": [
    "file_url=\"https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p1p41242.bz2\"\n",
    "\n",
    "!curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p1p41242.bz2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 17,
     "status": "ok",
     "timestamp": 1622628653036,
     "user": {
      "displayName": "TOMA Naruaki",
      "photoUrl": "",
      "userId": "11747312442870110137"
     },
     "user_tz": -540
    },
    "id": "E0A8hag-DAup",
    "outputId": "bbbe3f2e-ac76-4bd7-9f9b-d39fa3ffa43b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<mediawiki xmlns=\"http://www.mediawiki.org/xml/export-0.10/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd\" version=\"0.10\" xml:lang=\"en\">\n",
      "  <siteinfo>\n",
      "    <sitename>Wikipedia</sitename>\n",
      "    <dbname>enwiki</dbname>\n",
      "    <base>https://en.wikipedia.org/wiki/Main_Page</base>\n",
      "    <generator>MediaWiki 1.38.0-wmf.23</generator>\n",
      "    <case>first-letter</case>\n",
      "    <namespaces>\n",
      "      <namespace key=\"-2\" case=\"first-letter\">Media</namespace>\n",
      "      <namespace key=\"-1\" case=\"first-letter\">Special</namespace>\n",
      "\n",
      "bzcat: I/O or other error, bailing out.  Possible reason follows.\n",
      "bzcat: Broken pipe\n",
      "\tInput file = enwiki-latest-pages-articles1.xml-p1p41242.bz2, output file = (stdout)\n"
     ]
    }
   ],
   "source": [
    "!bzcat enwiki-latest-pages-articles1.xml-p1p41242.bz2 | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oKVSFGGji34A"
   },
   "source": [
    "### 環境構築2\n",
    "- [multiprocessing](https://docs.python.org/ja/3/library/multiprocessing.html)は、実行環境におけるCPU数（コア数）を確認するために利用。\n",
    "- [gensim.corpora.wikicorpus](https://radimrehurek.com/gensim/corpora/wikicorpus.html)は、Wikipediaのダンプデータから本文データのみを抽出するために利用。\n",
    "- [gensim.models.fasttext](https://radimrehurek.com/gensim/models/fasttext.html)は、FastTextモデル。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 1066,
     "status": "ok",
     "timestamp": 1622628654097,
     "user": {
      "displayName": "TOMA Naruaki",
      "photoUrl": "",
      "userId": "11747312442870110137"
     },
     "user_tz": -540
    },
    "id": "NqRIiq8l3JtV",
    "outputId": "7db89479-f6e5-4a7a-8f40-e93436a70912"
   },
   "outputs": [
    {
     "ename": "ModuleNotFoundError",
     "evalue": "No module named 'gensim'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
      "\u001b[0;32m/var/folders/nc/_3k6g05j2499x9n2cjtmhxl80000gn/T/ipykernel_33837/2467614308.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mmultiprocessing\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0mgensim\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcorpora\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwikicorpus\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mWikiCorpus\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mgensim\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmodels\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfasttext\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mFastText\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mFT_gensim\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'gensim'"
     ]
    }
   ],
   "source": [
    "import multiprocessing\n",
    "from gensim.corpora.wikicorpus import WikiCorpus\n",
    "from gensim.models.fasttext import FastText as FT_gensim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ToUOvs4UkAKJ"
   },
   "source": [
    "ダンプデータから本文抽出する様子。sentencesにlatestの全本文があり、文書数は15025件。1件目の文書に含まれる単語数は50単語。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 102856,
     "status": "ok",
     "timestamp": 1622628756942,
     "user": {
      "displayName": "TOMA Naruaki",
      "photoUrl": "",
      "userId": "11747312442870110137"
     },
     "user_tz": -540
    },
    "id": "kANgs9Lan0cY",
    "outputId": "84ef1db0-70ed-429d-b7d8-314d57e07687"
   },
   "outputs": [],
   "source": [
    "!date\n",
    "\n",
    "!curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles11.xml-p6899367p7054859.bz2\n",
    "wikipedia_data = \"./enwiki-latest-pages-articles11.xml-p6899367p7054859.bz2\"\n",
    "\n",
    "# expand and extarct\n",
    "print(\"get texts from {}\".format(wikipedia_data))\n",
    "wiki = WikiCorpus(wikipedia_data, dictionary={})\n",
    "sentences = list(wiki.get_texts())\n",
    "\n",
    "# 出力確認\n",
    "print(len(sentences))\n",
    "print(len(sentences[0]))\n",
    "print(sentences[0][0:5])\n",
    "\n",
    "!date"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uYpTtjuBv7QW"
   },
   "source": [
    "### fastTextによる事前学習\n",
    "- build_vocab() により、まずボキャブラリ（単語一覧）を作成する。\n",
    "- その後、コーパスとそれに対する基本情報、エポック数を指定して学習する。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 728809,
     "status": "ok",
     "timestamp": 1622629485746,
     "user": {
      "displayName": "TOMA Naruaki",
      "photoUrl": "",
      "userId": "11747312442870110137"
     },
     "user_tz": -540
    },
    "id": "iwBkobZwEcCb",
    "outputId": "ce982721-e07d-4c1f-a0b0-894eb0b24268"
   },
   "outputs": [],
   "source": [
    "!date\n",
    "\n",
    "# faxtText\n",
    "ft_model = FT_gensim(vector_size=200, window=10, min_count=10, workers=max(1, multiprocessing.cpu_count() - 1))\n",
    "\n",
    "# build the vocabulary\n",
    "print(\"building vocab...\")\n",
    "ft_model.build_vocab(sentences)\n",
    "\n",
    "# train the model\n",
    "print(\"training model...\")\n",
    "ft_model.train(\n",
    "    sentences,\n",
    "    epochs = ft_model.epochs,\n",
    "    total_examples = ft_model.corpus_count,\n",
    "    total_words = ft_model.corpus_total_words\n",
    ")\n",
    "\n",
    "print(\"training done.\")\n",
    "\n",
    "!date"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XWzvQVy_waHJ"
   },
   "source": [
    "### 事前学習により得られたモデルの確認\n",
    "- 単語でも文章でもベクトル化できる。\n",
    "- \"hoge\" は元々の文書には存在しない（False）が、ベクトル化できている。（サブワードによる未知語対応）\n",
    "- ベクトル化できたため、類似単語も確認可能。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 13,
     "status": "ok",
     "timestamp": 1622629485746,
     "user": {
      "displayName": "TOMA Naruaki",
      "photoUrl": "",
      "userId": "11747312442870110137"
     },
     "user_tz": -540
    },
    "id": "q7RQqAcM6K8B",
    "outputId": "1ce4e773-0015-43bb-a333-79b982ee2cea"
   },
   "outputs": [],
   "source": [
    "# 動作確認\n",
    "print(ft_model.wv['artificial'].shape)\n",
    "print(ft_model.wv['artificial'][:5])\n",
    "print(ft_model.wv[\"more like funchuck,Gave this\"][:5])\n",
    "\n",
    "print(\"===========\")\n",
    "print(\"hoge\" in ft_model.wv.key_to_index)\n",
    "print(ft_model.wv[\"hoge\"][:5])\n",
    "\n",
    "print(\"===========\")\n",
    "print(ft_model.wv.most_similar(\"computer\"))\n",
    "print(ft_model.wv.most_similar(\"programming\"))\n",
    "print(ft_model.wv.most_similar(\"apple\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "75ltLfSbw4a7"
   },
   "source": [
    "## ファインチューニング\n",
    "fastTextによる事前学習を終えた。これを用いて本当にやりたい 20 news 分類学習に移る。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GtruEPCYxUlN"
   },
   "source": [
    "### データセットを用意\n",
    "- 20 newsのデータセットを用意。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 11557,
     "status": "ok",
     "timestamp": 1622629497299,
     "user": {
      "displayName": "TOMA Naruaki",
      "photoUrl": "",
      "userId": "11747312442870110137"
     },
     "user_tz": -540
    },
    "id": "fI_I4I8--HH-",
    "outputId": "dabfeed3-ac1e-4238-e50f-5e862667cf85"
   },
   "outputs": [],
   "source": [
    "# fine-tuneing stage.\n",
    "# デーセットの用意\n",
    "# こちらも時間かかるので、変換したデータセットを指定した場所に保存。\n",
    "# 既に保存済みデータセットの利用にも対応。\n",
    "\n",
    "from sklearn.datasets import fetch_20newsgroups\n",
    "#categories = ['alt.atheism', 'sci.space']\n",
    "categories = ['comp.os.ms-windows.misc',  'comp.sys.mac.hardware',  'misc.forsale']\n",
    "newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)\n",
    "train_text = newsgroups_train.data\n",
    "train_label = newsgroups_train.target\n",
    "newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)\n",
    "test_text = newsgroups_test.data\n",
    "test_label = newsgroups_test.target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xZNv8LuaxjOG"
   },
   "source": [
    "### 事前学習モデルによるベクトル化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 31056,
     "status": "ok",
     "timestamp": 1622629528344,
     "user": {
      "displayName": "TOMA Naruaki",
      "photoUrl": "",
      "userId": "11747312442870110137"
     },
     "user_tz": -540
    },
    "id": "gLnYJcprFi43",
    "outputId": "58e4d61f-b82e-4ebc-c81b-096ef1585c56"
   },
   "outputs": [],
   "source": [
    "!date\n",
    "# 事前学習したfastTextにより、文章をベクトルに変換\n",
    "def sentence2vector(sentences, model):\n",
    "    vectors = []\n",
    "    for sent in sentences:\n",
    "        vectors.append(model.wv[sent])\n",
    "    return vectors\n",
    "\n",
    "ft_train_vectors = sentence2vector(train_text, ft_model)\n",
    "ft_test_vectors = sentence2vector(test_text, ft_model)\n",
    "\n",
    "!date"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "C5kikKNtxqJF"
   },
   "source": [
    "### 分類学習モデルによる学習（fastText版）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 16391,
     "status": "ok",
     "timestamp": 1622629544730,
     "user": {
      "displayName": "TOMA Naruaki",
      "photoUrl": "",
      "userId": "11747312442870110137"
     },
     "user_tz": -540
    },
    "id": "jDxHPF4zPXkM",
    "outputId": "8d531db2-b03b-4e5c-e824-49a472d980fd"
   },
   "outputs": [],
   "source": [
    "!date\n",
    "\n",
    "#from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn import svm\n",
    "from sklearn.neural_network import MLPClassifier\n",
    "\n",
    "#clf1 = MultinomialNB()\n",
    "clf2 = svm.SVC(gamma='scale')\n",
    "clf3 = MLPClassifier(max_iter=500)\n",
    "clfs = {\"SVM\":clf2, \"NN\":clf3}\n",
    "\n",
    "ft_scores = []\n",
    "for name, clf in clfs.items():\n",
    "  clf.fit(ft_train_vectors, train_label)\n",
    "  score = clf.score(ft_test_vectors, test_label)\n",
    "  ft_scores.append(score)\n",
    "  print(\"ft_score = {} by {}\".format(score,name))\n",
    "\n",
    "!date"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gO3b-VVDxxdZ"
   },
   "source": [
    "### 分類学習（TF-IDF版）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 746,
     "status": "ok",
     "timestamp": 1622629545473,
     "user": {
      "displayName": "TOMA Naruaki",
      "photoUrl": "",
      "userId": "11747312442870110137"
     },
     "user_tz": -540
    },
    "id": "brDF3Er5GCoW",
    "outputId": "17e7a8cd-a2b5-47a9-e2ea-ed7a203b68cc"
   },
   "outputs": [],
   "source": [
    "# 比較対象の、事前学習なし実験。\n",
    "# BoW + TFIDFによるベクトル生成\n",
    "\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "vectorizer = TfidfVectorizer()\n",
    "tfidf_train_vectors = vectorizer.fit_transform(newsgroups_train.data)\n",
    "print(\"train_vectors.shape=\", tfidf_train_vectors.shape)\n",
    "print(\"len(train_label)=\",len(train_label))\n",
    "\n",
    "tfidf_test_vectors = vectorizer.transform(newsgroups_test.data)\n",
    "print(\"test_vectors.shape=\", tfidf_test_vectors.shape)\n",
    "print(\"len(test_label)=\",len(test_label))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 98500,
     "status": "ok",
     "timestamp": 1622629643962,
     "user": {
      "displayName": "TOMA Naruaki",
      "photoUrl": "",
      "userId": "11747312442870110137"
     },
     "user_tz": -540
    },
    "id": "AsIvZs7oe1PT",
    "outputId": "16a4a6c4-c4f1-4b74-e414-d168e3aa490b"
   },
   "outputs": [],
   "source": [
    "!date\n",
    "\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn import svm\n",
    "from sklearn.neural_network import MLPClassifier\n",
    "\n",
    "clf1 = MultinomialNB()\n",
    "clf2 = svm.SVC(gamma='scale')\n",
    "clf3 = MLPClassifier(max_iter=500)\n",
    "clfs = {\"NB\":clf1, \"SVM\":clf2, \"NN\":clf3}\n",
    "\n",
    "tfidf_scores = []\n",
    "for name, clf in clfs.items():\n",
    "  clf.fit(tfidf_train_vectors, train_label)\n",
    "  score = clf.score(tfidf_test_vectors, test_label)\n",
    "  tfidf_scores.append(score)\n",
    "  print(\"tfidf_scores = {} by {}\".format(score,name))\n",
    "\n",
    "!date"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 465,
     "status": "ok",
     "timestamp": 1622629644422,
     "user": {
      "displayName": "TOMA Naruaki",
      "photoUrl": "",
      "userId": "11747312442870110137"
     },
     "user_tz": -540
    },
    "id": "rw9KrgsrHvnW",
    "outputId": "77725488-1339-4798-9ec8-0c1a76e5ea35"
   },
   "outputs": [],
   "source": [
    "!date"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "j7q9l-kHGHpu"
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "authorship_tag": "ABX9TyOLlAdQ2Ug0SFBNwGE/ZS0d",
   "collapsed_sections": [],
   "name": "fine-turning.ipynb",
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
