{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.1"
    },
    "colab": {
      "provenance": []
    }
  },
  "cells": [
    {
      "cell_type": "code",
      "source": [
        "!python --version"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "8DTOxTv0X05P",
        "outputId": "1def9a66-262d-4654-b7cb-2f7876b7c3d4"
      },
      "execution_count": 20,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Python 3.10.12\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "X8CRRkxzOkh5"
      },
      "source": [
        "# カテゴリデータに対する前処理コード例\n",
        "- ref.\n",
        "    - preprocess methods\n",
        "        - [機械学習のための特徴量エンジニアリング](https://www.oreilly.co.jp/books/9784873118680/)\n",
        "        - [5.3. Preprocessing data](https://scikit-learn.org/stable/modules/preprocessing.html#normalization)\n",
        "        - [Categorical Data, UNDERSTANDING FEATURE ENGINEERING (PART 2)](https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63)\n",
        "        - [Category Encoders](https://contrib.scikit-learn.org/category_encoders/index.html)\n",
        "    - data: [YouTuberデータセット公開してみた](https://qiita.com/myaun/items/7e0dd7f3f9d9d2fef497)\n",
        "      - 2024年5月現在、Google Colab のPython 3.10.12 では quilt 周りで不具合あり。そこでquiltは使用せず、別途データセットをファイルとして用意して利用するように修正。以下は youtuber.xlsx を用意してから実行してください。\n",
        "- 全体の流れ\n",
        "    - データセットの準備\n",
        "    - 手法1：one-hotエンコーディング\n",
        "    - 手法2：特徴量ハッシング\n",
        "    - 手法3：BaseNエンコーディング\n",
        "    - 手法4：エビデンス重みエンコーディング\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "H5XIUQWLOkiC"
      },
      "source": [
        "## データセットの準備"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aq7Xzm6aPYjc"
      },
      "source": [
        "#!pip install quilt\n",
        "#!quilt install haradai1262/YouTuber"
      ],
      "execution_count": 21,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 957
        },
        "id": "PDK2k7qeOkiD",
        "outputId": "09a198a5-14b8-461e-879b-49d48c9f739a"
      },
      "source": [
        "#from quilt.data.haradai1262 import YouTuber\n",
        "import pandas as pd\n",
        "\n",
        "#df = YouTuber.channel_videos.UUUM_videos()\n",
        "df = pd.read_excel(\"youtuber.xlsx\")\n",
        "df.head()"
      ],
      "execution_count": 22,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "            id                                          title  \\\n",
              "0  R7V5d94XkGQ  【大食い】超高級寿司店で３人で食べ放題したらいくらかかるの!?【大トロ1カン2,000円】   \n",
              "1  2R9_bkcWNd4       【女王集結】女性YouTuberたちと飲みながら本音トークしてみたら爆笑www   \n",
              "2  EU8S-zxS9PI                【悪質】偽物ヒカキン許さねぇ…注意してください！【なりすまし】   \n",
              "3  5wnfkIfw0jE                ツイッターのヒカキンシンメトリーBotが面白すぎて爆笑www   \n",
              "4  -6duBsde_XM    【放送事故】酒飲みながら東海オンエア×ヒカキンで質問コーナーやったらヤバかったwww   \n",
              "\n",
              "                                         description liveBroadcastContent  \\\n",
              "0  提供：ポコロンダンジョンズ\\n\\n\\n\\niOS：https://bit.ly/2sGgOR...                 none   \n",
              "1  しばなんチャンネルの動画\\n\\n\\n\\nhttps://www.youtube.com/wa...                 none   \n",
              "2  ◆チャンネル登録はこちら↓\\n\\n\\n\\nhttp://www.youtube.com/us...                 none   \n",
              "3  ◆チャンネル登録はこちら↓\\n\\n\\n\\nhttp://www.youtube.com/us...                 none   \n",
              "4  提供：モンスターストライク\\n\\n\\n\\n▼キャンペーンサイトはこちら\\n\\n\\n\\nhtt...                 none   \n",
              "\n",
              "                                                tags  \\\n",
              "0  ['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか...   \n",
              "1  ['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか...   \n",
              "2  ['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか...   \n",
              "3  ['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか...   \n",
              "4  ['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか...   \n",
              "\n",
              "                publishedAt                                      thumbnails  \\\n",
              "0  2018-06-30T04:00:01.000Z  https://i.ytimg.com/vi/R7V5d94XkGQ/default.jpg   \n",
              "1  2018-06-29T08:00:01.000Z  https://i.ytimg.com/vi/2R9_bkcWNd4/default.jpg   \n",
              "2  2018-06-27T08:38:55.000Z  https://i.ytimg.com/vi/EU8S-zxS9PI/default.jpg   \n",
              "3  2018-06-25T07:46:07.000Z  https://i.ytimg.com/vi/5wnfkIfw0jE/default.jpg   \n",
              "4  2018-06-21T08:00:00.000Z  https://i.ytimg.com/vi/-6duBsde_XM/default.jpg   \n",
              "\n",
              "   viewCount  likeCount  favoriteCount  ...  commentCount  caption  \\\n",
              "0  2244205.0    27703.0              0  ...        8647.0    False   \n",
              "1  1869268.0    30889.0              0  ...        8859.0    False   \n",
              "2  1724625.0    33038.0              0  ...       11504.0    False   \n",
              "3  1109029.0    25986.0              0  ...        6852.0    False   \n",
              "4  1759797.0    33923.0              0  ...        4517.0    False   \n",
              "\n",
              "   definition dimension  duration   projection TopicIds  \\\n",
              "0          hd        2d  PT21M16S  rectangular      NaN   \n",
              "1          hd        2d  PT18M38S  rectangular      NaN   \n",
              "2          hd        2d   PT6M12S  rectangular      NaN   \n",
              "3          hd        2d   PT6M31S  rectangular      NaN   \n",
              "4          hd        2d   PT27M7S  rectangular      NaN   \n",
              "\n",
              "                                    relevantTopicIds idx  \\\n",
              "0  ['/m/02wbm', '/m/019_rr', '/m/019_rr', '/m/02w...   1   \n",
              "1               ['/m/04rlf', '/m/02jjt', '/m/02jjt']   2   \n",
              "2               ['/m/04rlf', '/m/02jjt', '/m/02jjt']   3   \n",
              "3               ['/m/04rlf', '/m/02jjt', '/m/02jjt']   4   \n",
              "4  ['/m/098wr', '/m/019_rr', '/m/02wbm', '/m/019_...   5   \n",
              "\n",
              "                         cid  \n",
              "0  UCZf__ehlCEBPop___sldpBUQ  \n",
              "1  UCZf__ehlCEBPop___sldpBUQ  \n",
              "2  UCZf__ehlCEBPop___sldpBUQ  \n",
              "3  UCZf__ehlCEBPop___sldpBUQ  \n",
              "4  UCZf__ehlCEBPop___sldpBUQ  \n",
              "\n",
              "[5 rows x 21 columns]"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-39c6d332-1011-4b6f-a13b-8335b7aae61c\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>id</th>\n",
              "      <th>title</th>\n",
              "      <th>description</th>\n",
              "      <th>liveBroadcastContent</th>\n",
              "      <th>tags</th>\n",
              "      <th>publishedAt</th>\n",
              "      <th>thumbnails</th>\n",
              "      <th>viewCount</th>\n",
              "      <th>likeCount</th>\n",
              "      <th>favoriteCount</th>\n",
              "      <th>...</th>\n",
              "      <th>commentCount</th>\n",
              "      <th>caption</th>\n",
              "      <th>definition</th>\n",
              "      <th>dimension</th>\n",
              "      <th>duration</th>\n",
              "      <th>projection</th>\n",
              "      <th>TopicIds</th>\n",
              "      <th>relevantTopicIds</th>\n",
              "      <th>idx</th>\n",
              "      <th>cid</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>R7V5d94XkGQ</td>\n",
              "      <td>【大食い】超高級寿司店で３人で食べ放題したらいくらかかるの!?【大トロ1カン2,000円】</td>\n",
              "      <td>提供：ポコロンダンジョンズ\\n\\n\\n\\niOS：https://bit.ly/2sGgOR...</td>\n",
              "      <td>none</td>\n",
              "      <td>['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか...</td>\n",
              "      <td>2018-06-30T04:00:01.000Z</td>\n",
              "      <td>https://i.ytimg.com/vi/R7V5d94XkGQ/default.jpg</td>\n",
              "      <td>2244205.0</td>\n",
              "      <td>27703.0</td>\n",
              "      <td>0</td>\n",
              "      <td>...</td>\n",
              "      <td>8647.0</td>\n",
              "      <td>False</td>\n",
              "      <td>hd</td>\n",
              "      <td>2d</td>\n",
              "      <td>PT21M16S</td>\n",
              "      <td>rectangular</td>\n",
              "      <td>NaN</td>\n",
              "      <td>['/m/02wbm', '/m/019_rr', '/m/019_rr', '/m/02w...</td>\n",
              "      <td>1</td>\n",
              "      <td>UCZf__ehlCEBPop___sldpBUQ</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>2R9_bkcWNd4</td>\n",
              "      <td>【女王集結】女性YouTuberたちと飲みながら本音トークしてみたら爆笑www</td>\n",
              "      <td>しばなんチャンネルの動画\\n\\n\\n\\nhttps://www.youtube.com/wa...</td>\n",
              "      <td>none</td>\n",
              "      <td>['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか...</td>\n",
              "      <td>2018-06-29T08:00:01.000Z</td>\n",
              "      <td>https://i.ytimg.com/vi/2R9_bkcWNd4/default.jpg</td>\n",
              "      <td>1869268.0</td>\n",
              "      <td>30889.0</td>\n",
              "      <td>0</td>\n",
              "      <td>...</td>\n",
              "      <td>8859.0</td>\n",
              "      <td>False</td>\n",
              "      <td>hd</td>\n",
              "      <td>2d</td>\n",
              "      <td>PT18M38S</td>\n",
              "      <td>rectangular</td>\n",
              "      <td>NaN</td>\n",
              "      <td>['/m/04rlf', '/m/02jjt', '/m/02jjt']</td>\n",
              "      <td>2</td>\n",
              "      <td>UCZf__ehlCEBPop___sldpBUQ</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>EU8S-zxS9PI</td>\n",
              "      <td>【悪質】偽物ヒカキン許さねぇ…注意してください！【なりすまし】</td>\n",
              "      <td>◆チャンネル登録はこちら↓\\n\\n\\n\\nhttp://www.youtube.com/us...</td>\n",
              "      <td>none</td>\n",
              "      <td>['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか...</td>\n",
              "      <td>2018-06-27T08:38:55.000Z</td>\n",
              "      <td>https://i.ytimg.com/vi/EU8S-zxS9PI/default.jpg</td>\n",
              "      <td>1724625.0</td>\n",
              "      <td>33038.0</td>\n",
              "      <td>0</td>\n",
              "      <td>...</td>\n",
              "      <td>11504.0</td>\n",
              "      <td>False</td>\n",
              "      <td>hd</td>\n",
              "      <td>2d</td>\n",
              "      <td>PT6M12S</td>\n",
              "      <td>rectangular</td>\n",
              "      <td>NaN</td>\n",
              "      <td>['/m/04rlf', '/m/02jjt', '/m/02jjt']</td>\n",
              "      <td>3</td>\n",
              "      <td>UCZf__ehlCEBPop___sldpBUQ</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>5wnfkIfw0jE</td>\n",
              "      <td>ツイッターのヒカキンシンメトリーBotが面白すぎて爆笑www</td>\n",
              "      <td>◆チャンネル登録はこちら↓\\n\\n\\n\\nhttp://www.youtube.com/us...</td>\n",
              "      <td>none</td>\n",
              "      <td>['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか...</td>\n",
              "      <td>2018-06-25T07:46:07.000Z</td>\n",
              "      <td>https://i.ytimg.com/vi/5wnfkIfw0jE/default.jpg</td>\n",
              "      <td>1109029.0</td>\n",
              "      <td>25986.0</td>\n",
              "      <td>0</td>\n",
              "      <td>...</td>\n",
              "      <td>6852.0</td>\n",
              "      <td>False</td>\n",
              "      <td>hd</td>\n",
              "      <td>2d</td>\n",
              "      <td>PT6M31S</td>\n",
              "      <td>rectangular</td>\n",
              "      <td>NaN</td>\n",
              "      <td>['/m/04rlf', '/m/02jjt', '/m/02jjt']</td>\n",
              "      <td>4</td>\n",
              "      <td>UCZf__ehlCEBPop___sldpBUQ</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>-6duBsde_XM</td>\n",
              "      <td>【放送事故】酒飲みながら東海オンエア×ヒカキンで質問コーナーやったらヤバかったwww</td>\n",
              "      <td>提供：モンスターストライク\\n\\n\\n\\n▼キャンペーンサイトはこちら\\n\\n\\n\\nhtt...</td>\n",
              "      <td>none</td>\n",
              "      <td>['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか...</td>\n",
              "      <td>2018-06-21T08:00:00.000Z</td>\n",
              "      <td>https://i.ytimg.com/vi/-6duBsde_XM/default.jpg</td>\n",
              "      <td>1759797.0</td>\n",
              "      <td>33923.0</td>\n",
              "      <td>0</td>\n",
              "      <td>...</td>\n",
              "      <td>4517.0</td>\n",
              "      <td>False</td>\n",
              "      <td>hd</td>\n",
              "      <td>2d</td>\n",
              "      <td>PT27M7S</td>\n",
              "      <td>rectangular</td>\n",
              "      <td>NaN</td>\n",
              "      <td>['/m/098wr', '/m/019_rr', '/m/02wbm', '/m/019_...</td>\n",
              "      <td>5</td>\n",
              "      <td>UCZf__ehlCEBPop___sldpBUQ</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>5 rows × 21 columns</p>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-39c6d332-1011-4b6f-a13b-8335b7aae61c')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-39c6d332-1011-4b6f-a13b-8335b7aae61c button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-39c6d332-1011-4b6f-a13b-8335b7aae61c');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-a2574381-5fd2-481f-8c25-d56551ec347e\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-a2574381-5fd2-481f-8c25-d56551ec347e')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-a2574381-5fd2-481f-8c25-d56551ec347e button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "variable_name": "df"
            }
          },
          "metadata": {},
          "execution_count": 22
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HEhdkmh5OkiE"
      },
      "source": [
        "## 手法1：one-hotエンコーディング(one-hot encoding)\n",
        "- [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)\n",
        "- [sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "OEN75L4cOkiF",
        "outputId": "c572b1ed-9618-4187-db4b-88a14136fc76"
      },
      "source": [
        "df['cid'].value_counts().head()"
      ],
      "execution_count": 23,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "cid\n",
              "UCZf__ehlCEBPop___sldpBUQ    501\n",
              "UC__AsSnEuyVgO9TWvZE_ziA     501\n",
              "UCmol_xpWkIbQU0ZCuinqpQA     501\n",
              "UCXqocGp__RQ_sTw8EpPDg10A    501\n",
              "UCiJvvLq45i4sC4__dTiV88SQ    501\n",
              "Name: count, dtype: int64"
            ]
          },
          "metadata": {},
          "execution_count": 23
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "scrolled": true,
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "R9dR6atuOkiF",
        "outputId": "f7822546-93c8-4036-fb20-972168e6ff89"
      },
      "source": [
        "# one-hot encoding by pandas\n",
        "\n",
        "one_hot_df = pd.get_dummies(df['cid'], dtype=int)\n",
        "\n",
        "# check the one-hot vector\n",
        "print(one_hot_df.values.shape)\n",
        "print(df['cid'][0])\n",
        "print(one_hot_df.values[0])\n",
        "index = pd.Index(one_hot_df.values[0]).get_loc(1)\n",
        "print('index = ', index)\n",
        "print('cid = ', one_hot_df.columns[index])"
      ],
      "execution_count": 24,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "(66289, 151)\n",
            "UCZf__ehlCEBPop___sldpBUQ\n",
            "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
            " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
            " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
            " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
            " 0 0 0]\n",
            "index =  88\n",
            "cid =  UCZf__ehlCEBPop___sldpBUQ\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "EL4bhNO-OkiG",
        "outputId": "59db2a88-2583-4dcf-81fb-7093d5ecbe0d"
      },
      "source": [
        "# one-hot encoding by sklearn\n",
        "\n",
        "from sklearn import preprocessing\n",
        "encoder = preprocessing.OneHotEncoder()\n",
        "category = df['cid'].values.reshape(-1, 1)\n",
        "encoder.fit(category)\n",
        "one_hot_encoding = encoder.transform(category)\n",
        "\n",
        "# check the one-hot vector\n",
        "print(one_hot_encoding[0:10])\n",
        "print(type(one_hot_encoding))"
      ],
      "execution_count": 25,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "  (0, 88)\t1.0\n",
            "  (1, 88)\t1.0\n",
            "  (2, 88)\t1.0\n",
            "  (3, 88)\t1.0\n",
            "  (4, 88)\t1.0\n",
            "  (5, 88)\t1.0\n",
            "  (6, 88)\t1.0\n",
            "  (7, 88)\t1.0\n",
            "  (8, 88)\t1.0\n",
            "  (9, 88)\t1.0\n",
            "<class 'scipy.sparse._csr.csr_matrix'>\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Mth4RqogOkiG"
      },
      "source": [
        "## 手法2：特徴量ハッシング(Feature hashing)\n",
        "- sklearn : [5.2.2 Feature hashing](https://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing)\n",
        "- wikipedia: [Feature hashing](https://en.wikipedia.org/wiki/Feature_hashing)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "gBcLvwTFOkiG",
        "outputId": "f6e73ede-a8f5-400b-fcd4-f8b2b126ee6e"
      },
      "source": [
        "from sklearn.feature_extraction import FeatureHasher\n",
        "\n",
        "category = df['cid'].values.reshape(-1, 1)\n",
        "\n",
        "# if want, you can set the size of hash table (=n_features on FeatureHasher)\n",
        "num_of_features = 5\n",
        "hasher = FeatureHasher(n_features=num_of_features, input_type='string')\n",
        "hashed_array = hasher.transform(category)\n",
        "\n",
        "# check the result\n",
        "print(hashed_array.shape)\n",
        "print(df['cid'][0])\n",
        "print(hashed_array.toarray()[0])\n",
        "\n",
        "for i in range(0, len(df), 5000):\n",
        "    print(df['cid'][i], hashed_array.toarray()[i])"
      ],
      "execution_count": 26,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "(66289, 5)\n",
            "UCZf__ehlCEBPop___sldpBUQ\n",
            "[0. 1. 0. 0. 0.]\n",
            "UCZf__ehlCEBPop___sldpBUQ [0. 1. 0. 0. 0.]\n",
            "UC6wKgAlOeFNqmXV167KERhQ [0. 0. 0. 0. 1.]\n",
            "UC4lZ8vGPy8bwmKILb__YlhzQ [0. 0. 0. 0. 1.]\n",
            "UCKtKKtjaaPKA1Oj8Ldnfsdg [1. 0. 0. 0. 0.]\n",
            "UCdtFmWwPlKiCOEND_95fwiA [ 0.  0.  0.  0. -1.]\n",
            "UC2RdeFmVA1PrDqmFqJMG7hA [0. 0. 0. 1. 0.]\n",
            "UCO06KZjWOe6b1tXrgzzakZA [0. 0. 0. 0. 1.]\n",
            "UCg_Wchs_AGoHrlayD_rhO0Q [ 0. -1.  0.  0.  0.]\n",
            "UC__8H678xX1SNBOM10_ReY6Q [1. 0. 0. 0. 0.]\n",
            "UC2rbyOa3Jo7vGSibqKcRjqw [0. 0. 0. 1. 0.]\n",
            "UCPJOCEIyI3gxXbTqKSsViqg [1. 0. 0. 0. 0.]\n",
            "UCrOnS768WQGgNzvM0wOGa1w [0. 0. 0. 1. 0.]\n",
            "UCjX7kJYLEAdsaCDnTsWK3Wg [0. 0. 1. 0. 0.]\n",
            "UCdb7Jw5rprurSCutjT9BW5A [ 0. -1.  0.  0.  0.]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eX6NIOydOkiH"
      },
      "source": [
        "## 手法3：BaseNエンコーディング(BaseN encoding)\n",
        "- [BaseN](https://contrib.scikit-learn.org/category_encoders/basen.html)\n",
        "- [BASEN ENCODING AND GRID SEARCH IN CATEGORY_ENCODERS](http://www.willmcginnis.com/2016/12/18/basen-encoding-grid-search-category_encoders/)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "vfG4U1TYQPJw",
        "outputId": "ae31ad64-564e-4b66-c982-bbe5a27bdde5"
      },
      "source": [
        "!pip install category_encoders"
      ],
      "execution_count": 27,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Requirement already satisfied: category_encoders in /usr/local/lib/python3.10/dist-packages (2.6.3)\n",
            "Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.25.2)\n",
            "Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.2.2)\n",
            "Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.11.4)\n",
            "Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (0.14.2)\n",
            "Requirement already satisfied: pandas>=1.0.5 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (2.0.3)\n",
            "Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (0.5.6)\n",
            "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2.8.2)\n",
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2023.4)\n",
            "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2024.1)\n",
            "Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.1->category_encoders) (1.16.0)\n",
            "Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->category_encoders) (1.4.2)\n",
            "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->category_encoders) (3.5.0)\n",
            "Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from statsmodels>=0.9.0->category_encoders) (24.0)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "fKBQwUTGOkiH",
        "outputId": "f544860e-26b9-4507-f95e-629b8746c365"
      },
      "source": [
        "import category_encoders as ce\n",
        "\n",
        "encoder = ce.basen.BaseNEncoder(cols='cid', base=3)\n",
        "result = encoder.fit_transform(df)\n",
        "\n",
        "# check the result\n",
        "columns = result.columns.tolist()\n",
        "columns_name = [s for s in columns if \"cid\" in s]\n",
        "\n",
        "def get_cid_values(df, names, index):\n",
        "    temp = []\n",
        "    for name in names:\n",
        "        temp.append(df[name][index])\n",
        "    return temp\n",
        "\n",
        "for i in range(0, len(df), 5000):\n",
        "    temp = get_cid_values(result, columns_name, i)\n",
        "    print(df['cid'][i], temp)\n"
      ],
      "execution_count": 28,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "UCZf__ehlCEBPop___sldpBUQ [0, 0, 0, 0, 1]\n",
            "UC6wKgAlOeFNqmXV167KERhQ [0, 0, 1, 0, 1]\n",
            "UC4lZ8vGPy8bwmKILb__YlhzQ [0, 0, 2, 1, 0]\n",
            "UCKtKKtjaaPKA1Oj8Ldnfsdg [0, 1, 0, 1, 1]\n",
            "UCdtFmWwPlKiCOEND_95fwiA [0, 1, 1, 2, 0]\n",
            "UC2RdeFmVA1PrDqmFqJMG7hA [0, 1, 2, 2, 2]\n",
            "UCO06KZjWOe6b1tXrgzzakZA [0, 2, 1, 0, 0]\n",
            "UCg_Wchs_AGoHrlayD_rhO0Q [0, 2, 2, 0, 2]\n",
            "UC__8H678xX1SNBOM10_ReY6Q [1, 0, 0, 1, 1]\n",
            "UC2rbyOa3Jo7vGSibqKcRjqw [1, 0, 1, 2, 1]\n",
            "UCPJOCEIyI3gxXbTqKSsViqg [1, 1, 0, 0, 2]\n",
            "UCrOnS768WQGgNzvM0wOGa1w [1, 1, 1, 1, 2]\n",
            "UCjX7kJYLEAdsaCDnTsWK3Wg [1, 1, 2, 2, 0]\n",
            "UCdb7Jw5rprurSCutjT9BW5A [1, 2, 1, 1, 0]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NekZN7juOkiI"
      },
      "source": [
        "## 手法4：エビデンス重みエンコーディング(Weight of Evidence)\n",
        "- category_encoder: [Weight of Evidence](http://contrib.scikit-learn.org/categorical-encoding/woe.html)\n",
        "- [WEIGHT OF EVIDENCE (WOE) AND INFORMATION VALUE EXPLAINED](https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html)\n",
        "- [Weight of evidence and Information Value using Python](https://medium.com/@sundarstyles89/weight-of-evidence-and-information-value-using-python-6f05072e83eb)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "kB28yKXFOkiI",
        "outputId": "853329cd-fa2f-4343-bedf-2992a10368e7"
      },
      "source": [
        "import category_encoders as ce\n",
        "\n",
        "encoder = ce.woe.WOEEncoder(cols='cid')\n",
        "\n",
        "# ready for evidence\n",
        "target = df['viewCount'] > 10000\n",
        "\n",
        "# calculate WOE\n",
        "result = encoder.fit_transform(df, y=target)\n",
        "\n",
        "# check the result\n",
        "for i in range(0, len(df), 5000):\n",
        "    print(df['cid'][i], '\\t', result['cid'][i])\n"
      ],
      "execution_count": 29,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "UCZf__ehlCEBPop___sldpBUQ \t 3.6095662610015764\n",
            "UC6wKgAlOeFNqmXV167KERhQ \t 0.4757909580572925\n",
            "UC4lZ8vGPy8bwmKILb__YlhzQ \t 1.9921283059003998\n",
            "UCKtKKtjaaPKA1Oj8Ldnfsdg \t 1.807796698178421\n",
            "UCdtFmWwPlKiCOEND_95fwiA \t -1.76268251990698\n",
            "UC2RdeFmVA1PrDqmFqJMG7hA \t -0.8357080911509357\n",
            "UCO06KZjWOe6b1tXrgzzakZA \t 0.42925802928263423\n",
            "UCg_Wchs_AGoHrlayD_rhO0Q \t 3.4570742314135954\n",
            "UC__8H678xX1SNBOM10_ReY6Q \t 2.2192798786121486\n",
            "UC2rbyOa3Jo7vGSibqKcRjqw \t 3.022177923131213\n",
            "UCPJOCEIyI3gxXbTqKSsViqg \t -0.2641732825478688\n",
            "UCrOnS768WQGgNzvM0wOGa1w \t -2.1435507408557353\n",
            "UCjX7kJYLEAdsaCDnTsWK3Wg \t -1.794934067712545\n",
            "UCdb7Jw5rprurSCutjT9BW5A \t -1.606225698959651\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mKr2PNPlOkiJ"
      },
      "source": [],
      "execution_count": 29,
      "outputs": []
    }
  ]
}