4. カテゴリデータに対する前処理コード例

4.1. データセットの準備

!pip install quilt
!quilt install haradai1262/YouTuber
Requirement already satisfied: quilt in /usr/local/lib/python3.7/dist-packages (2.9.15)
Requirement already satisfied: pyarrow>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (3.0.0)
Requirement already satisfied: future>=0.16.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (0.16.0)
Requirement already satisfied: packaging>=16.8 in /usr/local/lib/python3.7/dist-packages (from quilt) (20.9)
Requirement already satisfied: requests>=2.12.4 in /usr/local/lib/python3.7/dist-packages (from quilt) (2.23.0)
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (1.19.5)
Requirement already satisfied: xlrd>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (1.1.0)
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (1.15.0)
Requirement already satisfied: pandas>=0.21.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (1.1.5)
Requirement already satisfied: pyyaml>=3.12 in /usr/local/lib/python3.7/dist-packages (from quilt) (3.13)
Requirement already satisfied: tqdm>=4.11.2 in /usr/local/lib/python3.7/dist-packages (from quilt) (4.41.1)
Requirement already satisfied: appdirs>=1.4.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (1.4.4)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=16.8->quilt) (2.4.7)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.12.4->quilt) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.12.4->quilt) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.12.4->quilt) (2020.12.5)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.12.4->quilt) (3.0.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.21.0->quilt) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.21.0->quilt) (2018.9)
Downloading package metadata...
haradai1262/YouTuber already installed.
Overwrite? (y/n) n
from quilt.data.haradai1262 import YouTuber
import pandas as pd

df = YouTuber.channel_videos.UUUM_videos()
df.head()
id title description liveBroadcastContent tags publishedAt thumbnails viewCount likeCount favoriteCount dislikeCount commentCount caption definition dimension duration projection TopicIds relevantTopicIds idx cid
0 R7V5d94XkGQ 【大食い】超高級寿司店で3人で食べ放題したらいくらかかるの!?【大トロ1カン2,000円】 提供:ポコロンダンジョンズ\r\r\r\r\niOS:https://bit.ly/2sGg... none ['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか... 2018-06-30T04:00:01.000Z https://i.ytimg.com/vi/R7V5d94XkGQ/default.jpg 2244205.0 27703.0 0 3667.0 8647.0 False hd 2d PT21M16S rectangular NaN ['/m/02wbm', '/m/019_rr', '/m/019_rr', '/m/02w... 1 UCZf__ehlCEBPop___sldpBUQ
1 2R9_bkcWNd4 【女王集結】女性YouTuberたちと飲みながら本音トークしてみたら爆笑www しばなんチャンネルの動画\r\r\r\r\nhttps://www.youtube.com/... none ['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか... 2018-06-29T08:00:01.000Z https://i.ytimg.com/vi/2R9_bkcWNd4/default.jpg 1869268.0 30889.0 0 3483.0 8859.0 False hd 2d PT18M38S rectangular NaN ['/m/04rlf', '/m/02jjt', '/m/02jjt'] 2 UCZf__ehlCEBPop___sldpBUQ
2 EU8S-zxS9PI 【悪質】偽物ヒカキン許さねぇ…注意してください!【なりすまし】 ◆チャンネル登録はこちら↓\r\r\r\r\nhttp://www.youtube.com/... none ['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか... 2018-06-27T08:38:55.000Z https://i.ytimg.com/vi/EU8S-zxS9PI/default.jpg 1724625.0 33038.0 0 4298.0 11504.0 False hd 2d PT6M12S rectangular NaN ['/m/04rlf', '/m/02jjt', '/m/02jjt'] 3 UCZf__ehlCEBPop___sldpBUQ
3 5wnfkIfw0jE ツイッターのヒカキンシンメトリーBotが面白すぎて爆笑www ◆チャンネル登録はこちら↓\r\r\r\r\nhttp://www.youtube.com/... none ['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか... 2018-06-25T07:46:07.000Z https://i.ytimg.com/vi/5wnfkIfw0jE/default.jpg 1109029.0 25986.0 0 5063.0 6852.0 False hd 2d PT6M31S rectangular NaN ['/m/04rlf', '/m/02jjt', '/m/02jjt'] 4 UCZf__ehlCEBPop___sldpBUQ
4 -6duBsde_XM 【放送事故】酒飲みながら東海オンエア×ヒカキンで質問コーナーやったらヤバかったwww 提供:モンスターストライク\r\r\r\r\n▼キャンペーンサイトはこちら\r\r\r\r\... none ['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか... 2018-06-21T08:00:00.000Z https://i.ytimg.com/vi/-6duBsde_XM/default.jpg 1759797.0 33923.0 0 2150.0 4517.0 False hd 2d PT27M7S rectangular NaN ['/m/098wr', '/m/019_rr', '/m/02wbm', '/m/019_... 5 UCZf__ehlCEBPop___sldpBUQ

4.2. 手法1:one-hotエンコーディング(one-hot encoding)

df['cid'].value_counts().head()
UCsX8MJHEI5UukXoF3HLnTvg      501
UCMsuwHzQPFMDtHaoR7_HDxg      501
UC66VyLEdgCot__4w8x__n0CGA    501
UCtLo4nwb3ObCDZ4m8b8u7fA      501
UCOZ7Kq5_VWBC__TtteAcsRBg     501
Name: cid, dtype: int64
# one-hot encoding by pandas

one_hot_df = pd.get_dummies(df['cid'])

# check the one-hot vector
print(one_hot_df.values.shape)
print(df['cid'][0])
print(one_hot_df.values[0])
index = pd.Index(one_hot_df.values[0]).get_loc(1)
print('index = ', index)
print('cid = ', one_hot_df.columns[index])
(66289, 151)
UCZf__ehlCEBPop___sldpBUQ
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0]
index =  88
cid =  UCZf__ehlCEBPop___sldpBUQ
# one-hot encoding by sklearn

from sklearn import preprocessing
encoder = preprocessing.OneHotEncoder()
category = df['cid'].values.reshape(-1, 1)
encoder.fit(category)
one_hot_encoding = encoder.transform(category)

# check the one-hot vector
print(one_hot_encoding[0:10])
print(type(one_hot_encoding))
  (0, 88)	1.0
  (1, 88)	1.0
  (2, 88)	1.0
  (3, 88)	1.0
  (4, 88)	1.0
  (5, 88)	1.0
  (6, 88)	1.0
  (7, 88)	1.0
  (8, 88)	1.0
  (9, 88)	1.0
<class 'scipy.sparse.csr.csr_matrix'>

4.3. 手法2:特徴量ハッシング(Feature hashing)

from sklearn.feature_extraction import FeatureHasher

category = df['cid'].values.reshape(-1, 1)

# if want, you can set the size of hash table (=n_features on FeatureHasher)
num_of_features = 5
hasher = FeatureHasher(n_features=num_of_features, input_type='string')
hashed_array = hasher.transform(category)

# check the result
print(hashed_array.shape)
print(df['cid'][0])
print(hashed_array.toarray()[0])

for i in range(0, len(df), 5000):
    print(df['cid'][i], hashed_array.toarray()[i])
(66289, 5)
UCZf__ehlCEBPop___sldpBUQ
[0. 1. 0. 0. 0.]
UCZf__ehlCEBPop___sldpBUQ [0. 1. 0. 0. 0.]
UC6wKgAlOeFNqmXV167KERhQ [0. 0. 0. 0. 1.]
UC4lZ8vGPy8bwmKILb__YlhzQ [0. 0. 0. 0. 1.]
UCKtKKtjaaPKA1Oj8Ldnfsdg [1. 0. 0. 0. 0.]
UCdtFmWwPlKiCOEND_95fwiA [ 0.  0.  0.  0. -1.]
UC2RdeFmVA1PrDqmFqJMG7hA [0. 0. 0. 1. 0.]
UCO06KZjWOe6b1tXrgzzakZA [0. 0. 0. 0. 1.]
UCg_Wchs_AGoHrlayD_rhO0Q [ 0. -1.  0.  0.  0.]
UC__8H678xX1SNBOM10_ReY6Q [1. 0. 0. 0. 0.]
UC2rbyOa3Jo7vGSibqKcRjqw [0. 0. 0. 1. 0.]
UCPJOCEIyI3gxXbTqKSsViqg [1. 0. 0. 0. 0.]
UCrOnS768WQGgNzvM0wOGa1w [0. 0. 0. 1. 0.]
UCjX7kJYLEAdsaCDnTsWK3Wg [0. 0. 1. 0. 0.]
UCdb7Jw5rprurSCutjT9BW5A [ 0. -1.  0.  0.  0.]

4.4. 手法3:BaseNエンコーディング(BaseN encoding)

!pip install category_encoders
Requirement already satisfied: category_encoders in /usr/local/lib/python3.7/dist-packages (2.2.2)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (1.4.1)
Requirement already satisfied: pandas>=0.21.1 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (1.1.5)
Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (0.22.2.post1)
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (1.19.5)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (0.5.1)
Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (0.10.2)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.21.1->category_encoders) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.21.1->category_encoders) (2018.9)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.20.0->category_encoders) (1.0.1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from patsy>=0.5.1->category_encoders) (1.15.0)
import category_encoders as ce

encoder = ce.basen.BaseNEncoder(cols='cid', base=3)
result = encoder.fit_transform(df)

# check the result
columns = result.columns.tolist()
columns_name = [s for s in columns if "cid" in s]

def get_cid_values(df, names, index):
    temp = []
    for name in names:
        temp.append(df[name][index])
    return temp

for i in range(0, len(df), 5000):
    temp = get_cid_values(result, columns_name, i)
    print(df['cid'][i], temp)
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
/usr/local/lib/python3.7/dist-packages/category_encoders/utils.py:21: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  elif pd.api.types.is_categorical(cols):
UCZf__ehlCEBPop___sldpBUQ [0, 0, 0, 0, 0, 1]
UC6wKgAlOeFNqmXV167KERhQ [0, 0, 0, 1, 0, 1]
UC4lZ8vGPy8bwmKILb__YlhzQ [0, 0, 0, 2, 1, 0]
UCKtKKtjaaPKA1Oj8Ldnfsdg [0, 0, 1, 0, 1, 1]
UCdtFmWwPlKiCOEND_95fwiA [0, 0, 1, 1, 2, 0]
UC2RdeFmVA1PrDqmFqJMG7hA [0, 0, 1, 2, 2, 2]
UCO06KZjWOe6b1tXrgzzakZA [0, 0, 2, 1, 0, 0]
UCg_Wchs_AGoHrlayD_rhO0Q [0, 0, 2, 2, 0, 2]
UC__8H678xX1SNBOM10_ReY6Q [0, 1, 0, 0, 1, 1]
UC2rbyOa3Jo7vGSibqKcRjqw [0, 1, 0, 1, 2, 1]
UCPJOCEIyI3gxXbTqKSsViqg [0, 1, 1, 0, 0, 2]
UCrOnS768WQGgNzvM0wOGa1w [0, 1, 1, 1, 1, 2]
UCjX7kJYLEAdsaCDnTsWK3Wg [0, 1, 1, 2, 2, 0]
UCdb7Jw5rprurSCutjT9BW5A [0, 1, 2, 1, 1, 0]

4.5. 手法4:エビデンス重みエンコーディング(Weight of Evidence)

import category_encoders as ce

encoder = ce.woe.WOEEncoder(cols='cid')

# ready for evidence
target = df['viewCount'] > 10000

# calculate WOE
result = encoder.fit_transform(df, y=target)

# check the result
for i in range(0, len(df), 5000):
    print(df['cid'][i], '\t', result['cid'][i])
/usr/local/lib/python3.7/dist-packages/category_encoders/utils.py:21: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  elif pd.api.types.is_categorical(cols):
UCZf__ehlCEBPop___sldpBUQ 	 3.6095662610015764
UC6wKgAlOeFNqmXV167KERhQ 	 0.4757909580572925
UC4lZ8vGPy8bwmKILb__YlhzQ 	 1.9921283059003998
UCKtKKtjaaPKA1Oj8Ldnfsdg 	 1.807796698178421
UCdtFmWwPlKiCOEND_95fwiA 	 -1.76268251990698
UC2RdeFmVA1PrDqmFqJMG7hA 	 -0.8357080911509357
UCO06KZjWOe6b1tXrgzzakZA 	 0.42925802928263423
UCg_Wchs_AGoHrlayD_rhO0Q 	 3.4570742314135954
UC__8H678xX1SNBOM10_ReY6Q 	 2.2192798786121486
UC2rbyOa3Jo7vGSibqKcRjqw 	 3.022177923131213
UCPJOCEIyI3gxXbTqKSsViqg 	 -0.2641732825478688
UCrOnS768WQGgNzvM0wOGa1w 	 -2.1435507408557353
UCjX7kJYLEAdsaCDnTsWK3Wg 	 -1.794934067712545
UCdb7Jw5rprurSCutjT9BW5A 	 -1.606225698959651