4. カテゴリデータに対する前処理コード例¶
ref.
全体の流れ
データセットの準備
手法1:one-hotエンコーディング
手法2:特徴量ハッシング
手法3:BaseNエンコーディング
手法4:エビデンス重みエンコーディング
4.1. データセットの準備¶
!pip install quilt
!quilt install haradai1262/YouTuber
Requirement already satisfied: quilt in /usr/local/lib/python3.7/dist-packages (2.9.15)
Requirement already satisfied: pyarrow>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (3.0.0)
Requirement already satisfied: future>=0.16.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (0.16.0)
Requirement already satisfied: packaging>=16.8 in /usr/local/lib/python3.7/dist-packages (from quilt) (20.9)
Requirement already satisfied: requests>=2.12.4 in /usr/local/lib/python3.7/dist-packages (from quilt) (2.23.0)
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (1.19.5)
Requirement already satisfied: xlrd>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (1.1.0)
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (1.15.0)
Requirement already satisfied: pandas>=0.21.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (1.1.5)
Requirement already satisfied: pyyaml>=3.12 in /usr/local/lib/python3.7/dist-packages (from quilt) (3.13)
Requirement already satisfied: tqdm>=4.11.2 in /usr/local/lib/python3.7/dist-packages (from quilt) (4.41.1)
Requirement already satisfied: appdirs>=1.4.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (1.4.4)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=16.8->quilt) (2.4.7)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.12.4->quilt) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.12.4->quilt) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.12.4->quilt) (2020.12.5)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.12.4->quilt) (3.0.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.21.0->quilt) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.21.0->quilt) (2018.9)
Downloading package metadata...
haradai1262/YouTuber already installed.
Overwrite? (y/n) n
from quilt.data.haradai1262 import YouTuber
import pandas as pd
df = YouTuber.channel_videos.UUUM_videos()
df.head()
id | title | description | liveBroadcastContent | tags | publishedAt | thumbnails | viewCount | likeCount | favoriteCount | dislikeCount | commentCount | caption | definition | dimension | duration | projection | TopicIds | relevantTopicIds | idx | cid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | R7V5d94XkGQ | 【大食い】超高級寿司店で3人で食べ放題したらいくらかかるの!?【大トロ1カン2,000円】 | 提供:ポコロンダンジョンズ\r\r\r\r\niOS:https://bit.ly/2sGg... | none | ['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか... | 2018-06-30T04:00:01.000Z | https://i.ytimg.com/vi/R7V5d94XkGQ/default.jpg | 2244205.0 | 27703.0 | 0 | 3667.0 | 8647.0 | False | hd | 2d | PT21M16S | rectangular | NaN | ['/m/02wbm', '/m/019_rr', '/m/019_rr', '/m/02w... | 1 | UCZf__ehlCEBPop___sldpBUQ |
1 | 2R9_bkcWNd4 | 【女王集結】女性YouTuberたちと飲みながら本音トークしてみたら爆笑www | しばなんチャンネルの動画\r\r\r\r\nhttps://www.youtube.com/... | none | ['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか... | 2018-06-29T08:00:01.000Z | https://i.ytimg.com/vi/2R9_bkcWNd4/default.jpg | 1869268.0 | 30889.0 | 0 | 3483.0 | 8859.0 | False | hd | 2d | PT18M38S | rectangular | NaN | ['/m/04rlf', '/m/02jjt', '/m/02jjt'] | 2 | UCZf__ehlCEBPop___sldpBUQ |
2 | EU8S-zxS9PI | 【悪質】偽物ヒカキン許さねぇ…注意してください!【なりすまし】 | ◆チャンネル登録はこちら↓\r\r\r\r\nhttp://www.youtube.com/... | none | ['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか... | 2018-06-27T08:38:55.000Z | https://i.ytimg.com/vi/EU8S-zxS9PI/default.jpg | 1724625.0 | 33038.0 | 0 | 4298.0 | 11504.0 | False | hd | 2d | PT6M12S | rectangular | NaN | ['/m/04rlf', '/m/02jjt', '/m/02jjt'] | 3 | UCZf__ehlCEBPop___sldpBUQ |
3 | 5wnfkIfw0jE | ツイッターのヒカキンシンメトリーBotが面白すぎて爆笑www | ◆チャンネル登録はこちら↓\r\r\r\r\nhttp://www.youtube.com/... | none | ['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか... | 2018-06-25T07:46:07.000Z | https://i.ytimg.com/vi/5wnfkIfw0jE/default.jpg | 1109029.0 | 25986.0 | 0 | 5063.0 | 6852.0 | False | hd | 2d | PT6M31S | rectangular | NaN | ['/m/04rlf', '/m/02jjt', '/m/02jjt'] | 4 | UCZf__ehlCEBPop___sldpBUQ |
4 | -6duBsde_XM | 【放送事故】酒飲みながら東海オンエア×ヒカキンで質問コーナーやったらヤバかったwww | 提供:モンスターストライク\r\r\r\r\n▼キャンペーンサイトはこちら\r\r\r\r\... | none | ['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか... | 2018-06-21T08:00:00.000Z | https://i.ytimg.com/vi/-6duBsde_XM/default.jpg | 1759797.0 | 33923.0 | 0 | 2150.0 | 4517.0 | False | hd | 2d | PT27M7S | rectangular | NaN | ['/m/098wr', '/m/019_rr', '/m/02wbm', '/m/019_... | 5 | UCZf__ehlCEBPop___sldpBUQ |
4.2. 手法1:one-hotエンコーディング(one-hot encoding)¶
df['cid'].value_counts().head()
UCsX8MJHEI5UukXoF3HLnTvg 501
UCMsuwHzQPFMDtHaoR7_HDxg 501
UC66VyLEdgCot__4w8x__n0CGA 501
UCtLo4nwb3ObCDZ4m8b8u7fA 501
UCOZ7Kq5_VWBC__TtteAcsRBg 501
Name: cid, dtype: int64
# one-hot encoding by pandas
one_hot_df = pd.get_dummies(df['cid'])
# check the one-hot vector
print(one_hot_df.values.shape)
print(df['cid'][0])
print(one_hot_df.values[0])
index = pd.Index(one_hot_df.values[0]).get_loc(1)
print('index = ', index)
print('cid = ', one_hot_df.columns[index])
(66289, 151)
UCZf__ehlCEBPop___sldpBUQ
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0]
index = 88
cid = UCZf__ehlCEBPop___sldpBUQ
# one-hot encoding by sklearn
from sklearn import preprocessing
encoder = preprocessing.OneHotEncoder()
category = df['cid'].values.reshape(-1, 1)
encoder.fit(category)
one_hot_encoding = encoder.transform(category)
# check the one-hot vector
print(one_hot_encoding[0:10])
print(type(one_hot_encoding))
(0, 88) 1.0
(1, 88) 1.0
(2, 88) 1.0
(3, 88) 1.0
(4, 88) 1.0
(5, 88) 1.0
(6, 88) 1.0
(7, 88) 1.0
(8, 88) 1.0
(9, 88) 1.0
<class 'scipy.sparse.csr.csr_matrix'>
4.3. 手法2:特徴量ハッシング(Feature hashing)¶
sklearn : 5.2.2 Feature hashing
wikipedia: Feature hashing
from sklearn.feature_extraction import FeatureHasher
category = df['cid'].values.reshape(-1, 1)
# if want, you can set the size of hash table (=n_features on FeatureHasher)
num_of_features = 5
hasher = FeatureHasher(n_features=num_of_features, input_type='string')
hashed_array = hasher.transform(category)
# check the result
print(hashed_array.shape)
print(df['cid'][0])
print(hashed_array.toarray()[0])
for i in range(0, len(df), 5000):
print(df['cid'][i], hashed_array.toarray()[i])
(66289, 5)
UCZf__ehlCEBPop___sldpBUQ
[0. 1. 0. 0. 0.]
UCZf__ehlCEBPop___sldpBUQ [0. 1. 0. 0. 0.]
UC6wKgAlOeFNqmXV167KERhQ [0. 0. 0. 0. 1.]
UC4lZ8vGPy8bwmKILb__YlhzQ [0. 0. 0. 0. 1.]
UCKtKKtjaaPKA1Oj8Ldnfsdg [1. 0. 0. 0. 0.]
UCdtFmWwPlKiCOEND_95fwiA [ 0. 0. 0. 0. -1.]
UC2RdeFmVA1PrDqmFqJMG7hA [0. 0. 0. 1. 0.]
UCO06KZjWOe6b1tXrgzzakZA [0. 0. 0. 0. 1.]
UCg_Wchs_AGoHrlayD_rhO0Q [ 0. -1. 0. 0. 0.]
UC__8H678xX1SNBOM10_ReY6Q [1. 0. 0. 0. 0.]
UC2rbyOa3Jo7vGSibqKcRjqw [0. 0. 0. 1. 0.]
UCPJOCEIyI3gxXbTqKSsViqg [1. 0. 0. 0. 0.]
UCrOnS768WQGgNzvM0wOGa1w [0. 0. 0. 1. 0.]
UCjX7kJYLEAdsaCDnTsWK3Wg [0. 0. 1. 0. 0.]
UCdb7Jw5rprurSCutjT9BW5A [ 0. -1. 0. 0. 0.]
4.4. 手法3:BaseNエンコーディング(BaseN encoding)¶
!pip install category_encoders
Requirement already satisfied: category_encoders in /usr/local/lib/python3.7/dist-packages (2.2.2)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (1.4.1)
Requirement already satisfied: pandas>=0.21.1 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (1.1.5)
Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (0.22.2.post1)
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (1.19.5)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (0.5.1)
Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (0.10.2)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.21.1->category_encoders) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.21.1->category_encoders) (2018.9)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.20.0->category_encoders) (1.0.1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from patsy>=0.5.1->category_encoders) (1.15.0)
import category_encoders as ce
encoder = ce.basen.BaseNEncoder(cols='cid', base=3)
result = encoder.fit_transform(df)
# check the result
columns = result.columns.tolist()
columns_name = [s for s in columns if "cid" in s]
def get_cid_values(df, names, index):
temp = []
for name in names:
temp.append(df[name][index])
return temp
for i in range(0, len(df), 5000):
temp = get_cid_values(result, columns_name, i)
print(df['cid'][i], temp)
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
import pandas.util.testing as tm
/usr/local/lib/python3.7/dist-packages/category_encoders/utils.py:21: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
elif pd.api.types.is_categorical(cols):
UCZf__ehlCEBPop___sldpBUQ [0, 0, 0, 0, 0, 1]
UC6wKgAlOeFNqmXV167KERhQ [0, 0, 0, 1, 0, 1]
UC4lZ8vGPy8bwmKILb__YlhzQ [0, 0, 0, 2, 1, 0]
UCKtKKtjaaPKA1Oj8Ldnfsdg [0, 0, 1, 0, 1, 1]
UCdtFmWwPlKiCOEND_95fwiA [0, 0, 1, 1, 2, 0]
UC2RdeFmVA1PrDqmFqJMG7hA [0, 0, 1, 2, 2, 2]
UCO06KZjWOe6b1tXrgzzakZA [0, 0, 2, 1, 0, 0]
UCg_Wchs_AGoHrlayD_rhO0Q [0, 0, 2, 2, 0, 2]
UC__8H678xX1SNBOM10_ReY6Q [0, 1, 0, 0, 1, 1]
UC2rbyOa3Jo7vGSibqKcRjqw [0, 1, 0, 1, 2, 1]
UCPJOCEIyI3gxXbTqKSsViqg [0, 1, 1, 0, 0, 2]
UCrOnS768WQGgNzvM0wOGa1w [0, 1, 1, 1, 1, 2]
UCjX7kJYLEAdsaCDnTsWK3Wg [0, 1, 1, 2, 2, 0]
UCdb7Jw5rprurSCutjT9BW5A [0, 1, 2, 1, 1, 0]
4.5. 手法4:エビデンス重みエンコーディング(Weight of Evidence)¶
category_encoder: Weight of Evidence
import category_encoders as ce
encoder = ce.woe.WOEEncoder(cols='cid')
# ready for evidence
target = df['viewCount'] > 10000
# calculate WOE
result = encoder.fit_transform(df, y=target)
# check the result
for i in range(0, len(df), 5000):
print(df['cid'][i], '\t', result['cid'][i])
/usr/local/lib/python3.7/dist-packages/category_encoders/utils.py:21: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
elif pd.api.types.is_categorical(cols):
UCZf__ehlCEBPop___sldpBUQ 3.6095662610015764
UC6wKgAlOeFNqmXV167KERhQ 0.4757909580572925
UC4lZ8vGPy8bwmKILb__YlhzQ 1.9921283059003998
UCKtKKtjaaPKA1Oj8Ldnfsdg 1.807796698178421
UCdtFmWwPlKiCOEND_95fwiA -1.76268251990698
UC2RdeFmVA1PrDqmFqJMG7hA -0.8357080911509357
UCO06KZjWOe6b1tXrgzzakZA 0.42925802928263423
UCg_Wchs_AGoHrlayD_rhO0Q 3.4570742314135954
UC__8H678xX1SNBOM10_ReY6Q 2.2192798786121486
UC2rbyOa3Jo7vGSibqKcRjqw 3.022177923131213
UCPJOCEIyI3gxXbTqKSsViqg -0.2641732825478688
UCrOnS768WQGgNzvM0wOGa1w -2.1435507408557353
UCjX7kJYLEAdsaCDnTsWK3Wg -1.794934067712545
UCdb7Jw5rprurSCutjT9BW5A -1.606225698959651