4. カテゴリデータに対する前処理コード例¶

ref.
- preprocess methods
- data: YouTuberデータセット公開してみた
全体の流れ
- データセットの準備
- 手法1：one-hotエンコーディング
- 手法2：特徴量ハッシング
- 手法3：BaseNエンコーディング
- 手法4：エビデンス重みエンコーディング

4.1. データセットの準備¶

!pip install quilt
!quilt install haradai1262/YouTuber

Requirement already satisfied: quilt in /usr/local/lib/python3.7/dist-packages (2.9.15)
Requirement already satisfied: pyarrow>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (3.0.0)
Requirement already satisfied: future>=0.16.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (0.16.0)
Requirement already satisfied: packaging>=16.8 in /usr/local/lib/python3.7/dist-packages (from quilt) (20.9)
Requirement already satisfied: requests>=2.12.4 in /usr/local/lib/python3.7/dist-packages (from quilt) (2.23.0)
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (1.19.5)
Requirement already satisfied: xlrd>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (1.1.0)
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (1.15.0)
Requirement already satisfied: pandas>=0.21.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (1.1.5)
Requirement already satisfied: pyyaml>=3.12 in /usr/local/lib/python3.7/dist-packages (from quilt) (3.13)
Requirement already satisfied: tqdm>=4.11.2 in /usr/local/lib/python3.7/dist-packages (from quilt) (4.41.1)
Requirement already satisfied: appdirs>=1.4.0 in /usr/local/lib/python3.7/dist-packages (from quilt) (1.4.4)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=16.8->quilt) (2.4.7)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.12.4->quilt) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.12.4->quilt) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.12.4->quilt) (2020.12.5)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.12.4->quilt) (3.0.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.21.0->quilt) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.21.0->quilt) (2018.9)
Downloading package metadata...
haradai1262/YouTuber already installed.
Overwrite? (y/n) n

from quilt.data.haradai1262 import YouTuber
import pandas as pd

df = YouTuber.channel_videos.UUUM_videos()
df.head()

	id	title	description	liveBroadcastContent	tags	publishedAt	thumbnails	viewCount	likeCount	favoriteCount	dislikeCount	commentCount	caption	definition	dimension	duration	projection	TopicIds	relevantTopicIds	idx	cid
0	R7V5d94XkGQ	【大食い】超高級寿司店で３人で食べ放題したらいくらかかるの!?【大トロ1カン2,000円】	提供：ポコロンダンジョンズ\r\r\r\r\niOS：https://bit.ly/2sGg...	none	['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか...	2018-06-30T04:00:01.000Z	https://i.ytimg.com/vi/R7V5d94XkGQ/default.jpg	2244205.0	27703.0	0	3667.0	8647.0	False	hd	2d	PT21M16S	rectangular	NaN	['/m/02wbm', '/m/019_rr', '/m/019_rr', '/m/02w...	1	UCZf__ehlCEBPop___sldpBUQ
1	2R9_bkcWNd4	【女王集結】女性YouTuberたちと飲みながら本音トークしてみたら爆笑www	しばなんチャンネルの動画\r\r\r\r\nhttps://www.youtube.com/...	none	['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか...	2018-06-29T08:00:01.000Z	https://i.ytimg.com/vi/2R9_bkcWNd4/default.jpg	1869268.0	30889.0	0	3483.0	8859.0	False	hd	2d	PT18M38S	rectangular	NaN	['/m/04rlf', '/m/02jjt', '/m/02jjt']	2	UCZf__ehlCEBPop___sldpBUQ
2	EU8S-zxS9PI	【悪質】偽物ヒカキン許さねぇ…注意してください！【なりすまし】	◆チャンネル登録はこちら↓\r\r\r\r\nhttp://www.youtube.com/...	none	['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか...	2018-06-27T08:38:55.000Z	https://i.ytimg.com/vi/EU8S-zxS9PI/default.jpg	1724625.0	33038.0	0	4298.0	11504.0	False	hd	2d	PT6M12S	rectangular	NaN	['/m/04rlf', '/m/02jjt', '/m/02jjt']	3	UCZf__ehlCEBPop___sldpBUQ
3	5wnfkIfw0jE	ツイッターのヒカキンシンメトリーBotが面白すぎて爆笑www	◆チャンネル登録はこちら↓\r\r\r\r\nhttp://www.youtube.com/...	none	['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか...	2018-06-25T07:46:07.000Z	https://i.ytimg.com/vi/5wnfkIfw0jE/default.jpg	1109029.0	25986.0	0	5063.0	6852.0	False	hd	2d	PT6M31S	rectangular	NaN	['/m/04rlf', '/m/02jjt', '/m/02jjt']	4	UCZf__ehlCEBPop___sldpBUQ
4	-6duBsde_XM	【放送事故】酒飲みながら東海オンエア×ヒカキンで質問コーナーやったらヤバかったwww	提供：モンスターストライク\r\r\r\r\n▼キャンペーンサイトはこちら\r\r\r\r\...	none	['ヒカキン', 'ヒカキンtv', 'hikakintv', 'hikakin', 'ひか...	2018-06-21T08:00:00.000Z	https://i.ytimg.com/vi/-6duBsde_XM/default.jpg	1759797.0	33923.0	0	2150.0	4517.0	False	hd	2d	PT27M7S	rectangular	NaN	['/m/098wr', '/m/019_rr', '/m/02wbm', '/m/019_...	5	UCZf__ehlCEBPop___sldpBUQ

4.2. 手法1：one-hotエンコーディング(one-hot encoding)¶

df['cid'].value_counts().head()

UCsX8MJHEI5UukXoF3HLnTvg      501
UCMsuwHzQPFMDtHaoR7_HDxg      501
UC66VyLEdgCot__4w8x__n0CGA    501
UCtLo4nwb3ObCDZ4m8b8u7fA      501
UCOZ7Kq5_VWBC__TtteAcsRBg     501
Name: cid, dtype: int64

# one-hot encoding by pandas

one_hot_df = pd.get_dummies(df['cid'])

# check the one-hot vector
print(one_hot_df.values.shape)
print(df['cid'][0])
print(one_hot_df.values[0])
index = pd.Index(one_hot_df.values[0]).get_loc(1)
print('index = ', index)
print('cid = ', one_hot_df.columns[index])

(66289, 151)
UCZf__ehlCEBPop___sldpBUQ
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0]
index =  88
cid =  UCZf__ehlCEBPop___sldpBUQ

# one-hot encoding by sklearn

from sklearn import preprocessing
encoder = preprocessing.OneHotEncoder()
category = df['cid'].values.reshape(-1, 1)
encoder.fit(category)
one_hot_encoding = encoder.transform(category)

# check the one-hot vector
print(one_hot_encoding[0:10])
print(type(one_hot_encoding))

  (0, 88)	1.0
  (1, 88)	1.0
  (2, 88)	1.0
  (3, 88)	1.0
  (4, 88)	1.0
  (5, 88)	1.0
  (6, 88)	1.0
  (7, 88)	1.0
  (8, 88)	1.0
  (9, 88)	1.0
<class 'scipy.sparse.csr.csr_matrix'>

4.3. 手法2：特徴量ハッシング(Feature hashing)¶

sklearn : 5.2.2 Feature hashing
wikipedia: Feature hashing

from sklearn.feature_extraction import FeatureHasher

category = df['cid'].values.reshape(-1, 1)

# if want, you can set the size of hash table (=n_features on FeatureHasher)
num_of_features = 5
hasher = FeatureHasher(n_features=num_of_features, input_type='string')
hashed_array = hasher.transform(category)

# check the result
print(hashed_array.shape)
print(df['cid'][0])
print(hashed_array.toarray()[0])

for i in range(0, len(df), 5000):
    print(df['cid'][i], hashed_array.toarray()[i])

(66289, 5)
UCZf__ehlCEBPop___sldpBUQ
[0. 1. 0. 0. 0.]
UCZf__ehlCEBPop___sldpBUQ [0. 1. 0. 0. 0.]
UC6wKgAlOeFNqmXV167KERhQ [0. 0. 0. 0. 1.]
UC4lZ8vGPy8bwmKILb__YlhzQ [0. 0. 0. 0. 1.]
UCKtKKtjaaPKA1Oj8Ldnfsdg [1. 0. 0. 0. 0.]
UCdtFmWwPlKiCOEND_95fwiA [ 0.  0.  0.  0. -1.]
UC2RdeFmVA1PrDqmFqJMG7hA [0. 0. 0. 1. 0.]
UCO06KZjWOe6b1tXrgzzakZA [0. 0. 0. 0. 1.]
UCg_Wchs_AGoHrlayD_rhO0Q [ 0. -1.  0.  0.  0.]
UC__8H678xX1SNBOM10_ReY6Q [1. 0. 0. 0. 0.]
UC2rbyOa3Jo7vGSibqKcRjqw [0. 0. 0. 1. 0.]
UCPJOCEIyI3gxXbTqKSsViqg [1. 0. 0. 0. 0.]
UCrOnS768WQGgNzvM0wOGa1w [0. 0. 0. 1. 0.]
UCjX7kJYLEAdsaCDnTsWK3Wg [0. 0. 1. 0. 0.]
UCdb7Jw5rprurSCutjT9BW5A [ 0. -1.  0.  0.  0.]

4.4. 手法3：BaseNエンコーディング(BaseN encoding)¶

!pip install category_encoders

Requirement already satisfied: category_encoders in /usr/local/lib/python3.7/dist-packages (2.2.2)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (1.4.1)
Requirement already satisfied: pandas>=0.21.1 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (1.1.5)
Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (0.22.2.post1)
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (1.19.5)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (0.5.1)
Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (0.10.2)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.21.1->category_encoders) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.21.1->category_encoders) (2018.9)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.20.0->category_encoders) (1.0.1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from patsy>=0.5.1->category_encoders) (1.15.0)

import category_encoders as ce

encoder = ce.basen.BaseNEncoder(cols='cid', base=3)
result = encoder.fit_transform(df)

# check the result
columns = result.columns.tolist()
columns_name = [s for s in columns if "cid" in s]

def get_cid_values(df, names, index):
    temp = []
    for name in names:
        temp.append(df[name][index])
    return temp

for i in range(0, len(df), 5000):
    temp = get_cid_values(result, columns_name, i)
    print(df['cid'][i], temp)

/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
/usr/local/lib/python3.7/dist-packages/category_encoders/utils.py:21: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  elif pd.api.types.is_categorical(cols):

UCZf__ehlCEBPop___sldpBUQ [0, 0, 0, 0, 0, 1]
UC6wKgAlOeFNqmXV167KERhQ [0, 0, 0, 1, 0, 1]
UC4lZ8vGPy8bwmKILb__YlhzQ [0, 0, 0, 2, 1, 0]
UCKtKKtjaaPKA1Oj8Ldnfsdg [0, 0, 1, 0, 1, 1]
UCdtFmWwPlKiCOEND_95fwiA [0, 0, 1, 1, 2, 0]
UC2RdeFmVA1PrDqmFqJMG7hA [0, 0, 1, 2, 2, 2]
UCO06KZjWOe6b1tXrgzzakZA [0, 0, 2, 1, 0, 0]
UCg_Wchs_AGoHrlayD_rhO0Q [0, 0, 2, 2, 0, 2]
UC__8H678xX1SNBOM10_ReY6Q [0, 1, 0, 0, 1, 1]
UC2rbyOa3Jo7vGSibqKcRjqw [0, 1, 0, 1, 2, 1]
UCPJOCEIyI3gxXbTqKSsViqg [0, 1, 1, 0, 0, 2]
UCrOnS768WQGgNzvM0wOGa1w [0, 1, 1, 1, 1, 2]
UCjX7kJYLEAdsaCDnTsWK3Wg [0, 1, 1, 2, 2, 0]
UCdb7Jw5rprurSCutjT9BW5A [0, 1, 2, 1, 1, 0]

4.5. 手法4：エビデンス重みエンコーディング(Weight of Evidence)¶

import category_encoders as ce

encoder = ce.woe.WOEEncoder(cols='cid')

# ready for evidence
target = df['viewCount'] > 10000

# calculate WOE
result = encoder.fit_transform(df, y=target)

# check the result
for i in range(0, len(df), 5000):
    print(df['cid'][i], '\t', result['cid'][i])

/usr/local/lib/python3.7/dist-packages/category_encoders/utils.py:21: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  elif pd.api.types.is_categorical(cols):

UCZf__ehlCEBPop___sldpBUQ 	 3.6095662610015764
UC6wKgAlOeFNqmXV167KERhQ 	 0.4757909580572925
UC4lZ8vGPy8bwmKILb__YlhzQ 	 1.9921283059003998
UCKtKKtjaaPKA1Oj8Ldnfsdg 	 1.807796698178421
UCdtFmWwPlKiCOEND_95fwiA 	 -1.76268251990698
UC2RdeFmVA1PrDqmFqJMG7hA 	 -0.8357080911509357
UCO06KZjWOe6b1tXrgzzakZA 	 0.42925802928263423
UCg_Wchs_AGoHrlayD_rhO0Q 	 3.4570742314135954
UC__8H678xX1SNBOM10_ReY6Q 	 2.2192798786121486
UC2rbyOa3Jo7vGSibqKcRjqw 	 3.022177923131213
UCPJOCEIyI3gxXbTqKSsViqg 	 -0.2641732825478688
UCrOnS768WQGgNzvM0wOGa1w 	 -2.1435507408557353
UCjX7kJYLEAdsaCDnTsWK3Wg 	 -1.794934067712545
UCdb7Jw5rprurSCutjT9BW5A 	 -1.606225698959651

3. カテゴリデータ 1. 特徴量設計1：シソーラス、カウントと推論に基づいた設計