2. グラフマイニングの例

  • データセット

    • social circles: Facebookから facebook_combined.txt.zip を利用。

    • graph1.txt, graph2.txt, graph_modularity_simple.txt: 小さなグラフ例

  • ツール等

    • Graphviz: ノードとエッジで構成されるグラフ描画ツール。

    • pygraphviz: GraphvizをPythonから使うためのラッパー。

    • networkx: ネットワーク解析パッケージ。

2.1. 環境構築

!apt install libgraphviz-dev
!pip install pygraphviz
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libgail-common libgail18 libgtk2.0-0 libgtk2.0-bin libgtk2.0-common
  libgvc6-plugins-gtk libxdot4
Suggested packages:
  gvfs
The following NEW packages will be installed:
  libgail-common libgail18 libgraphviz-dev libgtk2.0-0 libgtk2.0-bin
  libgtk2.0-common libgvc6-plugins-gtk libxdot4
0 upgraded, 8 newly installed, 0 to remove and 39 not upgraded.
Need to get 2,120 kB of archives.
After this operation, 7,128 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 libgtk2.0-common all 2.24.32-1ubuntu1 [125 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 libgtk2.0-0 amd64 2.24.32-1ubuntu1 [1,769 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/main amd64 libgail18 amd64 2.24.32-1ubuntu1 [14.2 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic/main amd64 libgail-common amd64 2.24.32-1ubuntu1 [112 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libxdot4 amd64 2.40.1-2 [15.7 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libgvc6-plugins-gtk amd64 2.40.1-2 [18.2 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libgraphviz-dev amd64 2.40.1-2 [57.3 kB]
Get:8 http://archive.ubuntu.com/ubuntu bionic/main amd64 libgtk2.0-bin amd64 2.24.32-1ubuntu1 [7,536 B]
Fetched 2,120 kB in 1s (1,456 kB/s)
Selecting previously unselected package libgtk2.0-common.
(Reading database ... 160772 files and directories currently installed.)
Preparing to unpack .../0-libgtk2.0-common_2.24.32-1ubuntu1_all.deb ...
Unpacking libgtk2.0-common (2.24.32-1ubuntu1) ...
Selecting previously unselected package libgtk2.0-0:amd64.
Preparing to unpack .../1-libgtk2.0-0_2.24.32-1ubuntu1_amd64.deb ...
Unpacking libgtk2.0-0:amd64 (2.24.32-1ubuntu1) ...
Selecting previously unselected package libgail18:amd64.
Preparing to unpack .../2-libgail18_2.24.32-1ubuntu1_amd64.deb ...
Unpacking libgail18:amd64 (2.24.32-1ubuntu1) ...
Selecting previously unselected package libgail-common:amd64.
Preparing to unpack .../3-libgail-common_2.24.32-1ubuntu1_amd64.deb ...
Unpacking libgail-common:amd64 (2.24.32-1ubuntu1) ...
Selecting previously unselected package libxdot4.
Preparing to unpack .../4-libxdot4_2.40.1-2_amd64.deb ...
Unpacking libxdot4 (2.40.1-2) ...
Selecting previously unselected package libgvc6-plugins-gtk.
Preparing to unpack .../5-libgvc6-plugins-gtk_2.40.1-2_amd64.deb ...
Unpacking libgvc6-plugins-gtk (2.40.1-2) ...
Selecting previously unselected package libgraphviz-dev.
Preparing to unpack .../6-libgraphviz-dev_2.40.1-2_amd64.deb ...
Unpacking libgraphviz-dev (2.40.1-2) ...
Selecting previously unselected package libgtk2.0-bin.
Preparing to unpack .../7-libgtk2.0-bin_2.24.32-1ubuntu1_amd64.deb ...
Unpacking libgtk2.0-bin (2.24.32-1ubuntu1) ...
Setting up libgtk2.0-common (2.24.32-1ubuntu1) ...
Setting up libxdot4 (2.40.1-2) ...
Setting up libgtk2.0-0:amd64 (2.24.32-1ubuntu1) ...
Setting up libgail18:amd64 (2.24.32-1ubuntu1) ...
Setting up libgail-common:amd64 (2.24.32-1ubuntu1) ...
Setting up libgvc6-plugins-gtk (2.40.1-2) ...
Setting up libgraphviz-dev (2.40.1-2) ...
Setting up libgtk2.0-bin (2.24.32-1ubuntu1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Processing triggers for libc-bin (2.27-3ubuntu1.2) ...
/sbin/ldconfig.real: /usr/local/lib/python3.7/dist-packages/ideep4py/lib/libmkldnn.so.0 is not a symbolic link

Collecting pygraphviz
?25l  Downloading https://files.pythonhosted.org/packages/3a/d6/2c56f09ee83dbebb62c40487e4c972135661b9984fec9b30b77fb497090c/pygraphviz-1.7.zip (118kB)
     |████████████████████████████████| 122kB 7.9MB/s 
?25hBuilding wheels for collected packages: pygraphviz
  Building wheel for pygraphviz (setup.py) ... ?25l?25hdone
  Created wheel for pygraphviz: filename=pygraphviz-1.7-cp37-cp37m-linux_x86_64.whl size=166142 sha256=6bbbb96a20c7c22804e6b3f842eb5a71b5a652364c064af09c58727fae084d8b
  Stored in directory: /root/.cache/pip/wheels/32/59/00/14934a4292c4359eeabcdbf90f33d309b55d0f1be8a1262523
Successfully built pygraphviz
Installing collected packages: pygraphviz
Successfully installed pygraphviz-1.7
import pandas as pd
import networkx as nx
import pygraphviz
import matplotlib.pyplot as plt

2.2. やや大きめの実データを用いたグラフ化例

  • facebook_combined.txt には「どのノード間にエッジがあるか」というリンク情報が保存されている。例えば「0 1」という行なら「ノード0とノード1の間にエッジがある」ことを意味している。

  • このままではグラフではないため、データを一度DataFrameとして読み込み、それを from_edges_to_networks() によりグラフに変換する。

2.2.1. データ準備(facebook_combined.txt)

!curl -O http://snap.stanford.edu/data/facebook_combined.txt.gz
!gzip -d facebook_combined.txt.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  213k  100  213k    0     0   808k      0 --:--:-- --:--:-- --:--:--  808k
!head facebook_combined.txt
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
0 9
0 10
filename = "facebook_combined.txt"
df = pd.read_csv(filename, sep=' ')
df[:5]
0 1
0 0 2
1 0 3
2 0 4
3 0 5
4 0 6
# エッジ数の確認(上位10件)

counts = df['0'].value_counts()
counts[:10]
107     1043
1684     778
1912     748
3437     542
0        346
348      225
1941     215
1985     207
483      191
1917     189
Name: 0, dtype: int64

2.2.2. エッジ集合のDataFrameからグラフに変換

# グラフ変換

def from_edges_to_networks(filename):
    """グラフ読み込み
    Graphvizのラッパーであるpygraphvizにデータを渡すため、
    netoworkx.Graph()へノードとエッジ集合を読み込む。
    """
    G = nx.Graph()
    with open(filename) as fh:
        for line in fh:
            edges = list(map(int, line.split(" ")))
            G.add_edge(edges[0], edges[1])
    return G


G = nx.read_edgelist(filename, nodetype=int)

# ノード数、エッジ数の確認
print(nx.number_of_nodes(G), nx.number_of_edges(G))
4039 88234

2.2.3. グラフ描画

直接描画することが困難なサイズなので、一度ファイルに保存し、それを縮小したものを描画している。

# そのまま描画すると潰れてしまう、、
nx.draw(G)
../../_images/graph_example_13_0.png
# save graph

pos = nx.nx_agraph.pygraphviz_layout(G, prog='sfdp', args='-Goverlap=false -Tpng -o facebook.png')


from PIL import Image
from IPython.display import display

img = Image.open("facebook.png")
img_resize = img.resize((256,256)) # 画像リサイズ
img_resize.save("facebook_s.png")

with Image.open("facebook_s.png") as im:
  display(im)
../../_images/graph_example_14_0.png

2.2.4. ネットワーク分析例

  • エッジ数をカウントし、エッジ数あたりの出現回数を描画。

  • 上記をlog-scaleで描画。

# degree distribution
# normal scale

import collections

# ノード毎に有するエッジ数をカウント
count = []
for index_of_node in G.adj:
    count.append(len(G.adj[index_of_node]))

count.sort()
temp = collections.Counter(count)

# エッジ数あたりの出現回数をカウント
degree, freq = zip(*temp.items())

plt.scatter(degree, freq, s=5)
plt.xlim(0, max(degree))
plt.ylim(0, max(freq))
plt.xlabel('degree')
plt.ylabel('freq')
plt.title(filename + ' (degree x freq)')
plt.show()
../../_images/graph_example_16_0.png
# log scale
plt.scatter(degree, freq, s=5)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('degree (log10 scale)')
plt.ylabel('freq (log10 scale)')
plt.title(filename + ' (degree x freq) in log-scale')
plt.show()
../../_images/graph_example_17_0.png

2.2.5. Zipfの法則っぽさを確認

zipf_values = []
for i in range(10,50):
  value = i * freq[i]
  zipf_values.append(value)

zipf_values
[810,
 902,
 948,
 1131,
 1484,
 1230,
 1216,
 1241,
 1296,
 1197,
 1040,
 1323,
 1166,
 1380,
 1320,
 1400,
 1274,
 999,
 1064,
 1160,
 1140,
 1364,
 1120,
 1419,
 1224,
 1505,
 1548,
 1628,
 1102,
 1053,
 1160,
 861,
 1218,
 903,
 836,
 1080,
 1104,
 1128,
 1584,
 1225]
df = pd.DataFrame(zipf_values)
df.plot(kind='box')
<matplotlib.axes._subplots.AxesSubplot at 0x7f1ea0316f10>
../../_images/graph_example_20_1.png

2.3. ページランクの例

2.3.1. データセットの用意

!curl -O https://raw.githubusercontent.com/naltoma/datamining_intro/master/4-casestudy/graph/graph1.txt
!curl -O https://raw.githubusercontent.com/naltoma/datamining_intro/master/4-casestudy/graph/graph2.txt
!ls
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    20  100    20    0     0    124      0 --:--:-- --:--:-- --:--:--   124
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   122  100   122    0     0    734      0 --:--:-- --:--:-- --:--:--   734
facebook_combined.txt  facebook_s.png  graph2.txt
facebook.png	       graph1.txt      sample_data
!cat graph1.txt
A B
A C
B C
C A
D C

2.3.2. ランダムサーファーモデルによるシミュレーション

import random

def random_surfer(G, max_itr=1000, d=0.15):
    """ランダムサーファーモデルのシミュレーション。
    ランダムなノードから出発し、そのページのout-edgesをランダムに選択して移動する。
    ただし、確率dでページ遷移を辞め、無作為にノードを選んで再出発する。

    :param G: networkx.classes.digraph.DiGraph。有向グラフ。
    :param max_itr: ページ遷移回数。
    :param d: エッジ選択から離脱する確率。
    :return: 各ノードの訪問割合。
    """
    num = len(nx.nodes(G))
    nodes = list(nx.nodes(G))
    reached = {}
    count = 0
    current = random.choice(nodes)
    while True:
        if current in reached:
            reached[current] += 1
        else:
            reached.update({current:1})
        count += 1
        if count == max_itr:
            break
        if d < random.random():
            targets = [target for cur, target in nx.edges(G, current)]
            if len(targets) == 0:
                break
            current = random.choice(targets)
        else:
            current = random.choice(nodes)
            continue
        print(count,reached)

    for node in reached.keys():
        reached[node] /= count
    return reached


filename = "graph1.txt"
G = nx.read_edgelist(filename, create_using=nx.DiGraph())
reached = random_surfer(G, 20)
print(reached)
2 {'A': 1, 'D': 1}
3 {'A': 1, 'D': 1, 'C': 1}
4 {'A': 2, 'D': 1, 'C': 1}
5 {'A': 2, 'D': 1, 'C': 1, 'B': 1}
6 {'A': 2, 'D': 1, 'C': 2, 'B': 1}
7 {'A': 3, 'D': 1, 'C': 2, 'B': 1}
8 {'A': 3, 'D': 1, 'C': 2, 'B': 2}
9 {'A': 3, 'D': 1, 'C': 3, 'B': 2}
10 {'A': 4, 'D': 1, 'C': 3, 'B': 2}
11 {'A': 4, 'D': 1, 'C': 4, 'B': 2}
12 {'A': 5, 'D': 1, 'C': 4, 'B': 2}
13 {'A': 5, 'D': 1, 'C': 5, 'B': 2}
15 {'A': 7, 'D': 1, 'C': 5, 'B': 2}
16 {'A': 7, 'D': 1, 'C': 6, 'B': 2}
17 {'A': 8, 'D': 1, 'C': 6, 'B': 2}
18 {'A': 8, 'D': 1, 'C': 7, 'B': 2}
19 {'A': 9, 'D': 1, 'C': 7, 'B': 2}
{'A': 0.45, 'D': 0.05, 'C': 0.35, 'B': 0.15}

2.3.3. networkx.pagerankによるランク評価

filename = "graph1.txt"
G = nx.read_edgelist(filename, create_using=nx.DiGraph())

# ランク算出
pr = nx.pagerank(G)
print(pr)
{'A': 0.372526246091333, 'B': 0.19582365458881654, 'C': 0.39415009931985023, 'D': 0.037500000000000006}

2.3.4. グラフ描画

# グラフ描画
pos = nx.spring_layout(G)
labels = {}
for node in pr.keys():
    labels[node] = 'Node {}\n{}'.format(node, round(pr[node], 3))

G = nx.relabel_nodes(G, labels)

import graphviz
graphviz.Source(nx.nx_agraph.to_agraph(G))
../../_images/graph_example_30_0.svg

2.4. 少し複雑なグラフでページランク確認

filename = "graph2.txt"
G = nx.read_edgelist(filename, create_using=nx.DiGraph())

# ランク算出
pr = nx.pagerank(G)
print(pr)

# グラフ描画
pos = nx.spring_layout(G)
labels = {}
for node in pr.keys():
    labels[node] = 'Node {}\n{}'.format(node, round(pr[node], 3))

G = nx.relabel_nodes(G, labels)

#nx.nx_agraph.view_pygraphviz(G, prog='sfdp')

import graphviz
graphviz.Source(nx.nx_agraph.to_agraph(G))
#nx.draw_networkx(G)
{'1': 0.12174896105921995, '2': 0.04386978407256124, '3': 0.04386978407256124, '4': 0.04386978407256124, '5': 0.06530818443477784, '6': 0.10217580762337017, '7': 0.07144359556771177, '8': 0.05279920075611914, '15': 0.14450495559625123, '9': 0.03181526582418545, '10': 0.05885906219895707, '11': 0.02188284889175391, '12': 0.02188284889175391, '13': 0.02188284889175391, '14': 0.02188284889175391, '16': 0.13220421915470792}
../../_images/graph_example_32_1.svg

2.5. モジュラリティの確認

2.6. データセットの用意

!curl -O https://raw.githubusercontent.com/naltoma/datamining_intro/master/4-casestudy/graph/graph_modularity_simple.txt
!head graph_modularity_simple.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    84  100    84    0     0    518      0 --:--:-- --:--:-- --:--:--   521
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5

2.6.1. girvan_newman方式のモジュラリティ評価

%matplotlib inline
from networkx.algorithms.community.centrality import girvan_newman

filename = "graph_modularity_simple.txt"
G = nx.read_edgelist(filename)

comp = girvan_newman(G)
print(tuple(sorted(c) for c in next(comp)))

#nx.nx_agraph.view_pygraphviz(G, prog='sfdp')
nx.draw_networkx(G)
(['1', '2', '3', '4', '5'], ['A', 'B', 'C', 'D', 'E'])
../../_images/graph_example_37_1.png
from networkx import community

clusters = []
clusters.append([{'1', '2', '3', '4', '5'}, {'A', 'B', 'C', 'D', 'E'}])
clusters.append([{'2', '3', '4', '5'}, {'1', 'A', 'B', 'C', 'D', 'E'}])
clusters.append([{'3', '4', '5'}, {'2', '1', 'A', 'B', 'C', 'D', 'E'}])
clusters.append([{'4', '5'}, {'2', '3', '1', 'A', 'B', 'C', 'D', 'E'}])
clusters.append([{'5'}, {'2', '3', '4', '1', 'A', 'B', 'C', 'D', 'E'}])

for i, cluster in enumerate(clusters):
    modularity = community.modularity(G, cluster)
    print('modularity(cluster{}) = {}'.format(i, modularity))
modularity(cluster0) = 0.45238095238095233
modularity(cluster1) = 0.2811791383219955
modularity(cluster2) = 0.12244897959183668
modularity(cluster3) = 0.02267573696145122
modularity(cluster4) = -0.018140589569161043