I recently come across a powerful tools called igraph that provides some very powerful graph mining capabilities. Following are some interesting things that I have found.

##
**Create a Graph**

Graph is composed of Nodes and Edges, both of them can be attached with a set of properties (name/value pairs). Furthermore, edges can be directed or undirected and weights can be attached to it.

> library(igraph) > # Create a directed graph > g <- graph(c(0,1, 0,2, 1,3, 0,3), directed=T) > g Vertices: 4 Edges: 4 Directed: TRUE Edges: [0] 0 -> 1 [1] 0 -> 2 [2] 1 -> 3 [3] 0 -> 3 > # Create a directed graph using adjacency matrix > m <- matrix(runif(4*4), nrow=4) > m [,1] [,2] [,3] [,4] [1,] 0.4086389 0.2160924 0.1557989 0.2896239 [2,] 0.4669456 0.1071071 0.1290673 0.3715809 [3,] 0.2031678 0.3911691 0.5906273 0.7417764 [4,] 0.8808119 0.7687493 0.9734323 0.4487252 > g <- graph.adjacency(m > 0.5) > g Vertices: 4 Edges: 5 Directed: TRUE Edges: [0] 2 -> 2 [1] 2 -> 3 [2] 3 -> 0 [3] 3 -> 1 [4] 3 -> 2 > plot(g, layout=layout.fruchterman.reingold) >

iGraph also provide various convenient ways to create patterned graphs

> #Create a full graph > g1 <- graph.full(4) > g1 Vertices: 4 Edges: 6 Directed: FALSE Edges: [0] 0 -- 1 [1] 0 -- 2 [2] 0 -- 3 [3] 1 -- 2 [4] 1 -- 3 [5] 2 -- 3 > #Create a ring graph > g2 <- graph.ring(3) > g2 Vertices: 3 Edges: 3 Directed: FALSE Edges: [0] 0 -- 1 [1] 1 -- 2 [2] 0 -- 2 > #Combine 2 graphs > g <- g1 %du% g2 > g Vertices: 7 Edges: 9 Directed: FALSE Edges: [0] 0 -- 1 [1] 0 -- 2 [2] 0 -- 3 [3] 1 -- 2 [4] 1 -- 3 [5] 2 -- 3 [6] 4 -- 5 [7] 5 -- 6 [8] 4 -- 6 > graph.difference(g, graph(c(0,1,0,2), directed=F)) Vertices: 7 Edges: 7 Directed: FALSE Edges: [0] 0 -- 3 [1] 1 -- 3 [2] 1 -- 2 [3] 2 -- 3 [4] 4 -- 6 [5] 4 -- 5 [6] 5 -- 6 > # Create a lattice > g1 = graph.lattice(c(3,4,2)) > # Create a tree > g2 = graph.tree(12, children=2) > plot(g1, layout=layout.fruchterman.reingold) > plot(g2, layout=layout.reingold.tilford)

iGraph also provides 2 graph generation mechanism. "Random graph" is to generate an edge randomly between any two nodes. "Preferential attachment" is to assign a higher probably to create an edge to an existing node which has a high in-degree already (the rich gets richer model).

# Generate random graph, fixed probability > g <- erdos.renyi.game(20, 0.3) > plot(g, layout=layout.fruchterman.reingold, vertex.label=NA, vertex.size=5) # Generate random graph, fixed number of arcs > g <- erdos.renyi.game(20, 15, type='gnm') # Generate preferential attachment graph > g <- barabasi.game(60, power=1, zero.appeal=1.3)

##
**Basic Graph Algorithms**

This section will cover how to use iGraph to perform some very basic graph algorithm. *algorithm is to find a Tree that connect all the nodes within a connected graph while the sum of edges weight is minimum.*

**Minimum Spanning Tree**# Create the graph and assign random edge weights > g <- erdos.renyi.game(12, 0.35) > E(g)$weight <- round(runif(length(E(g))),2) * 50 > plot(g, layout=layout.fruchterman.reingold, edge.label=E(g)$weight) # Compute the minimum spanning tree > mst <- minimum.spanning.tree(g) > plot(mst, layout=layout.reingold.tilford, edge.label=E(mst)$weight)

* Connected Component* algorithms is to find the island of
nodes that are interconnected with each other, in other words, one can
traverse from one node to another one via a path. Notice that
connectivity is symmetric in undirected graph, it is not the necessary
the case for directed graph (ie: it is possible that nodeA can reach
nodeB, then nodeB cannot reach nodeA). Therefore in directed graph,
there is a concept of "strong" connectivity which means both nodes are
considered connected only when it is reachable in both direction. A
"weak" connectivity means nodes are connected

> g <- graph(c(0, 1, 1, 2, 2, 0, 1, 3, 3, 4, 4, 5, 5, 3, 4, 6, 6, 7, 7, 8, 8, 6, 9, 10, 10, 11, 11, 9)) # Nodes reachable from node4 > subcomponent(g, 4, mode="out") [1] 4 5 6 3 7 8 # Nodes who can reach node4 > subcomponent(g, 4, mode="in") [1] 4 3 1 5 0 2 > clusters(g, mode="weak") $membership [1] 0 0 0 0 0 0 0 0 0 1 1 1 $csize [1] 9 3 $no [1] 2 > myc <- clusters(g, mode="strong") > myc $membership [1] 1 1 1 2 2 2 3 3 3 0 0 0 $csize [1] 3 3 3 3 $no [1] 4 > mycolor <- c('green', 'yellow', 'red', 'skyblue') > V(g)$color <- mycolor[myc$membership + 1] > plot(g, layout=layout.fruchterman.reingold)

* Shortest Path* is almost the most commonly used algorithm
in many scenarios, it aims to find the shortest path from nodeA to
nodeB. In iGraph, it use "breath-first search" if the graph is
unweighted (ie: weight is 1) and use Dijkstra's algo if the weights are
positive, otherwise it will use Bellman-Ford's algorithm for negatively
weighted edges.

> g <- erdos.renyi.game(12, 0.25) > plot(g, layout=layout.fruchterman.reingold) > pa <- get.shortest.paths(g, 5, 9)[[1]] > pa [1] 5 0 4 9 > V(g)[pa]$color <- 'green' > E(g)$color <- 'grey' > E(g, path=pa)$color <- 'red' > E(g, path=pa)$width <- 3 > plot(g, layout=layout.fruchterman.reingold)

## Graph Statistics

There are many statistics that we can look to get a general ideas of the shape of the graph. At the highest level, we can look at summarized statistics of the graph. This includes ...- Size of the graph (number of nodes and edges)
- Density of the graph measure weither the graph dense (|E| proportional to |V|^2) or sparse (|E| proportional to |V|) ?
- Is the graph very connected (large portion of nodes can reach each other), or is it disconnected (many islands) ?
- Diameter of the graph measure the longest distance between any two nodes
- Reciprocity measures in a directed graph, how symmetric the relationships are
- Distribution of in/out "degrees"

> # Create a random graph > g <- erdos.renyi.game(200, 0.01) > plot(g, layout=layout.fruchterman.reingold, vertex.label=NA, vertex.size=3) > # No of nodes > length(V(g)) [1] 200 > # No of edges > length(E(g)) [1] 197 > # Density (No of edges / possible edges) > graph.density(g) [1] 0.009899497 > # Number of islands > clusters(g)$no [1] 34 > # Global cluster coefficient: > #(close triplets/all triplets) > transitivity(g, type="global") [1] 0.015 > # Edge connectivity, 0 since graph is disconnected > edge.connectivity(g) [1] 0 > # Same as graph adhesion > graph.adhesion(g) [1] 0 > # Diameter of the graph > diameter(g) [1] 18 > # Reciprocity of the graph > reciprocity(g) [1] 1 > # Diameter of the graph > diameter(g) [1] 18 > # Reciprocity of the graph > reciprocity(g) [1] 1 > degree.distribution(g) [1] 0.135 0.280 0.315 0.110 0.095 0.050 0.005 0.010 > plot(degree.distribution(g), xlab="node degree") > lines(degree.distribution(g))

Drill down a level, we can also look at statistics of each pair of nodes, such as ...

- Connectivity between two nodes measure the distinct paths with no shared edges between two nodes. (ie: how much edges need to be removed to disconnect them)
- Shortest path between two nodes
- Trust between two nodes (a function of number of distinct path and distance of each path)

> # Create a random graph > g <- erdos.renyi.game(9, 0.5) > plot(g, layout=layout.fruchterman.reingold) > # Compute the shortest path matrix > shortest.paths(g) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 0 1 3 1 2 2 1 3 2 [2,] 1 0 2 2 3 2 2 2 1 [3,] 3 2 0 2 1 2 2 2 1 [4,] 1 2 2 0 3 1 2 2 1 [5,] 2 3 1 3 0 3 1 3 2 [6,] 2 2 2 1 3 0 2 1 1 [7,] 1 2 2 2 1 2 0 2 1 [8,] 3 2 2 2 3 1 2 0 1 [9,] 2 1 1 1 2 1 1 1 0 > # Compute the connectivity matrix > M <- matrix(rep(0, 81), nrow=9) > M [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 0 0 0 0 0 0 0 0 0 [2,] 0 0 0 0 0 0 0 0 0 [3,] 0 0 0 0 0 0 0 0 0 [4,] 0 0 0 0 0 0 0 0 0 [5,] 0 0 0 0 0 0 0 0 0 [6,] 0 0 0 0 0 0 0 0 0 [7,] 0 0 0 0 0 0 0 0 0 [8,] 0 0 0 0 0 0 0 0 0 [9,] 0 0 0 0 0 0 0 0 0 > for (i in 0:8) { + for (j in 0:8) { + if (i == j) { + M[i+1, j+1] <- -1 + } else { + M[i+1, j+1] <- edge.connectivity(g, i, j) + } + } + } > M [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] -1 2 2 3 2 3 3 2 3 [2,] 2 -1 2 2 2 2 2 2 2 [3,] 2 2 -1 2 2 2 2 2 2 [4,] 3 2 2 -1 2 3 3 2 3 [5,] 2 2 2 2 -1 2 2 2 2 [6,] 3 2 2 3 2 -1 3 2 3 [7,] 3 2 2 3 2 3 -1 2 3 [8,] 2 2 2 2 2 2 2 -1 2 [9,] 3 2 2 3 2 3 3 2 -1 >

## Centrality Measures

At the fine grain level, we can look at statistics of individual nodes. Centrality score measure the social importance of a node in terms of how "central" it is based on a number of measures ...- Degree centrality gives a higher score to a node that has a high in/out-degree
- Closeness centrality gives a higher score to a node that has short path distance to every other nodes
- Betweenness centrality gives a higher score to a node that sits on many shortest path of other node pairs
- Eigenvector centrality gives a higher score to a node if it connects to many high score nodes
- Local cluster coefficient measures how my neighbors are inter-connected with each other, which means the node becomes less important.

> # Degree > degree(g) [1] 2 2 2 2 2 3 3 2 6 > # Closeness (inverse of average dist) > closeness(g) [1] 0.4444444 0.5333333 0.5333333 0.5000000 [5] 0.4444444 0.5333333 0.6153846 0.5000000 [9] 0.8000000 > # Betweenness > betweenness(g) [1] 0.8333333 2.3333333 2.3333333 [4] 0.0000000 0.8333333 0.5000000 [7] 6.3333333 0.0000000 18.8333333 > # Local cluster coefficient > transitivity(g, type="local") [1] 0.0000000 0.0000000 0.0000000 1.0000000 [5] 0.0000000 0.6666667 0.0000000 1.0000000 [9] 0.1333333 > # Eigenvector centrality > evcent(g)$vector [1] 0.3019857 0.4197153 0.4197153 0.5381294 [5] 0.3019857 0.6693142 0.5170651 0.5381294 [9] 1.0000000 > # Now rank them > order(degree(g)) [1] 1 2 3 4 5 8 6 7 9 > order(closeness(g)) [1] 1 5 4 8 2 3 6 7 9 > order(betweenness(g)) [1] 4 8 6 1 5 2 3 7 9 > order(evcent(g)$vector) [1] 1 5 2 3 7 4 8 6 9

From his studies, Drew Conway has found that people with low Eigenvector centrality but high Betweenness centrality are important gate keepers, while people with high Eigenvector centrality but low Betweenness centrality has direct contact to important persons. So lets plot Eigenvector centrality against Betweenness centrality.

> # Create a graph > g1 <- barabasi.game(100, directed=F) > g2 <- barabasi.game(100, directed=F) > g <- g1 %u% g2 > lay <- layout.fruchterman.reingold(g) > # Plot the eigevector and betweenness centrality > plot(evcent(g)$vector, betweenness(g)) > text(evcent(g)$vector, betweenness(g), 0:100, cex=0.6, pos=4) > V(g)[12]$color <- 'red' > V(g)[8]$color <- 'green' > plot(g, layout=lay, vertex.size=8, vertex.label.cex=0.6)

With this basic of graph mining, in future posts I will cover some specific examples of social network analysis.