If you would like to learn more about networks and network analysis, please buy a copy of my book!
Hello everyone. Welcome back to another post in this adventure. On day 8, I used our Wikipedia crawler to create a network edge list related to the band Wilco, my favorite music group.
Data Scientists often complain that most of their work is in cleaning data, or data wrangling. Network Science is no different. If you do not clean your networks, then your network metrics will be based on a mixed network (on-topic and off-topic).
Today, we’re going to take our Wilco network, identify the nodes that are not related to the band Wilco, and then we’re going to remove them. At the end, we will have a pure network, and this will be useful for downstream tasks.
Today’s code is available on github.
Spot Checks
The first thing I always do is a spot check. If I’m working with a small enough network, I’ll visualize it. If it’s a network of thousands of nodes, I’ll look at centralities and Page Rank. Today’s network is tiny, so I’ll visualize it.
This quick visualization actually already shows what we need to do. I know the band Wilco, so I have some domain knowledge on this topic. I can see that there are three sections in this network.
I can see that there is some junk on the left, mostly related to Space Quest, but not entirely. And I can see that there is some junk at the bottom related to the node “Procedure word”.
Think about some nodes as bridges. Which nodes do you see that lump the pure section with the junk sections? Try to guess. “Procedure word” is one of them. Look for more.
If you identify these junk bridge nodes and remove them, the network begins to untangle itself. Before we do that, let’s look at the overall network’s Edge Betweenness Centralities.
Edge Betweenness Centrality
Edge Betweenness Centrality has to do with which edges exist on the greatest number of shortest paths. Any two nodes in a network will have a shortest path, from A to B. If an edge sits between most nodes, then that edge is a bridge between many nodes.
Here are the top ten edges with the highest Edge Betweenness Centrality scores:
Notice that the top one is junk. It’s linking Wilco to Space Quest. Our Wikipedia crawler made a link between the two based on the name. Roger Wilco has to do with Space Quest. The top edge should definitely be removed.
The second link is between Procedure word and Wilco. This also looks like junk that should be removed. Further down, there is also a link between Allied Communication Procedures and Procedure word that should be removed.
But before we do anything with these edges, let’s first manually identify junk nodes, remove them, and then see how the network looks. We’ll revisit Edge Betweenness Centrality again a bit further down.
Manually Removing Nodes
After looking at the network with my eyeballs, I noticed a few junk nodes. Here is how I set them aside, and how I remove them from the network.
The first link specifies the nodes to remove, and the second line removes them. How does the network look now, after removing just a few nodes? First, here is how it originally looked:
And here is how it looks now:
OUTSTANDING. Already, one of the junk clusters has shot off to the left, and the Space Quest stuff is held together with the Wilco network by a single edge.
NOW is the right time to look at Edge Betweenness Centralities again. Dropping just a few key junk nodes was enough to do most of the cleanup, and snipping one edge will be enough to finalize the separation. Here’s the new top ten Edge Betweenness Centralities:
Notice that top edge. Notice how it looks visually on the network. It has a much higher centrality score because it sits between every Wilco node and every junk node. We can easily remove this one edge.
The first line is just getting the top edge, by name. The second line is removing the edge from the graph. How does the network look, now?
Beautiful. The three parts of the networks have split apart. The Wilco network is in the center. We want to keep that, and disregard the rest.
Let’s look at each connected component, keep the Wilco one, disregard the rest, and persist the data so that we have a clean edge list for later days.
Connected Components
In my book, I describe networks as having connected components that resemble islands and continents. Networks will often have one super cluster, then a bunch of large but smaller clusters, and many isolate nodes.
Let’s take a look at the connected components that now exist in this graph.
This shows us that there are three connected components in our graph. The first component has 41 nodes, the second component has ten nodes, and the third component has four nodes. Let’s look at each of them.
This is the first connected component. I can see Jeff Tweedy and several album names. This is definitely the Wilco network. This is the good stuff. Here’s the next component:
This component is clearly related to Space Quest, not Wilco. Finally, here is the third component:
Keep the Wilco Graph
Check the code and you’ll see that the next step is to overwrite G with the Wilco connected component, and then I persist the network edge list as a file for tomorrow.
Now we have a clean Wilco network! With this, I can crawl each of the Wilco-related Wikipedia pages and get clean content. We will use that content for Natural Language Processing, and for creating Entity graphs! More soon!
That’s all for today! Thanks for reading! If you would like to learn more about networks and network analysis, please buy a copy of my book!