Day 47 of #100daysofnetworks
Introducing KARATE CLUB! Graph Machine Learning!
Today, I will be introducing a very special Python library, the winner of the “Cool Name Award”, in my opinion: KARATE CLUB. Yes, there is a Python Library called Karate Club, and it is all about Graph Machine Learning. If you like Graphs and you like Machine Learning, you need to know about this.
Two Approaches to Graphs and ML
In my own book, I describe my approach to Graphs and ML. I care a lot about science and insights, so Interpretable and Explainable Machine Learning is important to me.
Learn about Interpretable Machine Learning:
My ways make well-known models simple to use. You can use my approach with RandomForest, and you can inspect the Feature Importances to learn what it learned about the graph, for instance. RandomForest, XGBoost, Regression, whatever. Have fun.
So, I have my own ways of doing ML on graph data that prioritizes transparency and insightfulness over prediction, but my way is not the only way, and these other ways are also useful.
The book Graph Machine Learning shows several other ways to do Machine Learning on graph data, such as using Graph Neural Networks. It is one of my favorite books, because it is very different than other ML books.
But the models have different names, so it is less straight-forward, and you have to do some additional learning. SCD? What is SCD? It’s impressive? Why? Why do I know all about XGBoost but not SCD? Can XGBoost do billion scale community detection? No, it cannot.
The models are different, so you have to do more work to learn about them, but I find that fun. That’s no different than traditional ML, it’s just less known and less explored.
If you don’t understand this paragraph, you should read my book and do some reading on model explainability and interpretability. Those are important topics that don’t get enough attention.
Installation and Setup
In 2023, when my book was published, the book Graph Machine Learning came out right about the same time. It was a thrill for me to use the graphs I had created with a completely different toolset, and I learned a lot.
Back then, Karate Club worked fine on my laptop.
I tried to get it setup on my new laptop this week, but it seems to not have kept up with Networkx, so if you are using modern Networkx, Karate Club is going to give you problems.
Instead of getting frustrated and giving up, I found a shortcut for you all. You can be using it in five minutes from now, if you want it bad enough:
Go to deepnote.com and get yourself a free account. I personally use a paid account, as that is useful for consulting.
Create a new project. Call it something related to Karate Club.
On the bottom left, set your Python version to “3.10 for data science”
That is what we need, to get unstuck. The rest are pip installs. I tried to use Karate Club on 3.11 and the install never completed. I tried on 3.12 and it outright failed. I don’t have time to troubleshoot this right now. Please feel free, and let us know if you find a fix for Python 3.12.
But from what I am hearing, Karate Club might just be a bit out of date with Networkx and might need to catch up. Maybe some of you readers can go support their project. Maybe I will try to get involved, someday.
For our tests and use, you will need to run one pip install command:
!pip install karateclubIf you don’t get an error, we should be good to go.
I used this code for my imports and to test Karate Club.
Here is the code for copy/paste:
from karateclub.community_detection.non_overlapping.scd import SCD
import networkx as nx
import pandas as pdGreen check mark shows the import did not fail and that the installation was successful.
Quick Guide: Usage
This is a speed-guide, not for completion. This is the FIRST exploration of Karate Club for this blog, and we will do more. We just opened up a whole world of ML models. We are not stopping here.
I am mainly just testing that the install went well, and getting setup for more exploration. You should do the same. Making exploration as simple as possible is the goal right now, not chasing insights.
Today, I want to load SCD, a Community Detection algorithm that is claimed to be as fast and accurate as the Louvain Method. So, let’s try to get it working, real quick. We already did the imports, above.
In this screenshot, I am:
Loading the data into df
Creating a graph G with the edgelist from df
Instantiating SCD as ‘model’
Fitting ‘model’ on G
It is that easy. It is that easy. It is that easy! Play with Karate Club. It is that easy to get started today.
And SCD is good. In the next article, I will compare it to Louvain, apples-to-apples. We’ll explore the communities side by side, SCD vs Louvain Cage Match style. Science should be fun.
Ok, that Alice network is a toy network. Let’s try something with meat.
What are we looking at:
I figured out how to crawl Bluesky
I crawled Bluesky and created an edgelist containing 40k nodes and 109k edges
This is not a toy graph. This is a real world graph data. I am in this.
It trained against 27k nodes and 69k edges in 34 seconds
It trained again 41k nodes and 109k edges in 52 seconds
SCD is claimed to be billion scale, so have fun.
CLICK THIS TO EXPLORE THE NETWORK.
Evaluation
How did SCD do, though? It’s one thing to run a ML model. Did it do what I want?
Let’s explore one account: laskerfdn.bsky.social
I chose this one because it is a foundation, not a person. I doubt a foundation will mind me pointing a bunch of strangers at them. It’s a foundation, so I would expect there to be a community around it. Let’s see how SCD did.
Three. SCD found three nodes connected to that person. It found only two nodes in my community, which I found odd and caused me to look closer.
Let’s see the edges from a subgraph of this community.
In that subgraph, there are only three edges. SCD has put this foundation into a tiny subgraph with few edges.
How do you troubleshoot this? Do you tune the ML?
No, you look at the graph. What does this account’s Ego Graph look like? What edges are in the Ego Graph?
Visually, you should be able to notice that there are more edges. Logically, this doesn’t make sense. A person is directly connected to everyone in their ego graph, so everyone in someone’s ego graph has high likelihood of being in their community.
So, first impression, something is wrong. Hey cool, we’re in Test Driven Development land! Fail First! But is this a failure? The model is on default parameters. Does it need to be tuned? Have I forgotten to do something, since I last used this in 2023? Did something break?
It doesn’t matter. This is the entry-point. We now have a way to play with Karate Club and learn about it. We now have a way to experiment with it and tune it. We now have a way to find models we had no idea even existed that may serve our use-cases better than traditional ML models would. What is important is the starting point, the foundation, not hitting a hole in one on your first .fit(X) .predict(). ML doesn’t work like that. You have to do the work.
And that’s all for today! I’m going to go have a nice sunny afternoon. I’ve been wanting to write this and introduce you all to Karate Club in some usable way, and now you have it!
Colab might work. Jupyter was a pain. Deepnote is easy.
Anyway, that’s all for today. Have fun learning.
Need Any Help?
Finally… there’s no fun way to say this. I am out of a job and need to find work. If you work in Data Operations, Data Engineering, Cybersecurity, Data Science, or do anything with Artificial Intelligence or Machine Learning and you think I can be of use to your company, please reach out.
I need full-time work with benefits. I need a good problem to help solve. I enjoy collaborations, but I need to pay mortgage and expenses. I am very good at what I do, and I am very good at making companies more effective. I’m a very friendly person and a good teammate, so let me know if you think of anything. Or pitch in and buy a coffee to support this writing.
Please Support this Blog
I would like to make a special request in this article. This blog has over 600 subscribers. I have written over 40 articles. Each article typically involves about four hours of research and development, so that’s about 200 hours of valuable work and writing that I’ve provided for free, because most important is that I want people to learn this. I am not doing this to make money.
However, these days, there are things that I would like to do. For instance, to play with GraphRAG for AI, it is useful to have access to a Graph Database. The cheapest tier Neo4j instance is about $800/year. I would like to work on GraphRAG and write about it so that you all learn, but I cannot do that without support.
So, I have opened up a few ways for you to support this blog:
If you are a subscriber, please consider converting to a paid subscriber. I provide code, data files, and coding explanations that are absolutely worth more than $8 per month. But I understand that not everyone can afford to pay, and that’s fine. Free is absolutely fine, for those who need free.
If you are a paid or unpaid subscriber and you want more flexibility in your contributions, I set up a ko-fi account. CLICK HERE. You can use this to buy me a coffee ($5 donation) or even to pitch in for a Neo4j Aura instance, which will enable more writing and learning.
And no matter what, if you are here, please buy and read my book. I am working on more book projects, as mentioned on Day 43.
Oh, and please participate in comments. I am a friendly guy. It is lonely in the comments and always weird to me that people don’t seem to want to talk about this stuff. Why not? What is on your mind? Have any cool ideas you want to brainstorm? Don’t be intimidated, for sure. Be creative instead.










