I’m really excited to write this entry for #100daysofnetworks. I’ll cut to the chase. In Python Data Science, Pandas is a well known library for working with data. If you are doing Data Science with Python, there is a good chance you work with Pandas or have worked with Pandas. I use it all the time.
Today’s post is an exciting one (for me, at least):
It turns out, Polars seems to work ok with Networkx (Python Graph Library)
It also turns out that Polars is much faster than Pandas for doing string lookups, which is common pre-processing of DataFrames to prepare them for Graph work
I need to experiment more with Polars for use in Natural Language Processing and Network Science, because I am seeing impressive improvements to my workflows.
But there’s more:
I was on a podcast and talked for over an hour about Network Science and Systems thinking about a month ago, and the podcast is now edited and ready!
So, let’s go!
Introducing Polars
Polars is a an alternative to Pandas. If you are using Pandas, you should compare the performance of what you commonly do in Pandas against Polars, as I am demonstrating today.
I don’t like the idea of using Polars and Pandas together. I resisted using Polars before, because Networkx has a function for nx.from_pandas_edgelist() for creating graphs, but I didn’t see any compatibility for Polars. So, I expected that I’d be doing a bunch of back and forth between Polars and Pandas, and I’d rather just use Pandas in that case.
Yuki Kakegawa recently wrote a book called Polars Cookbook, which is really helpful for comparing Pandas to Polars (if you are familiar with Pandas). You can read my review of his book here. And you can buy a copy of his book here or on Amazon.
I spent some time over the last few weeks familiarizing myself with Polars, and figuring out how to do some of my common workflows using Polars. Today’s code shows that.
Yuki’s book is great, and it make today’s work very easy and fast for me. Thank you, so much! I wish your book complete success! It has really opened me up to the idea of using Polars more, or maybe even replacing Pandas with Polars.
Cagematch! Pandas vs Polars!
Today, I did a very simple comparison of Pandas and Polars, by timing them at doing the same tasks:
Loading a file into a DataFrame (Create DataFrame)
Loading a DataFrame edgelist into a NetworkX graph (Create Graph)
Doing string searches on a DataFrame (Search and Retrieval)
All three of these tasks are important in both Natural Language Processing as well as in Network Science and Social Network Analysis. Here’s why:
You need to be able to load data to be able to do anything with it.
You need to be able to create a graph to be able to analyze it
Sometimes, you need to do preprocessing on the data before using it
The results are really impressive. You can get the code here to see the results.
Loading Data
For this comparison, I loaded the largest dataset from #100daysofnetworks, the Arxiv Network Science dataset. The file is 34.6 MB, not an itty bitty toy dataset.
Result: Polars loaded the data in almost 1/3 of the time Pandas took. Clear winner. 185ms vs 488ms.
Creating a Graph
First, you should see my original confusion when I was able to use a Polars DataFrame directly with Networkx.
Check the notebook to see more. It appears that it actually DID work and looks good. However, I am still skeptical, so will be doing more validation, before I can trust it 100%.
However, using a Polars DataFrame rather than Pandas, I was able to create a Graph, and you can see that this is not a toy graph. There are 91,659 nodes and 96,394 edges.
Was it faster or slower loading Polars?
Polars was barely slower. One tenth of a second difference, hardly noticeable. The first edgelist is a Polars DataFrame, and the second one is a Pandas DataFrame.
Doing String Searches
Often, there is some preprocessing involved before creating a graph, such as filtering by a category, or removing junk. I often use string searches to do this. How do the two compare?
The first edgelist is Polars, and the second is Pandas. Polars took 10.2 ms to do a string search on the dataframe, and Pandas took nearly 20x longer.
When I saw this, I was simultaneously excited and filled with dread! I was excited, because I now have a much faster tool for a lot of my actual WORK work. This is going to have a real impact in my work and life.
But it has me wondering:
Should I rewrite my book using Polars for the second edition?
Should I write a second book called “Network Science with Python: Polars Edition”
Do I completely stop using Pandas today and force myself to learn Polars?
Or do I just go slow and use Polars for the remainder of #100daysofnetworks, and gradually build skill?
I am going to do the latter, and I will consider what to do about my book and future books. I absolutely don’t want to write a book that teaches Network Science using both Polars and Pandas.
Polars Wins, Today!
This was a quick comparison, for me to begin probing where I can make use of Polars. Turns out, it is much more usable than I expected for Network Science. I can create a Graph using a Polars DataFrame, and I can load and search data much quicker using Polars. However, there is a learning curve, so that is going to slow me down, temporarily.
Polars currently seems much faster than Pandas for things I commonly do in NLP and Graph analysis, but I am still nervous about it’s compatibility with other things I will use. I will build trust and confidence over time.
However, from here on, on #100daysofnetworks, I will be using Polars, not Pandas. I want to test this out, and this is a good place. Now, you’ll get to learn about Polars and Network Science!
Polars is a viable option for Network Science with Python, or at least it is looking that way. I will do some more validation, as I didn’t expect this to work.
One more time, thank you Yuki Kakegawa for writing your excellent book!
Network Thinking Podcast
Another cool thing happened, recently. About a month ago, I did a podcast with my Packt buddy Ali Abidi. He has a podcast called “All About AI with Ali Abidi”. On the podcast, we talked for over an hour about some important topics, such as:
Why I wrote my book
What the writing process was like
What is systems thinking (or network thinking) and why is it important
Why systems thinking is important to cybersecurity (my domain) and other fields
And much more…
Here is the podcast! You can watch it here, directly!
You can follow his podcast here, and you can connect with Ali here. He is a good friend of mine, and I really enjoy his posts.
Finally, Music and Scales
For the previous two articles, I wrote about Networks and Music, and build a very useful tool for comparing scales and for improvisation. I am really happy that there was so much excitement about these two posts. This new tool has really unlocked some new learning for me. It’s hard to explain. It feels like something has literally unlocked.
I really liked this interaction I had with one of the readers/users of the tool.
I love Network Science and Natural Language Processing because they give me tools that help me learn and help me make sense of the world and my own existence. I don’t like just programming. I have been coding since I was six years old. Code is not special to me. I code for outcomes.
I wanted to build a tool that could help me with my inability to memorize music scales, and it became something that is both effective (this has really opened up for me) and fun (I can’t put my guitar down, lately).
I wrote my book to help people understand and interact with the world. And I write these articles to keep the dream alive. Oh, and the blog just passed 500 subscribers! Nice!
It’s been an all-around great week, both personally and professionally. Thank you, to everyone who is a part of my life, and to those who read my words.
That’s All for Today
Thanks to everyone who has been following along with this series. Happy learning! If you would like to learn more about networks and network analysis, please buy a copy of my book!
An earlier post today triggered me looking for an article. I kinda fits in this context :)
https://medium.com/@peternorvig/functional-lifestyles-training-47984a3cd2ba
I spent a good part of 2020 immersed in Haskell and then Rust. As a math nerd, functional programming was truly the first kind of programming that made sense to me. Fast forward to early 2022, I read a Kaggle article on Polars vs Pandas, and I was hooked :) The `explain()` on `LazyFrame`s-- perfect for a query optimizing nerd like me captured my interest :)
Unfortunately, I don't get to code as much as I'd like, but I loved reading your article! Thanks!