Day 20 of #100daysofnetworks

Text as Data, Converting PDFs to Text

Nov 18, 2023

Hello everyone. Today’s post for #100daysofnetworks is going to be very simple, as I want to show how to do just one thing: convert a PDF to text. That’s the goal.

But first, DAY TWENTY!!! NICE MILESTONE!

In my book, I show how to convert raw text into social networks that can be explored. I use the text from Alice in Wonderland to create the coolest social network that I am aware of.

However, I learned this technique a few years prior, when I used the Bible’s book of Genesis to create an ancient social network graph. It’s not much to look at right now, but this is what I created back in 2018 or so.

It’s my first network visualization. It would look much better if I were to recreate it again, now, with all that I have learned. It would be very cool to explore in Graphistry, too. Graphistry is the coolest network visualization software that I am aware of.

The larger circles are nodes with the most edges (connections). So, the largest nodes were God and the most talked about people in the book of Genesis. Even though I am not religious, the fact that I could convert text into networks felt like a superpower. I used to think of it as “Cliffs Notes on Steroids”, because rather than just reading someone’s summary of what a book is about, I could actually interact and explore the social network, getting to know that characters and people in another way.

Read this paragraph slowly and understand: This can be done with ANY text whatsoever. If people are mentioned in text, if NER (Named Entity Recognition) can properly extract those entities, then you can build a network off of that text. Any text imaginable. Email. Random PDF files from the internet (be careful of malware). The internet is made of text, and there are files that can be used as well (with care).

That means that this has limitless use. Even audio can be transcribed into text and then text into networks. And this isn’t limited to the English language.

We can use this to explore and understand our world and existence in ways not previously possible. It would be extremely time consuming to map out the book of Genesis manually, to read each verse individually and draw dots and lines for each person mentioned. With NLP, you can create social networks from raw text very quickly, in seconds.

The largest file I have attempted this on was a random 600 page PDF I found on the internet. I tried it the day after my book was published, just to see if the techniques in my book would work with much larger pieces of text. Yup. No problem at all.

In today’s code, I show how to convert a PDF file to text, and then use that text with a sentence tokenizer (a first step in NLP). From that point on, text is data and we have a lot of options with how we can use it.

The code is very simple, because the PyPDF2 library is doing the complicated work, making this easy. I also include ‘tqdm’ in the function, which gives a nice status bar. A 600 page PDF takes a lot longer than a ten page PDF, so it is nice to have.

That’s it. That’s all I’m going to teach today. We’ll be using in coming days to convert a PDF into text and then into a network, and then we’ll explore that network. I’m looking for a PDF I want to use in the demonstration, and I suspect arXiv will be less useful with NER because citations are written differently than how people are named in other documents.

Text as Data

Most of my work involves “text as data”. I’ve gotten very good at extracting information and context out of text. In my work and personal research, the book Text as Data has been really helpful to me. If you are interested in Natural Language Processing (NLP) or in using text to extract networks, this is my current favorite NLP book. I can’t recommend this enough.

This is not a programming book. It is a very useful description of NLP, and it describes how and where techniques can be useful.

Recognizing Others

One of my book readers has been really getting into Network analysis and Natural Language Processing, and I want to showcase some of her work. It makes me happy to see others who “get it”, who understand the value and opportunities that exploring networks can give to us. I love that she shows her work and makes cool presentations for each of her analysis.

I also really enjoy her random Network Analysis study posts, like this.

It’s always cool to see people learning to explore networks that are interesting to them, learning more about the things that they care about.

Thank you, Tré Rodríguez-Terry, for showing your learning adventure and progress. It’s so motivating for me to see.

A few others have shared their work with me, as well, like the winners of previous book giveaways. Thank you to everyone who has done this! It motivates me to share more and learn more!

Do you have something you’d like me to showcase? Let me know and I’ll see if it fits!

Music: Just for Fun

Finally, there’s another side of me that I don’t talk about as often. If there is one thing that I am about, it is EXPLORATION. That exploration manifests a few different ways in my life:

On teams I’ve been on, I’ve been tasked to figure out how to even do something. Not necessarily how to do something best, or how to do something fastest, but how to do something at all. To take an idea and figure out a realistic path from idea to runnable and then to automation. It takes a lot of experimentation to do that. There’s not much that is helpful when you are figuring out how to do new things, and I love being in that kind of idea wilderness.
These network posts are another example. My book is more about network exploration than it is about Graph Machine Learning, and my book is not at all about Graph Databases. I want to show people how to explore networks, because there are insights to find. You should be able to do this without requiring a database, and I show you how.
But my music is where I am most experimental.

So, to me, these are all the same. They’re equally a part of me. I am as much a musician as I am a data person or programmer, and I’ve been playing music for just as long.

A friend of mine mentioned that I should share some of my music on LinkedIn (where we hang out), because it is a part of me, and because it can bring people happiness and maybe even some peace in their life. So, I started doing that.

I won’t do a long introduction like this in later posts. This is just the why. But if I post some new music on LI (I’m not a Youtuber), then I’ll tag it here. Take it or leave it. Maybe you’ll enjoy it. It’s another side of me, an important part of who I am, and it is directly related to learning, creativity, innovation, and exploration.

These are not performances. These are me learning music and exploring sound. I don’t sing. My brain won’t even let me…

I hope you’ll enjoy it. It’s nice seeing that people enjoy my guitar playing so much. I didn’t expect that. Thank you.

That’s All, Folks!

That’s all for today! Next, I’m going to show a bit more about Named Entity Recognition, using text extracted from PDF. Stay tuned! Thanks for reading! If you would like to learn more about networks and network analysis, please buy a copy of my book!

100 Days of Networks

Discussion about this post