Hi everyone, happy weekend. I have a fondness for reading literature. Actually, I have a fondness of reading in general, but there are certain authors that I am very fond of. My favorite authors include Ray Bradbury, Robert Silverberg, Arthur C. Clarke, Kurt Vonnegut, Lewis Carroll, and many, many more.
Natural Language Processing and Network Science is cool, but using Natural Language Processing and Network Science to analyze literature is cooler!
I’ve been using both to analyze all kinds of writing for years. All writing is information, and we can analyze language using useful technologies. Language is not a black box. Language is not something that is impossible to measure, analyze, or use.
Today, I am dabbling in the Computational Humanities. Using software engineering, Natural Language Processing, and Network Science to understand literature is a clear example of Computational Humanities.
I think that Computational Humanities is a very important subject that does not receive enough attention. We can use computers to better understand ourselves.
Language and Creativity
I have been reading an incredible book called The Information, and two quotes have really captivated me, this week.
“The circle of the English language has a well-defined centre but no discernable circumference.” In the center are the words everyone knows. At the edges, where Murray placed slang and cant and scientific jargon and foreign border crossers, everyone’s sense of the language differs and no one’s can be called “standard.”
This is from page 72, the second paragraph.
When I read this paragraph, I knew exactly what James Gleick was describing, because I have seen it with my own eyes, several times.
He is describing the English language as a network of words, with the rare words on the outskirts, and the common words in the core.
Like this, this is the core of the network I will describe in today’s article. Look in the center, and you’ll see these words: in, and, to, that, so, was, for. Node coloring shows the most common words in red.
I learned to do this in 2018, when I was working on some deepfake defense work. We needed to be able to classify between AI and human-generated context, and we did quite well in that. During this project, I had a lot of opportunity to look at both human and AI generated text.
When I read the paragraph about the circle of the English language, I knew exactly what he was talking about, and you can too. This is the overlap between Natural Language Processing and Network Science, and they make for the absolute best of friends.
The second quote that inspired me this week is from Ada Lovelace, on page 112 of the same book.
[Imagination] is that which penetrates into the unseen worlds around us, the worlds of Science. It is that which feels and discovers what is, the real which we see not, which exists for our senses.
Those who have learned to walk on the threshold of the unknown worlds… may then with their fair white wings of Imagination hope to soar further into the unexplored amidst which we live.
I intentionally split this quote in two parts, because there are two halves:
Imagination penetrates into the unseen worlds around us, the worlds of science. Imagination is that which leads to discovery of reality.
Those who learn to walk on the threshold of these unseen and unknown worlds will soar further into the unexplored amidst which we live.
That’s pretty much the whole point of my book, my blog, and my work. I tap into these hidden networks that exist around us and show how to draw out useful insights. This is not just cool. This is how you go further.
I hope these quotes captivate you.
Today’s Experiment
Today, I will attempt to build a vocabulary list off of Jane Austen texts. She is not my favorite writer, but I have been impressed with her word-use for as long as I have known about her. Today, I’ll use her as my vocabulary coach.
This is a fun thing that I have done many times, even shown similarly on a few days of this series. It’s easy to build word networks, and they are fun to analyze. You can do this with any author, with any text, including your own text.
Today, I want to do something fun and simple. I will use literature as input, and I will generate a vocabulary list as output.
literature → (NLP + Network Science) = vocabulary list
The NLP for this task is simple, mostly text cleaning and splitting the text on spaces, the very basics of NLP, nothing advanced.
Here is the approach:
I used NLTK to identify which Jane Austen texts I could access
I downloaded all of them and smashed them into one single ‘text’ variable
I did some wrangling of the text, removing non-word characters and such
I split the remaining text into words, splitting on whitespace
I used the words to create a word network
I used the degrees of the words in the word network to find the rarity of words I was looking for.
I recommend that you click the code link above and follow along with the notebook, to see the process in action, and the outputs of each step.
In the end, I am able to investigate the layers of the word graph, like peeling an onion, like I described in the previous post.
Looking at the final output, of the rare words that start with the letter A, I see these words:
a-day, a-shooting, abandoned, abatement, abbreviation, abdication, ablest, abolition, abominably, abominate, abominates, abounded, abridge, abridgement, absences, absented, absorbing, abstained, abstracted, abstruse, absurdities, abundant, abuses, acacia, acceded, accelerate, acceptably, accepts, accessions, accidently, accompaniment, accord, accountable, accounting, accumulation, accumulations, accustom, accustomed, aches, achievement, acquaintances, acquire, acquitting, acres, acrostic, actuated, acutely, adherence, adhering, adjusting, administered, administering, admirers, admits, adoption, adored, adoring, advancement, adventuring, adventurous, adversary's, adverse, advertise, advertised, advertising, affects, affirm, affix, afflict, afflicting, afflictions, affluent, affronts, after-days, afterward, againsts, agent, aggrandise, aggrandizement, aggression, aghast, agitating, agriculture, ah, ahead, aids, ailed, aim, aimable, akin, al-fresco, alienable, alienated, alienations, alighted, all-sufficiency, allay, allayed, alleged, allied, allies, alphabetically, alternation, alternatives, ambitious, amended, amiableness, amiably, amicable, amid, amounting, ancestry, anchorage, anecdote, anecdotes, angles, angrily, ankle, ankles, annoyance, annual, annum, antagonist, anticipations, antidote, anxieties, anymore, anyone, apologized, apparatus, appease, appellation, appendages, appetites, applauded, apple-dumpling, apple-dumplings, apple-tart, apple-tarts, apple-trees, applicant, applicants, appointments, appreciating, appreciation, apprehending, apprehensively, approachable, appropriate, approval, approver, apricot, aptitude, archness, argue, argued, arguing, argumentative, aright, arises, aristocratic, armed, army, arranger, arrear, arrives, arrow, articulation, artlessly, ask-, asperity, aspersion, assailed, assemblies, assenting, assertions, assiduously, assistants, assizes, associated, associating, association, assorting, assuage, astonishingly, atoned, atoning, attachment's, attacking, attaining, attestation, attorney, au, audacity, auditors, augment, augmentation, augmenting, aunts, auspices, auspicious, auspiciously, authentic, authorise, authorising, autumnal, availed, availing, avenue, avenues, avoidance, avowed, avowedly, awaited, awaken, awakening, awaking, awe, awes, awhile, awkwardnesses
These are the most rare words from Jane Austen’s literature. These are not all difficult word, but they are the most rare in her vocabulary.
These are just the words that start with the letter A. I recommend that you play with the code and explore, to internalize capabilities.
This was an easy experiment. If you are just getting started, this is a good one for you to start with, because setup is so simple.
Expanding the Experiment
If we wanted to show what was described in the first quote, about the circle of the English language, how would we do it differently? Scroll up, look through the steps. There is one single step that needs to be changed to show what that quote is describing. Here is how I would do it:
Collect as much random text as you can from as many human sources as you can get access to and use. Throughout this series, I’ve shown some useful sources, and data is everywhere. Find what seems useful.
Load all of the data and combine the text into one single ‘text’ variable
Clean the text, removing non-word characters and such
Split the remaining text into words, splitting on whitespace
Use the words to create a word network
Investigate the layers, using degrees or k-corona
This is Fun Stuff
Computational Humanities is always fun. When I was in college, we relied on Cliff’s Notes books as study guides to help us in our understanding of literature. Some people used them to cheat, but that is not why they were created.
I think of computational humanities as Cliff’s Notes on steroids or rocket fuel. It is one thing to read Alice in Wonderland. It is another thing to play in the networks and run simulations, to let the Red Queen have her way, or to turn the White Rabbit into an assassin in an alternate universe.
But these approaches will work on any text. Get creative and run the experiments you want to run.
Another Thank You
My book has been doing really well, and I want to thank all of you who have read it.
Today, I want to give a special thank you to my friend Koo Ping Shung who had some really nice things to say about my book.
You should follow him on Substack and LinkedIn, if you aren’t already. He shares so much useful information and is very active in the Data Science community. You should read and follow his blog as well.
There are a few things that I liked about his review:
“His dedication and commitment to having good content can be seen in all these posts, unlike other authors, not necessarily on topics in #datascience or #artificialintelligence, there is no track record to follow on.”
“It covers a large variety of topics too like #naturallanguageprocessing & #machinelearning together with network analysis which is not an easy topic to write on and thus not a lot of materials on it.”
“There are accompanying #python codes that gives you immediate feedback on your learning or put into actions what is written which can increase your learning efficiency!”
“Personally I feel that network analysis is under-utilized, as in it is a overlooked goldmine of insights that can be used so there is first-mover advantage if you have a good grasp of it.”
“Get the book! :D”
“It widen the variety of technical books in my library! :D”
“It combines the social sciences with data”
I want to address each of these, because they address the why of why I write.
I try to keep human innovation moving forward. I’m not writing for myself. I don’t care to read my old writing. I like to keep things progressing, including human knowledge and innovation. So I write, and write, and write. I don’t do this for money. Creativity helps me relax and recover.
Yes, this book and today’s article showcases the value of combining software engineering, network science, and natural language processing. This innovation is powerful.
The code is written so that you can immediately understand if you are moving in the right direction. I try to write my code for this so that it is easy to follow, not so that it is ready for production. That is a different task with a different focus.
This is absolutely under-utilized. So, I don’t mind that my blog only has hundreds of readers while my LinkedIn has tens of thousands of followers. This is a specialized subject, not as broad as Artificial Intelligence or Machine Learning, and certainly not as popular. I find it more exciting, because it is more useful to me than ML and AI both, but that is something I learned, and something you can learn.
It will broaden your technical horizons. Re-read that Ada Lovelace quote. That’s why I do this.
We are combining the social sciences with software engineering. Why are we doing that? To understand ourselves and reality better.
Thank you, Koo Ping Shung! Your review made my day! Thank you for everything that you do!
That’s All for Today
Thanks to everyone who has been following along with this series. Happy learning! If you would like to learn more about networks and network analysis, please buy a copy of my book!
The amount of effort you put in your book shines through with your posts, be it on LinkedIn and Substack which is what I like. Track record is all the more important to determine is someone is genuine or fluffy. You are one of the best example for the former! 👏👏