People often ask how networks can be used for something practical, so in this post, I’m going to show you how I made a lot of use of Network Science and Data Science while working in Data Operations. Anyone who works in Data Operations, DevOps, or Site Reliability Engineering, this will be useful to you. However, I have also used this in Data Engineering, to understand various data pipelines.
Absolutely, this is useful in these roles, and I am living proof. But how was this impactful?
Before figuring out that this was possible, it used to take me weeks to dissect and map out a production server, sometimes over a month. Doing server uplifts requires understanding what runs on a server and what are the upstream and downstream dependencies. It’s very complicated, and I had to hold all of this in my head, in various Visio diagrams, and in way too many documents.
After figuring this out, I was able to map out production systems in 1/10 the time. Yes, people ramble on about 10x engineers, but this literally made me 10x faster for the most time consuming part of uplift/migration work. Testing goes fast after you have a map. This creates the map.
I was hired by one company in 2015 to uplift one very old and very complicated server. I call these “ancient” servers, though it was only 15 years old. All of the documentation was so outdated that it was actually harmful to use, because it was so incomplete. The server had close to 200 scripts running on it, automated, and created one very large file. It was a complicated beast, and the uplift project took THIRTEEN MONTHS. Other teams had tried and failed. My team was the first to succeed, and it ran smoothly after that until I left the company.
So, these are real techniques, used by me, in Data Operations. In fact, in my work, I mapped out thousands of scripts across hundreds of servers, across data centers.
After figuring out these techniques, we were able to do our uplifts much faster, and in one year, we were so successful in so many that we were recognized by the CEO. This is not imaginary. This helps. Others in the company wanted to learn these techniques from me, but I had not written my book, and it’s complicated to teach.
In today’s blog, I am only going to scratch the surface. I used these and many other techniques from 2017 to 2020, and we had a perfect record on our uplifts/migrations, with no problems whatsoever. We received praise for having such boring migrations. If it sounds too good to be true, it’s not. It’s the right tool for the right job.
Code and Data Networks
You can follow along with the code here. I’ve tried to keep the code very simple today, as we are discussing something new. I have never met anyone else who does this, and this is my first time showing how to do it, openly.
I usually call these “Dataflow Networks”, as they show how data is transformed. These days, in Data Engineering, DAGs do this. DAGs are graphs, and they can be visualized, but they can also be analyzed. What I’m showing today extends beyond DAGs and can be used with any code whatsoever. All it needs are inputs and outputs.
Building the Graph
Today, I built a simple imaginary dataflow network that is an ultra-simplistic similar system than the one I was hired to uplift. Instead of close to 200 scripts, I’ve only added a handful, to keep things simple, to help with understanding.
I also manually created the graph, like so:
I’ve added comments so that you can see each script separately. Look near 3.py, read those comments. If the script is put on the left, then the output is placed on the right. If a file is an input, put it on the left and put the script on the right. At the top, I’m using a Directed Graph, because that’s how this would work. Inputs get used by scripts to create one final combined_export.json file.
Visualizing the Dataflow Network
After creating this graph, we can visualize it.
Even with just that, it’s easy enough to know what this imaginary server does and how it works. At the top left, I can see that two Python scripts create two .json files. Next, 3.py and 4.py write the data to MySQL. 5.py then picks up the data and extracts two things. Then 6.py takes this data and uses it to create a combined export file.
Troubleshooting is Easy
I know, that’s a generalization. However, on ancient, undocumented servers, teams can spend DAYS trying to figure out what broke, what it impacted, and how a cascading failure occurred.
With this, instead of days, it can take MINUTES. After mapping out ancient, undocumented, orphaned servers, it is easy to find the source of the problem. Just start at the known failure and work your way backwards, upstream.
Let’s pretend that combined_export.json is still being created, but the data is stale. What’s upstream from the file? 6.py. The log files from 6.py indicate that the script is running well. Ok, let’s go back another step. OH NO! There’s an error in 5.py, and it’s been failing and no new data has been written to the two files it produces. And 6.py doesn’t check for changes in the data before doing it’s work. Hey, that explains the stale data! Turns out MySQL crashed, causing 5.py to malfunction, causing staleness in the combined export.
That’s an easy story to tell, with a network visualization. That is something your manager will understand and appreciate.
This seems like the way data operations should be done, but it is not. We didn’t get data science help in data operations. Most analysis was very simple. Most troubleshooting was manual and slow.
This makes understanding complex production systems very easy. You can do this. Anyone can.
Network Science Goodies are Useful
Once the graph has been constructed, everything else I have written about becomes available. You can use ego networks to understand individual nodes and what works around them. You can use communities to find subsystems in the overall system. You can identify weaknesses and bolster them. And you can easily find what are the most critical pieces.
That makes sense. If MySQL goes down, everything that uses MySQL is affected. If 5.py goes down, the final export file is never created. Everything else is important, but less critically important, and it’s measurable.
Fine, How Do I Start?
It has been an uphill battle getting people in Data Operations to use these techniques, even though they are so obviously useful and informative. I think it’s because there is no tool that just does all of this naturally. There are tools that do SOME of this, and DAGs can be visualized, but there is nothing that covers everything, and there may never be.
That doesn’t mean you give up. You can use these tools to do enough to be useful for whatever you are working on.
Here’s how you start simple:
Find a simple server or workflow that you know about that just has a few things running on them.
Start at the orchestration, the automation piece. This is often CRON. CRON will tell you what runs on a server and when. It doesn’t matter if it’s CRON. Find the orchestrator. Note the scripts.
For each script, identify the inputs and outputs.
Build the graph, like I showed.
Read my book and blog and practice whatever looks useful and interesting.
Just start small and simple, even with an imaginary system, like I did. Or try something I’ve never tried: map out how your code/system should work before even writing a single line of code. Visualize how you want it to work, then build it. Try the reverse of what I’m doing.
Want More Like This?
Network Science and network analysis is useful for more than just social scientists and data scientists. There is no reason that people in Data Operations should not be using these techniques. This saved me so much time and made my team effective. It made us 100% successful on our uplifts/migrations, and it got us recognition all the way to the top of the company.
But these are different types of networks than I usually show. If this is useful and informative to you, please let me know, and I’ll be encouraged to write more. I might even write a smaller book on just this topic, to help the Data Operations community next.
Dive In and Try Stuff
Just jump in and try stuff. Once you start, there’s no going back to the old ways. This has completely replaced the old methods I used to use. I use this in my own work, and in my company.
And if you get stuck or just want to brainstorm an idea, message me on LinkedIn.
This might feel tedious at first, but only at first. This is good “listen to music and get stuff done” work. Sometimes, you just have to dig in. Creating the graphs is simple once you identify the inputs and outputs, and then insights are easy to get to immediately after.
Fun Piece of Trivia
I’m actually writing this blog post using a laptop I won in 2019 at McAfee as an award for mapping out production systems using these techniques.
That’s All, Folks!
That’s all for today! Thanks for reading! If you would like to learn more about networks and network analysis, please buy a copy of my book!