writing

My Blueprint for Writing Data Stories

Putting structure and soul in technical articles

Juan De Dios Santos

07 Dec 2019 • 7 min read

“Practice makes perfect” is the quintessential motivational line. And while I’m not sure that we’ll ever summit at the utopic perfection, I’m convinced we can get pretty far up the hill. For the last two years, I’ve been continuously writing about data topics. In particular, data stories, stories where I analyze, summarize, and interpret data — using statistics and machine learning, an application from artificial intelligence — to describe an event, a fact, or answer a question. For example in my article titled Interpreting 135 Nights of Sleep with Data, Anomaly Detection, and Time Series I’m using my Fitbit’s sleep data to learn about my sleep patterns, while in another one, I’m making sense of Puerto Rico’s tweets written during their last protests.

Like most beginners, my first articles were subpar. Although readable, I feel like they lack structure or soul. Still, I kept going, and after quite some time, after pouring thousands of words in a text editor, I’ve been able to shape up a blueprint that serves as the skeleton of mostly all my content. While the framework is far from perfect, it has helped me to bring order to my writing, while also assisting me in organizing my thoughts and the ideas I want to share with the world. This guideline or outline has become my writer’s companion and an essential step in my creative process.

However, before getting there, I want to repeat that this is neither a perfect guide nor the solution to write the best data articles. What I’m about to present is a guide I’ve tailored to suit my needs, so please, take everything with a grain of salt and apply what you think might benefit you. Let’s get to it.

The guideline, as you’ll see below, is more like a “fill the blank” document I complete before starting the article. Here, I spit out the ideas, questions, and notes I’ll discuss in the piece. For example, in the “topics to discuss” point, I list the topics I’d like to present, while in “programming languages,” I write down the programming language I’ll use. In the following sections, I’ll explain each point in detail.

Tentative title
Main idea
Topics to discuss
- Topic 1
- Topic 2
- ...
- Topic n
Questions to answer
- Question 1
- Question 2
- ...
- Question n
Methods/techniques to apply
- Method 1
- Method 2
- ...
- Method n
Programming languages and code libraries
Article outline
- Introduction
- About the data
- Getting the data
- The tools
- Topic 1
- Topic 2
- Topic n
- Recap and conclusion
Extras

Tentative title

Self-explanatory. In this blank, I’ll write and probably iterate over the article’s title. At the very beginning, it might be somehow very vague, e.g., “Puerto Rico’s Tweets,” but as the material takes shape, so does the title.

Main idea

Here, I’ll state the article’s principal idea, which in most cases, is the answer to the following question: what do I want to explore? For example, the primary purpose of my sleep article is “To analyze my sleep data and discover the patterns hidden in it.” Simple; without going into detail. The following two points are responsible for that.

Topics to discuss

Now we’re entering into the particulars. Under “topics to discuss,” I’ll list the subjects or the main points I want to address, e.g., “sleep in general,” “time in bed vs. time sleeping,” and “sleep start and end times.” Typically, I write down between three to five arguments.

Questions to answer

The field of data science is mostly about answers; there’s a question/doubt/uncertainty, and we study the data to shed light on the issue. Here, I’ll establish those questions. I consider this section the most important one and the place where I have to be as clear and precise as possible since the story’s content will come from how I answer them. Some sample questions are: “At what time do I go to bed? When do I wake up?” “How much time do I spend sleeping?” or “How much time do I spend up each night?.”

Methods and techniques to use

This part is about the “how-to” and the methods I’ll apply to answer my questions. The domains of statistics and machine learning are swarming with algorithms, and as fascinating as they are, it is crucial to select the most appropriate for our use case. Examples of methods might be anomaly detection or time series analysis.

The programming language(s) and code libraries

Right after determining the methods I’ll apply, I need to think about the programming language I’ll use and the methods’ implementation. For instance, I rather use the language Python for performing machine learning and R to explore the dataset. Moreover, I also take time to decide the algorithm’s implementation I’ll employ. There’re many libraries out there that do similar things, but in different ways, and each one of them has its cons and pros, points I need to consider before making my final call.

Sidenote: Some people might do this the other way around, selecting first the language and the libraries, and then the technique. Not me. I prefer first to put an emphasis on the algorithms and then worry about the tools.

By this step, I’m already pretty clear about the article’s content or the soul I previously mentioned. However, it doesn’t say anything about its structure. To address this, I also designed an outline framework that I not only use to arrange the piece but also to connect the bits I just discussed.

Article Outline

Introduction

Technical articles, including data stories, can be a bit monotonous and sometimes, honestly, a bit boring. So, I try always to start my pieces in an unexpected way. With an anecdote, a relatable thought, a background story, or the motivation. In other words, telling the reader why I am writing this. Also, in this segment (maybe in a second paragraph), I introduce the problem I want to investigate and the key questions.

About the data

In this section, I describe the story’s main character: the data. Each piece of data out there is unique; it has different sizes, characteristics, history, and meaning. So, due to the vital role it plays, I like to say a couple of words about it briefly. For example, where it comes from, how large the dataset is, and how many features (or columns) it has. Moreover, here I introduce some of its inconsistencies such as missing information — e.g., in my sleep dataset, I’m missing five days of data. In summary, I want you, the reader, to have a clear idea and know the essence of the data.

Getting the data

Obtaining the required data can be as easy as downloading a file from a website, or as complicated as having to code a solution to gather it. Regardless of the situation, I explain to the user how it can access it. If the data comes from a website, I share its source’s link. If the case is the latter, and it requires a more technical solution, I guide the user through all the necessary steps and share the code I used to obtain it so it can replicate the complete process.

The tools

Back in the “Code and libraries” part, I pointed out the technology I’ll be using to perform my data analysis, and here, I’ll tell this directly to the user. Besides just writing it down, I also point out the particularities of the platforms I’m about to employ. For instance, the version, how to obtain said software, and its cost (this rarely applies).

The story

The story. The article’s meaty part. No tale is similar to a previous one, and thus, I don’t have an exact way to describe how I do this part. My only remark is that I try to give each of the key questions or main topics a section of their own, and discuss everything related to them there.

The recap and conclusion

The story’s end. Usually, I open this part with another remark or anecdote similar to the one told at the introduction. I believe that with such a comment, I’ll be able to close the circle and remind the user of the real meaning behind the story and of all the technicalities I probably described in the previous thousand or two thousand words. (Plus, by this point the reader might be a bit overwhelmed by all the numbers and technical jargon I dropped, so it’s nice to change the mood a bit :P).

After having said my witty comments and life lessons, I do a quick recap of what we discovered and summarize the key question’s answers. For example, “…we found out that on average, I sleep XXX hours” or that “according to the time series analysis, on Fridays, I go to bed at 2 am”; the important thing here is keeping it simple and straightforward. Lastly, I like to mention any issues of difficulties encountered, post a link to the project’s source code, and say a few words about things I’d do or add if I ever go back to the project.

Extras

Call this the miscellaneous section. Typically I use it to remind myself of the code samples presented in the post, to give examples of the “filler” or stock images I’d include, and link to references.

And that’s the end!

(This article) Recap and conclusion

They say that if you do a task for a thousand hours, you’ll become an expert. Well, if that’s true, I don’t think it applies to the skill of writing. In the last years, I’ve spent easily over a hundred hours writing, and while I’m far, very far, from being an expert, I’d say that I’ve become quite comfortable with my style. In this guide, I introduced a blueprint I’ve shaped together to guide me and help me writing my data stories.

This framework is based on seven main points: tentative title, main idea, topics, questions to answer, methods/techniques to use, code and libraries, and article outline. By taking some time and brainstorming about these topics before starting to smash the keyboard and build what is to become your next best-seller, you’ll be able to shape better the structure and essence of the upcoming piece. But beware! As I said in the beginning, this guide is far from perfect, but it works for me. So, I can’t assure you it’ll have the same effect on you. Nonetheless, let it serve as a base for yours :).

Good luck, happy writing, and thanks for reading.