Discovering the top nouns, verbs, entities and text similarity within the spoken lines of Earth’s Mightiest Heroes

After a long year of waiting, Avengers: Endgame is finally here. I, like you, and most of the world will be rushing to the cinemas on day one to catch the movie and experience how the Avengers save the world and end a ten years story. To calm down my nerves and ease the wait, I wanted to relive the previous movie, Infinity War, but differently and interactively. And, since I am a data guy, of course, it had to involve data and a couple of buzzwords.The answer? Natural Language Processing, or NLP for short.Using spaCy, an NLP Python open source library designed to help us process and understand volumes of text, I analyzed the script of the movie to investigate the following concepts:

  • Overall top 10 verbs, nouns, adverbs and adjectives from the film.
  • Top verbs and nouns spoke by a particular character.
  • Top 30 named entities from the film.
  • The similarity between the spoken lines of each character pair, e.g., the similarity between Thor’s and Thanos’ lines.

In this article, I will discuss and show my findings while explaining with code how I did it with spaCy.

Aren’t you interested in code and technical words? Today is your lucky day! I want to say that the vocabulary and terms I’ll use here are mostly non-technical and user-friendly so even if you have no experience in NLP, AI, machine learning or *insert buzzword here*, you should be able to grasp the main idea and concepts I want to inform. So, feel free to ignore the pieces of code :)

The Mad Titan. Credits: Marvel

Processing the data

The data or text corpus — as is usually known in NLP — used for the experiment is the script of the movie, available at this link. However, before using the data, I had to clean it up. Thus, I removed some unnecessary things such as the comments that describe an action, or scene e.g. “[Thanos crushes the Tesseract, revealing the blue Space Stone…]”, as well as the name of the character who says the line (actually, the name was used to know who said what, but not as part of the actual corpus used for the analysis). Moreover, as part of the spaCy data processing step, I’m ignoring the terms that are labeled as stop words, in other words, the commonly used words, e.g. “I”, “you”, “an”. Also, I’m using only the lemma, that’s it the canonical form, of each word. For instance, the verbs “talks,” “talked,” and “talking” are forms of the same lexeme, and its lemma is “talk”.

To process a piece of text in spaCy, first, we need to load our language model, followed by calling the model on a text corpus. The result is a Doc object, an object that holds the processed text.

Now that we have a clean and processed corpus, it’s time to start!

Top 10 verbs, nouns, adverbs and adjectives

Is it possible to know what was the overall action, or the plot of the film by just looking at the verbs? The first graph of this article addresses this.

“I know”, “you think” are some of the most common phrases

“Know”, “go”, “come”, “get”, “think”, “tell”, “kill”, “need”, “stop”, and “want”. What can we infer from this? Since I have seen the film a couple of times — also implying that I am biased — I’m willing to conclude the Avengers: Infinity War — according to these verbs — is about knowing, thinking, and investigating how to go and stop something or someone.

This is how we can obtain the verbs with spaCy:

What about the words that describe the verbs, namely the adverbs?

“I seriously don’t know how you fit your head into that helmet” — Doctor Strange

For a movie that is about stopping a guy from destroying half of the Universe, there was a lot of positivism in the spoken adverbs–words such as “right,” “exactly,” and “better” are examples of that.

So, we know the actions, and how they were described, now it is time to see the nouns.

“You will pay for his life with yours. Thanos will have that stone.” — Proxima Midnight

It is not surprising seeing “stones” as the first result, after all, the movie is about them. At the second position, we have the term “life,” which is the thing that Thanos wants to destroy, and followed by it, is “time,” precisely what the Avengers didn’t have (note: “time” could also be attributed to mentions of the Time Stone).

Lastly, I will wrap up this section with the adjectives, or words that describe nouns. Similar to the adverbs, we have terms such as “good” and “right” that convey positivity, and terms like “okay” and “sure” that suggest affirmation.

“I’m sorry, little one.” — Thanos

Top verbs and nouns mentioned by a particular character

Previously, we saw what the most common verbs and nouns mentioned in the movie were. Although this knowledge gives us a sense of the overall feeling and plot of the film, it does not say much about the personal odyssey of our characters. Therefore, I applied the same procedure used to find the top ten verbs and nouns, but on the character level.

Since there are many characters in the movie, I selected only some of those who actually say a reasonable amount of lines, plus some of my favorites :). These characters are Tony Stark, Doctor Strange, Gamora, Thor, Rocket, Peter Quill (Star-Lord), Ebony Maw, and Thanos. Sorry, Cap, you didn’t make the cut. However, in my GitHub (link is at the end of the article), you can find a folder with the graphs of every character.

The next images show the top nouns used by these characters.

What’s with Quill calling Drax so much?

I find it curious and even refreshing that in most cases the top nouns used by our dear heroes are mentions of the members of the crew. For example, Tony said “kid” nine times (referring to Spider-Man), Rocket called Quill (Star-Lord) three times, while Quill itself called (more like screamed at) Drax in seven occasions.

Upon further inspection, we can infer what seems to be the most important thing for each character. In the case of Iron Man, the data suggest that Earth is valuable for him. Similar to him is Gamora who was always thinking about the higher goal — “life,” “universe,” and “planets” — and ultimately paid for it. Doctor Strange, had another objective — protecting his stone — which he mentions quite repeatedly. Then, there’s Thor, who has a personal vendetta against Thanos, saying his name eight times, and a new rabbit best friend. Lastly, there’s the Mad Titan itself, Thanos, who couldn’t stop thinking about gathering the Infinity Stones, or about his daughter.

While the nouns were expressive and significant, the same cannot be said about the verbs. As you will see in the next image, the verbs are not as colorful as the nouns. Words like “know,” “want,” and “get” take most of the top spots. Yet, there’s one character who had probably the most unique verbs from the whole corpus: Ebony Maw. In case you don’t remember, Ebony Maw, is Thanos top henchman. Like the good servant he is, his objective was — besides getting the Time Stone — to preach the mission of his master, a task he did using words such as “hear,” and “rejoice.” Creep.

“Hear me, and rejoice. You have had the privilege of being saved by the Great Titan…” — Ebony Maw

As a bonus, here are Groot’s top nouns.

“I am Groot”

Named Entities

So far we have explored the most common verbs, nouns, adverbs, and adjectives our heroes and villains have been uttering throughout this epic motion picture. Yet, to fully put meaning to all these bunches of words we have been scrutinizing, we need some context, namely, the named entities.

A named entity is, and I quote spaCy’s website, a “real-world object that is assigned a name — for example, a person, a country, a product or a book title.” So, knowing the entities, means, being aware of the things the characters are talking about. In the spaCy package, the entities have a predicted label that categorizes the entity into one of many types, such as a person, product, word of art, among others (https://spacy.io/api/annotation#named-entities), granting us with an extra level of granularity which could be useful to further categorize them. Unfortunately, for simplicity reasons, I won’t use the entity kind, but just the entity itself.

These are the top 30 entities.

“MAYEFA YA HU” is the chant of Wakanda’s Jabari warriors

In the first place, we have Thanos, which is not surprising at all considering the movie is all about him. Following him, is his daughter Gamora, one of the central figures in the film. Then at the third position, we have Groot (I’m sure I don’t have to explain why), followed by Tony, the other Avengers, and some locales such as New York, Asgard and Wakanda [Forever]. Besides heroes and places, two of the “six” (see entity no. 14) Infinity Stones — the Time and Soul Stone — appear on this list (position 15 and 16 respectively). Surprisingly enough, The Mind Stone, the stone that brought Thanos to Earth, is not part of the list.

To access the entities, in spaCy, read the property ents of a Doc like this:

The similarity between the spoken lines of each character pair

Back when we were discussing the top verbs per character, we realized, that unlike the nouns, most of the verbs were very similar and conveyed the same feeling. Terms like “go,” and “come” gives us the impression of movement or the feeling that our characters wanted to go or reach a particular place, and verbs such as “kill,” and “stop,” imply that indeed, there is a huge threat that has to be stopped.

With this thought in mind, and to further investigate the concept of similarity, I calculated the similarity score between the spoken lines of each character pair.

The concept of similarity in NLP describes how close or relative the synthetic or syntactic meaning of two pieces of text are — usually, the similarity score ranges from 0 to 1, where 0 means total dissimilarity and 1 means complete similarity (or that both texts are the same). Technically speaking, the similarity is computed by measuring the distance between the word vectors, namely, the multi-dimensional representation of a word. For those interesting in learning more about the topic, I recommend searching for word2vec, probably the most common algorithm used to generate these word embeddings. The image below presents the similarity matrix.

Again, Ebony Maw is the most unique character

Honestly, I wasn’t expecting this result. On one hand, I can accept that all the similarities are close to 1 since after all, the movie has one main plot and it’s expected to have associations across the conversations. However — and this is what makes me feel uneasy — is the fact that the scores are very, very similar. I mean, look at Thanos scores; I was hoping to see a substantial difference in the score of the villain when compared to those of the people who want to stop him. On a more positive note, it’s interesting to see that Peter Parker’s (Spider-Man) scores are the ones that vary the most. After all, he’s just a kid who got caught amid the chaos, so, it was somehow anticipated to have this result.

This is an example of how to compute the similarity between two Doc in spaCy:

Recap and conclusion

In the film Avengers: Infinity War we follow an assembly of superheroes in their journey to stop Thanos, a menace whose objective is to obliterate half of the Universe’s life. Throughout the movie, we come to learn that most of these heroes have their motive and motivation to save the world, which reflects in the way they express themselves. In this article we explore — with the help of Python, NLP, and spaCy — how our heroes, and villains too, express and communicate among themselves by studying each of their spoken lines. By focusing on features such as their most used verbs and nouns, we learned, confirmed and relived, Iron Man’s loyalty to Earth, Doctor Strange’s sworn duty to protect the Time Stone, Thor’s thirst for revenge, and Thanos ambition to fulfill his destiny.

Happy Avengers week, and see you again once the script of Endgame is available.

The code used to produce this experiment is available on my GitHub, at the following link: juandes/infinity-war-spacy

If you have any questions, comments, doubt, or want to chat, leave a comment here or on Twitter and I will be happy to help.