Turns out, the characters DO have distinctive patterns of using words.

The extent of each line represents the degree to which that character is more likely to use that word compared to other characters.

Notice how Cap talks about people (especially Tony). T’Challa (Black Panther)’s speech is marked by noble topics, opposite of Spiderman, who bumbles around like the teenager that he is. Hulk (Bruce Banner) and Clint (Hawkeye) both are notable for referring to Nat (Black Widow), although for different reasons. Vision and Scarlet Witch talk about very similar themes, which might explain why they seem to gravitate toward each other. Thor’s got his mind set on the bigger picture, leading directly into the events to come in Infinity War. Loki, Unsurprisingly, is the character most likely to talk about power. Ultron wants power in an entirely different way, and is more poetic.
All of these patterns were identified by Elle O’Brien, who uses neural networks to generate predictive text for Botnik Studios. The visualization project was initiated during a meetup of Data Viz Jam Sessions, hosted by Nancy Organ.

NOW this is an exercise in visualizing the data using R.

Want to find out how this plot was made? read on!



Here are the R packages that we will use:

library(dplyr)
library(grid)
library(gridExtra)
library(ggplot2)
library(reshape2)
library(cowplot)
library(jpeg)
library(extrafont)



Some people say that it’s bad form to use this “clear everything” line. I do it routinely at the top of a script to make sure that when I run it, it doesn’t depend on any objects that I accidentally left in the workspace.

rm(list = ls())


This is the folder that contains all the images

dir_images <- "C:\\Users\\Matt\\Documents\\R\\Avengers"
setwd(dir_images)


Set font

windowsFonts(Franklin=windowsFont("Franklin Gothic Demi"))


simple version of character names

character_names <- c("black_panther","black_widow","bucky","captain_america",
                     "falcon","hawkeye","hulk","iron_man",
                     "loki","nick_fury","rhodey","scarlet_witch",
                     "spiderman","thor","ultron","vision")
image_filenames <- paste0(character_names, ".jpg")



Function to read in the image file corresponding to the simple character name

read_image <- function(filename){
  char_name <- gsub(pattern = "\\.jpg$", "", filename)
  img <- jpeg::readJPEG(filename)
  return(img)
}


Read all the images into one list

all_images <- lapply(image_filenames, read_image)


Assign names to the list of images, so they can be indexed by character

names(all_images) <- character_names


Here’s an example of how easy it is, using those names

# clear the plot window
grid.newpage()
# draw to the plot window
grid.draw(rasterGrob(all_images[['vision']]))


Get the text data

This was collected by Elle O’Brien, using some fancy text mining analysis on the movie scripts.



I know that you won’t be able to download it on your own computer using this line (because you don’t have the file), but maybe Elle might share it. If she wants to share it here, I’ll update this page.

load("Avengers_word_data.RData")


Correct the capitalization of proper names

capitalize <- Vectorize(function(string){
  substr(string,1,1) <- toupper(substr(string,1,1))
  return(string)
})

proper_noun_list <- c("clint","hydra","steve","tony",
                      "sam","stark","strucker","nat","natasha",
                      "hulk","tesseract", "vision",
                      "loki","avengers","rogers", "cap", "hill")

# Run the capitalization function
word_data <- word_data %>%
  mutate(word = ifelse(word %in% proper_noun_list, capitalize(word), word)) %>%
  mutate(word = ifelse(word == "jarvis", "JARVIS", word))


Notice that the simplified character names from before don’t match the nicely formatted character names in the text dataframe

unique(word_data$Speaker)
##  [1] "Black Panther"   "Black Widow"     "Bucky"
##  [4] "Captain America" "Falcon"          "Hawkeye"
##  [7] "Hulk"            "Iron Man"        "Loki"
## [10] "Nick Fury"       "Rhodey"          "Scarlet Witch"
## [13] "Spiderman"       "Thor"            "Ultron"
## [16] "Vision"


Make a lookup table to convert shorthand file names to pretty character names

character_labeler <- c(`black_panther` = "Black Panther",
                       `black_widow` = "Black Widow",
                       `bucky` = "Bucky",
                       `captain_america` = "Captain America",
                       `falcon` = "Falcon", `hawkeye` = "Hawkeye",
                       `hulk` = "Hulk", `iron_man` = "Iron Man",
                       `loki` = "Loki", `nick_fury` = "Nick Fury",
                       `rhodey` = "Rhodey",`scarlet_witch` ="Scarlet Witch",
                       `spiderman`="Spiderman", `thor`="Thor",
                       `ultron` ="Ultron", `vision` ="Vision")


Have two different versions of character names

one for display (pretty) and one for simple organization and referring to image file names (simple)

convert_pretty_to_simple <- Vectorize(function(pretty_name){
  # pretty_name = "Vision"
  simple_name <- names(character_labeler)[character_labeler==pretty_name]
  # simple_name <- as.vector(simple_name)
  return(simple_name)
})
# convert_pretty_to_simple(c("Vision","Thor"))
# just for fun, the inverse of that function
convert_simple_to_pretty <- function(simple_name){
  # simple_name = "vision"
  pretty_name <- character_labeler[simple_name] %>% as.vector()
  return(pretty_name)
}
# example
convert_simple_to_pretty(c("vision","black_panther"))
## [1] "Vision"        "Black Panther"


Add simplified character names to the text data frame

word_data$character <- convert_pretty_to_simple(word_data$Speaker)


Assign a main color for each character

character_palette <- c(`black_panther` = "#51473E",
                       `black_widow` = "#89B9CD",
                       `bucky` = "#6F7279",
                       `captain_america` = "#475D6A",
                       `falcon` = "#863C43", `hawkeye` = "#84707F",
                       `hulk` = "#5F5F3F", `iron_man` = "#9C2728",
                       `loki` = "#3D5C25", `nick_fury` = "#838E86",
                       `rhodey` = "#38454E",`scarlet_witch` ="#620E1B",
                       `spiderman`="#A23A37", `thor`="#323D41",
                       `ultron` ="#64727D", `vision` ="#81414F" )


Make a horizontal bar plot

avengers_bar_plot <- word_data %>%
  group_by(Speaker) %>%
  top_n(5, amount) %>%
  ungroup() %>%
  mutate(word = reorder(word, amount)) %>%
  ggplot(aes(x = word, y = amount, fill = character))+
  geom_bar(stat = "identity", show.legend = FALSE)+
  scale_fill_manual(values = character_palette)+
  scale_y_continuous(name ="Log Odds of Word",
                     breaks = c(0,1,2)) +
  theme(text = element_text(family = "Franklin"),
        # axis.title.x = element_text(size = rel(1.5)),
        panel.grid = element_line(colour = NULL),
        panel.grid.major.y = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "white",
                                    colour = "white"))+
  # theme(strip.text.x = element_text(size = rel(1.5)))+
  xlab("")+
  coord_flip()+
  facet_wrap(~Speaker, scales = "free_y")
avengers_bar_plot


This is pretty good.

But I want to plot something more ambitious. We want the character images to show through the bars.
The idea is to display the image only in the area of the bar, cutting it off at the bar endpoint.

To do this, we will display a transparent bar, and then at the bar endpoint, plot a white bar extending to the plot edge, to cover up the rest of the picture