#TidyTuesday Behind The Scenes: Matrix Plot
#TidyTuesday Behind The Scenes: Matrix Plot
About
For this TidyTuesday I created a simple matrix plot with ggplot2 and a few extension libraries. This plot is quick and easy to create, in this tutorial I’ll walk us through the behind the scenes.
R Libraries
Let’s go ahead and import our libraries. For this graphic, I used the following packages. When it comes to data wrangling/reshaping, I mostly default to dplyr
(part of tidyverse
).
The graphic we’ll create usese ggplot2
, also a part of the tidyverse
, along with additional ggplot extension libraries such as ggimage
and ggtext
.
Lastly, we’ll use sysfonts
and showtext
to add in some custom fonts - no need to download from online, sysfonts lets us use Google’s font library.
#for data wrangling (dplyr) & graphing (ggplot2)
library(tidyverse)
#for plotting
library(ggimage)
library(ggtext)
#to bring in the data
library(tidytuesdayR)
#for fonts
library(sysfonts)
library(showtext)
#to preview data as tables
library(kableExtra)
Importing Data
There’s a couple of ways we can import our TidyTuesday data set. This week, the data includes 3 different data sets. For our tutorial, we’ll only use two of the three - characters and psych_stats - to produce our visual.
The R for Data Science team provides a link to the data set or we can use tidytuesdayR
package with tt_load
to download the data. Since the files are large and I don’t want to abuse the API rate limit, I’ll opt to read them manually with read_csv
.
#alternative option to import with tidytuesdayR
#data <- tidytuesdayR::tt_load(2022, week = 33)
#characters<-data$characters
#ps<-data$psych_stats
#import data with read_csv
<-read.csv("https://raw.githubusercontent.com/tashapiro/open-psychometrics/main/data/characters.csv")
characters<-read.csv('https://raw.githubusercontent.com/tashapiro/open-psychometrics/main/data/psych_stats.csv')
ps
#preview characters dataframe, first 5 records
kable(head(characters,5))%>%kable_styling(latex_options = "HOLD_position")
id | name | uni_id | uni_name | notability | link | image_link |
---|---|---|---|---|---|---|
F2 | Monica Geller | F | Friends | 79.7 | https://openpsychometrics.org/tests/characters/stats/F/2 | https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg |
F1 | Rachel Green | F | Friends | 76.7 | https://openpsychometrics.org/tests/characters/stats/F/1 | https://openpsychometrics.org/tests/characters/test-resources/pics/F/1.jpg |
F5 | Chandler Bing | F | Friends | 74.4 | https://openpsychometrics.org/tests/characters/stats/F/5 | https://openpsychometrics.org/tests/characters/test-resources/pics/F/5.jpg |
F4 | Joey Tribbiani | F | Friends | 74.3 | https://openpsychometrics.org/tests/characters/stats/F/4 | https://openpsychometrics.org/tests/characters/test-resources/pics/F/4.jpg |
F3 | Phoebe Buffay | F | Friends | 72.6 | https://openpsychometrics.org/tests/characters/stats/F/3 | https://openpsychometrics.org/tests/characters/test-resources/pics/F/3.jpg |
#preview psyc_stats dataframe, first 5 records
kable(head(ps,5))%>%kable_styling(latex_options = "HOLD_position")
char_id | char_name | uni_id | uni_name | question | personality | avg_rating | rank | rating_sd | number_ratings |
---|---|---|---|---|---|---|---|---|---|
F2 | Monica Geller | F | Friends | messy/neat | neat | 95.7 | 9 | 11.7 | 1079 |
F2 | Monica Geller | F | Friends | disorganized/self-disciplined | self-disciplined | 95.2 | 27 | 11.2 | 1185 |
F2 | Monica Geller | F | Friends | diligent/lazy | diligent | 93.9 | 87 | 10.4 | 1166 |
F2 | Monica Geller | F | Friends | on-time/tardy | on-time | 93.8 | 34 | 14.3 | 236 |
F2 | Monica Geller | F | Friends | competitive/cooperative | competitive | 93.6 | 56 | 13.4 | 1168 |
Reshaping & Cleaning Data
One of the most important steps in creating a data visualization is UNDERSTANDING the data you’re working with. I easily spend 10-15 minutes (if not more) combing through the data. Are there missing values? Are values standardized? Is there a data dictionary I can reference to make sense of different fields?
The characters data set provides contains a row per character with the character name, universe name, and related links. The psych_stats data set has a many-to-one relationship with characters: each record represents a personality item for a character (and there are~ 400 items per character).
Digging into the personality evaluations, avg_rating
in psych_stats never exceeds 50 and relates to a personality extreme (e.g. neat/messy a character is either neat or messy and the avg_rating is >=50). This makes it tricky for comparison against other characters, let’s try to clean this up with dplyr
.
<-ps%>%
sc#filter to just see characters from Schit's Creek. use two personality items
filter(uni_name=="Schitt's Creek"
& question %in% c("genuine/sarcastic","cynical/gullible"))%>%
#grab the last half of the question with sub & regex
mutate(anchor = sub("^(.+?)\\/","",question))%>%
#select let's us subset our columns - let's grab the ones we need for plotting
select(char_id, char_name, question, personality, anchor, avg_rating)
kable(head(sc,5))%>%kable_styling(latex_options = "HOLD_position")
char_id | char_name | question | personality | anchor | avg_rating |
---|---|---|---|---|---|
SsC3 | David Rose | genuine/sarcastic | sarcastic | sarcastic | 86.2 |
SsC3 | David Rose | cynical/gullible | cynical | gullible | 68.6 |
SsC4 | Alexis Rose | cynical/gullible | gullible | gullible | 72.4 |
SsC4 | Alexis Rose | genuine/sarcastic | sarcastic | sarcastic | 57.8 |
SsC1 | Johnny Rose | genuine/sarcastic | genuine | sarcastic | 57.4 |
We used dplyr::mutate
(with the help of some regex) to created our new anchor
field. This represents one of the personality extreme. We’ll use this anchor field to rescale our avg_rating.
If the character’s personality doesn’t match the anchor, we’ll change our new rating, rescaled
, to 100 - avg_rating (e.g. if someone is 60 genuine, they’re now 40 sarcastic). We can use case_when
to set up our new if/then rules.
<-sc%>%
scmutate(rescaled = case_when(anchor!=personality~ 100-avg_rating,
TRUE ~ avg_rating))
kable(head(sc,5))%>%kable_styling(latex_options = "HOLD_position")
char_id | char_name | question | personality | anchor | avg_rating | rescaled |
---|---|---|---|---|---|---|
SsC3 | David Rose | genuine/sarcastic | sarcastic | sarcastic | 86.2 | 86.2 |
SsC3 | David Rose | cynical/gullible | cynical | gullible | 68.6 | 31.4 |
SsC4 | Alexis Rose | cynical/gullible | gullible | gullible | 72.4 | 72.4 |
SsC4 | Alexis Rose | genuine/sarcastic | sarcastic | sarcastic | 57.8 | 57.8 |
SsC1 | Johnny Rose | genuine/sarcastic | genuine | sarcastic | 57.4 | 42.6 |
The penultimate step in our data reshaping process: we need to convert the data from a long format to a wide format. Currently, each record represents a personality trait per character. The end goal - we want one record per character with their respective scores per personality item. This is a perfect use case for dplyr::pivot_wider
! It’s almost like dplyr has something for every scenario…
<-sc%>%
sc#subset data again, we can get rid of avg_rating and question
select(char_id, char_name, anchor, rescaled)%>%
#reshape data - we want to use this for a matrix plot with x & y points for
pivot_wider(names_from=anchor, values_from=rescaled)
kable(head(sc,5))%>%kable_styling(latex_options = "HOLD_position")
char_id | char_name | sarcastic | gullible |
---|---|---|---|
SsC3 | David Rose | 86.2 | 31.4 |
SsC4 | Alexis Rose | 57.8 | 72.4 |
SsC1 | Johnny Rose | 42.6 | 45.6 |
SsC2 | Moira Rose | 70.2 | 30.0 |
SsC5 | Stevie Budd | 90.1 | 14.7 |
And as a quick finisher, we’ll also include the image links for each character. The character data set has an image field. We can use join
to combine these data sets together.
<-sc%>%left_join(characters%>%select(id, image_link), by=c("char_id"="id")) sc
The Fun Part, Plotting!
Base Plot
Let’s see what our initial plot looks like with our freshly reshaped data.
ggplot(data=sc, mapping=aes(x=sarcastic, y=gullible))+
geom_text(aes(label=char_name))
Font Set Up
Before we start de novo, let’s take a minute to reset our fonts. Using different fonts is such an easy way to elevate the aesthetic of your plot. I love using sysfonts
because I can call in different google fonts without downloading them. To make sure they render properly in our plot, we’ll also use showtext_auto()
.
#import fonts from sysfont package
::font_add_google("roboto")
sysfonts::font_add_google("DM Serif Display","dm")
sysfontsshowtext_auto()
Starting from Scratch
Not much to look at, but we can start seeing our matrix forming. We’ll start from scratch and rebuild with some new elements.
Let’s start by drawing our matrix lines first with geom_segment
. Instead of character names, we’ll introduce our friend ggimage:geom_image
to plot their pictures using the image_link field.
We will also re-add the character names back in with geom_label
. Since we don’t want to plot the name over the picture, we’ll modify the y coordinate so it fits slightly beneath the image.
Finally, we’re going to ditch ggplot’s default theme. To completely clear it out, I like using theme_void
. Let’s add in our own theme in this step too to give it our dark mode vibe! We can do this by specifying the fill
color for plot.background
within element_rect
.
<-ggplot(data=sc, mapping=aes(x=sarcastic, y=gullible))+
plot#lines for matrix, use arrow() field to draw arrows at both ends
geom_segment(mapping=aes(x=0, xend=100, y=50, yend=50),
arrow=arrow(lengt=unit(0.1,"inches"), ends="both"),
color="#FFED47")+
geom_segment(mapping=aes(y=0, yend=100, x=50, xend=50),
arrow=arrow(length=unit(0.1,"inches"),ends="both"),
color="#FFED47")+
#use geom_label instead to plot
geom_image(aes(image=image_link), size=0.07)+
#add character label beneath image, adjust by subtracting a little from y value
geom_label(aes(label=char_name, y=gullible-7.5),
fill="black", color="white", size=3.5)+
#clear out theme
theme_void()+
theme(plot.background = element_rect(fill="black", color=NA))
plot
Custom Axis Labels
Since we ditched our axis text with theme_void, we need to add some text back in to guide our audience. Let’s add in new labels (e.g. SARCASTIC, GENUINE) on the respective ends of the arrows with geom_text
. We can use the angle
argument to rotate the labels.
<-plot+
plotgeom_text(mapping=aes(label="GENUINE",x=-10, y=50),
angle=90, size=5, color="white")+
geom_text(mapping=aes(label="SARCASTIC",x=110, y=50),
angle=-90, size=5, color="white")+
geom_text(mapping=aes(label="GULLIBLE",x=50, y=110),
size=5, color="white")+
geom_text(mapping=aes(label="CYNICAL",x=50, y=-10),
size=5)
plot
Adding in Title
I have a new growing obsession with ggtext
. It gives ggplotters a whole new level of flexibility when it comes to adding labels to our plot. To get really fancy, it helps to know some basic HTML and CSS. Here’s how I created the title text:
= "<span style='font-size:24pt;color:white;font-family:dm;'>**Schitt's**</span><span style='font-size:24pt;color:#FFED47;font-family:dm'> **Creek**</span><br>
title <span style='font-size:11pt;color:white;font-family:roboto;'>Character Personality Matrix. Data from the Open-Source Psychometrics Project.</span>"
= "<span style='color:white;'>Graphic by </span><span style='color:#FFED47;'>@tanya_shapiro</span>" caption
I agree, the code is not pretty to look at for this part. But take a look at what happens when we add it in our new labels with ggplot::labs
and tweak our plot title with ggtext::element_textbox_simple
!
+
plotlabs(title=title, caption=caption)+
#adjust theme
theme(plot.title=element_textbox_simple(halign =0.5),
plot.caption=element_textbox_simple(halign=0.95),
plot.margin = margin(rep(20,4)))
That’s a Wrap!
That concludes the behind the scenes for my TidyTuesday plot this week. If you have any questions, please feel free to shoot me a Tweet @tanya_shapiro. Thank you!