I published this paper recently and some folks have asked me to share my code used to produce the figures. So in this post I’ll reproduce the main panel of Figure 1 (below) and also jazz things up with an interactive version of the figure. Here’s the figure:
First we’ll need the output data from the study, which is available on Zenodo and can be grabbed straight from the web using
library(vroom) vroom("https://zenodo.org/record/5813878/files/supply%20contributions.csv?download=1", comment = "#", col_types = cols()) -> supply_contrib head(supply_contrib)
## # A tibble: 6 × 6 ## city supply_index state region division city_population ## <chr> <dbl> <chr> <chr> <chr> <dbl> ## 1 Akron | OH 1 OH Midwest East North C… 198006 ## 2 Albuquerque | NM -0.14 NM West Mountain 560218 ## 3 Amarillo | TX -0.88 TX South West South C… 199924 ## 4 Atlanta | GA 1 GA South South Atlant… 498044 ## 5 Augusta-Richmond | GA 0 GA South South Atlant… 196939 ## 6 Aurora | CO 0.9 CO West Mountain 374114
The data include a
supply_index variable for each of 116 U.S. cities. The supply index (-1 to 1) indicates the share of each city’s water supply coming from surface and groundwater resources. A supply index of 1 means 100% surface water, 0 means even split, and -(1) means 100% groundwater.
The plot is made with
ggplot2, and is essentially showing the distribution of the “supply index” across cities in each major U.S. region. The first thing we need to do is wrangle the data a litte. First, convert the city population variable to units of millions of people (avoid unnecessary zeros in legend). Then convert the
division variables to factors to dictate the order in which these appear in the figure and figure legend.
library(dplyr) supply_contrib %>% mutate(city_population = city_population * 1e-6, region = factor(region, levels = rev(c("Northeast", "South", "Midwest", "West"))), division = factor(division, levels = c("New England", "Middle Atlantic", "South Atlantic", "East North Central", "East South Central", "West North Central", "West South Central", "Mountain", "Pacific"))) -> supply_contrib_plot_ready
One way of plotting these distributions is simply to plot the points in a line:
library(ggplot2) ggplot(supply_contrib_plot_ready, aes(region, supply_index, fill = division, size = city_population)) + geom_point(pch = 21, alpha = 0.8)
This gives a feel for how cities are distributed according to water supply split, but it isn’t a particularly attractive or informative plot. There are some problems, such as multiple points at the extremes of each distribution overlapping and preventing us from viewing the number of cities with either 100% surface or 100% groundwater making u supply.
A useful way to avoid points overlapping in a distribution like this is to jitter the points, which we can do by switching
ggplot(supply_contrib_plot_ready, aes(region, supply_index, fill = division, size = city_population, text = city)) + geom_jitter(pch = 21, alpha = 0.8, width = 0.3, height = 0) -> p_basic_jitter p_basic_jitter
We see that for each region many of the cities are reliant entirely on surface water to meet public water demand. The
height arguments in
geom_jitter specify how much “jittering” is allowed. By setting
height = 0 we ensure the actual point values remain true while allowing some random jittering in the horizontal direction (within each category of data, or each census region in this case) using
width = 0.3.
Next we’ll create a better color scheme with similar colors for cities within each region, providing some further nuance for census divisions. The
RColorBrewer “paired” palette is useful:
library(RColorBrewer) division_pal <- brewer.pal(12, "Paired") division_pal_ordered <- division_pal[c(1, 2, 5, 7, 12, 8, 11, 4, 3)] p_basic_jitter + scale_fill_manual(values = division_pal_ordered) + # remove grey background and gridlines theme_classic() -> p_paired_colors p_paired_colors
Use a simple
coord_flip() to switch the axes. Then make a few other minor adjustments too the layout and add some lines to represent the extremities and mid-point:
p_paired_colors + coord_flip() + scale_x_discrete(expand = expansion(0.4)) + scale_y_continuous(expand = expansion(0.2)) + geom_hline(yintercept = 0, alpha = 0.5, linetype = 2) + geom_hline(yintercept = -1, alpha = 0.5) + geom_hline(yintercept = 1, alpha = 0.5) + theme(axis.line = element_blank(), axis.ticks = element_blank(), axis.text.x = element_blank(), axis.text.y = element_text(size = 9)) -> p_flipped_and_cleaned p_flipped_and_cleaned
Finally, add some labeling and annotate:
p_flipped_and_cleaned + # set point size breaks for legend scale_size_continuous(breaks = c(1, 4, 8), range = c(1, 12)) + # remove self-explanatory axis names and neaten up legend labels labs(y = NULL, x = NULL, fill = "Census division", size = "City population (mil.)") + # add labels and arrows annotate("text", x = 5, y = 0.5, label = "Surface water dominated", size = 3) + annotate("text", x = 5, y = -0.5, label = "Groundwater dominated", size = 3) + annotate("segment", y = 0.02, yend = 0.98, x = 4.8, xend = 4.8, arrow = arrow(length = unit(0.02, "npc"), type = "closed")) + annotate("segment", y = -0.02, yend = -0.98, x = 4.8, xend = 4.8, arrow = arrow(length = unit(0.02, "npc"), type = "closed")) + annotate("label", x = 0, y = 0, label = "Balanced", size = 3) + annotate("label", x = 0.2, y = 1, label = "100%\nsurface water", size = 3) + annotate("label", x = 0.2, y = -1, label = "100%\ngroundwater", size = 3) + # increase legend point sizes guides(fill = guide_legend(override.aes = list(size=4))) -> p_final p_final
Finally, something we can’t do in a journal article (yet)… an interactive version that allows the reader to hover the cursor over each point to view the city:
library(plotly) ggplotly(p_final, tooltip = "text") %>% hide_legend()