2020 was a year when many scientific facts were distorted into politicized tools, and vaccination was not an exception to this polarization. However, data is not subject to such manipulation and distortion; by truthfully visualizing data, we are able to observe clear trends that allow us to extrapolate relationships. Whether vaccination is effective in preventing disease outbreaks, or is simply a politicized tool of the left--data will tell us.
We first call the relevant data in R. If the data is unavailable in R, we should apply web scraping (which I will post about in the future), but luckily we have access to data about vaccination. In this post, we will specifically focus on measles.
library(dplyr)
library(ggplot2)
library(dslabs)
data(us_contagious_diseases)
Measles vaccination was introduced in 1967, so we will first begin with examining the raw data in 1967. However, there are 52 rows, as there are 52 states solely in the year 1967, making it harder for us to see any remarkable trends. Therefore, wsing the filter function, we will first classify any data that is a numerical value and the disease is measles. Then, we will add a column rate and reorder in descending order. As comparison between unrelated variables is much easier when the variables are in the y axis, we will allow rate to be the x axis, whereas state is on the y axis.
dat <- us_contagious_diseases %>% filter(year == 1967 & disease=="Measles" & count>0 & !is.na(population)) %>%
mutate(rate = count / population * 10000 * 52 / weeks_reporting) %>%
mutate(state=reorder(state,rate))
dat %>% ggplot(aes(state, rate)) +
geom_bar(stat="identity") +
coord_flip()
Using this code, we get the following result for the year 1967.

After graphing this, I suddenly became curious if there would be a trend between the region of the state and the rate of infection! We do not have access to the regions of individual states in the dataset us_contagious_diseases, so we will instead extract the necessary data in dataset murders, and merge it with dataset us_contagious_diseases.
data(murders)
region <- murders %>% select(state, region)
dat <- us_contagious_diseases %>% filter(year == 1967 & disease=="Measles" & count>0 & !is.na(population)) %>%
left_join(region) %>%
mutate(rate = count / population * 10000 * 52 / weeks_reporting) %>%
mutate(state=reorder(state,rate))
dat %>% ggplot(aes(state, rate)) +
geom_col(aes(fill=region)) +
coord_flip()
We can obtain this result, which shows that states in the west seem to have higher rates and those in the northeast tend to have the lowest rates.

Now we begin to examine the changes before and after vaccination, which is our purpose of visualization. We will first begin with California, using the following code, this time using the geom_line function.
dat %>% filter(state == "California" & !is.na(rate)) %>%
ggplot(aes(year, rate)) +
geom_line() +
ylab("Cases per 10,000")
To better clearly seIf we wanted to see how the number of cases plummets after vaccination, we can add the final line like this:
dat %>% filter(state == "California" & !is.na(rate)) %>%
ggplot(aes(year, rate)) +
geom_line() +
ylab("Cases per 10,000") +
geom_vline(xintercept=1963, col = "blue")
and we would get the following result.

A clear decrease after 1963, when vaccines were implemented, is observable. However, this trend might not be the case for other states. It would be better if we could see the trends of the individual states.
Recall the visualization principle that we would rather make a 2d graph with lots of lines intersecting than a 3d graph, since we generally want to avoid 3d. What we want to see is the difference after the use of vaccination, but a 3d graph would provide so much needless information about the individual states that it distracts us from the general trend. Therefore, we add an additional line of average values.
dat %>%
filter(!is.na(rate)) %>%
ggplot() +
geom_line(aes(year, rate, group = state), color = "grey50",
show.legend = FALSE, alpha = 0.2, size = 1) +
scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300)) +
ggtitle("Cases per 10,000 by state") +
xlab("") +
ylab("") +
geom_text(data = data.frame(x = 1955, y = 50),
mapping = aes(x, y, label = "US average"), color = "black") +
geom_vline(xintercept = 1963, col = "blue")

But the fatal flaw of this graph is that so many lines have intersected with one another that only very visible trends (the steep decline after the implementation of vaccination, or the single spike around 2000) can be observed. Hence, we add an additional line of code, where we compute the US average and add that line to signify how the overall number of cases changed.
dat %>%
filter(!is.na(rate)) %>%
ggplot() +
geom_line(aes(year, rate, group = state), color = "grey50",
show.legend = FALSE, alpha = 0.2, size = 1) +
geom_line(mapping = aes(year, us_rate), data = avg, size = 1, col = "black") +
scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300)) +
ggtitle("Cases per 10,000 by state") +
xlab("") +
ylab("") +
geom_text(data = data.frame(x = 1955, y = 50),
mapping = aes(x, y, label = "US average"), color = "black") +
geom_vline(xintercept = 1963, col = "blue")
A line in the middle has been added to show the US average each year:

Another method to convey such information is tile plots, where we can use the geom_tile function like the following. This graph is personally my favourite, as it clearly shows the changes for each state for each year.
library(RColorBrewer)
dat %>% ggplot(aes(year, state, fill=rate)) +
geom_tile(color = "grey50") +
scale_x_continuous(expand = c(0,0)) +
scale_fill_gradientn(colors = RColorBrewer::brewer.pal(9, "Reds"), trans = "sqrt") +
geom_vline(xintercept = 1963, col = "blue") +
theme_minimal() + theme(panel.grid = element_blank()) +
ggtitle(the_disease) +
ylab("") +
xlab("")
We get this beautiful graph:

However, a drawback of this graph is that the colours can be misleading. Red with opacity 80% may seem like more intense and darker colors than opacity 80% of colors that stand out less, such as light purple. Also, there is a key for color by rate, but we can only make an estimation because our eyes cannot recognize the RGB of colors and analyze them. The use of different hues is one of the greatest strengths, but at the same time, one of the greatest weaknesses of this graph.
My personal favourite, regardless, is the tile graph as it effectively conveys the information we wish to convey, but in a more aesthetically pleasing way than the other graphs.