Independet Study Writeup

Welcome

This is my final writeup for my data visualization independent study with Dr. Mine Centikaya-Rundel.

Case Study 1: A Better Visualization

In this assingment, I work to create what I think is a better visualization for a dataset. The dataset I work with is a record of tornadoes that have happened in the United States between 1950 and 2015. The name of the dataset is tornadoes. The original dataset (linked below) contains more variables that are not relevent for this assignment.

Note that this assignment is exploratory in nature, so I try different visualizations before the final product and those visualizations are also included here.

Here is the link to the data.

The original visualization of part of this data is two pie charts; one shows the distribution of the tornadoes by scale and the other shows the distribution of tornado-related deaths, also by tornado scale. The pie charts were used by ‘Tornado Project Online’ to show that, although tornadoes of scales 4 and 5 are rare, they cause more deaths than all the other tornadoes combined.

Here is the link to the original charts.

Below are screenshots of the pie charts as they appeared on the site on February 7, 2019.

knitr::include_graphics("../www/piechart_one.png", dpi = 400)

knitr::include_graphics("../www/piechart_two.png", dpi = 400)

The tornado scale used here is the Enhanced Fujita (EF) scale. The EF scale is based on the damage the tornado causes and it ranges from 0 to 5, with 5 being the most violent tornadoes. According to the National Weather Service of the United States, the scale of a tornado is determined by first identifying the appropriate damage indicators from a list of 28. For a given indicator, the degree of damage is determined from a list of 8. Each degree of damage is assigned a range of wind speed, and it is from these ranges that the scale is finally decided.

The reason I decided to create a better way to visualize the distribution of tornadoes and tornado-ralated deaths is because (1) I find it unnessary to use 3-D (AND exploded) pie charts and I think a regular one-dimensional piechart is better but (2) in general, pie charts can be confusing to interpret and they don’t always present the data in an easily digestible way - it can be hard to keep track of the labels (often there is labelling for percentages and for the different groups involved) and seeing the relationship between the different size of the pieces of the chart and the groups.

My idea of a better visualization is the use of a bar graph. From this point forward, I try different visualizations and try to come up with something better than pie charts.

Below is one that shows the percentage of tornadoes by scale.

#loading the data.
tornadoes = read_excel("../Datasets/tornadoes.xls")

ggplot(data = tornadoes) + geom_bar(aes(x = EF))

The bar graph above has count on the y-axis. The one below has percentages - this makes it easier to see the distribution of the different tornadoes as far as scale is concerned.

ggplot(data = tornadoes) + 
  geom_bar(mapping = aes(x = EF, y = ..prop..)) + 
  scale_y_continuous(labels=scales::percent) + 
  ylab("Percentage")

To make it even easier to see the distribution, I add percentages at the top of the individual bars. See below.

ggplot(tornadoes, aes(x = EF)) + 
  geom_bar(aes(y = ..prop..), stat="count", position = position_dodge()) +
  geom_text(aes(label = scales::percent(round(..prop..,3)), y = ..prop..), 
  stat = "count", vjust = -.2, position = position_dodge(.9)) +
  scale_y_continuous(labels = scales::percent) + ylab("Percentage")

I give the graph a title, a more descriptive axes labels and some colors to make it more visually appealing.

tornadoes_plot = ggplot(tornadoes, aes(x = EF)) + 
  geom_bar(aes(y = ..prop..), stat="count", position = position_dodge(),
  fill = "maroon") +
  geom_text(aes(label = scales::percent(round(..prop..,3)), y = ..prop..), 
  stat= "count", vjust = -.2, position =     position_dodge(.9)) +
  scale_y_continuous(labels = scales::percent) + 
  ylab("Percentage") + 
  labs(title = "Percentage of Tornadoes by Scale", x = "Scale (0 - 5)") + 
  theme_light() + 
  theme(plot.title = element_text(hjust = 0.5))
tornadoes_plot

To visualize the percentage of tornado-related deaths, I first create a table that groups deaths by EF scales. I then use this table to create a scatter plot of EF scale vs deaths.

deaths = tornadoes %>% group_by(EF) %>% summarise(deaths = sum(fat))
deaths

## # A tibble: 6 x 2
##      EF deaths
##   <dbl>  <dbl>
## 1     0     23
## 2     1    227
## 3     2    587
## 4     3   1282
## 5     4   2357
## 6     5   1347

deaths_plot = ggplot(deaths, aes(x = EF, y = deaths)) + 
  geom_count() + 
  theme_light() + 
  labs(title = "Deaths vs. Scale", x = "Scale (0 - 5)") + 
  theme(legend.position = "none") + 
  theme(plot.title = element_text(hjust = 0.5))
deaths_plot

Below are the two graphs plotted side by side. On the left (percentage of tornadoes by scale), it can be seen that as the scale increases, the number of tornadoes decreases. On the right (deaths vs scale), it can be seen that, in general, deaths increase with scale. I think it is now easier to actually visualize the trend for both graphs - one can see an increase or a decrease and does not have to always have to keep track of labels to be able to understand what is happening (once one sees the the graphs, they don’t need to keep going back to the labels to understand the graphs).

grid.arrange(tornadoes_plot, deaths_plot, ncol = 2)

Working on feedback:

In the graph below, I flip the axes to have a horizontal bar graph that is better visually and is easier to read. Besides the axes, I also remove unnecessary labels on the now y-axis (after flipping) and add more descriptive labels on the now x-axis. It is the graph of Perentage of Tornadoes by Scale.

tornadoes_plot1 = ggplot(tornadoes, aes(x = EF)) +
  geom_bar(aes(y = ..prop..), stat="count", position = position_dodge(),
  fill = "maroon") +
  geom_text(aes(label = scales::percent(round(..prop..,3)), y = ..prop..),
  stat= "count", hjust = -.1, position = position_dodge(0.9)) +
  scale_y_continuous(labels = scales::percent, limits = c(0,0.5), breaks = NULL) +
  scale_x_continuous(breaks = c(0,1,2,3,4,5),
                   labels = c("EF0", "EF1", "EF2", "EF3", "EF4", "EF5")) +
  labs(title = "Percentage of Tornadoes by Scale", x = NULL, y = NULL) +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_flip()
tornadoes_plot1

The graph below is also horizontal but has all the labels on the axes and does not have percentages on the individual bars. It is the graph of Counts of Tornadoe-related Deaths by Scale. This is a bar graph of two variables (as opposed to one variable vs counts of tornadoes in the previsous one) so it is hard to get the y-axis to behave the same as that in the previous
graph. So I leave it like that for now.

tornadoes_plot2 = ggplot(data = tornadoes) + 
geom_bar(aes(x = EF, y = fat), stat="identity", fill = "maroon") +
scale_x_continuous(breaks = c(0,1,2,3,4,5),
                   labels = c("EF0", "EF1", "EF2", "EF3", "EF4", "EF5")) +
labs(title = "Counts of Tornadoe-related Deaths by \n Scale", x = NULL, y = NULL) +
scale_y_continuous(limits = c(0,2500)) +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
coord_flip()
tornadoes_plot2

It was suggested that I plot EF agains death rate (number of deaths per tornado) to get a different graph. I was not able to do this. The variable fat in the dataset is the number of deaths per tornado, so it is already a death rate. That way, the graph below is already a distribution of tornado scale by death rate.

I tried different ways to show the relationship between EF and death keeping in in mind all the suggestions, and I did not found one that looks better than the one below. The ones I have tried looked very weird and hard to interpret. I also tried to find a way to put the the information in the pie charts in one graph and this was not successful too.

So given the challenges, I decided to just leave both graphs as horizontal bar charts for now. Below, the graphs are presented side by side. One can see that as the intensity of a tornado increases (on the left), number of tornadoes decreases, but the number of deaths by tornadoes increases. This is what the two were trying to show, but now on bar graphs.

grid.arrange(tornadoes_plot1, tornadoes_plot2, ncol = 2)

Edit (February 11):

I was figured out how to show all the information in one chart. I grouped deaths by EF scale, calculated mean deaths for each scale and plotted EF scale against its mean. This can also be thought of as deaths per tornado. It seems like it is what a previous suggestion about plotting EF vs death rate meant (it is - updated February 20). Below is the graph.

tornadoes %>%
  
  group_by(EF) %>%
  summarize(mean_deaths_per_sc = mean(fat)) %>%
  
  ggplot(data = .) +
  geom_bar(aes(x = EF, y = mean_deaths_per_sc), stat = "identity", fill = "maroon") +
  labs(x = "EF Scale", y = "Deaths per EF Scale", title = "Average Deaths by Scale") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_continuous(breaks = c(0,1,2,3,4,5),
                     labels = c("EF0", "EF1", "EF2", "EF3", "EF4", "EF5")) +
  coord_flip()

Correction: the plot of Average Deaths by Scale above still doesn not show the information from both of the pie charts. It is just showing the expected death for each tornado given it’s EF scale.

While creating the plot, I thought of boxplots. A boxplot could show the same information. I tried to plot one.

ggplot(data = tornadoes) +
  geom_boxplot(aes(x = EF, y = fat, group = EF)) +
  labs(x = "EF Scale", y = "Deaths", title = "Deaths by Scale") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_continuous(breaks = c(0,1,2,3,4,5),
                     labels = c("EF0", "EF1", "EF2", "EF3", "EF4", "EF5")) + 
  coord_cartesian(ylim = c(0,30))

#scale_y_continuous(limits = ...) will remove data that fall outside the limits 
#and then perform the statistical calculations so the mean and other summaries 
#will be affected. The alternative is to use coord_cartesian(limits = ...) - this 
#'zooms' in without removing data or affecting the summaries.

The boxplot above indeed shows the same information. It is however hard to see where the means for the lower scales are. I have tried zooming in as much as possible, and I can only see two means, for EF4 and EF5.

To show information from both pie-charts, I will go back to two side by side graphs. This time, however, I will use the Average Deaths by Scale and a version of the Percentage of Tornadoes by Scale graph. Moreover, the graphs will be diverging away from each other - giving a diverging bar chart with a shared central axis. I use the Average Deaths by Scale graph and not Counts of Tornadoe-related Deaths by Scale graph because it shows the increasing trend in deaths by tornado scale better. The Average Deaths by Scale graph corrects the problem that a given scale can have more deaths just because there were many tornadoes of that scale.

In the code below, I am re-doing the Percentage of Tornadoes by Scale graph. The new graph will not use percentage and will not have numbers stacked on the bar. The reason I am changing to counts instead of using percentages is because I think the use of percentages (in this particular case) is unnecessary - we can get an idea of how the tornadoes are ditributed by EF scale by looking at the size of the bars. If one wants to know the exact number of tornadoes of a given scale, they can get that information from the y-axis that now has tornado counts.

tornadoes_left = ggplot(data = tornadoes) + 
  geom_bar(mapping = aes(x = EF), fill = "black") + 
  scale_y_continuous(trans = "reverse") + 
  labs(title = "Tornado Count by Scale", y = "Tornado Count\n", x = NULL) + 
  theme(plot.title = element_text(hjust = 0.5),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.y = element_blank()) +
  coord_flip()

I am also redoing the Average Deaths by Scale graph just so I can easily merge the two graphs later on. For both graphs, I will change fill color just to have a combination that works well.

tornadoes_right <- tornadoes %>%
  group_by(EF) %>%
  summarize(mean_deaths_per_sc = mean(fat)) %>%
  
  ggplot(data = .) +
  geom_bar(aes(x = EF, y = mean_deaths_per_sc), stat = "identity", fill = "black") +
  labs(x = NULL, y = "Deaths per EF Scale\n", title = "Average Deaths by Scale") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.ticks.y = element_blank()) +
  scale_x_continuous(breaks = c(0,1,2,3,4,5),
                     labels = c("EF0   ", "EF1   ", "EF2   ", "EF3   ", "EF4   ",
                                "EF5   ")) +
  coord_flip()

Merging the graphs:

grid.arrange(tornadoes_left, tornadoes_right, ncol = 2, top = textGrob("\n
Though Fewer, the More Violent Tornadoes Cause More Deaths per Tornado\n",
gp = gpar(fontsize = 15, font = 1)))

And this will be my final graph. It clearly shows the information from the two pie charts.

Case Study 2: Femmes 2019

The aim of this assignment was to write code that connects to a survey to analyse and visualize data that’s been provided by the surveyed population. To do this, we used Rstudios’s R markdown to write the analysis and visualization code and used google forms to gather data.

We started with creating the google form. This contained eight questions targeted at fifth grade girls participating in FEMMES data visualization workshop here at Duke University. For variability, these questions gathered both categorical and numeric data. We then created a google sheet to store the responses to these questions. In an R markdown file, we implemented the code for analysing and visuazing the survey data. In the code, we used a package called googlesheets to link the google sheet with the R markdown file.

For the visualizions, we included two scatter plots to visualize correlation between two variables. These two plots were meant to show FEMMES workshop participants the the difference betweeen correlated and uncorrelated data. We also included two bar plots to show how one can visualize categorical data and also illustrate to students cool things one can do in R, for example changing the color of individual bars to virtually any color one prefers. Lastly, we included one animated plot to show students that it is also possible to animate visualizations in R.

Gmail login info: email - femmesworkshop2019@gmail.com
password - femmes2019

Although I provide the login information here, I have changed the code slightly so that one doesn’t have to log in to obtain the data. The code saves the spreadsheet as a csv file. Using the data this way, makes some code unnecessary, so that’s been commented out.

Below is the actual assignment.

# gs_auth(new_user = TRUE)

To see what’s in the google drive use gs_ls().

# quest_res_list <- gs_title("Femmes Questionnaire Responses")
# quest_res_df <- for_gs_sheet <- gs_read(quest_res_list)
# write.csv(quest_res_df, file = "quest_responses_df.csv")
quest_responses_df <- read_csv("../Assignment01/quest_responses_df.csv") %>%
  select(-X1)

Changing column names:

names(quest_responses_df) = c("time_stamp", "num_siblings", "fav_color", "height", "shoe_size", "fav_chocolate", "fav_pet", "fav_season", "day_old") #"reaction_one", "reaction_two", "reaction_three", "reaction_four")
head(quest_responses_df, n = 10)

## # A tibble: 10 x 9
##    time_stamp num_siblings fav_color height shoe_size fav_chocolate fav_pet
##    <chr>             <dbl> <chr>      <dbl>     <dbl> <chr>         <chr>  
##  1 2/23/2019…            6 Blue          52      13.5 Milk chocola… Dog    
##  2 2/23/2019…            1 Other         62       4   Dark chocola… Bird   
##  3 2/23/2019…            1 Blue          48       7   Dark chocola… Dog    
##  4 2/23/2019…            1 Other         52       4   Dark chocola… Dog    
##  5 2/23/2019…            2 Green         48       5.5 Milk chocola… Cat, F…
##  6 2/23/2019…            2 Blue          63       5   Dark chocola… I don'…
##  7 2/23/2019…            1 Other         61       9.5 Milk chocola… Dog, F…
##  8 2/23/2019…            2 Blue          57       3   I don't like… I don'…
##  9 2/23/2019…            2 Other         61       6   Milk chocola… I don'…
## 10 2/23/2019…            2 Green         63       9   Milk chocola… Dog    
## # … with 2 more variables: fav_season <chr>, day_old <dbl>

Correlation between shoe number and number of siblings?

ggplot(quest_responses_df) +
  geom_point(aes(x = num_siblings, y = shoe_size)) +
  geom_smooth(aes(x = num_siblings, y = shoe_size), method = "lm", formula = y ~ x, se = F) +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(title = "Any Relationship Between Shoe Size and Number of Siblings?", x = "Number of Siblings", y = "Shoe Size") +
  theme_light()

Correlations between shoe size and height?

ggplot(quest_responses_df) + 
  geom_point(aes(x = shoe_size, y = height)) +
  geom_smooth(aes(x = shoe_size, y = height), method = "lm", formula = y ~ x, se = F) +
  labs(title = "Any Relationship Between Shoe Size and Height?", x = "Shoe Size", y = "Height") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5))

Your Favorite Color:

color_pallete <- c(
  "Red" = "red",
  "Blue" = "blue",
  "Green" = "green",
  "Pink" = "pink",
  "Other" = "gray88" # maybe something else here?
)

legend_title <- "Color"

ggplot(quest_responses_df) + 
  geom_bar(aes(x = fav_color, fill = fav_color)) +
  labs(x = "Color", title = "Your Favorite Color") +
  theme_light() +
  theme(legend.position = "right",
        legend.title = element_text(face = "bold"),
        plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(
    values = color_pallete,
    limits = names(color_pallete)) +
    scale_fill_manual(legend_title, values = color_pallete)

Favorite chocolate:

color_pallete <- c(
  "White chocolate" = "papayawhip",
  "I don't like chocolate 😬" = "black", #arbitrary
  "Milk chocolate" = "tan3",
  "Dark chocolate" = "chocolate4"
)

legend_title <- "Type"

ggplot(quest_responses_df) +
  geom_bar(aes(x = fav_chocolate, fill = fav_chocolate)) +
  labs(x = "Chocolate Type" , title = "Your Favorite Chocolate") +
  theme_light() +
  theme(legend.position = "right",
        legend.title = element_text(face="bold"),
        plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(
    values = color_pallete,
    limits = names(color_pallete)) +
    scale_fill_manual(legend_title, values=color_pallete)

  #theme_economist()

# quest_responses_df_narrow = gather(quest_responses_df, "reaction_one", "reaction_two", "reaction_three", "reaction_four", key = "trial", value = "reaction")
# head(quest_responses_df_narrow)

Animated Change in Reaction:

# animated <- ggplot(quest_responses_df_narrow) + 
#   geom_point(aes(x = time_stamp, y = reaction, color = time_stamp), size = 3) +
#   theme(plot.title = element_text(hjust = 0.5)) +
#   labs(title = "Reaction times for trial", x = "Student", y = "Reaction times") +
#   # labs(title = "Reaction times for trial {frame_state}", x = "Student", y = "Reaction times")
#   # we want to make the title so that it changes for each trial but we don't know how to
#   theme_bw() +
#   theme(
#     axis.ticks.x = element_blank(),
#     axis.text.x = element_blank(),
#     legend.position = "none",
#     plot.title = element_text(hjust = 0.5)
#   ) +
#   transition_states(
#     states = trial,  
#     transition_length = 3,
#     state_length = 2) + 
#   enter_fade() +
#   exit_shrink() +
#   ease_aes('sine-in-out')
# animated

Case Study 3: CLT Debugging

Debugging CLT_mean shiny applet

The bug: During applet startup, the error message ‘invalid arguments’ briefly appear in the main panel.

Finding the bug: To find the bug, I first considered the error messages. The fact that the errors mentioned ‘arguments’ meant that the cause was coming from how we are passing values to the reactive functions that give the outputs. Moreover, since it is the same error being repeated, I knew that it was a single mistake causing it and its effect was being propagated throughout the applet. Looking at the code, the values that are being used repeatedly are those that are being created by renderUI and displayed in the UI. So, I read renderUI and uiOutput and found that, sometimes when using renderUI and uiOutput to dynamically populate the user controls, when the applet launches, the user inputs linked to renderUI are momentarily non-existent. So, the bug was potentially caused by renderUI inputs.

Fixing the bug: To fix the bug, I used req() to check validity of all renderUI inputs before using them. req() check to see if an input is available, and if it is not or if invalid, it prevents an error message from being displayed. This solution worked. The only problem is, since req() also completely stops the execution of the code proceeding it, using req() stops the graphs from being displayed until the user clicks on the radio buttons a few times. Other options such as validate() seem to be causing the same effect and I could not find a clear reason why.

Other changes: There are a few things I changed in the code. These are as follows:

I added wellPanel() to separate user inputs from the additional information on the user input column. I thought these appear better when separated as thy are not related.
I removed all the req() for inputs not created inside renderUI. There is always a default value for these, so I thought using req() on these was unnecessary.
I removed the error message that appears when maximum is less than minimum for the uniform distribution. Instead of an error message, I made it such that, maximum adjusts automatically whenever minimum is made greater than maximum and minimum adjusts automatically whenever maximum is made smaller than minimum. For this to work I had to change set the highest possible maximum to 21 instead of 20.

Since it takes a few seconds for this change to happen, an error appears for a few seconds after, for example, the user sets a minimum that is greater than the maximum. To hide this, instead of plotting a graph during this time, I put the message ‘plots reloading …’. Once the values have been updated, the actual plot displays.

The last change I made is re-style the title of the CLT graph so that there is some space between the title and the frame of the graph.

Update:

To fix the problem caused by req(), I changed all req() calls such that an input is only required right before it is used. Previously, there was a line that required all the inputs from renderUI at once. Everytime I deleted some of the inputs from this line, the code would work without errors. I noticed that, the inputs left in the line were the same inputs used first in a function that was executing next. So, instead of requiring all of the inputs together before the function call, I changed the function to include req() calls everytime an input was about to be used. So, it turns out that, in this case, req() doesn’t prevent graphs from being drawn.

Additional changes: 1. During meeting, we changed the two sliders for maximum and minimum for the uniform distributin to have just one slider that gives both maximum and minimum. This way, there is no need to check if maximum is greater than minimum anymore, so I remove all the code that does this.

I included windowTitle = “CLT for means” to have a shorter window name in the url.
I re-did all the plot with ggplot2 code. This includes some stylistic changes to texts to match the new plots. I shortened title of sampling distribution as the information was redundant – there is a similar description right below the graph.
I changed the layout of the app to include three tabs. The visulizations are now spread over these tabs. This was a suggestion by my supervisor. I also changed the ordering of the help text in the user input section to shorten the column so it fits the screen. These changes make it easier for the user to see everything in one place without having to scroll up and down.
For aesthetics, I added background color for the sample distribution plots to match the rest of the plots in the app and changed fill colors for the rest of the plots.

Link to final product: CLT_mean

Case Study 4: Learn R Tutorial

For this assigment, I created an R tutorial. This tutorial is guided by my first case study. In the tutorial, I explain the code I used for that assingment and let a potential learner practice the code. I also improve the visualizations, so the final product(s) for this assignment look slightly different than those in the first assignemtn.

Link to the assignment: learn tutorial