Jonathan C. Johnson

MATH 4910/5010 - R Lab 3

In this lab you will get some experience with visualizing data in R.

Objectives:

Learn about the grammar of graphics
Learn how to visualize data using functions from the ggplot2 and plotly packages
Learn how to create 2D and 3D scatter plots in R
Get experience performing basic statistical analysis in R

You can find the R markdown template and the datasets used for this lab on Canvas in the course files under the "R Lab 3" folder. Follow the instructions below. Instructions in green indicate tasks that should be completed in the R markdown file for this lab.

Packages and data

Now that we can load data into R, let's see how we can "look at" our data in R. We will be using functions from a subpackage of the tidyverse package called ggplot2 and the plotly package. Be sure that each of these packages are installed on your machine.

In code block 1, write code to load the tidyverse and plotly packages into R.

We'll also need some data to work with. You should have found five .csv files on canvas. The first twp files, AAPL.csv and MSFT.csv, contain lifetime stock price data for Apple and Microsoft respectively. The third file, harry_potter.csv, contains information about characters from the book series Harry Potter. The fourth file, spotify_youtube.csv, contains data about 20,000 songs available for streaming on Spotify. The last file, gaia_data.csv, contains data from the Gaia space observatory for approximately 2,500 celestial objects (stars, clusters, stellar clouds). We will also be using one of the built in data sets txhousing, which contains data for home sales in Texas from 2000 to 2015.

In code block 2, load the Texas home sales data into the variable homesales by adding the following line of code.

> homesales <- txhousing
Add code to code block 2 that loads the data from AAPL.csv, MSFT.csv, harry_potter.csv, spotify_youtube.csv, and gaia_data.csv into a variables respectively called apple, microsoft, hp_chars, spotify, and stars.

The Grammar of Graphics

There are many ways to visualize data in R, but ggplot2 is one of the most elegant and versatile. ggplot2 implements system of data visualization called the grammar of graphics. You can read more about the grammar of graphics in this summary of Hadley Wickham's article "A Layered Grammar of Graphics".

The idea of the grammar of graphics is to separate the different components of a graph global parts. Data is visualized with a default set of data, a default mapping from variables to aesthetics, some number of distinct layers, a scale for each aesthetic, a coordinate system, and an separation of data into subsets (faceting). Each layer is composed of one statistical transformation, one geometric object, one positional adjustment, and optionally a dataset and mapping (if not using the default data and mapping).

Components of the Grammar of Graphics

Defaults
- Data
- Mapping
Layers
- Data
- Mapping
- Statistical Transformation
- Geometry
- Positional Adjustment
Scale
Coordinate System
Faceting

Consider the following graph.

A graph made with ggplot2

Data: This graph uses data from a dataset of information about the settled countries and territories of the world.
Mapping: Here we map the variable Continent to the x-axis.
Statistical Transformation: The y-axis is not mapped to a specific column of the dataset. Instead, we compute the frequency of each distinct value in the Continent column.
Geometry: This graph uses the geometry of a bar graph.
Positional Adjustment: This is hard to see from the graph, but this all the ways the location of the graphics was adjusted from the default locations. For example, we could make the bars smaller or move the legend around.
Scale/Coordinate System: This component includes all of the axis scaling and labeling of the graph.
Faceting: This graph does not use any faceting. Below we added faceting by country's currency minor unit (the number of decimal places used in a countries currency).

A graph with faceting

Graphing with ggplot2

To create a graph using ggplot2 we use the ggplot function. The arguments of this function is a default dataset and a default mapping. We've already have experience passing datasets as arguments, passing a mapping as an argument in R is new to us. This is done using the aes function.

The aes function requires an 'x' and a 'y' argument. Most of the time, we pass the name of the columns we want to use as these arguments.

Let's start by working with the apple stock price data from the year 2020. The first column of the apple dataset is the date stored as a date data type. These are just strings that R knows how to treat as dates.

Type The following code in to the R console.

> ggplot(filter(apple,year(Date)==2020),aes(x=Date,y=Close))

If you get an empty graph, that's great. That's what's suppose to happen.

The code in the previous step only specified the data and mapping of our graph. We have not told R how we want the data to be displayed. To do that we need specify a geometry. For a standard scatter plot, we can use the geom_point function.

Type The following code in to the R console.

> ggplot(filter(apple,year(Date)==2020),aes(x=Date,y=Close))+geom_point()

Now that's more like it.
Typically when displaying stock price, we use a line graph. To do this, we can use geom_line instead of geom_point. Also, let's use different color.star wars Notice that in ggplot2 we use the British spelling "colour".

> ggplot(filter(apple,year(Date)==2020),aes(x=Date,y=Close)) + geom_line(colour='darkgreen')

We can add layers by adding more geometries. When adding geometries, if we don't want to use the default data and mapping, we have the option of specifying what data and mapping to use for that geometry.

Let's add the stock price data for microsoft to this graph.

> ggplot(filter(apple,year(Date)==2020),aes(x=Date,y=Close)) + geom_line(colour='darkgreen') + geom_line(data=filter(microsoft,year(Date)==2020),colour='orange')

We can use functions like xlim(), ylim(), xlab(), ylab(), and labs() to change the scaling and labeling of our graph. Let's scale the y-axis so that it starts at zero. Also, we should label y-axis with a clearer description. A legend would be nice. Oh yeah, we should also add a title. Then, we can add a nice black and white theme.

Writing a line of code for graphs can get out of hand very quickly. To improve readability of our code, lets format it using the grammar of graphics as our guide.

Grammar of Graphics Standard Syntax

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
    <GEOM_FUNCTION>(
        data = <DATA>,
        mapping = aes(<MAPPINGS>),
        stat = <STAT>,
        position = <POSITION>
    ) +
    <COORDINATE_FUNCTION> +
    <FACET_FUNCTION>

Type the code below into the R console.

> ggplot(filter(apple,year(Date)==2020),aes(x=Date,y=Close))+
     geom_line(mapping=aes(color='Apple'))+
     geom_line(
         data=filter(microsoft,year(Date)==2020),
         mapping=aes(color='Microsoft')
     )+
     theme_bw()+
     ylim(0,250)+
     ylab("Price per Share at Market Close ($)")+
     labs(title="Stock Price of Apple and Microsoft Stock in 2020")+
     scale_color_manual(name="Stock", values=c('Microsoft'='orange','Apple'='darkgreen'))

Now that's what I call a graph!

Notice that we had to create a legend using the scale_color_manual() function. This was necessary since our data was in two different data sets. When using a single dataset, most geometries can create a legend automatically by defining attributes of the aesthetic.

Let's plot the location of our stars using latitude and distance, and let's color code them by type.

> ggplot(stars,aes(x=distance_pars,y=latitude))+geom_point(mapping=aes(colour=type))

In code block 3, write code that uses the data in homesales to make a line graph tracking the total number of homes sold each day in the year 2008 for the cities Austin, Dallas, Houston, and San Antonio. Give your graph appropriate labels. Your graph should look like this.

Graphs with Statistics

There are several types of geometries available. You can read about them on the tidyverse website. Each geometry has certain requirements. Lets try to make a histogram. The geom_histogram function only needs a one variable aesthetic.

Lets make a histogram of the tempo of songs in the spotify.

> ggplot(spotify,aes(x=Tempo))+geom_histogram(fill='purple',colour='black')

Notice that when using geom_histogram, we didn't specify a 'y' variable in our aesthetic. However, a statistic counting the number of entries with value in a certain range is automatically assigned to 'y'. Also, the 'x' variable was transformed from a continuous range of values to 'bins' of size 30. These are statistical transformations. We can change how this statistic is computed by manipulating arguments of the geom_histogram function.

Try playing around with the bins and binwidth arguments of geom_histogram.

> ggplot(spotify,aes(x=Tempo))+geom_histogram(bins=10,fill='purple',colour='black')

> ggplot(spotify,aes(x=Tempo))+geom_histogram(binwidth=50,fill='purple',colour='black')
Let's breakdown the histogram by album release type.

> ggplot(spotify,aes(x=Tempo))+
     geom_histogram(
         mapping=aes(alpha=Album_type),
         bins = 10,
         fill='purple',
         colour='black'
     )

Notice when breaking down the histogram, the different categories are stacked. We can change the position of the boxes by changing the position attribute or faceting our data.

Try setting the position attribute to "dodge".

> ggplot(spotify,aes(x=Tempo))+
     geom_histogram(
         mapping=aes(alpha=Album_type),
         bins = 10,
         fill='purple',
         colour='black'
         position="dodge"
     )
We can also facet our data using a function like facet_wrap.

> ggplot(spotify,aes(x=Tempo))+
     geom_histogram(
         bins = 10,
         fill='purple',
         colour='black'
     )+
     facet_wrap(facets=vars(Album_type))

For discrete variables, we can create bar graphs using the geom_bar function.

Create a box plot of the known wand core types of characters in the hp_chars dataset.

> ggplot(filter(hp_chars,!is.na(Wand_Core)),aes(x=Wand_Core))+
geom_bar(mapping=aes(fill=Wand_Core))

In code block 4, write code that makes a bar graph of the known houses of characters in the hp_chars dataset. Organize the data by gender.

Another useful statistical geometry is the geom_boxplot() function. To use this function, we must use an aesthetic with a discrete 'x' and numerical 'y'.

Create a box plot of the acousticness of songs in spotify organized by album release type.

> ggplot(spotify,aes(x=Album_type,y=Acousticness))+geom_boxplot()

Choose your three favorite artist in the spotify dataset and write code in code block 5 to filter the data for those three artist storing the result in a variable. Then, add code to create a box plot of the danceability organized by artist.

3D graphs with plotly

As of this moment ggplots does not support 3D scatter plots. For this we can use the plotly package.

Use the following code to create a 3D scatter plot using the number os streams, number of views, and tempo of songs in spotify. We'll organize them by album release type.

> plot_ly(spotify,x=~Stream,y=~Views,z=~Tempo,type="scatter3d", color=~Album_type)

In code block 6, write code that creates a 3d scatter plot of the locations of stellar objects in stars organized by type.

Congratulations! You've mastered visualizing data in R.