MATH 4910/5010 - R Lab 3
In this lab you will get some experience with visualizing data in R.
Objectives:- Learn about the grammar of graphics
- Learn how to visualize data using functions from the ggplot2 and plotly packages
- Learn how to create 2D and 3D scatter plots in R
- Get experience performing basic statistical analysis in R
You can find the R markdown template and the datasets used for this lab on Canvas in the course files under the "R Lab 3" folder. Follow the instructions below. Instructions in green indicate tasks that should be completed in the R markdown file for this lab.
Packages and data
Now that we can load data into R, let's see how we can "look at" our data in R. We will be using functions from a subpackage of the tidyverse package called ggplot2 and the plotly package. Be sure that each of these packages are installed on your machine.
In code block 1, write code to load the tidyverse and plotly packages into R.
We'll also need some data to work with.
You should have found five .csv
files on canvas.
The first twp files, AAPL.csv
and MSFT.csv
, contain lifetime stock price data for Apple and Microsoft respectively.
The third file, harry_potter.csv
, contains information about characters from the book series Harry Potter.
The fourth file, spotify_youtube.csv
, contains data about 20,000 songs available for streaming on Spotify.
The last file, gaia_data.csv
, contains data from the Gaia space observatory for approximately 2,500 celestial objects (stars, clusters, stellar clouds).
We will also be using one of the built in data sets txhousing
, which contains data for home sales in Texas from 2000 to 2015.
In code block 2, load the Texas home sales data into the variable
homesales
by adding the following line of code.> homesales <- txhousing
Add code to code block 2 that loads the data from
AAPL.csv
,MSFT.csv
,harry_potter.csv
,spotify_youtube.csv
, andgaia_data.csv
into a variables respectively calledapple
,microsoft
,hp_chars
,spotify
, andstars
.
The Grammar of Graphics
There are many ways to visualize data in R, but ggplot2 is one of the most elegant and versatile. ggplot2 implements system of data visualization called the grammar of graphics. You can read more about the grammar of graphics in this summary of Hadley Wickham's article "A Layered Grammar of Graphics".
The idea of the grammar of graphics is to separate the different components of a graph global parts. Data is visualized with a default set of data, a default mapping from variables to aesthetics, some number of distinct layers, a scale for each aesthetic, a coordinate system, and an separation of data into subsets (faceting). Each layer is composed of one statistical transformation, one geometric object, one positional adjustment, and optionally a dataset and mapping (if not using the default data and mapping).
Components of the Grammar of Graphics- Defaults
- Data
- Mapping
- Layers
- Data
- Mapping
- Statistical Transformation
- Geometry
- Positional Adjustment
- Scale
- Coordinate System
- Faceting
Consider the following graph.
A graph made with ggplot2
Data: This graph uses data from a dataset of information about the settled countries and territories of the world.
Mapping: Here we map the variable
Continent
to the x-axis.Statistical Transformation: The y-axis is not mapped to a specific column of the dataset. Instead, we compute the frequency of each distinct value in the
Continent
column.Geometry: This graph uses the geometry of a bar graph.
Positional Adjustment: This is hard to see from the graph, but this all the ways the location of the graphics was adjusted from the default locations. For example, we could make the bars smaller or move the legend around.
Scale/Coordinate System: This component includes all of the axis scaling and labeling of the graph.
Faceting: This graph does not use any faceting. Below we added faceting by country's currency minor unit (the number of decimal places used in a countries currency).
A graph with faceting
Graphing with ggplot2
To create a graph using ggplot2 we use the ggplot
function.
The arguments of this function is a default dataset and a default mapping.
We've already have experience passing datasets as arguments, passing a mapping as an argument in R is new to us.
This is done using the aes
function.
The aes
function requires an 'x' and a 'y' argument.
Most of the time, we pass the name of the columns we want to use as these arguments.
Let's start by working with the apple stock price data from the year 2020.
The first column of the apple
dataset is the date stored as a date
data type.
These are just strings that R knows how to treat as dates.
Type The following code in to the R console.
> ggplot(filter(apple,year(Date)==2020),aes(x=Date,y=Close))
The code in the previous step only specified the data and mapping of our graph.
We have not told R how we want the data to be displayed.
To do that we need specify a geometry.
For a standard scatter plot, we can use the geom_point
function.
Type The following code in to the R console.
> ggplot(filter(apple,year(Date)==2020),aes(x=Date,y=Close))+geom_point()
Typically when displaying stock price, we use a line graph. To do this, we can use
geom_line
instead ofgeom_point
. Also, let's use different color.star wars Notice that in ggplot2 we use the British spelling "colour".> ggplot(filter(apple,year(Date)==2020),aes(x=Date,y=Close)) + geom_line(colour='darkgreen')
We can add layers by adding more geometries. When adding geometries, if we don't want to use the default data and mapping, we have the option of specifying what data and mapping to use for that geometry.
Let's add the stock price data for microsoft to this graph.
> ggplot(filter(apple,year(Date)==2020),aes(x=Date,y=Close)) + geom_line(colour='darkgreen') + geom_line(data=filter(microsoft,year(Date)==2020),colour='orange')
We can use functions like xlim()
, ylim()
, xlab()
, ylab()
, and labs()
to change the scaling and labeling of our graph.
Let's scale the y-axis so that it starts at zero.
Also, we should label y-axis with a clearer description.
A legend would be nice.
Oh yeah, we should also add a title.
Then, we can add a nice black and white theme.
Writing a line of code for graphs can get out of hand very quickly. To improve readability of our code, lets format it using the grammar of graphics as our guide.
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
<GEOM_FUNCTION>(
data = <DATA>,
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
Type the code below into the R console.
> ggplot(filter(apple,year(Date)==2020),aes(x=Date,y=Close))+
geom_line(mapping=aes(color='Apple'))+
geom_line(
data=filter(microsoft,year(Date)==2020),
mapping=aes(color='Microsoft')
)+
theme_bw()+
ylim(0,250)+
ylab("Price per Share at Market Close ($)")+
labs(title="Stock Price of Apple and Microsoft Stock in 2020")+
scale_color_manual(name="Stock", values=c('Microsoft'='orange','Apple'='darkgreen'))
Notice that we had to create a legend using the scale_color_manual()
function.
This was necessary since our data was in two different data sets.
When using a single dataset, most geometries can create a legend automatically by defining attributes of the aesthetic.
Let's plot the location of our stars using latitude and distance, and let's color code them by type.
> ggplot(stars,aes(x=distance_pars,y=latitude))+geom_point(mapping=aes(colour=type))
In code block 3, write code that uses the data in
homesales
to make a line graph tracking the total number of homes sold each day in the year 2008 for the cities Austin, Dallas, Houston, and San Antonio. Give your graph appropriate labels. Your graph should look like this.
Graphs with Statistics
There are several types of geometries available.
You can read about them on the tidyverse website.
Each geometry has certain requirements.
Lets try to make a histogram.
The geom_histogram
function only needs a one variable aesthetic.
Lets make a histogram of the tempo of songs in the
spotify
.> ggplot(spotify,aes(x=Tempo))+geom_histogram(fill='purple',colour='black')
Notice that when using geom_histogram
, we didn't specify a 'y' variable in our aesthetic.
However, a statistic counting the number of entries with value in a certain range is automatically assigned to 'y'.
Also, the 'x' variable was transformed from a continuous range of values to 'bins' of size 30.
These are statistical transformations.
We can change how this statistic is computed by manipulating arguments of the geom_histogram
function.
Try playing around with the
bins
andbinwidth
arguments ofgeom_histogram
.> ggplot(spotify,aes(x=Tempo))+geom_histogram(bins=10,fill='purple',colour='black')
> ggplot(spotify,aes(x=Tempo))+geom_histogram(binwidth=50,fill='purple',colour='black')
Let's breakdown the histogram by album release type.
> ggplot(spotify,aes(x=Tempo))+
geom_histogram(
mapping=aes(alpha=Album_type),
bins = 10,
fill='purple',
colour='black'
)
Notice when breaking down the histogram, the different categories are stacked.
We can change the position of the boxes by changing the position
attribute or faceting our data.
Try setting the
position
attribute to "dodge".> ggplot(spotify,aes(x=Tempo))+
geom_histogram(
mapping=aes(alpha=Album_type),
bins = 10,
fill='purple',
colour='black'
position="dodge"
)
We can also facet our data using a function like
facet_wrap
.> ggplot(spotify,aes(x=Tempo))+
geom_histogram(
bins = 10,
fill='purple',
colour='black'
)+
facet_wrap(facets=vars(Album_type))
For discrete variables, we can create bar graphs using the geom_bar
function.
Create a box plot of the known wand core types of characters in the
hp_chars
dataset.> ggplot(filter(hp_chars,!is.na(Wand_Core)),aes(x=Wand_Core))+
geom_bar(mapping=aes(fill=Wand_Core))
In code block 4, write code that makes a bar graph of the known houses of characters in the
hp_chars
dataset. Organize the data by gender.
Another useful statistical geometry is the geom_boxplot()
function.
To use this function, we must use an aesthetic with a discrete 'x' and numerical 'y'.
Create a box plot of the acousticness of songs in
spotify
organized by album release type.> ggplot(spotify,aes(x=Album_type,y=Acousticness))+geom_boxplot()
Choose your three favorite artist in the
spotify
dataset and write code in code block 5 to filter the data for those three artist storing the result in a variable. Then, add code to create a box plot of the danceability organized by artist.
3D graphs with plotly
As of this moment ggplots does not support 3D scatter plots. For this we can use the plotly package.
Use the following code to create a 3D scatter plot using the number os streams, number of views, and tempo of songs in spotify. We'll organize them by album release type.
> plot_ly(spotify,x=~Stream,y=~Views,z=~Tempo,type="scatter3d", color=~Album_type)
In code block 6, write code that creates a 3d scatter plot of the locations of stellar objects in
stars
organized by type.
Congratulations! You've mastered visualizing data in R.