MATH 4910/5010 - R Lab 2
In this lab, you will get some experience with data wrangling in R.
Objectives:- Learn how to install and load packages in R
- Learn how to manage files and directories in R and RStudio
- Learn how to read data into R
- Learn how to use data wrangling functions from the tidyverse package
You can find the R markdown template and one of the datasets used for this lab on Canvas in the course files under the "R Lab 2" folder. Follow the instructions below. Instructions in green indicate tasks that should be completed in the R markdown file for this lab.
Installing and Loading R Packages
In this class, we will be working with lots of data. Let's start by loading some data into R. We will be using functions from the tidyverse package to load data. To learn more about this package, I highly encourage you to browse the tidyverse website. This site has lots of examples and cheat sheets that I find helpful.
If you haven't done so already, install the tidyverse package. To install packages, we use the
install.packages()
function like below. Packages only need to be installed once. When you start a new session in R, all of your installed packages will still be there.> install.packages("tidyverse")
To use an installed package, we need to load the package library using the library()
function.
This needs to be done every time you start a new R session.
Add the following code to code block 1.
library(tidyverse)
File and Directory Management
You should have found a file on canvas called imdb.csv
.
This file contains data for approximately 10,000 movies extracted from the IMDB website.
To read this file into R, the folder where the file is located should be (or at least be contained in) the current working directory R is pointed to.
Use the
getwd()
to find out the current working directory.> getwd()
Use the
list.files()
function to see the contents of the current working directory.> list.files()
RStudio also has a built-in file explorer for your convenience. You can find it under the "Files" tab of the lower-right window. Beware that this does not always show the working directory, changing the working directory in R does not affect this window.
The working directory for R markdown files is whatever directory the file is located in.
Use the
getwd()
function in code block 2 to display the current working directory for the lab 2 R markdown file. Then use thelist.files()
function to display the contents of that directory.
I will be assuming that the working directory for R (the one you see in the R console) is the directory where the R markdown file is located.
You can change the working directory using the setwd()
function.
> setwd("C:\Users\jplac\Documents\TDA Course\R_files\Lab 2\")
Create a subdirectory in the working directory called "". This can be done using the code below.
> dir.create("new_dir")
Set the working directory to your newly created folder using the following code.
> setwd("new_dir")
> getwd()
## [1] "C:/Users/jplac/Documents/TDA Course/R_files/Lab 2/new_dir"
To set the working directory back to the parent directory use the following code.
> setwd("../")
> getwd()
## [1] "C:/Users/jplac/Documents/TDA Course/R_files/Lab 2"
Importing Data Into R
Let's load some data.
Most data in this course is stored in comma-separated values files(.csv).
We can load these into R using the read_csv()
function.
In code block 2, use code that loads the data from
imdb.csv
into a variable calledmovies
.movies <- read_csv("imdb.csv")
At first glance, one might think the read_csv()
function loads data into a data frame which we learned about in the last lab.
It turns out this is not exactly true.
Actually, our data is stored in a new class called a tibble, or more precisely, tbl_df
.
Check the class of movies.
> class(movies)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Tibbles inherit all of the functionality of data frames so we can use tibbles in any function that uses data frames. Tibbles are introduced using the tibble package. You don't need to install or load tibble since it's a subpackage of tidyverse.
We can use glimpse()
function to quickly view data in a tibble.
Use
glimpse()
to view the data in movies.> glimpse(movies)
## Rows: 9,849
## Columns: 11
## $ title <chr> "The Shawshank Redemption", "The Godfather", "Ramayan…
## $ year <chr> "1994", "1972", "1993", "1975", "2022", "2008", "1993…
## $ runtime <dbl> 142, 175, 135, 87, 121, 152, 195, 96, 201, 202, 147, …
## $ certificate <chr> "R", "R", "PG", NA, NA, "PG-13", "R", "Approved", "PG…
## $ genre <chr> "Drama", "Crime, Drama", "Animation, Action, Adventur…
## $ director <chr> "['Frank Darabont', 'Tim Robbins', 'Morgan Freeman', …
## $ stars <chr> "['Tim Robbins', 'Morgan Freeman', 'Bob Gunton', 'Wil…
## $ rating <dbl> 9.3, 9.2, 9.2, 9.2, 9.1, 9.0, 9.0, 9.0, 9.0, 9.0, 9.0…
## $ metascore <dbl> 82, 100, NA, NA, NA, 84, 95, 97, 94, 90, NA, 86, 95, …
## $ votes <dbl> 2780534, 1935895, 12470, 42018, 13372, 2758250, 13992…
## $ gross <dbl> 28.34, 134.97, NA, NA, NA, 534.86, 96.90, 4.36, 377.8…
I will be providing most of the data in this class. However, there is a lot data readily available for anyone to use. Once such source is the U.S. Census Bureau.
Go to the U.S. Census Bureau website, type "oklahoma education" in the search bar, and find the "S1501 | Educational Attainment" survey data.
Download the data by clicking the icon with three dots in the upper-right hand corner of the report and finding "CSV" in the dropdown list. Move the
.csv
file into the working directory for R. Feel free to rename the file to whatever you like.
In code block 2, add code that loads the census data into a variable called
education
.In code block 3, rename the column labels in
education
using the following code.names(education) <- c("group", "total", "total_error", "tot_percent", "tot_perc_err", "male", "male_error", "m_percent", "m_perc_err", "female", "female_error", "f_percent", "f_perc_err")
In code block 3, add code that glimpses the census data.
In addition to the data I provide, you also have access to datasets already stored in R.
Use the
data()
function to see the datasets R has by default.data()
Sometimes we will want to add data to a dataset.
This can be done using the add_row()
function.
Let's demonstrate this using the chickwts
dataset which comes with R.
Use the
View()
function to look at the data in thechickwts
dataset.View(chickwts)
Use the
add_row()
function to add a row to this dataset and store it in a variablechickens2
.chicken2<-add_row(chickwts,weight=502,feed="trash")
View(chickens2)
Data Wrangling
Now that we've loaded our data into R, you might want to view and manipulate it in different ways. The subpackage dplyr contains several functions useful for this purpose.
Using the select()
function, we can create a new tibble picking and choosing which columns we want to use.
When we call select()
, along with a dataset, we can indicate which columns we want to use.
We can also use the -
operator to get all columns except the one indicated.
Use the
select()
function onmovies
to create smaller datasets. Here's a few examples.> select(movies,title,year,genre,rating)
## # A tibble: 9,849 × 4
## title year genre rating
## <chr> <chr> <chr> <dbl>
## 1 The Shawshank Redemption 1994 Drama 9.3
## 2 The Godfather 1972 Crime, Drama 9.2
## 3 Ramayana: The Legend of Prince Rama 1993 Animation, A… 9.2
## 4 The Chaos Class 1975 Comedy, Drama 9.2
## 5 Daman 2022 Adventure, D… 9.1
## 6 The Dark Knight 2008 Action, Crim… 9
## 7 Schindler's List 1993 Biography, D… 9
## 8 12 Angry Men 1957 Crime, Drama 9
## 9 The Lord of the Rings: The Return of the King 2003 Action, Adve… 9
## 10 The Godfather Part II 1974 Crime, Drama 9
## # ℹ 9,839 more rows
## # ℹ Use `print(n = ...)` to see more rows
> select(movies,title,rating:gross)
## # A tibble: 9,849 × 5
## title rating metascore votes gross
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 The Shawshank Redemption 9.3 82 2.78e6 28.3
## 2 The Godfather 9.2 100 1.94e6 135.
## 3 Ramayana: The Legend of Prince Rama 9.2 NA 1.25e4 NA
## 4 The Chaos Class 9.2 NA 4.20e4 NA
## 5 Daman 9.1 NA 1.34e4 NA
## 6 The Dark Knight 9 84 2.76e6 535.
## 7 Schindler's List 9 95 1.40e6 96.9
## 8 12 Angry Men 9 97 8.25e5 4.36
## 9 The Lord of the Rings: The Return of the… 9 94 1.91e6 378.
## 10 The Godfather Part II 9 90 1.32e6 57.3
## # ℹ 9,839 more rows
## # ℹ Use `print(n = ...)` to see more rows
> mov2 <- select(movies,title,year,genre,rating)
> mov2
## # A tibble: 9,849 × 9
## title year runtime genre director rating metascore votes gross
## <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 The Shawsha… 1994 142 Drama ['Frank… 9.3 82 2.78e6 28.3
## 2 The Godfath… 1972 175 Crim… ['Franc… 9.2 100 1.94e6 135.
## 3 Ramayana: T… 1993 135 Anim… ['Ram M… 9.2 NA 1.25e4 NA
## 4 The Chaos C… 1975 87 Come… ['Ertem… 9.2 NA 4.20e4 NA
## 5 Daman 2022 121 Adve… ['Lenka… 9.1 NA 1.34e4 NA
## 6 The Dark Kn… 2008 152 Acti… ['Chris… 9 84 2.76e6 535.
## 7 Schindler's… 1993 195 Biog… ['Steve… 9 95 1.40e6 96.9
## 8 12 Angry Men 1957 96 Crim… ['Sidne… 9 97 8.25e5 4.36
## 9 The Lord of… 2003 201 Acti… ['Peter… 9 94 1.91e6 378.
## 10 The Godfath… 1974 202 Crim… ['Franc… 9 90 1.32e6 57.3
## # ℹ 9,839 more rows
## # ℹ Use `print(n = ...)` to see more rows
In code block 4, add code that removes the 'certificate', 'director' and 'stars' columns from the movie data and stores the result in a variable
movies2
.In code block 5, add code that removes the error columns from the U.S. census data and stores the result in a variable
education2
.
The filter()
function creates a new tibble from a set of data only using rows that satisfies certain conditions.
When we call filter()
, along with a dataset, we can add any number of conditions to filter the data.
Use the
filter()
function onmovies2
to create smaller datasets. Here's a couple of examples.> filter(movies2, runtime>200)
## # A tibble: 32 × 8
## title year runtime genre rating metascore votes gross
## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 The Lord of the Ring… 2003 201 Acti… 9 94 1.91e6 378.
## 2 The Godfather Part II 1974 202 Crim… 9 90 1.32e6 57.3
## 3 The Message 1976 207 Acti… 8.9 NA 1.00e4 NA
## 4 Kill Bill: The Whole… 2006 247 Acti… 8.8 NA 1.46e4 NA
## 5 Seven Samurai 1954 207 Acti… 8.6 98 3.56e5 0.27
## 6 The Best of Youth 2003 366 Dram… 8.5 89 2.27e4 0.27
## 7 Gone with the Wind 1939 238 Dram… 8.2 97 3.25e5 199.
## 8 Gangs of Wasseypur 2012 321 Acti… 8.2 89 1.00e5 NA
## 9 Satantango 1994 439 Drama 8.2 NA 1.20e4 NA
## 10 Ben-Hur 1959 212 Adve… 8.1 90 2.47e5 74.7
## # ℹ 22 more rows
## # ℹ Use `print(n = ...)` to see more rows
> filter(movies2, year==2000, metascore>90)
## # A tibble: 3 × 8
## title year runtime genre rating metascore votes gross
## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Yi Yi: A One and a Tw… 2000 173 Dram… 8.1 94 26741 1.14
## 2 Werckmeister Harmonies 2000 145 Dram… 8 92 15171 0.03
## 3 Crouching Tiger, Hidd… 2000 120 Acti… 7.9 94 276925 128.
Filtering the 'genre' column is not so straightforward.
However, we can use the grepl()
function to help.
> grepl("i","team")
## FALSE
> grepl("cab","Escabar")
## [1] TRUE
Use the
filter()
andgrepl()
functions onmovies2
to create datasets of different genres.> filter(movies2, grepl("Drama",genre))
We can also get the first or last \(n\) rows from a dataset using respectively the slice_head()
and slice_tail()
functions.
Use the
slice_head()
andslice_tail()
functions oneducation2
.> slice_head(education2,n=5)
> slice_tail(education2,n=15)
In code block 5, add code that creates a dataset from
education2
only containing the education by age information (should be rows 2 through 28) and stores it in a variable callededu_age
. Then, glimpse the data.
Notice that some of the entries in our dataset have value NA. There are missing entries in our data. Typically, these are observations which were not made. Missing values don't work like normal values.
> NA==NA
## [1] NA
> NA!=NA
## [1] NA
filter()
like we did before to remove rows with missing values.
> filter(movies2,metascore!=NA)
## # A tibble: 0 × 8
## # ℹ 8 variables: title <chr>, year <chr>, runtime <dbl>, genre <chr>,
## # rating <dbl>, metascore <dbl>, votes <dbl>, gross <dbl>
To remove rows containing missing values we can use the drop_na()
function.
> drop_na(movies2,metascore)
## # A tibble: 7,876 × 8
## title year runtime genre rating metascore votes gross
## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 The Shawshank Redemp… 1994 142 Drama 9.3 82 2.78e6 28.3
## 2 The Godfather 1972 175 Crim… 9.2 100 1.94e6 135.
## 3 The Dark Knight 2008 152 Acti… 9 84 2.76e6 535.
## 4 Schindler's List 1993 195 Biog… 9 95 1.40e6 96.9
## 5 12 Angry Men 1957 96 Crim… 9 97 8.25e5 4.36
## 6 The Lord of the Ring… 2003 201 Acti… 9 94 1.91e6 378.
## 7 The Godfather Part II 1974 202 Crim… 9 90 1.32e6 57.3
## 8 Spider-Man: Across t… 2023 140 Anim… 8.9 86 2.03e5 NA
## 9 Pulp Fiction 1994 154 Crim… 8.9 95 2.13e6 108.
## 10 Inception 2010 148 Acti… 8.8 74 2.45e6 293.
## # ℹ 7,866 more rows
## # ℹ Use `print(n = ...)` to see more rows
In code block 4, add code that creates a dataset from
movies2
only containing movies without missing 'gross' values and store this in a variable calledmoneymovies
.In code block 4, add code that displays all the rows in
moneymovies
for movies rated 9 and above.
The mutate()
function can be used to add columns to your dataset.
When we call mutate()
, along with a dataset, we can add any number of new column labels with a formula for computing the column values.
Use the
mutate()
to compute amount of gross revenue earned per minute (in $million/min) for the movies inmoneymovies
via the following code.> mutate(moneymovies,dol_per_min=gross/runtime)
## # A tibble: 7,015 × 9
## title year runtime genre rating metascore votes gross dol_per_min
## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 The Shaw… 1994 142 Drama 9.3 82 2.78e6 28.3 0.200
## 2 The Godf… 1972 175 Crim… 9.2 100 1.94e6 135. 0.771
## 3 The Dark… 2008 152 Acti… 9 84 2.76e6 535. 3.52
## 4 Schindle… 1993 195 Biog… 9 95 1.40e6 96.9 0.497
## 5 12 Angry… 1957 96 Crim… 9 97 8.25e5 4.36 0.0454
## 6 The Lord… 2003 201 Acti… 9 94 1.91e6 378. 1.88
## 7 The Godf… 1974 202 Crim… 9 90 1.32e6 57.3 0.284
## 8 Pulp Fic… 1994 154 Crim… 8.9 95 2.13e6 108. 0.701
## 9 Inception 2010 148 Acti… 8.8 74 2.45e6 293. 1.98
## 10 Fight Cl… 1999 139 Drama 8.8 67 2.22e6 37.0 0.266
## # ℹ 7,005 more rows
## # ℹ Use `print(n = ...)` to see more rows
The arrange()
function sorts your dataset by a chosen column.
When we call arrange()
, along with a dataset, we can indicate which columns we want to sort out data by starting from the first column name given.
By default, this function sorts in ascending order.
To sort in descending order, use the desc()
function.
Use the
arrange()
to sort the data inmoneymovies
. Here are some examples.> arrange(moneymovies,metascore)
## # A tibble: 7,015 × 8
## title year runtime genre rating metascore votes gross
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miss March 2009 90 Come… 5 7 22036 4.54
## 2 Unplanned 2019 109 Biog… 5.5 10 14453 19.0
## 3 Death Wish II 1982 89 Acti… 5.9 11 17221 16.1
## 4 Nine Lives 2016 87 Come… 5.3 11 23713 19.6
## 5 Cocktail 1988 104 Come… 5.9 12 89374 78.2
## 6 Slackers 2002 86 Come… 5.3 12 15026 4.81
## 7 Strange Wilderness 2008 87 Adve… 5.2 12 22435 6.56
## 8 Friday the 13th Part V… 1988 88 Horr… 5.2 13 39997 19.2
## 9 Nothing But Trouble 1991 94 Come… 5.1 13 26511 8.48
## 10 Cannonball Run II 1984 108 Acti… 5 13 16967 28.1
## # ℹ 7,005 more rows
## # ℹ Use `print(n = ...)` to see more rows
> arrange(moneymovies,desc(metascore))
## # A tibble: 7,015 × 8
## title year runtime genre rating metascore votes gross
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 The Godfather 1972 175 Crime,… 9.2 100 1.94e6 135.
## 2 Casablanca 1942 102 Drama,… 8.5 100 5.89e5 1.02
## 3 Rear Window 1954 112 Myster… 8.5 100 5.06e5 36.8
## 4 Fanny and Alexander 1982 188 Drama 8.1 100 6.58e4 4.97
## 5 Three Colors: Red 1994 99 Drama,… 8.1 100 1.07e5 4.04
## 6 Boyhood 2014 165 Drama 7.9 100 3.62e5 25.4
## 7 Notorious 1946 102 Drama,… 7.9 100 1.04e5 10.5
## 8 The Conformist 1970 113 Drama 7.9 100 3.24e4 0.54
## 9 City Lights 1931 87 Comedy… 8.5 99 1.90e5 0.02
## 10 Army of Shadows 1969 145 Drama,… 8.1 99 2.49e4 0.74
## # ℹ 7,005 more rows
## # ℹ Use `print(n = ...)` to see more rows
> arrange(moneymovies,desc(year),runtime)
## # A tibble: 7,015 × 9
## # A tibble: 7,015 × 8
## title year runtime genre rating metascore votes gross
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Minions: The Rise of … 2022 87 Anim… 6.5 56 77386 370.
## 2 Beast 2022 93 Acti… 5.6 54 36198 31.8
## 3 The Bad Guys 2022 100 Anim… 6.8 64 50836 97.2
## 4 Puss in Boots: The La… 2022 102 Anim… 7.9 73 147748 168.
## 5 M3GAN 2022 102 Horr… 6.4 72 115771 93.9
## 6 Ticket to Paradise 2022 104 Come… 6.1 50 56785 68.3
## 7 Morbius 2022 104 Acti… 5.2 35 140517 73.9
## 8 DC League of Super-Pe… 2022 105 Anim… 7.1 56 71143 93.7
## 9 Lightyear 2022 105 Anim… 6.1 60 116808 118.
## 10 Lyle, Lyle, Crocodile 2022 106 Anim… 6.1 51 12694 46.9
## # ℹ 7,005 more rows
## # ℹ Use `print(n = ...)` to see more rows
The last innovation from the tidyverse package we will learn about is the pipe operator %>%
.
Pipe operators allow us to combine multiple operations into one continuous line of code.
For example, following code displays the top five grossing horror movies of the 1990's along with the year they were released.
Give it a try.
> moneymovies %>% filter(grepl("Horror",genre),year>=1990,year<2000) %>% select(title,year,gross) %>% arrange(desc(gross)) %>% slice_head(n=5)
%>%
.
For all the other commands used we can omit the dataset in the function calls.
The pipe is always executed from left to right.
In code block 6, add one line of code that displays a list of movie titles (just the titles) from the Star Trek franchise in order of release (earliest to latest).
To save data we can use the write_csv()
function.
Use the following line of code to save the data in
moneymovies
to a new file calledimdb_gross.csv
.> write_csv(moneymovies,"imdb_gross.csv")
-
Use what you've learned to answer the following questions using the
moneymovies
dataset.How many action movies, rated 8 and above, were released before the year 1970? Answer in space marked [Answer Here 1].
What is the title of 5th highest grossing movie released in the year 1999? Answer in space marked [Answer Here 2].
What is the mean runtime (rounded to the nearest tenth) of a Sci-fi movie released after (not during) the year 2015? Answer in space marked [Answer Here 3].
Congratulations! You've mastered data wrangling in R.