Jonathan C. Johnson

E-mail: jonathan.johnson10@okstate.edu
Office: MSCS 524
Department of Mathematics
Oklahoma State University
Return to MATH 4910/5010 Homepage

MATH 4910/5010 - R Lab 2

In this lab, you will get some experience with data wrangling in R.

Objectives:
  1. Learn how to install and load packages in R
  2. Learn how to manage files and directories in R and RStudio
  3. Learn how to read data into R
  4. Learn how to use data wrangling functions from the tidyverse package

You can find the R markdown template and one of the datasets used for this lab on Canvas in the course files under the "R Lab 2" folder. Follow the instructions below. Instructions in green indicate tasks that should be completed in the R markdown file for this lab.

Installing and Loading R Packages

In this class, we will be working with lots of data. Let's start by loading some data into R. We will be using functions from the tidyverse package to load data. To learn more about this package, I highly encourage you to browse the tidyverse website. This site has lots of examples and cheat sheets that I find helpful.

  1. If you haven't done so already, install the tidyverse package. To install packages, we use the install.packages() function like below. Packages only need to be installed once. When you start a new session in R, all of your installed packages will still be there.

    > install.packages("tidyverse")

To use an installed package, we need to load the package library using the library() function. This needs to be done every time you start a new R session.

  1. Add the following code to code block 1.

    library(tidyverse)

File and Directory Management

You should have found a file on canvas called imdb.csv. This file contains data for approximately 10,000 movies extracted from the IMDB website. To read this file into R, the folder where the file is located should be (or at least be contained in) the current working directory R is pointed to.

  1. Use the getwd() to find out the current working directory.

    > getwd()

  2. Use the list.files() function to see the contents of the current working directory.

    > list.files()

RStudio also has a built-in file explorer for your convenience. You can find it under the "Files" tab of the lower-right window. Beware that this does not always show the working directory, changing the working directory in R does not affect this window.

The working directory for R markdown files is whatever directory the file is located in.

  1. Use the getwd() function in code block 2 to display the current working directory for the lab 2 R markdown file. Then use the list.files() function to display the contents of that directory.

I will be assuming that the working directory for R (the one you see in the R console) is the directory where the R markdown file is located. You can change the working directory using the setwd() function.

> setwd("C:\Users\jplac\Documents\TDA Course\R_files\Lab 2\")

This is where I'm keeping my R files.

  1. Create a subdirectory in the working directory called "". This can be done using the code below.

    > dir.create("new_dir")

  2. Set the working directory to your newly created folder using the following code.

    > setwd("new_dir")
    > getwd()
    ## [1] "C:/Users/jplac/Documents/TDA Course/R_files/Lab 2/new_dir"

  3. To set the working directory back to the parent directory use the following code.

    > setwd("../")
    > getwd()
    ## [1] "C:/Users/jplac/Documents/TDA Course/R_files/Lab 2"

Importing Data Into R

Let's load some data. Most data in this course is stored in comma-separated values files(.csv). We can load these into R using the read_csv() function.

  1. In code block 2, use code that loads the data from imdb.csv into a variable called movies.

    movies <- read_csv("imdb.csv")

At first glance, one might think the read_csv() function loads data into a data frame which we learned about in the last lab. It turns out this is not exactly true. Actually, our data is stored in a new class called a tibble, or more precisely, tbl_df.

  1. Check the class of movies.

    > class(movies)
    ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"

Tibbles inherit all of the functionality of data frames so we can use tibbles in any function that uses data frames. Tibbles are introduced using the tibble package. You don't need to install or load tibble since it's a subpackage of tidyverse.

We can use glimpse() function to quickly view data in a tibble.

  1. Use glimpse() to view the data in movies.

    > glimpse(movies)
    ## Rows: 9,849
    ## Columns: 11
    ## $ title       <chr> "The Shawshank Redemption", "The Godfather", "Ramayan…
    ## $ year        <chr> "1994", "1972", "1993", "1975", "2022", "2008", "1993…
    ## $ runtime     <dbl> 142, 175, 135, 87, 121, 152, 195, 96, 201, 202, 147, …
    ## $ certificate <chr> "R", "R", "PG", NA, NA, "PG-13", "R", "Approved", "PG…
    ## $ genre       <chr> "Drama", "Crime, Drama", "Animation, Action, Adventur…
    ## $ director    <chr> "['Frank Darabont', 'Tim Robbins', 'Morgan Freeman', …
    ## $ stars       <chr> "['Tim Robbins', 'Morgan Freeman', 'Bob Gunton', 'Wil…
    ## $ rating      <dbl> 9.3, 9.2, 9.2, 9.2, 9.1, 9.0, 9.0, 9.0, 9.0, 9.0, 9.0…
    ## $ metascore   <dbl> 82, 100, NA, NA, NA, 84, 95, 97, 94, 90, NA, 86, 95, …
    ## $ votes       <dbl> 2780534, 1935895, 12470, 42018, 13372, 2758250, 13992…
    ## $ gross       <dbl> 28.34, 134.97, NA, NA, NA, 534.86, 96.90, 4.36, 377.8…

I will be providing most of the data in this class. However, there is a lot data readily available for anyone to use. Once such source is the U.S. Census Bureau.

  1. Go to the U.S. Census Bureau website, type "oklahoma education" in the search bar, and find the "S1501 | Educational Attainment" survey data.

  2. Download the data by clicking the icon with three dots in the upper-right hand corner of the report and finding "CSV" in the dropdown list. Move the .csv file into the working directory for R. Feel free to rename the file to whatever you like.

    Downloading data from U.S. Census Bureau

  1. In code block 2, add code that loads the census data into a variable called education.

  2. In code block 3, rename the column labels in education using the following code.

    names(education) <- c("group", "total", "total_error", "tot_percent", "tot_perc_err", "male", "male_error", "m_percent", "m_perc_err", "female", "female_error", "f_percent", "f_perc_err")

  3. In code block 3, add code that glimpses the census data.

In addition to the data I provide, you also have access to datasets already stored in R.

  1. Use the data() function to see the datasets R has by default.

    data()
    A window should open with a list of datasets.

Sometimes we will want to add data to a dataset. This can be done using the add_row() function. Let's demonstrate this using the chickwts dataset which comes with R.

  1. Use the View() function to look at the data in the chickwts dataset.

    View(chickwts)

  2. Use the add_row() function to add a row to this dataset and store it in a variable chickens2.

    chicken2<-add_row(chickwts,weight=502,feed="trash")
    View(chickens2)

Data Wrangling

Now that we've loaded our data into R, you might want to view and manipulate it in different ways. The subpackage dplyr contains several functions useful for this purpose.

Using the select() function, we can create a new tibble picking and choosing which columns we want to use. When we call select(), along with a dataset, we can indicate which columns we want to use. We can also use the - operator to get all columns except the one indicated.

  1. Use the select() function on movies to create smaller datasets. Here's a few examples.

    > select(movies,title,year,genre,rating)
    ## # A tibble: 9,849 × 4
    ##    title                                         year  genre         rating
    ##    <chr>                                         <chr> <chr>          <dbl>
    ##  1 The Shawshank Redemption                      1994  Drama            9.3
    ##  2 The Godfather                                 1972  Crime, Drama     9.2
    ##  3 Ramayana: The Legend of Prince Rama           1993  Animation, A…    9.2
    ##  4 The Chaos Class                               1975  Comedy, Drama    9.2
    ##  5 Daman                                         2022  Adventure, D…    9.1
    ##  6 The Dark Knight                               2008  Action, Crim…    9
    ##  7 Schindler's List                              1993  Biography, D…    9
    ##  8 12 Angry Men                                  1957  Crime, Drama     9
    ##  9 The Lord of the Rings: The Return of the King 2003  Action, Adve…    9
    ## 10 The Godfather Part II                         1974  Crime, Drama     9
    ## # ℹ 9,839 more rows
    ## # ℹ Use `print(n = ...)` to see more rows
    > select(movies,title,rating:gross)
    ## # A tibble: 9,849 × 5
    ##    title                                     rating metascore  votes  gross
    ##    <chr>                                      <dbl>     <dbl>  <dbl>  <dbl>
    ##  1 The Shawshank Redemption                     9.3        82 2.78e6  28.3
    ##  2 The Godfather                                9.2       100 1.94e6 135.
    ##  3 Ramayana: The Legend of Prince Rama          9.2        NA 1.25e4  NA
    ##  4 The Chaos Class                              9.2        NA 4.20e4  NA
    ##  5 Daman                                        9.1        NA 1.34e4  NA
    ##  6 The Dark Knight                              9          84 2.76e6 535.
    ##  7 Schindler's List                             9          95 1.40e6  96.9
    ##  8 12 Angry Men                                 9          97 8.25e5   4.36
    ##  9 The Lord of the Rings: The Return of the…    9          94 1.91e6 378.
    ## 10 The Godfather Part II                        9          90 1.32e6  57.3
    ## # ℹ 9,839 more rows
    ## # ℹ Use `print(n = ...)` to see more rows
    > mov2 <- select(movies,title,year,genre,rating)
    > mov2
    ## # A tibble: 9,849 × 9
    ##    title        year  runtime genre director rating metascore  votes  gross
    ##    <chr>        <chr>   <dbl> <chr> <chr>     <dbl>     <dbl>  <dbl>  <dbl>
    ##  1 The Shawsha… 1994      142 Drama ['Frank…    9.3        82 2.78e6 28.3
    ##  2 The Godfath… 1972      175 Crim… ['Franc…    9.2       100 1.94e6 135.
    ##  3 Ramayana: T… 1993      135 Anim… ['Ram M…    9.2        NA 1.25e4 NA
    ##  4 The Chaos C… 1975       87 Come… ['Ertem…    9.2        NA 4.20e4 NA
    ##  5 Daman        2022      121 Adve… ['Lenka…    9.1        NA 1.34e4 NA
    ##  6 The Dark Kn… 2008      152 Acti… ['Chris…    9          84 2.76e6 535.
    ##  7 Schindler's… 1993      195 Biog… ['Steve…    9          95 1.40e6 96.9
    ##  8 12 Angry Men 1957       96 Crim… ['Sidne…    9          97 8.25e5 4.36
    ##  9 The Lord of… 2003      201 Acti… ['Peter…    9          94 1.91e6 378.
    ## 10 The Godfath… 1974      202 Crim… ['Franc…    9          90 1.32e6 57.3
    ## # ℹ 9,839 more rows
    ## # ℹ Use `print(n = ...)` to see more rows

  1. In code block 4, add code that removes the 'certificate', 'director' and 'stars' columns from the movie data and stores the result in a variable movies2.

  2. In code block 5, add code that removes the error columns from the U.S. census data and stores the result in a variable education2.

The filter() function creates a new tibble from a set of data only using rows that satisfies certain conditions. When we call filter(), along with a dataset, we can add any number of conditions to filter the data.

  1. Use the filter() function on movies2 to create smaller datasets. Here's a couple of examples.

    > filter(movies2, runtime>200)
    ## # A tibble: 32 × 8
    ##    title                 year  runtime genre rating metascore  votes  gross
    ##    <chr>                 <chr>   <dbl> <chr>  <dbl>     <dbl>  <dbl>  <dbl>
    ##  1 The Lord of the Ring… 2003      201 Acti…    9          94 1.91e6 378.
    ##  2 The Godfather Part II 1974      202 Crim…    9          90 1.32e6  57.3
    ##  3 The Message           1976      207 Acti…    8.9        NA 1.00e4  NA
    ##  4 Kill Bill: The Whole… 2006      247 Acti…    8.8        NA 1.46e4  NA
    ##  5 Seven Samurai         1954      207 Acti…    8.6        98 3.56e5   0.27
    ##  6 The Best of Youth     2003      366 Dram…    8.5        89 2.27e4   0.27
    ##  7 Gone with the Wind    1939      238 Dram…    8.2        97 3.25e5 199.
    ##  8 Gangs of Wasseypur    2012      321 Acti…    8.2        89 1.00e5  NA
    ##  9 Satantango            1994      439 Drama    8.2        NA 1.20e4  NA
    ## 10 Ben-Hur               1959      212 Adve…    8.1        90 2.47e5  74.7
    ## # ℹ 22 more rows
    ## # ℹ Use `print(n = ...)` to see more rows
    > filter(movies2, year==2000, metascore>90)
    ## # A tibble: 3 × 8
    ##    title                 year  runtime genre rating metascore  votes  gross
    ##    <chr>                 <chr>   <dbl> <chr>  <dbl>     <dbl>  <dbl>  <dbl>
    ## 1 Yi Yi: A One and a Tw… 2000      173 Dram…    8.1        94  26741   1.14
    ## 2 Werckmeister Harmonies 2000      145 Dram…    8          92  15171   0.03
    ## 3 Crouching Tiger, Hidd… 2000      120 Acti…    7.9        94 276925 128.

Filtering the 'genre' column is not so straightforward. However, we can use the grepl() function to help.

> grepl("i","team")
## FALSE
> grepl("cab","Escabar")
## [1] TRUE

  1. Use the filter() and grepl() functions on movies2 to create datasets of different genres.

    > filter(movies2, grepl("Drama",genre))

We can also get the first or last \(n\) rows from a dataset using respectively the slice_head() and slice_tail() functions.

  1. Use the slice_head() and slice_tail() functions on education2.

    > slice_head(education2,n=5)
    > slice_tail(education2,n=15)

  1. In code block 5, add code that creates a dataset from education2 only containing the education by age information (should be rows 2 through 28) and stores it in a variable called edu_age. Then, glimpse the data.

Notice that some of the entries in our dataset have value NA. There are missing entries in our data. Typically, these are observations which were not made. Missing values don't work like normal values.

> NA==NA
## [1] NA
> NA!=NA
## [1] NA
This means that we can't use filter() like we did before to remove rows with missing values.
> filter(movies2,metascore!=NA)
## # A tibble: 0 × 8
## # ℹ 8 variables: title <chr>, year <chr>, runtime <dbl>, genre <chr>,
## #   rating <dbl>, metascore <dbl>, votes <dbl>, gross <dbl>

To remove rows containing missing values we can use the drop_na() function.

> drop_na(movies2,metascore)
## # A tibble: 7,876 × 8
##    title                 year  runtime genre rating metascore  votes  gross
##    <chr>                 <chr>   <dbl> <chr>  <dbl>     <dbl>  <dbl>  <dbl>
##  1 The Shawshank Redemp… 1994      142 Drama    9.3        82 2.78e6  28.3
##  2 The Godfather         1972      175 Crim…    9.2       100 1.94e6 135.
##  3 The Dark Knight       2008      152 Acti…    9          84 2.76e6 535.
##  4 Schindler's List      1993      195 Biog…    9          95 1.40e6  96.9
##  5 12 Angry Men          1957       96 Crim…    9          97 8.25e5   4.36
##  6 The Lord of the Ring… 2003      201 Acti…    9          94 1.91e6 378.
##  7 The Godfather Part II 1974      202 Crim…    9          90 1.32e6  57.3
##  8 Spider-Man: Across t… 2023      140 Anim…    8.9        86 2.03e5  NA
##  9 Pulp Fiction          1994      154 Crim…    8.9        95 2.13e6 108.
## 10 Inception             2010      148 Acti…    8.8        74 2.45e6 293.
## # ℹ 7,866 more rows
## # ℹ Use `print(n = ...)` to see more rows

  1. In code block 4, add code that creates a dataset from movies2 only containing movies without missing 'gross' values and store this in a variable called moneymovies.

  2. In code block 4, add code that displays all the rows in moneymovies for movies rated 9 and above.

The mutate() function can be used to add columns to your dataset. When we call mutate(), along with a dataset, we can add any number of new column labels with a formula for computing the column values.

  1. Use the mutate() to compute amount of gross revenue earned per minute (in $million/min) for the movies in moneymovies via the following code.

    > mutate(moneymovies,dol_per_min=gross/runtime)
    ## # A tibble: 7,015 × 9
    ##    title     year  runtime genre rating metascore  votes  gross dol_per_min
    ##    <chr>     <chr>   <dbl> <chr>  <dbl>     <dbl>  <dbl>  <dbl>       <dbl>
    ##  1 The Shaw… 1994      142 Drama    9.3        82 2.78e6  28.3       0.200
    ##  2 The Godf… 1972      175 Crim…    9.2       100 1.94e6 135.        0.771
    ##  3 The Dark… 2008      152 Acti…    9          84 2.76e6 535.        3.52
    ##  4 Schindle… 1993      195 Biog…    9          95 1.40e6  96.9       0.497
    ##  5 12 Angry… 1957      96 Crim…    9          97 8.25e5   4.36      0.0454
    ##  6 The Lord… 2003      201 Acti…    9          94 1.91e6 378.        1.88
    ##  7 The Godf… 1974      202 Crim…    9          90 1.32e6  57.3       0.284
    ##  8 Pulp Fic… 1994      154 Crim…    8.9        95 2.13e6 108.        0.701
    ##  9 Inception 2010      148 Acti…    8.8        74 2.45e6 293.        1.98
    ## 10 Fight Cl… 1999      139 Drama    8.8        67 2.22e6  37.0       0.266
    ## # ℹ 7,005 more rows
    ## # ℹ Use `print(n = ...)` to see more rows

The arrange() function sorts your dataset by a chosen column. When we call arrange(), along with a dataset, we can indicate which columns we want to sort out data by starting from the first column name given. By default, this function sorts in ascending order. To sort in descending order, use the desc() function.

  1. Use the arrange() to sort the data in moneymovies. Here are some examples.

    > arrange(moneymovies,metascore)
    ## # A tibble: 7,015 × 8
    ##    title                    year runtime genre rating metascore votes gross
    ##    <chr>                   <dbl>   <dbl> <chr>  <dbl>     <dbl> <dbl> <dbl>
    ##  1 Miss March               2009      90 Come…    5           7 22036  4.54
    ##  2 Unplanned                2019     109 Biog…    5.5        10 14453 19.0 
    ##  3 Death Wish II            1982      89 Acti…    5.9        11 17221 16.1 
    ##  4 Nine Lives               2016      87 Come…    5.3        11 23713 19.6 
    ##  5 Cocktail                 1988     104 Come…    5.9        12 89374 78.2 
    ##  6 Slackers                 2002      86 Come…    5.3        12 15026  4.81
    ##  7 Strange Wilderness       2008      87 Adve…    5.2        12 22435  6.56
    ##  8 Friday the 13th Part V…  1988      88 Horr…    5.2        13 39997 19.2 
    ##  9 Nothing But Trouble      1991      94 Come…    5.1        13 26511  8.48
    ## 10 Cannonball Run II        1984     108 Acti…    5          13 16967 28.1 
    ## # ℹ 7,005 more rows
    ## # ℹ Use `print(n = ...)` to see more rows
    > arrange(moneymovies,desc(metascore))
    ## # A tibble: 7,015 × 8
    ##    title                year runtime genre   rating metascore  votes  gross
    ##    <chr>               <dbl>   <dbl> <chr>    <dbl>     <dbl>  <dbl>  <dbl>
    ##  1 The Godfather        1972     175 Crime,…    9.2       100 1.94e6 135.  
    ##  2 Casablanca           1942     102 Drama,…    8.5       100 5.89e5   1.02
    ##  3 Rear Window          1954     112 Myster…    8.5       100 5.06e5  36.8 
    ##  4 Fanny and Alexander  1982     188 Drama      8.1       100 6.58e4   4.97
    ##  5 Three Colors: Red    1994      99 Drama,…    8.1       100 1.07e5   4.04
    ##  6 Boyhood              2014     165 Drama      7.9       100 3.62e5  25.4 
    ##  7 Notorious            1946     102 Drama,…    7.9       100 1.04e5  10.5 
    ##  8 The Conformist       1970     113 Drama      7.9       100 3.24e4   0.54
    ##  9 City Lights          1931      87 Comedy…    8.5        99 1.90e5   0.02
    ## 10 Army of Shadows      1969     145 Drama,…    8.1        99 2.49e4   0.74
    ## # ℹ 7,005 more rows
    ## # ℹ Use `print(n = ...)` to see more rows
    > arrange(moneymovies,desc(year),runtime)
    ## # A tibble: 7,015 × 9
    ## # A tibble: 7,015 × 8
    ##    title                   year runtime genre rating metascore  votes gross
    ##    <chr>                  <dbl>   <dbl> <chr>  <dbl>     <dbl>  <dbl> <dbl>
    ##  1 Minions: The Rise of …  2022      87 Anim…    6.5        56  77386 370. 
    ##  2 Beast                   2022      93 Acti…    5.6        54  36198  31.8
    ##  3 The Bad Guys            2022     100 Anim…    6.8        64  50836  97.2
    ##  4 Puss in Boots: The La…  2022     102 Anim…    7.9        73 147748 168. 
    ##  5 M3GAN                   2022     102 Horr…    6.4        72 115771  93.9
    ##  6 Ticket to Paradise      2022     104 Come…    6.1        50  56785  68.3
    ##  7 Morbius                 2022     104 Acti…    5.2        35 140517  73.9
    ##  8 DC League of Super-Pe…  2022     105 Anim…    7.1        56  71143  93.7
    ##  9 Lightyear               2022     105 Anim…    6.1        60 116808 118. 
    ## 10 Lyle, Lyle, Crocodile   2022     106 Anim…    6.1        51  12694  46.9
    ## # ℹ 7,005 more rows
    ## # ℹ Use `print(n = ...)` to see more rows

The last innovation from the tidyverse package we will learn about is the pipe operator %>%. Pipe operators allow us to combine multiple operations into one continuous line of code. For example, following code displays the top five grossing horror movies of the 1990's along with the year they were released. Give it a try.

> moneymovies %>% filter(grepl("Horror",genre),year>=1990,year<2000) %>% select(title,year,gross) %>% arrange(desc(gross)) %>% slice_head(n=5)
The input dataset only needs to indicated once before the first %>%. For all the other commands used we can omit the dataset in the function calls. The pipe is always executed from left to right.

  1. In code block 6, add one line of code that displays a list of movie titles (just the titles) from the Star Trek franchise in order of release (earliest to latest).

To save data we can use the write_csv() function.

  1. Use the following line of code to save the data in moneymovies to a new file called imdb_gross.csv.

    > write_csv(moneymovies,"imdb_gross.csv")

  1. Use what you've learned to answer the following questions using the moneymovies dataset.
    1. How many action movies, rated 8 and above, were released before the year 1970? Answer in space marked [Answer Here 1].

    2. What is the title of 5th highest grossing movie released in the year 1999? Answer in space marked [Answer Here 2].

    3. What is the mean runtime (rounded to the nearest tenth) of a Sci-fi movie released after (not during) the year 2015? Answer in space marked [Answer Here 3].

Congratulations! You've mastered data wrangling in R.