Jonathan C. Johnson

MATH 4910/5010 - R Lab 1

This lab is designed to help you learn the basics of coding in R. Because R is not a prerequisite for the course, this guide assumes no background in the language.

Objectives:

Get familiar with the RStudio interface
Learn Basic R Expressions
Understand R basic data types
Understand R data structures: vectors and data frames
Understand R functions and control structures

You can find the R markdown template for this lab on Canvas in the course files under the "R Lab 1" folder. Follow the instructions below. Instructions in green indicate tasks that should be completed in the R markdown file for this lab.

Basic R Expressions and Data Types

Expressions can be typed directly into the R console. When you press enter, the result is immediately outputted underneath. These expressions can numerical, character, or logical.

Type some expressions into the R console to see what happens. Below are some sample expressions along with their output. Code after the '>' symbol indicates commands input in the R console. Expressions following the '##' characters indicate what the R console should output. For a list of R operators, see this webpage.

Numerical Expressions
> 5+6

## [1] 11

> (7/3)^2

## [1] 5.444444

> pi

## [1] 3.141593

Character Expressions
> "Go Pokes!!"

## [1] "Go Pokes!!"

> 'You can also use single quotes'

## [1] "You can also use single quotes"

> 'Just say "No"'

## [1] "Just say \"No\""

Logical Expressions
> 11>23

## [1] FALSE

> (3^2+4^2)==5^2

## [1] TRUE

> !(TRUE & FALSE)

## [1] TRUE

Like most programming languages, expressions can be assigned to variables. The assignment operator in R is '<-'. The following code assigns the value of the expression $4-11$ to the variable 'x'.

> x <- 4-11

You can also assign variables in the opposite direction using '->' as follows.

> "Topology is cool!" -> y

When you assign a variable, no output is produced. To see the value of a variable, just input the variable.

> x

## [1] -7

> y

## [1] "Topology is cool!"

Once defined you can use variables in expressions.

> x^2

## [1] 49

Assign the value of $ {2 \over 3} + {3 \over 4}$ to a variable 'w'. Then, output the value $6w$.

The class() function returns that data type of a variable or expression.

> x <- 5!=5
> class(x)

## [1] "logical"

> class(3/4)

## [1] "numeric"

> class("3/4")

## [1] "character"

There's also a fourth data type called a factor. Factors are sort of a category or class of objects. We won't be using factors much in this course.

Vectors

Most of the time, we will be working with variables that store several values at once. These are called arrays. Arrays consist of a set of values paired with a set of indices. Arrays are indexed by the positive integers starting at 1. (This is different from most other programming languages with begin indexing at 0.) For example, here's an array with the first 8 Fibonacci numbers.

Index	1	2	3	4	5	6	7	8
Value	1	1	2	3	5	8	13	21

An array of Fibonacci numbers.

There are a few different types of arrays in R. In this lab, we will focus on a couple of them. Vectors are arrays of values with the same data type. Vectors can be defined using the c() function.

Create a vector with the first eight Fibonacci numbers using the code below.

> c(1,1,2,3,5,8,13,21)

## [1] 1 1 2 3 5 8 13 21
Create a vector with the strings "one", "two", and "three" with the following code.

> c("one","two","three")

## [1] "one" "two" "three"
Vectors always have a single data type. See what happens when you run the following code.

> vect <- c(2,"Susan",TRUE)
> vect

What is the data type of vect?
You can see the length of a vector using the length function.

> vect2 <- c(5,6,7,8)
> length(vect2)

## [1] 4

What is the data type of vect?

A common type of vector you may want to define is a vector of numbers incremented by a constant value. The ':' operator can be used to create vectors which increment by 1.

> 1:100

## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100"

This vectors is too big to fit on one line so the output is word-wrapped. When displaying vectors like this, the numbers in the square brackets indicate the index of the value displayed first in each row. In fact, you may have noticed that all of the output we've seen so far has '[1]' at the front. This is because R stores information in vectors by default, even when it's just one value.

When using the ':' operator we can use different values to start and end at different places. We can look at the documentation for an operator to understand its functionality using '?'.

Check the documentation for ':' by inputting the following command into the R console.

> ?':'
Use ':' to create a sequence which starts at 2.5 and counts down by one to -6.5.

If you want to see a specific entry of a vector we can indicate the indices we want in square brackets.

> x <- c(4,8,15,16,23,42)
> x[3]

## [1] 15

> x[2:4]

## [1] 8 15 16

> x[c(3,5,6)]

## [1] 16 23 42

Use the following code to store a sequence to the variable my_seq.

> my_seq<-c(6, -37, 11, 43, 44, 34, 37, 36, 15, 46, -1, 26, -19, -43, -33, 1, -42, 25, 16, -18, 21, 27, -48, 7, 20, 12, -14, 39, 13, -24, 28)

Write code that displays the 10th, 15th and 20th terms of this sequence.

Another way to create iterative sequences is the seq() function. Just like with operators we can use '?' to look at the documentation for a function.

Check the documentation for 'seq()' by inputting the following command into the R console.

> ?seq
Use 'seq()' to create a sequence which counts by 0.25 from 2 to 4.

Before when we were using the function c(), we were actually concatenating vectors of length one together. However, we can concatenate vectors of longer length together using the same function.

Try using c() to concatenate some vectors together.

> c(2:7,14:10)

## [1] 2 3 4 5 6 7 14 13 12 11 10

> c(0,100:90)

## [1] 0 100 99 98 97 96 95 94 93 92 91 90

> v1 <- c(9,3/4,-pi)
> c(v1,22,seq(6.3,7.5,0.2))

## [1] 9.000000 0.750000 -3.141593 22.000000 6.300000 6.500000 6.700000
## [8] 6.900000 7.100000 7.300000 7.500000

Vectors can used in expressions to do computations. Operations will be applied to each value of the vector.

> x <- 1:6
> x*2

## [1] 2 4 6 8 10 12

What happens if we try to type in the following code?

> x <- 1:6
> y <- 0:5
> x * y
As you may have noticed, when more than one vector is used in an expersion, R starts by using the first term of every vector used. Then, for each subsequent term of the output, the index of each vector used in created by one. But what happens if the vectors have different length? Well, let's see.

> x <- 1:6
> y <- c(1,0)
> x * y

That's right, R will cycle through the terms of shorter vectors until all the terms of the longest vector are used.

Let $(a_n)$ be the following sequence: \[ a_n=2,1,2,1,2,1,\ldots=\begin{cases}2 & n\text{ is odd}\\ 1 & n\text{ is even}\end{cases} \] Define the terms of the sequence $(b_n)$ as follows: \[ b_n={a_n n^2 \over 2} \] Using vectors, write code that displays the first 50 terms of $(b_n)$ in code block 1.

Data Frames

When analyzing data, we usually don't just have an arbitrary sequence of values. What we usually have is several related variables collected from observations. Each observation has a value for each of these variables (some variables could be missing if they were not observed). Essentially, what we want is a table with a column for each variable and a row for each observation. To handle this type of information we can use an R data type called a data frame.

A data frame is a collection of equal length vectors with a header row of column labels. We usually think of data frames as having each row correspond to some observation and each column correspond to some property of the observations. Data frames can be defined directly using the data.frame() function.

Use the following code to create a data frame called nfc_east.

> city <- c("Dallas","New York","Washington","Philadelphia")
> nickname <- c("Cowboys","Giants","Commanders","Eagles")
> super_bowls <- c(5:3,1)
> nfc_east <- data.frame(city,nickname,super_bowls)
> nfc_east

##           city   nickname super_bowls
## 1       Dallas    Cowboys           5
## 2     New York     Giants           4
## 3   Washington Commanders           3
## 4 Philadelphia     Eagles           1
Confirm the data type of nfc_east using the class() function.
You can use the View() function to see the data inside nfc_east in a worksheet.
The colnames() function shows use the column labels of our data.

> colnames(nfc_east)

## [1] "city" "nickname" "super_bowls"
To see the dimensions of our data we can use the dim() function.

> dim(nfc_east)

## [1] 4 3

This means has 4 rows (observations) and 3 columns(variables).
Since our data is indexed by two numbers, a row and a column, we can use ordered pairs to call specific entries in the data frame.

> nfc_east[3,2]

## [1] "Commanders"

We can even call entire rows can columns by leaving one index blank.

> nfc_east[2,]

## city nickname super_bowls
## 2 New York Giants 4

> nfc_east[,3]

## [1] 5 4 3 1

If you want to call a column using it's label, you can use the '$' operator.

> nfc_east$city

## [1] "Dallas" "New York" "Washington" "Philadelphia"
A very neat function you can use to basic statistical analysis on data is the summary() function. Give it a try!

> summary(nfc_east)

##     city             nickname          super_bowls
## Length:4           Length:4           Min.   :1.00
## Class :character   Class :character   1st Qu.:2.50
## Mode :character   Mode :character   Median :3.50
##                                       Mean   :3.25
##                                       3rd Qu.:4.25
##                                       Max.   :5.00

In code block 2, write code to store the data from the following table into data a frame. Then, summarize the data using the summary() function.

Student	Exam 1	Exam 2	Exam 3
Olive Schaefer	77	76	61
Ishaan Hoover	92	90	97
Virginia Moss	75	94	60
Porter Navarro	95	79	81
Winter Ochoa	91	89	80
Winston Jennings	71	78	85
Palmer Dunn	74	96	84
Dawson McCormick	63	72	73
Macie Stein	82	69	64
Creed Newton	83	98	68

A table of test scores.

What was the mean score for Exam 1? Write your answer in the space marked [Answer Here 1].

We will learn more about data frames in the next lab.

Functions

We've already learned about several functions. Here are a few more convenient ones.

The functions abs(), log(), and sqrt() returns respectively the absolute value, the natural log, and the square root of a number.

> abs(-3)

## [1] 3

> log(10)

## [1] 2.302585

> sqrt(2)

## [1] 1.414214

The functions sum(), mean(), and sd() returns respectively the sum, the mean, and the standard deviation of a entires in a vector.

> sum(1:10)

## [1] 55

> mean(1:10)

## [1] 5.5

> sd(1:10)

## [1] 3.02765

The objects between the parentheses of a function are called arguments. To see what arguments can be passed to a function, its documentation using '?'.

View the documentation for the rnorm() function. What arguments can be passed to the rnorm() function, and what does the function do with them? Write your response in the space marked [Answer Here 2].

Check out the documentation for the rep() function to see what it does.
When passing arguments, R assumes that you are listing them in order. However you can also pass a specific argument by using '='. Input the following commands to see what happens.

> rep(3,4)

> rep(4,3)

> rep(times=4,x=3)

You can also create your own functions.

Define a function plus1() using the following code.

> plus1 <- function(x) x+1
> plus1(7)

## [1] 8
Here's a function that computes dot products.

> dot_prod <- function(x,y) sum(x*y)
> dot_prod(c(1,2,3),c(3,2,1))

## [1] 10

In code block 3, create a function d_s() which can be passed two numerical vector arguments of the same length and returns the distance between them in the standard metric. Recall that given two points $x,y\in\mathbb{R}^n$ where $x=(x_1,\ldots,x_n)$ and $x=(y_1,\ldots,y_n)$, \[ d_s(x,y)=\sqrt{(x_1-y_1)^2+\cdots+(x_n-y_n)^2} . \] Use this function to find the distance between the points $(-1,2,0,5)$ and $(1,-5,1,1)$.

Control Structures

In R, you get very far just using expressions and functions. However, sometimes, you may want to use more complex programming structures. Two of the most basic control structures are if statements, which are used to select a choice of commands based on a condition, and loops, which allows us to execute the same set of commands repeatedly while iterating through data.

Here's the for loop syntax:

if statement:
if (<boolean expression>) {<expression if true>}

if-else statement:
if (<boolean expression>) {<expression if true>} else {<expression if false>}

Here's the loop syntax:

for loop:
for (<iterated element> in <vector>) {<looped expression>}

while statement:
while (<boolean expression>) {<looped expression>}

Assign a number to the variable x. Test the following code.

> if (x>0) {print("positive")}

> if (x<0) {print("negative")}

> if (x>0) {print("positive");} else {if (x<0) {print("negative");}else{print("zero");}}
Here's an example of some loops. See what they do!

> for(i in 1:10) {print(rep("Hi",i))}

> s <- 0
> i <- 0
> diff <- 1
> while(diff>0.0005) {
diff <- 1/factorial(i);
s <- s+diff;
i=i+1;
}
> s
This loop creates a vector of prime numbers between 1 and 50.

> primes <- c()
> for(i in 1:50) {if(i > 1 & (i%%2 != 0 | i==2) & (i%%3 != 0 | i==3) & (i%%5 != 0 | i==5) & (i%%7 != 0 | i==7)){primes<-c(primes,i);}}
> primes

## [1] 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47

In code block 4, use if statements and for loops to create a vector of all the numbers between 1 and 100 which are both multiples of 3 and factors of 288.

Comments

Sometimes R code can be hard to read. To assist readers, you can add comments to your code using '#'.

Copy the definition below into code block 5.

ps <- function(s,n){sum(s[1:n])}
# The function ps(s,n) returns the sum of the first n terms of the vector s.
Use the ps() function, to display the first 10 partial sums of the following sequence. \[ (a_n)=(1,-{1 \over 2},{1 \over 3},-{1 \over 4},{1 \over 5},-{1 \over 6},\ldots) \]

Congratulations! You've mastered the basics of R programming.