How to speak R

Creating objects in R

We just learnt about the differences between the console and scripts and also we commanded R to carry out some simple (and not so simple) analyses for us. For a bit of a recap type 3 + 5 into the console. This is cool but has major limitations. If we wanted to add that answer to 10 we would have to repeate the process by going 3 + 5 + 10. None of what we did was ‘stored’ anywhere. This is where R becomes even more useful. We can assign values to objects. Let’s suppose we have two people, one of whom has three apples and another who has five apples. We can then command R to store the sum of all the apples by coding apples.sum <- 3 + 5 - who would have thought you would have to do story sums after grade three!?

3 + 5

## [1] 8

3 + 5 + 10

## [1] 18

apples.sum <- 3 + 5

See how the answer for each command appears directly below it? But what about the last set of commands? Why do we not see 8 anywhere there? That’s because all we have told R to do is to store the sum of 3 + 5 in apples.sum. We haven’t told R to tell us what apples.sum equals to. But look here:

apples.sum <- 3 + 5
apples.sum

## [1] 8

We have told R that apples.sum‘s value is 3 + 5 and then we have told it to tell us what apples.sum’s value is! This is great! Now maybe we know that the person with three apples’ name is John (John <- 3) and the second person who has five apples’ name is Thandeka (Thandeka <- 5) we could get the sum of the apples by coding apples.sum <- John + Thandeka:

John <- 3
Thandeka <- 5
apples.sum <- John + Thandeka
apples.sum

## [1] 8

Neat hey? There is a lot going on here but break it down into its basic algebra and it should become easier to understand. The only really new thing is <- which is called the “assignment operator”. Its job is to assign the values on the right into the object on the left. You can type this quickly in RStudio by typing Alt + - (pushing Alt at the same time as the - (minus) key).

Objects and variables

On the surface there is little difference between these two terms. What are commonly called variables in other coding languages seem to be referred to as objects in R. The terms can be used interchangably for most instances in R, however, when dealing with statistical terminology, variables carry the same meaning in R as in general statistics.

Objects can be given almost any name but there are some important guidelines and considerations. The text below is taken from an R for Data Science section:

4.2 What’s in a name?

Object names must start with a letter, and can only contain letters, numbers, _ and .. You want your object names to be descriptive, so you’ll need a convention for multiple words. I recommend snake_case where you separate lowercase words with _.

i_use_snake_case
otherPeopleUseCamelCase
some.people.use.periods
And_aFew.People_RENOUNCEconvention
We’ll come back to code style later, in functions.

You can inspect an object by typing its name:

x #> [1] 12
Make another assignment:

this_is_a_really_long_name <- 2.5 To inspect this object, try out RStudio’s completion facility: type “this”, press TAB, add characters until you have a unique prefix, then press return.

Ooops, you made a mistake! this_is_a_really_long_name should have value 3.5 not 2.5. Use another keyboard shortcut to help you fix it. Type “this” then press Ctrl + ↑. That will list all the commands you’ve typed that start those letters. Use the arrow keys to navigate, then press enter to retype the command. Change 2.5 to 3.5 and rerun.

Make yet another assignment:

r_rocks <- 2 ^ 3
Let’s try to inspect it:

r_rock
#> Error: object 'r_rock' not found
R_rocks
#> Error: object 'R_rocks' not found
There’s an implied contract between you and R: it will do the tedious computation for you, but in return, you must be completely precise in your instructions. Typos matter. Case matters.

In general you should note that:

Objects cannot start with a number
Objects cannot carry the same name as some basic R functions (e.g., if, else, for) and it is best not to use the names of other less fundamental functions (e.g., c, T, mean, data, df, weights).
As R is a dynamic language it is best to use nouns for object names and verbs for function names.
Be consistant with how you name your variables. There are several style guides available online which dictate how you should format your code and name your objects and functions.
The tidyverse style
The Jean Fan style
The Google style

These might all seem overwhelming but it is good practice to adopt one of these and use it consistantly. The most popular one is the tidyverse style. RStudio also has addin capabilities for installing style checkers or style formatters to correct your styling and suggest alternatives.

Back to coding!

We learnt that algebra works in R but that there are some slight differences, especially regarding the value assignment functionality. We can command R to both assign a value to an object and print the object’s value by enclosing the command in parentheses ():

John <- 3                       # doesn't print anything
Thandeka <- 5                   # doesn't print anything
(apples.sum <- John + Thandeka) # but this does!

## [1] 8

Objects can have their values manipulated or overwritten by either applying simple arithmatic to it or by assigning a new value to it:

apples.sum                        # print current value of apples.sum

## [1] 8

apples.sum * 2                    # compute and print the product of apples.sum and 2

## [1] 16

(apples.sum.2 <- apples.sum * 2)  # compute, store in a new object and print the product of apples.sum and 2

## [1] 16

Comments

Sometimes R code can get confusing to follow (both for you and someone else) and so it is important to describe exactly what you are doing as you do it. That is where comments come in. Comments are text which is stored in your script but do not affect the code’s output You can use comments to describe the code or you can comment out sections of code within a script that you do not want to run - this is useful for troubleshooting pesky code. To do this you can select the body of text and then push Ctrl + Shift + c.

Functions and their arguments

Functions are the heart of what makes programming so effective. Functions are essentially scripts which other people (and eventually maybe yourself) have written which consist of a set of commands which help carry out a particular task. ‘Base R’ contains some fundamental functions but more specific functions can be accessed by installing various other R packages. Functions usually get one or more inputs. These are called arguments. If the supplied arguments are correct then the function will generally return a value (or an array of many values) which can either be displayed immediately or assigned to an object for future reference. Let us look at the sqrt() function as an example. Call up the help file for this function by typing ?sqrt into the console and pushing Enter. See how it shows sqrt(x) where under the Arguments section x is a numeric or complex vector or array? That basically means that sqrt() needs a numeric input:

sqrt(4) # this works nicely

## [1] 2

a <- "a" # make a equal the letter "a"
sqrt(a) # this doesn't...

## Error in sqrt(a): non-numeric argument to mathematical function

a <- 4  # make a = 4
sqrt(a) # this should work now

## [1] 2

sqrt() is a fairly simple function in that it only accepts one argument. Many functions can accept (and require) more than just one argument. A simple one we can look at is round(). This rounds a numeric value to a defined number of decimal places. Typing args() lets you see what the arguments for a particular function are. If we type

args(round)

## function (x, digits = 0)
## NULL

we’ll see that round() accepts x and digits =. Again, x is a numeric vector or array. digits = refers to the number of decimal places you want after the point. You will see that the default value will often be defined. That means that you do not need to supply every argument but you can override the default arguments if you have the need to:

a <- pi                   # a = pi = 3.14159
round(a)                  # round a off to the default number of digits

## [1] 3

round(a, digits = 3)      # override the default number of digits to display the rounding to three places.

## [1] 3.142

round(digits = 3, x = a)  # if you want to you can change the order of the arguments. This doesn't affect the function call at all.

## [1] 3.142

round(a,3)                # if you do not define the argument values the function will default to its supplied order

## [1] 3.142

round(3,a)                # not explicityly defining the arguments is a bad idea...

## [1] 3

Be sure to correctly define your arguments when dealing with more complicated functions. You can save yourself and the people reading your code a lot of trouble if you do. Each function does, however, have non-optional arguments. By convention we input these in the order prescribed by the function itself. Eventually you will get the hang of this process and you will not need to specify to which argument each supplied variable or object is to be assigned for the more common functions.

Vectors and data types

One of the most common data types in R is the vector. These are objects which are composed of a series of values which can be either numerics, characters, factors, or logicals. Suppose we go out and collect data on the lengths (in centimeters) of leaves for a BIOL 101 prac. We need to collect five measurements from three tree species We can store these data in length.cm:

length.cm <- c(5.8, 4.8, 3.7, 5.3, 4.5, 8.3, 8.8, 9.7, 7.7, 8.1, 15.2, 16.1, 14.3, 12.2, 15.5)
length.cm

##  [1]  5.8  4.8  3.7  5.3  4.5  8.3  8.8  9.7  7.7  8.1 15.2 16.1 14.3 12.2
## [15] 15.5

We can then record the name of each tree in a vector in a similar manner:

names.tree <- c("Spp 1", "Spp 1", "Spp 1", "Spp 1", "Spp 1", "Spp 2", "Spp 2", "Spp 2", "Spp 2", "Spp 2", "Spp 3", "Spp 3", "Spp 3", "Spp 3", "Spp 3")

The quotation marks here are essential. They tell R that each value assigned to the vector is a character string. Without these quotations R will assume that Spp 1, Spp 2, and Spp 3 are objects. This would throw an error because there are no objects with those names in our current R session.

We can get an overview of a vector. length() will tell us how long (how many elements) are in the vector:

length.cm <- c(5.8, 4.8, 3.7, 5.3, 4.5, 8.3, 8.8, 9.7, 7.7, 8.1, 15.2, 16.1, 14.3, 12.2, 15.5)
names.tree <- c("Spp 1", "Spp 1", "Spp 1", "Spp 1", "Spp 1", "Spp 2", "Spp 2", "Spp 2", "Spp 2", "Spp 2", "Spp 3", "Spp 3", "Spp 3", "Spp 3", "Spp 3")
length(length.cm)

## [1] 15

length(names.tree)

## [1] 15

Vectors must be made up of a series of elements of the same data type. To see the data type of a vector we can call class():

class(length.cm)

## [1] "numeric"

class(names.tree)

## [1] "character"

When we get to larger data frames, str() will give us the structure of the object:

str(length.cm)

##  num [1:15] 5.8 4.8 3.7 5.3 4.5 8.3 8.8 9.7 7.7 8.1 ...

str(names.tree)

##  chr [1:15] "Spp 1" "Spp 1" "Spp 1" "Spp 1" "Spp 1" "Spp 2" "Spp 2" ...

Suppose we collected five more measurements from a fourth species, we could add these data to our vectors as follows:

length.cm <- c(length.cm, 22.1, 22.5, 20.3, 25.1, 23.3)
names.tree <- c(names.tree, "Spp 4", "Spp 4", "Spp 4", "Spp 4", "Spp 4")
length.cm

##  [1]  5.8  4.8  3.7  5.3  4.5  8.3  8.8  9.7  7.7  8.1 15.2 16.1 14.3 12.2
## [15] 15.5 22.1 22.5 20.3 25.1 23.3

##  [1] "Spp 1" "Spp 1" "Spp 1" "Spp 1" "Spp 1" "Spp 2" "Spp 2" "Spp 2"
##  [9] "Spp 2" "Spp 2" "Spp 3" "Spp 3" "Spp 3" "Spp 3" "Spp 3" "Spp 4"
## [17] "Spp 4" "Spp 4" "Spp 4" "Spp 4"

Note that in order to retain the original data contained in the vectors we have to ‘reassign’ the vector to itself together with the new data. If we did not do that the new data would override the original data. These are the two fundamental data types in R but there are a few others that are important:

"logical" which store TRUE and FALSE (boolean) type data
"integer" which store whole numbers

Besides vectors there are several other important data structures in R. These include list, matrix, data.frame, factor, and array.

Quiz time

Vectors can be of several types but what happens if we mix different types together?

Answer
R will implicitly convert each element to the same data type.
Give the data types of each of these vectors:
num_char <- c(1, 2, 3, "a")
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c("a", "b", "c", TRUE)
tricky <- c(1, 2, 3, "4")
Answer
character, numeric, character, character.
R does this by trying to find the ‘lowest common denominator’ amongst all the elements without loosing any data.
You’ve probably noticed that objects of different types get converted into a single, shared type within a vector. In R, we call converting objects from one class into another class coercion. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced?

Answer
logical –> numeric –> character <– logical

Subsetting vectors

We can call a vector’s name and it will return the entire contents of the vector:

names.tree

##  [1] "Spp 1" "Spp 1" "Spp 1" "Spp 1" "Spp 1" "Spp 2" "Spp 2" "Spp 2"
##  [9] "Spp 2" "Spp 2" "Spp 3" "Spp 3" "Spp 3" "Spp 3" "Spp 3" "Spp 4"
## [17] "Spp 4" "Spp 4" "Spp 4" "Spp 4"

but suppose we only want a few selected values of the vector. We can do this by telling R exactly what we want:

names.tree[2]

## [1] "Spp 1"

names.tree[c(2, 6, 9)]

## [1] "Spp 1" "Spp 2" "Spp 2"

Working with individual vectors is great - but suppose we wanted to put them together in a table form. We can do that quite easily with the cbind() function (this stands for “column bind”). We would bind the two vectors and then tell R that this combination must be as.data.frame(). This converts these two vectors (initially a matrix) into an object that has columns and rows much like a spreadsheet does:

data.prac <- as.data.frame(cbind(names.tree, length.cm))
data.prac

##    names.tree length.cm
## 1       Spp 1       5.8
## 2       Spp 1       4.8
## 3       Spp 1       3.7
## 4       Spp 1       5.3
## 5       Spp 1       4.5
## 6       Spp 2       8.3
## 7       Spp 2       8.8
## 8       Spp 2       9.7
## 9       Spp 2       7.7
## 10      Spp 2       8.1
## 11      Spp 3      15.2
## 12      Spp 3      16.1
## 13      Spp 3      14.3
## 14      Spp 3      12.2
## 15      Spp 3      15.5
## 16      Spp 4      22.1
## 17      Spp 4      22.5
## 18      Spp 4      20.3
## 19      Spp 4      25.1
## 20      Spp 4      23.3

Whilst we might not always have to do this kind of work (much of the data we will use in R will likely already be formatted in this manner when we import it into R) the principles described here provide a good introduction into the way you need to think in order to code effectively in R. That is all for this section - next we will move on to “data wrangling” with dplyr.

Baby steps

Stuart Demmer

28 July 2018