## Random R: comparing values

Welcome to another episode of “Random R,” where we’ll ask random programming and statistical questions and answer them with R. Today, for whatever reason, let’s say we want to dive into methods for comparing values. We’ll start simple (e.g. is 5 greater than 4? Read on to find out.) and then work our way towards trickier element-wise comparisons among multiple matrices.

## Comparing scalars

Let’s say we have two variables, and we want to see which one is larger. The variables are called scalars because they’re just one value each. We can use either the max function or ifelse.

```# Create the variables
a <- 5
b <- 4

# Method 1: max
max(a, b)

# Method 2: ifelse
ifelse(a > b, yes = a, no = b)
```

The above code is pretty straightforward. max(a, b) finds the maximum value in the set of a and b. ifelse reads like this: “Is a greater than b? If yes, return a. If no, return b.”

Note that ifelse gives us more flexibility because we can specify what happens when the logical statement a > b is either true or false. The below code is a small modification that gives us the identity of the variable that’s larger, instead of the value of that variable.

```# Return the identity of the larger variable
ifelse(a > b, yes = "a", no = "b")
```

While ifelse has this nice flexibility here, max excels when you have more than two variables (and for some reason you’re determined to keep them all separate variables and not combine them into a vector, matrix or list… more on that later). Say that instead of just a and b, for example, we have a whole bunch of variables we want to find the maximum for.

```# Find the maximum among 5 variables
a <- 5 ; b <- 4 ; c <- 10 ; d <- 8 ; e <- 10
max(a, b, c, d, e)
```

The above code lets us find the maximum of the five separate scalars, but identifying which scalar(s) is the maximum would be a nightmare. (We’d probably need to use nested if statements, where it’s easy to make a logical or grammatical mistake, for all the comparisons. Later on we use nested if statements for just three variables and it already starts getting lengthy.) Let’s try a different approach.

## Comparing vectors and matrices

Finding the identity of the max value above when we’re determined to keep the data as separate scalars is needlessly confusing and giving me a headache… it’s much simpler if we let R know that these variables are related somehow (e.g. they’re all height measurements, or time increments, or number of people moshing at a random moment in a metal show, etc.). We can do that by combining them into a vector. The elements of the vector are our variables.

```# Combine the 5 variables into a vector called "data"
a <- 5 ; b <- 4 ; c <- 10 ; d <- 8 ; e <- 10
data <- c(a, b, c, d, e)
```

Now we can find the max and ask which position in our vector corresponds to that maximum value. Because our variables are just named after sequential letters in the alphabet, we can just index the built-in letters variable. (If your variables had other names, like “location1”, “location2”, etc., you’d need to have a separate vector of names that you’d then index.)

```# Find the maximum and its position(s) in the vector
max(data)
which(data == max(data))

# variable name?
letters[which(data == max(data)]
```

So far, we’ve just been comparing scalars to each other. We ended by combining multiple scalars into a vector and then finding the max of that vector. But what if we have multiple vectors?

If we want to still just find the single maximum value among whatever we feed into the max function, we’ll do exactly what we did before.

```# Create the variables (letting R generate random
# numbers)
vector1 <- runif(n = 10, min = 0, max = 100)
vector2 <- runif(n = 10, min = 0, max = 100)
vector3 <- runif(n = 10, min = 0, max = 100)

# What's the maximum value out of those 30 values?
max(vector1, vector2, vector3)

```

But let’s say that instead of wanting to find the single maximum value, we want to compare each element of the vectors to each other and keep the largest value. So we want to look at the first element of vector1, vector2, and vector3 and keep the biggest one, then compare their second elements and keep the largest one, then do the same for the third, etc. For this, we’ll need to use the pmax function, which finds the parallel maxima of the vector inputs it receives. Basically, it performs the max function for each element of the set of vectors you give it.

```# Find the parallel maxima of our vectors
pmax(vector1, vector2, vector3)
```

The output of pmax is another vector, this one consisting of the parallel maxima of each element in vector1, vector2, and vector3.

As a final example, we can extend this thinking from vectors to matrices and still use pmax.

```# Create the matrices
matrix1 <- matrix(rnorm(n = 100), ncol = 10))
matrix2 <- matrix(rnorm(n = 100), ncol = 10))
matrix3 <- matrix(rnorm(n = 100), ncol = 10))

# Make a new matrix with the parallel maxima of the
# three inputs
new.matrix <- pmax(matrix1, matrix2, matrix3)

# Look at a subset of each matrix and confirm it
# worked
matrix1[1:3, 1:3]
matrix2[1:3, 1:3]
matrix3[1:3, 1:3]
new.matrix[1:3, 1:3]
```

## Comparing vectors and matrices to a constant

So far, we’ve been comparing scalars, vectors, and matrices to each other. But what if we have some external value, and we want to keep the values that are closest to it?

For this, we’ll return to ifelse. Our external value will be zero. To keep things simple, we’ll compare two vectors and find the distances that their elements are from zero.

```# Create the vectors
A <- rnorm(n = 10, mean = 5, sd = 1)
B <- rnorm(n = 10, mean = 5, sd = 2)

# Make a new vector with the elements of A and B
# closest to zero
C <- ifelse(abs(0 - A) < abs(0 - B), yes = A, no = B)

# Check to make sure it worked
A
B
C
```

The nice thing with ifelse is that it’s a concise function for when you have one of two possible outcomes. The story gets more complicated if we want to compare more than two vectors.

```# Create the three variables
A <- rnorm(4)
B <- rnorm(4)
C <- rnorm(4)

# Run the nested if statements
if(
abs(0 - A) < abs(0 - B) &
abs(0 - A) < abs(0 - C)){D <- A} else if(

abs(0 - B) < abs(0 - A) &
abs(0 - B) < abs(0 - C)){D <- B} else {D <- C}

# Compare the vectors to confirm it worked
A
B
C
D
```

## Comparing vectors and matrices to a vector or matrix

For our final comparison, let’s say that instead of some constant, e.g. zero, we have a whole set of numbers that we want to compare our vectors or matrices to. The ifelse line is identical for vectors and matrices, so let’s use matrices to be fancy.

```# Create our matrices
A <- matrix(rnorm(100), ncol = 10)
B <- matrix(rnorm(100), ncol = 10)

# Create the reference matrix
C <- matrix(rnorm(100), ncol = 10)

# Make a new matrix with the elements of A and B
# closest to C
D <- ifelse(abs(C - A) < abs(C - B),
yes = A, no = B)

# Check on a subset of the matrices to confirm
# it worked
A[1:3, 1:3]
B[1:3, 1:3]
C[1:3, 1:3]
D[1:3, 1:3]
```

Last example, and it’s a weird one. Let’s say that instead of comparing vectors to vectors or matrices to matrices, we want to compare a vector and a matrix. We’ll return to pmax to keep things simple and just ask which values are larger. With a bit of careful arranging, we can treat a matrix as a set of vectors arranged one after the other, and then we can just let pmax do its thing.

```# Create the variables
our.vector <- runif(n = 5, min = 0, max = 10)
our.matrix <- matrix(rnorm(10, mean = 5),
nrow = length(our.vector))

# Visualize them
our.vector
our.matrix

# Find the larger value
pmax(our.vector, our.matrix)
```

[Some clarification for the code above, because it’s actually pretty easy to make a mistake here. It’s important that the matrix is arranged so the number of rows is the same as the length of the vector, because R makes comparisons down each column, not across each row, when it compares a matrix to a vector. In other words, R will compare our.vector to our.matrix[1,1], then our.vector to our.matrix[2,1], then our.vector to our.matrix[3,1], etc. So even if our.matrix was arranged so the number of columns was equal to the length of our.vector, R would still run down the rows and wrap along the columns, which is most likely not what you’re trying to do. Just a heads up.]

Thanks for reading, and shoot me a message if you have any ideas for a fun Random R project.

Best,
Matt

## Intro to R: #1 – What is R? [Hi! I originally wrote this post on 12-15-15 in my blog The Headbanging Behaviorist, but I figured that developing programming skills is important enough for a cross-post. I’ve updated the examples, started using WordPress’s syntax highlighting, and toned down the humor a bit here. For the full silliness, check out the original post.]

# Beginnings

When I first encountered R in 2011, I was a junior in college. I had heard about it from other undergrads and from my TAs, and the conversations varied widely from loving R to hating it. One constant, though, was how powerful R was for data analysis and visualization. Ambitious, I tried downloading R to familiarize myself and learn its quirks.

When I opened it for the first time, though… nothing happened. I was staring at a blank white text box. Where were the buttons? How could I load any data? Why couldn’t I see my data? Everything was so much easier in Excel! I thought I could pick up R and begin learning, sort of like a musical instrument. I’d underestimated the fact that R is a language.

I ended up avoiding R and sticking with friendly statistical software like JMP and SPSS, where you can see your data at all times and there are buttons for mixed effects models and ANOVAs. I understood I was just buying time before I’d have to sit down with R and really learn it, but I was intimidated by feeling clueless. My time ran out a few months after graduation, though: I started collaborating with a PhD student in Germany, and the data were going to be analyzed in R. (The results of that project are here.) I read this book and the student tutored me, and slowly I began to appreciate R. Once I learned how to teach myself, I even grew to love the language.

Now that I understand R, I’m eager to pay it forward to anyone who wants to learn. If you’re determined to learn R from me instead of R Bloggers, Hadley Wickham, or Michael Crawley, here’s a baby steps, 1 MPH introduction to R. This post is for people with little to no programming experience: I’ll explain what R is, some very simple introductory commands, and how to teach yourself. I’ll focus on the core questions I wanted answers to when I was first staring at that blinking cursor as a college student. Later posts will cover more of the basics.

This is the first post in a series on R. This post answers:
– What is R?
– How do I load an Excel file into R?
– How do I find the answers to my questions about R?

This post will teach you how to read this:

```setwd("C:/Users/matt/Desktop/")
?str
blah <- mean(c(1, 2, 4, 8))
```
[Any code in these boxes can be copied directly into R and run. This example, though, will require there being a user called “matt” on your computer and a “Thesis_data” CSV file on your desktop.]

# What is R? Why not Excel?

R is a programming language. Programming lets you talk more directly to your computer in a language closer to how it operates. You tell the computer what to do. In programs like Excel, meanwhile, some engineer at Microsoft decided the range of actions you’d want to take, so you’re limited to what he or she thought you’d want to do.

With programming, you get rid of the comfortable structure of pressing buttons and a friendly interface in favor of freedom. Your analyses are now limited by your imagination and knowledge of the R language, not what someone else thought was relevant for you. This means you can perform incredibly nuanced analyses, even those that have never been done before. This is ideal for research.

R has extensive built-in and downloadable statistical tools, meaning the commands for linear regression, mixed effects modeling, Fourier transforms, heat maps, bootstrapping, Bayesian stats, and more are a short Google search away. If you get stuck, there’s a large community of researchers and data scientists regularly asking and answering questions about analyses in R, so you’re bound to find an answer.

One of the biggest benefits, though, is how easy it is to test ideas. Let’s say, for example, that you learned about t-tests in a stats class. The test theoretically makes sense, but you want a way to visualize what it really means when you compare two samples. In R, you can easily create random data, so you can create the conditions under which two samples should be significantly versus non-significantly different. You know what you’re putting in, so you can see what comes out. This will prepare you for how to look at real data later. It’s easy to visualize data in R, so you can look at what you’re trying to do.

```# Create the samples by drawing from two normal distributions
sample1 <- rnorm(15, 20, 2)   # N = 15, mean = 20, sd = 2
sample2 <- rnorm(15, 20.5, 2) # N = 15, mean = 20.5, sd = 2

# Plot the two distributions
plot(density(sample1), lwd = 2, col = "deepskyblue4")
lines(density(sample2), lwd = 2, col = "firebrick2")

# Are the means significantly different?
t.test(sample1, sample2)
```

How does the p-value change if you have more data in each sample? Change the 15s to 100s in lines 2-3 above, rerun the code, and you’re done. What if the two distributions are further apart from each other? Change the means in lines 2-3 above, rerun the code, and you’re done. You don’t have to trust some guy on the internet telling you what to believe; R lets you test things yourself. When you program, you leave behind a trail of code that leads to the result or visualization you produced. With a bit of practice, it’s easy to share code, meaning others can replicate your analysis. The code for the above plots, for example, is at the end of this blog post.

### Do I have to pay for R? How do I get it?

R is free. You can download it from the R website. I infinitely recommend also downloading R Studio, an interface that makes R easier to use. It organizes your windows (so plots are always in the same place, for example) and highlights the syntax so it’s easier to read. ### Why is R free?

I don’t know. Maybe it’s part of that “knowledge is more important than money” mentality academics have. It might simply be because it’s more efficient to collaborate and exchange information with other researchers if everyone uses the same programming language. If you’re a researcher with a limited budget and there are two equally good programming languages, you’ll probably start using the one that’s free. At any rate, it’s very cool that R developers have made it a priority for R to be accessible to anyone who wants it.

### I took a class in college that used MATLAB, so I’ll just keep using that.

Sure, go ahead. However, note that coding languages like MATLAB, Mathematica, SPSS, SAS, and Stata all require paid licenses. It’s unlikely to be a problem if you’re a grad student at a well-funded university, but don’t code yourself into a corner: many industry jobs don’t want to pay the tens of thousands of dollars for a license, so it might be a safer bet to learn a free software like R or Python. [I actually know someone who interviewed at Facebook for a data scientist position, and they wouldn’t accept MATLAB as a coding language – only R or Python.]

I believe R made me into a much better scientist, and I’m clearly heavily biased towards it! Would be nice to get paid to promote R, though… if someone from RStudio sent me a hat, I’d wear it.

### So why do you use R?

R is (currently) unparalleled in its ability to easily run complicated statistical tests and to produce beautiful data visualization. Instead of coding a non-linear least squares regression from scratch, you can download R code that’ll do it for you. Similarly, investing a little time into learning how R plots data can let you produce almost any visualization you can imagine. And as I mentioned before, R’s community is sufficiently large that websites like Stack Overflow constantly have people asking and answering questions about how to code something in R. 99% of my solutions to R questions come from searching through this community.

However, I use R because it fits my research needs: statistics and data visualization. If I was running computation-heavy evolutionary simulations, C++ would be a better bet. If I wanted to do engineering work, MATLAB or Python is stronger.

If you’re coming from click-based statistical programs like Excel or SPSS, seeing a program that’s just a command line can be a bit of a shock. Think of R as a language and less as a program. Of course you can’t say anything in German when you start learning, or mastering all the tones in Mandarin can be frustrating (or hilarious). Hopefully you’re in RStudio right now. In the TOP LEFT window, you can type whatever you want, and hitting enter doesn’t make R run the code. This allows you to write several lines of code before you run anything, which is essential for trickier programming. To run a line of code in this window, press ctrl + r on Windows or ctrl + ENTER on Mac.

The BOTTOM LEFT window is the terminal, where you can talk directly to R. When you open RStudio, there’s some text here about R and what version you have. Typing here and then pressing ENTER makes R run what you wrote. This is nice if you just want to type a quick command to check something, e.g. what “x” is equal to. For now, let’s focus on this window.

(The TOP RIGHT window is useful for having extra information about commands, like what arguments they take. For me, the BOTTOM RIGHT window displays plots.)

Below are the very first things you should do in R before you try any data analysis. This code can be copied directly into R and run.

### Numbers

```5
5 + 5
```

Yes, literally just type the number “5” into that box in the bottom of the screen and hit enter. Unsurprisingly, R says “5” back to you, confirming that 5 = 5 after all. Now try 5 + 5. Great work.

```x <- 5
x
```

The arrow (<-) is the equals sign in R. (You can use an actual equals sign if you want, but essentially everyone uses the arrow.) By typing x and hitting ENTER, you’ve made a variable x that has the value 5. R will remember this until you tell it to forget, or you close R. Now if you ask R to tell you what x is, it will say 5.

### Commands in R

Congratulations; you just coded! If you want the formal introduction to coding, apparently you’re supposed to type this:

```print("Hello world!")
```

print() is a command that, well prints whatever is inside the parentheses. It’s straightforward when you’re printing literally what’s inside of the command, but we can make it more interesting like this:

```y <- "Hello world!"
print(y)
```

print() is nice but you probably won’t use it that often relative to other functions. A critically important function to know is the concatenate function, or c(). This tells R that there are multiple elements to remember.

```z <- c(1, 2, 4, 8)
z
```

We can now run some pretty standard analyses on that vector of numbers.

```mean(z)
median(z)
sd(z)
min(z)
max(z)
```

You could also just run it directly on the numbers if you prefer.

```mean(c(1, 2, 4, 8))
```

This one caused me so much confusion when I was first learning R. It involves thinking a bit like a computer.

##### Step 1: Save the data in a format R will understand

This post from R Bloggers goes into intricate detail on all the file formats R will accept and how to load them. If your data are in Excel, the simplest way to load them in R is to use the .CSV format, or “comma separated values.”

Say you have a Excel sheet you want to open in R. In Excel:
– Make sure there are no spaces in the column names. Change the names from e.g. “Time (seconds)” to “Time_sec” or “Time.sec”

– File –> Save As –> Save as Type –> CSV (Comma delimited)

– Excel will say some features of the workbook might be lost. Say that Yes, you do want to keep using CSV format.

– When you exit, it will ask if you want to save your changes. Go ahead and save, even if you didn’t make any changes.

##### Step 2: Specify the working directory in R

Now you need to tell R where to find the data. When you use R, it focuses on one particular folder on your computer at a time, and you have to tell it which folder to look at. The folder R is looking at is called the “working directory.” You can find out where you currently are by typing getwd(). You can change the working directory with the setwd() command. If you’re on a Windows computer, your data are on the Desktop, and your user name is Matt, you can write this to get to the Desktop:

```setwd("C:/Users/Matt/Desktop")
```
##### Step 3: Load the data

This step will involve creating a variable called “data” and using the read.csv() function.

```data <- read.csv("Data.csv", header = T)
```

If there’s a file called Data.csv on the Desktop, R will take it and assign it to the variable “data.” The header = T argument tells R that the top row of the data is column names. (If you just imported a table of numbers with no header, for example, you could say header = F instead.)

##### Step 4: Look at the data

Now you can look at the data. You could just type data and hit enter, but R will display everything, so if you have a reasonable amount of data, your screen will become overwhelmed with numbers. A better option is to look at only part of the data.

```head(data)
tail(data)
data[1:5, c(2, 4)]
```

The head() and tail() commands tell R to only look at the first or last 6 rows of the data.

The last command, data[1:5, c(2, 4)], offers you more fine-tuned control. The brackets [ ] let you subset the data, which means selecting only part of it. The first argument, 1:5, means “rows 1, 2, 3, 4, and 5.” The second argument, c(2, 4), means “columns 2 and 4.”

If you wanted to create a new variable for only columns 1, 3 to 5, and 7 of the data, you could write something like this:

```data2 <- data[ , c(1, 3:5, 7)]
```

The empty first argument means “all rows.”

Finally, here are three commands to get a feel for the data:

```dim(data)
summary(data)
str(data)
```

dim() will tell you the dimensions of your data, i.e. how many rows and columns there are. summary() will summarize each column of the data, giving you values like the first quartile, median, etc. str() is more useful for when some of your columns have text like “Treatment A,” “Treatment B” and other columns have numbers.

# I still don’t really know how to do anything in R.

That’s ok. Again, think of R as a language instead of a program. It takes a while to gain fluency, but the more you invest in learning, the easier it’ll be to say what you’re thinking.

One of the most important things for me when I first started learning R was to learn where to find answers to my questions. Let’s say you found a function but don’t know how to use it:

```?mean
```

This will bring up R’s Help file for the mean() function.  There you can find what the function does, as well as what arguments R is looking for.

Say you don’t know what the function is called in R. Let’s say you’re trying to find the command for standard deviation:

```??"standard deviation"
```

This will give you a list of possible functions that match this. The “stats::sd” option is what you’re looking for. “stats” refers to the package in R, and “sd” is the command.

Finally, say you’re looking for a function for standard error and ??”standard error” only gives you really complicated-sounding options. Type the following into Google:

standard error in R

The first link, not surprisingly, takes you to Stack Overflow, where someone asked this exact question in 2011. The answer is that R doesn’t have a function for standard error, but it’s really easy to write one. I’ll cover writing your own functions in a future blog post. When in doubt, Google what you’re trying to do, followed by “in R.” This is honestly the easiest way to find out how to code something in R.

## What are other resources for learning R? No offense.

Here are some invaluable resources that have helped me learn. This book was exactly what I needed when I was first learning R. I needed something for an absolute beginner, and this book helped me overcome that initial learning curve. Quick-R provides a very useful overview of basic functions in R. I visit their page on graphical parameters all the time. The writers behind R-bloggers are incredibly helpful. While they won’t necessarily provide the well-rounded introduction to R you might need, they’re very useful for coding a random, specific analysis that might be hard to find elsewhere. Their 2005 post on shading a polygon was exactly what I needed for the analysis in this blog post. I follow them on Twitter and will read the occasional article that pops up and is relevant to me. I’ve never gone to their website directly, but I always end up there from Googling questions about R. Honestly, that’s the easiest way to learn how to run a particular analysis in R. Google it and then see what Stack Overflow suggests.

Thanks for reading! This is the first post in a series on R. The next post will be on plotting and simple statistical tests.

Best,
Matt

## Code for the figures in this post:

```
# Create the distributions. We'll list our parameters
# first. All groups will be drawn from a normal
# distribution with standard deviation = 2. We'll
# draw 2000 values to make the distributions nice and
# smooth.
N <- 2000
sd <- 2

# - The means, however, will differ between the groups
mean1 <- 20
mean2 <- 20.5
mean3 <- 17
mean4 <- 23

# Now we actually create the distributions
sample1 <- rnorm(N, mean1, sd)
sample2 <- rnorm(N, mean2, sd)
sample3 <- rnorm(N, mean3, sd)
sample4 <- rnorm(N, mean4, sd)

#--------------------------------------------------
# Plot the figures
# - First, divide the graphics window into one row,
#   two columns
par(mfrow = c(1,2))

# Figure 1: a small difference in means
plot(density(sample1), lwd = 3,
col = "deepskyblue4",
xlim = c(10, 30), ylim = c(0, 0.28),
main = "Small difference in means",
las = 1, cex.main = 1.6, xlab = "Value",
font.axis = 2, font.lab = 2)
lines(density(sample2), lwd = 3,
col = "firebrick2")

# Add a legend with bolded text
par(font = 2)
legend("topleft", bty = 'n', pch = 19,
col = c("deepskyblue4", "firebrick2"),
legend = c(paste0("Mean = ", mean1),
paste0("Mean = ", mean2)),
cex = 1.1)

# Figure 2: a large different in means
plot(density(sample3), lwd = 3,
col = "deepskyblue4",
xlim = c(10, 30), ylim = c(0, 0.28),
main = "Large difference in means",
las = 1, cex.main = 1.6, xlab = "Value",
font.axis = 2, font.lab = 2)
lines(density(sample4), lwd = 3,
col = "firebrick2")

# Add a legend with bolded text
par(font = 2)
legend("topleft", bty = 'n', pch = 19,
col = c("deepskyblue4", "firebrick2"),
legend = c(paste0("Mean = ", mean3),
paste0("Mean = ", mean4)),
cex = 1.1)

```