[Hi! I originally wrote this post on 12-15-15 in my blog The Headbanging Behaviorist, but I figured that developing programming skills is important enough for a cross-post. I’ve updated the examples, started using WordPress’s syntax highlighting, and toned down the humor a bit here. For the full silliness, check out the original post.]
When I first encountered R in 2011, I was a junior in college. I had heard about it from other undergrads and from my TAs, and the conversations varied widely from loving R to hating it. One constant, though, was how powerful R was for data analysis and visualization. Ambitious, I tried downloading R to familiarize myself and learn its quirks.
When I opened it for the first time, though… nothing happened. I was staring at a blank white text box. Where were the buttons? How could I load any data? Why couldn’t I see my data? Everything was so much easier in Excel! I thought I could pick up R and begin learning, sort of like a musical instrument. I’d underestimated the fact that R is a language.
I ended up avoiding R and sticking with friendly statistical software like JMP and SPSS, where you can see your data at all times and there are buttons for mixed effects models and ANOVAs. I understood I was just buying time before I’d have to sit down with R and really learn it, but I was intimidated by feeling clueless. My time ran out a few months after graduation, though: I started collaborating with a PhD student in Germany, and the data were going to be analyzed in R. (The results of that project are here.) I read this book and the student tutored me, and slowly I began to appreciate R. Once I learned how to teach myself, I even grew to love the language.
Now that I understand R, I’m eager to pay it forward to anyone who wants to learn. If you’re determined to learn R from me instead of R Bloggers, Hadley Wickham, or Michael Crawley, here’s a baby steps, 1 MPH introduction to R. This post is for people with little to no programming experience: I’ll explain what R is, some very simple introductory commands, and how to teach yourself. I’ll focus on the core questions I wanted answers to when I was first staring at that blinking cursor as a college student. Later posts will cover more of the basics.
This is the first post in a series on R. This post answers:
– What is R?
– How do I load an Excel file into R?
– How do I find the answers to my questions about R?
This post will teach you how to read this:
setwd("C:/Users/matt/Desktop/") data <- read.csv("Thesis_data.csv", header = T) head(data) ?str blah <- mean(c(1, 2, 4, 8))
What is R? Why not Excel?
R is a programming language. Programming lets you talk more directly to your computer in a language closer to how it operates. You tell the computer what to do. In programs like Excel, meanwhile, some engineer at Microsoft decided the range of actions you’d want to take, so you’re limited to what he or she thought you’d want to do.
With programming, you get rid of the comfortable structure of pressing buttons and a friendly interface in favor of freedom. Your analyses are now limited by your imagination and knowledge of the R language, not what someone else thought was relevant for you. This means you can perform incredibly nuanced analyses, even those that have never been done before. This is ideal for research.
R has extensive built-in and downloadable statistical tools, meaning the commands for linear regression, mixed effects modeling, Fourier transforms, heat maps, bootstrapping, Bayesian stats, and more are a short Google search away. If you get stuck, there’s a large community of researchers and data scientists regularly asking and answering questions about analyses in R, so you’re bound to find an answer.
One of the biggest benefits, though, is how easy it is to test ideas. Let’s say, for example, that you learned about t-tests in a stats class. The test theoretically makes sense, but you want a way to visualize what it really means when you compare two samples. In R, you can easily create random data, so you can create the conditions under which two samples should be significantly versus non-significantly different. You know what you’re putting in, so you can see what comes out. This will prepare you for how to look at real data later. It’s easy to visualize data in R, so you can look at what you’re trying to do.
# Create the samples by drawing from two normal distributions sample1 <- rnorm(15, 20, 2) # N = 15, mean = 20, sd = 2 sample2 <- rnorm(15, 20.5, 2) # N = 15, mean = 20.5, sd = 2 # Plot the two distributions plot(density(sample1), lwd = 2, col = "deepskyblue4") lines(density(sample2), lwd = 2, col = "firebrick2") # Are the means significantly different? t.test(sample1, sample2)
How does the p-value change if you have more data in each sample? Change the 15s to 100s in lines 2-3 above, rerun the code, and you’re done. What if the two distributions are further apart from each other? Change the means in lines 2-3 above, rerun the code, and you’re done. You don’t have to trust some guy on the internet telling you what to believe; R lets you test things yourself.
Do I have to pay for R? How do I get it?
R is free. You can download it from the R website. I infinitely recommend also downloading R Studio, an interface that makes R easier to use. It organizes your windows (so plots are always in the same place, for example) and highlights the syntax so it’s easier to read.
Why is R free?
I don’t know. Maybe it’s part of that “knowledge is more important than money” mentality academics have. It might simply be because it’s more efficient to collaborate and exchange information with other researchers if everyone uses the same programming language. If you’re a researcher with a limited budget and there are two equally good programming languages, you’ll probably start using the one that’s free. At any rate, it’s very cool that R developers have made it a priority for R to be accessible to anyone who wants it.
I took a class in college that used MATLAB, so I’ll just keep using that.
Sure, go ahead. However, note that coding languages like MATLAB, Mathematica, SPSS, SAS, and Stata all require paid licenses. It’s unlikely to be a problem if you’re a grad student at a well-funded university, but don’t code yourself into a corner: many industry jobs don’t want to pay the tens of thousands of dollars for a license, so it might be a safer bet to learn a free software like R or Python. [I actually know someone who interviewed at Facebook for a data scientist position, and they wouldn’t accept MATLAB as a coding language – only R or Python.]
This sounds like an advertisement for R.
I believe R made me into a much better scientist, and I’m clearly heavily biased towards it! Would be nice to get paid to promote R, though… if someone from RStudio sent me a hat, I’d wear it.
So why do you use R?
R is (currently) unparalleled in its ability to easily run complicated statistical tests and to produce beautiful data visualization. Instead of coding a non-linear least squares regression from scratch, you can download R code that’ll do it for you. Similarly, investing a little time into learning how R plots data can let you produce almost any visualization you can imagine. And as I mentioned before, R’s community is sufficiently large that websites like Stack Overflow constantly have people asking and answering questions about how to code something in R. 99% of my solutions to R questions come from searching through this community.
However, I use R because it fits my research needs: statistics and data visualization. If I was running computation-heavy evolutionary simulations, C++ would be a better bet. If I wanted to do engineering work, MATLAB or Python is stronger.
So I downloaded R. How do I do anything?
If you’re coming from click-based statistical programs like Excel or SPSS, seeing a program that’s just a command line can be a bit of a shock. Think of R as a language and less as a program. Of course you can’t say anything in German when you start learning, or mastering all the tones in Mandarin can be frustrating (or hilarious).
Hopefully you’re in RStudio right now. In the TOP LEFT window, you can type whatever you want, and hitting enter doesn’t make R run the code. This allows you to write several lines of code before you run anything, which is essential for trickier programming. To run a line of code in this window, press ctrl + r on Windows or ctrl + ENTER on Mac.
The BOTTOM LEFT window is the terminal, where you can talk directly to R. When you open RStudio, there’s some text here about R and what version you have. Typing here and then pressing ENTER makes R run what you wrote. This is nice if you just want to type a quick command to check something, e.g. what “x” is equal to. For now, let’s focus on this window.
(The TOP RIGHT window is useful for having extra information about commands, like what arguments they take. For me, the BOTTOM RIGHT window displays plots.)
Below are the very first things you should do in R before you try any data analysis. This code can be copied directly into R and run.
5 5 + 5
Yes, literally just type the number “5” into that box in the bottom of the screen and hit enter. Unsurprisingly, R says “5” back to you, confirming that 5 = 5 after all. Now try 5 + 5. Great work.
x <- 5 x
The arrow (<-) is the equals sign in R. (You can use an actual equals sign if you want, but essentially everyone uses the arrow.) By typing x and hitting ENTER, you’ve made a variable x that has the value 5. R will remember this until you tell it to forget, or you close R. Now if you ask R to tell you what x is, it will say 5.
Commands in R
Congratulations; you just coded! If you want the formal introduction to coding, apparently you’re supposed to type this:
print() is a command that, well prints whatever is inside the parentheses. It’s straightforward when you’re printing literally what’s inside of the command, but we can make it more interesting like this:
y <- "Hello world!" print(y)
print() is nice but you probably won’t use it that often relative to other functions. A critically important function to know is the concatenate function, or c(). This tells R that there are multiple elements to remember.
z <- c(1, 2, 4, 8) z
We can now run some pretty standard analyses on that vector of numbers.
mean(z) median(z) sd(z) min(z) max(z)
You could also just run it directly on the numbers if you prefer.
mean(c(1, 2, 4, 8))
Loading data into R
This one caused me so much confusion when I was first learning R. It involves thinking a bit like a computer.
Step 1: Save the data in a format R will understand
This post from R Bloggers goes into intricate detail on all the file formats R will accept and how to load them. If your data are in Excel, the simplest way to load them in R is to use the .CSV format, or “comma separated values.”
Say you have a Excel sheet you want to open in R. In Excel:
– Make sure there are no spaces in the column names. Change the names from e.g. “Time (seconds)” to “Time_sec” or “Time.sec”
– File –> Save As –> Save as Type –> CSV (Comma delimited)
– Excel will say some features of the workbook might be lost. Say that Yes, you do want to keep using CSV format.
– When you exit, it will ask if you want to save your changes. Go ahead and save, even if you didn’t make any changes.
Step 2: Specify the working directory in R
Now you need to tell R where to find the data. When you use R, it focuses on one particular folder on your computer at a time, and you have to tell it which folder to look at. The folder R is looking at is called the “working directory.” You can find out where you currently are by typing getwd(). You can change the working directory with the setwd() command. If you’re on a Windows computer, your data are on the Desktop, and your user name is Matt, you can write this to get to the Desktop:
Step 3: Load the data
This step will involve creating a variable called “data” and using the read.csv() function.
data <- read.csv("Data.csv", header = T)
If there’s a file called Data.csv on the Desktop, R will take it and assign it to the variable “data.” The header = T argument tells R that the top row of the data is column names. (If you just imported a table of numbers with no header, for example, you could say header = F instead.)
Step 4: Look at the data
Now you can look at the data. You could just type data and hit enter, but R will display everything, so if you have a reasonable amount of data, your screen will become overwhelmed with numbers. A better option is to look at only part of the data.
head(data) tail(data) data[1:5, c(2, 4)]
The head() and tail() commands tell R to only look at the first or last 6 rows of the data.
The last command, data[1:5, c(2, 4)], offers you more fine-tuned control. The brackets [ ] let you subset the data, which means selecting only part of it. The first argument, 1:5, means “rows 1, 2, 3, 4, and 5.” The second argument, c(2, 4), means “columns 2 and 4.”
If you wanted to create a new variable for only columns 1, 3 to 5, and 7 of the data, you could write something like this:
data2 <- data[ , c(1, 3:5, 7)]
The empty first argument means “all rows.”
Finally, here are three commands to get a feel for the data:
dim(data) summary(data) str(data)
dim() will tell you the dimensions of your data, i.e. how many rows and columns there are. summary() will summarize each column of the data, giving you values like the first quartile, median, etc. str() is more useful for when some of your columns have text like “Treatment A,” “Treatment B” and other columns have numbers.
I still don’t really know how to do anything in R.
That’s ok. Again, think of R as a language instead of a program. It takes a while to gain fluency, but the more you invest in learning, the easier it’ll be to say what you’re thinking.
One of the most important things for me when I first started learning R was to learn where to find answers to my questions. Let’s say you found a function but don’t know how to use it:
This will bring up R’s Help file for the mean() function. There you can find what the function does, as well as what arguments R is looking for.
Say you don’t know what the function is called in R. Let’s say you’re trying to find the command for standard deviation:
This will give you a list of possible functions that match this. The “stats::sd” option is what you’re looking for. “stats” refers to the package in R, and “sd” is the command.
Finally, say you’re looking for a function for standard error and ??”standard error” only gives you really complicated-sounding options. Type the following into Google:
standard error in R
The first link, not surprisingly, takes you to Stack Overflow, where someone asked this exact question in 2011. The answer is that R doesn’t have a function for standard error, but it’s really easy to write one. I’ll cover writing your own functions in a future blog post. When in doubt, Google what you’re trying to do, followed by “in R.” This is honestly the easiest way to find out how to code something in R.
What are other resources for learning R? No offense.
Here are some invaluable resources that have helped me learn.
This book was exactly what I needed when I was first learning R. I needed something for an absolute beginner, and this book helped me overcome that initial learning curve.
The writers behind R-bloggers are incredibly helpful. While they won’t necessarily provide the well-rounded introduction to R you might need, they’re very useful for coding a random, specific analysis that might be hard to find elsewhere. Their 2005 post on shading a polygon was exactly what I needed for the analysis in this blog post. I follow them on Twitter and will read the occasional article that pops up and is relevant to me.
I’ve never gone to their website directly, but I always end up there from Googling questions about R. Honestly, that’s the easiest way to learn how to run a particular analysis in R. Google it and then see what Stack Overflow suggests.
Thanks for reading! This is the first post in a series on R. The next post will be on plotting and simple statistical tests.
Code for the figures in this post:
# Create the distributions. We'll list our parameters # first. All groups will be drawn from a normal # distribution with standard deviation = 2. We'll # draw 2000 values to make the distributions nice and # smooth. N <- 2000 sd <- 2 # - The means, however, will differ between the groups mean1 <- 20 mean2 <- 20.5 mean3 <- 17 mean4 <- 23 # Now we actually create the distributions sample1 <- rnorm(N, mean1, sd) sample2 <- rnorm(N, mean2, sd) sample3 <- rnorm(N, mean3, sd) sample4 <- rnorm(N, mean4, sd) #-------------------------------------------------- # Plot the figures # - First, divide the graphics window into one row, # two columns par(mfrow = c(1,2)) # Figure 1: a small difference in means plot(density(sample1), lwd = 3, col = "deepskyblue4", xlim = c(10, 30), ylim = c(0, 0.28), main = "Small difference in means", las = 1, cex.main = 1.6, xlab = "Value", font.axis = 2, font.lab = 2) lines(density(sample2), lwd = 3, col = "firebrick2") # Add a legend with bolded text par(font = 2) legend("topleft", bty = 'n', pch = 19, col = c("deepskyblue4", "firebrick2"), legend = c(paste0("Mean = ", mean1), paste0("Mean = ", mean2)), cex = 1.1) # Figure 2: a large different in means plot(density(sample3), lwd = 3, col = "deepskyblue4", xlim = c(10, 30), ylim = c(0, 0.28), main = "Large difference in means", las = 1, cex.main = 1.6, xlab = "Value", font.axis = 2, font.lab = 2) lines(density(sample4), lwd = 3, col = "firebrick2") # Add a legend with bolded text par(font = 2) legend("topleft", bty = 'n', pch = 19, col = c("deepskyblue4", "firebrick2"), legend = c(paste0("Mean = ", mean3), paste0("Mean = ", mean4)), cex = 1.1)