Garrett, K.A., P.D. Esker, and A.H. Sparks. 2007. Introduction to the R Programming Environment. The Plant Health Instructor. DOI:10.1094/PHI-A-2007-1226-02.
An Introduction to the R Programming Environment
K. A. Garrett, P. D. Esker, and A. H. Sparks
Student Learning Goals
After completion of this module:
- Students will understand some of the range of potential applications of R in Biology.
- Students will be able to:
- use R in interactive sessions,
- use basic programming in R for graphics and dataset manipulation,
- generate random numbers and use statistical distributions.
Feedback
We would appreciate feedback for improving this paper and information about how it has been used for study and teaching. Please send your feedback to kgarrett@ksu.edu. Please include the following text in the e-mail subject line, "Feedback on R Modules", to make sure your comments are received.
Introduction
R is a free software environment that includes a set of base packages for graphics, math, and statistics. You can also make use of specialized packages contributed by R users or write your own new functions. R was developed as a part of the GNU project from the S language, http://cm.bell-labs.com/cm/ms/departments/sia/S/.
R performs many of the same statistical analyses as SAS, http://www.sas.com/. The trade-off in use of R versus SAS might revolve around the fact that R is free while SAS may have better technical support and testing for features such as linear models in packages like SAS Proc Mixed.
Biologists might be most interested in using R for statistical analysis, ecological modeling (Bolker, in press), and in bioinformatics applications using Bioconductor (Gentleman et al. 2005). Also a set of Plant Health Instructor documents have been prepared to illustrate the use of R in epidemiology and ecology: Disease Progress Over Time, Modeling Dispersal Gradients, Introduction to Spatial Analysis, Disease Forecasting.
For a more comprehensive introduction to R, see the book: An Introduction to R. Notes on R: A Programming Environment for Data Analysis and Graphics, by Venables et al. (2007), especially Appendix A/doc/manuals which gives a sample R session illustrating many features. This book is available on-line through the R website, http://www.r-project.org, or a slightly older bound version is available for purchase with money going to support free software development and documentation. Crawley (2007) provides a substantial reference for R.
Installing R
Both source code and binaries for Windows, Mac OS X, and several Linux distributions are available. You can install R on your computer from the R website, http://www.r-project.org/.
- Click on the 'CRAN mirror' link in the menu on the left side of the page.
- Pick a location for downloading, preferably a geographic location near you.
- Pick the operating system that you are using, Windows, Mac, or Linux.
- Perhaps start with 'base' software.
- Peruse information in the README file.
-
If using Windows, click on the R-installation .exe file to install.
Note that the setup wizard gives you an option for where the R directory will be. You might just want to use the default and keep track of where it is for future reference.
R as a Deluxe Calculator
Now that you have installed R on your computer, you can
begin trying it out. To start using R, open the interactive R
window by clicking on the R icon. Using R interactively, try
out common arithmetic operators: +,
-, /, *
For example, if 3*3 is entered on the command
line, 9 is returned on the following line.
Up and down arrows can be used to scroll among previous commands if you want to repeat them or edit them for re-entry.
Try entering the following commands to see the output
3^3
log10(100)
exp(4)
log10(10^4)
Creating objects and assigning values
Objects are assigned values using <- , an
arrow formed out of < and -. (An
equal sign, =, can also be used.) For example,
the following command assigns the value 5 to the
object x.
x <- 5
After this assignment, the object x
‘contains’ the value 5. Another assignment to the same object
will change the content.
x <- 107
You can check the content of an object by simply entering the name of the object on an interactive command line. Try that throughout these examples to see what the results are of the different operations and functions illustrated.
R is case sensitive – that is, it treats
data15 and Data15 as completely
different objects.
This can be a hassle if you write code in a separate file using a word processor that automatically changes case at the beginning of sentences, etc. In Word, you can change 'autocorrect' options in the Menu under Tools, Autocorrect options,… Using an editor like Notepad is a better option when simply writing R code.
An editor is also supplied within the R window through the menu under File, New Script. The simplest way to develop commands in R and then implement them may be to write the code in the R Editor window and then submit the code through the Edit, Run… options.
Free editors designed for R are available that highlight code syntax as in this document and may make working with R easier in other ways as well. Sciviews-R is a Graphical User Interface (GUI) for R for Windows. For Linux and BSD, two Unix-like operating systems, RKWard is a transparent front end for R based on KDE, a desktop environment for Unix systems, that strives to make R easier to use. This program was used while creating this HTML document to highlight the R-code. Other programs such as RCommander, and Emacs/ESS are available as well. While the use of programs such as these will not be covered in this document, you may find them helpful to you as you work through these documents or in the future if you choose to use R in your research.
Comments should be added to programs to clarify steps,
define variables, etc., so that anyone reading a program now
or in the future can more easily make sense of what the
program does. The part of a line that begins with
# does not get treated as code, rather as a
comment. For example, after the following two lines of code,
x still has the value 5.
x <- 5 # x <- 6
Try creating some other objects and assigning values to them. Naming of objects is very flexible, though names must begin with a letter and problems can be avoided if you don’t name objects with the same name as an existing function in R. Picking a name that has a clear meaning is particularly helpful for programs that contain many lines. If the R workspace is saved (more about this below), it is important to take note of what objects have been created, so that old objects are not accidentally used incorrectly in future sessions.
Some names have already been assigned in R, like
mean, function, etc.
For example, ‘pi’ is already defined (π =
3.1416).
Try entering the following command line,
pi.
An object can be assigned a set of numbers, as for example:
x12 <- c(10,6,8)
Here c combines the numbers 10, 6, and 8 into
a vector.
Operations can then be performed on the whole set of
numbers. For example, for the object x12 created
above, check the results of the following:
x12 * 10 x12/3
The numbers in x can be indexed individually, for example:
x12[2] # gives the second entry in x12 x12[1] # gives the first entry in x12
Or the numbers in x can be indexed in subsets. Consider the results of the following two commands:
x12[1:3] # gives the first through third entries in x12 x12[c(1,3)] # gives the first and third entries in x12
What is the difference between the these two commands?
Logical variables can be used in evaluation of the entries
in an object. Logical variables take on the values
TRUE or FALSE. Try:
x12 < 3
Logical variables can also be assigned to objects, for example:
y <- x12 < 3
Here the object y is assigned the vector of
TRUE and FALSE values corresponding
to whether the entries in x12 are less than three or
not.
y
Output
[1] FALSE FALSE FALSE
R includes many functions, especially for statistical summaries and analyses.
Try these functions on x12 and other objects
that you create.
mean(x12) # calculates the mean of the entries in x12 var(x12) # calculates the variance of the entries in x12 sum(x12) cumsum(x12) # check what this function is doing… median(x12) length(x12) sum(y) # logical values can be 'coerced' to numeric, where FALSE = 0 and TRUE = 1 # - here the sum is a count of 'TRUE's
The last example can be modified to calculate proportions or percentages. For example, if y is a set of logical variables indicating whether plants are heavily infected,
sum(y)/length(y)
gives the proportion of plants heavily infected.
sum(y) gives the number of plants infected and
length(y) gives the total number of plants.
The help function gives information about how a function
works. For example, help(mean).
The example function gives examples of applications of a
function, example(length).
Throughout these examples, try using the help
and example commands to better understand what
programs are doing.
ls() # gives a list of all the objects in your working directory rm() # removes the objects indicated. # For example the object x is removed by rm(x)
You might want to clear out old objects that you don’t need anymore to avoid confusion.
Graphics
Simple Plotting
Producing a simple x-y plot, or scatter plot, is easy in
R. Use help(plot) to get a description of
options available.
DisSev <- c(0,12,67,34,23,28) RelYield <- c(100,103,79,89,85,97) plot(x=DisSev, y=RelYield, xlab='Disease Severity (%)', ylab='Relative Yield (%)')
To draw a line through the points, from the smallest value
of DisSev to the largest value of DisSev, the necessary
ordering of the values can be determined using the
order command and then DisSev and RelYield can
be indexed by these ordered values. The first command lines
provide an introduction to order and the related
command rank.
# Set up a simple example vector ex8 <- c(103,101,102) # The order command give the order in which entries # must be drawn to put them in order from smallest to largest order(ex8) # First the 2nd entry must be drawn, then the 3rd, then the 1st. # In contrast, the rank command gives the rank of the entries # from smallest to largest. rank(ex8) # The first entry comes 3rd, the next entry comes first, and # the next entry comes second. # Now use the order command and plot the DisSev data. DisSevOrd <- order(DisSev) plot(DisSev[DisSevOrd],RelYield[DisSevOrd], type='l')
Advanced plotting
Using color in plots
Suppose you would like to use color in your plots to make
them easier to read or more attractive. Colors are easily
added to plots using the col='color name'
statement in the plot() command.
plot(DisSev, RelYield, xlab='Disease Severity', ylab='Relative Yield (%)', col='orange', pch=15, )
To find a list of all colors available for use in R
graphics type colors() Likewise, calling
different numbers for pch in the
plot command produces different shapes for the
plotting points.
Using subscripts and superscripts in graph labels
To add subscripts to your x or y labels enclose the text
in [] brackets, for example H[2]O.
To superscript text use a tilde, ~, before the
text and use the ^ before the part of the text
to be raised, for example ~ cm^3.
plot(x=DisSev, y=RelYield, xlab = quote(Size [subscript]), ylab = quote(Volume ~ cm^3), col='purple', pch=17 )
Interactive plotting features
Sometimes it's useful to specify the range for values
represented in a graph rather than just using the default
ranges produced by R. You can control the range of values on
the x (horizontal) and y (vertical) axes using
xlim and ylim.
plot(DisSev, RelYield, xlab='Disease Severity (%)', ylab='Relative Yield (%)', xlim=c(0,100), ylim=c(0,100))
Variations on the plot() function are called
when it is applied to different types of objects. For
example, when plot() is applied to categorical
data, a boxplot is produced. When applied to the output from
the linear regression function lm,
plot() produces a set of four regression
diagnostic plots.
R has interactive features, too. To see the ID for three points, try clicking on three points on the graph after entering the following:
identify(DisSev,RelYield, n=3)
To get the (x,y) coordinates for any four locations on the
graph, try clicking any four places on the graph after
entering locator(4). The coordinates will
appear after all the locations have been clicked on.
You can also plot a particular function instead of just a set of variables.
curve(x^2,from=0,to=10,xlab='Time',ylab='Population size')
or
curve(exp(x),from=0,to=10,xlab='Time',ylab='Population size')
More examples of graphing capabilities in R are available on the web, http://addictedtor.free.fr/graphiques/, and detailed instructions and illustrations are available in Murrell (2005).
Other Plant Health Instructor documents give examples of more complicated plots: Modeling Plant Disease Progress Over Time, Modeling Dispersal Gradients, Introduction to Spatial Analysis, Disease Forecasting and Validation for Plant Pathology.
Saving Graphical Output
Graphs can be saved in a PDF file (or a range of other formats) for later use and addition to documents. For example, the following command would save graphics output to the PDF file as named.
pdf(file='testplot.pdf')
Then, when output to that file is completed, the following command will turn off the ‘device’ that was moving output to the named file.
dev.off()
Using the interactive R windows, the simplest way to save a graph may be to right click on the graphics window and select the format for saving the graph from the menu. This operation depends on your operating system. Options may vary.
Loops
A loop can be used to perform the same function repeatedly. For example:
for(i in 1:3){x12[i] <- x12[i] + 1}
for(i in 1:3){}sets up the loop to run over three values of i, 1, 2, and 3 for whatever commands are entered in the curly brackets {}.x12[i]indicates the ith entry in x12. When i = 1,x12[i]is the same asx12[1]. When i = 2,x12[i]is the same asx12[2].-
x12[i] <- x12[i] + 1Here the value in x12[i] is replaced by the value in x12[i] plus 1.
In this case the same result is obtained more simply
by:
x12 <- x12 + 1
since R will perform the function on each entry of
x12.
A generalization if the function is to be applied to each entry of x12 would be the following.
for(i in 1:length(x12)){x12[i] <- x12[i] + 1}
Here i runs from one to the length of the x12
vector.
The use of a temporary variable named i is arbitrary, another name could be used.
for(Joe in 1:length(x12)){x12[Joe] <- x12[Joe] + 1}
You might choose to apply a function only to entries for which a certain criterion is true.
for(i in 1:length(x12)){if(x12[i] < 4){x12[i] <- x12[i] + 1}}
Here the if() statement controls which
entries in the vector x12 are modified.
In this case, only those entries that are less than 4 have 1 added.
Another way to accomplish the same thing would be as follows.
index12 <- x12 < 4 x12[index12] <- x12[index12] + 1
Minimizing the use of loops can make R programs more efficient. Often the use of a loop can be avoided if operations are performed on whole objects at a time, as in:
x12 <- x12 + 1
Working With Data Sets
Suppose you would like to combine two sets of values into a single object.
dai <- c(0,1,2,5,10,20) # dai might indicate days after inoculation pinf <- c(0,0,0,5,25,80) # pinf might indicate percent infection
Try
rbind(dai,pinf) cbind(dai,pinf)
Compare to
rbind(c(dai,pinf))
You can create arrays using the following commands
z5 <- array(0,c(5,3)) z6 <- array(-1,c(5,3,2))
Sometimes arrays with entries 0 or -1 are useful for starting values, where the array will later get filled in with other values. For example, 0 might be the starting infection level or -1 might be a placeholder indicating that infection levels haven’t been entered yet.
Data Structures
Venables et al. (2007) recommend keeping variables
in data frames for easier organization. See what
df5 looks like
df5 <- data.frame(dai,pinf)
Data frames need to be composed of objects with the same length. If objects don't have the same length, you might get an error or you may get surprising results. Check the following example.
var1 <- 1:10
The above command assigns the sequence of numbers between
1 and 10 to the object named var1.
var2 <- 1:2 df6 <- data.frame(var1,var2)
Did df6 look the way you expected it to? This
shows how different variable length can produce confusing
outcomes when the data.frame command is used.
You can indicate dai within df5
by df5$dai, df5[1], or
df5[[1]]. The name dai is only
maintained with the call df5[1]
Lists can also be used to gather objects together and the objects don't need to be of the same length.
list6 <- list(var1,var2)
Consider list6[[1]] and
list6[[2]]
The interactive R environment in Windows has a data editing window available through the Menu under Edit.
Moving to and From Files
Reading Data into R
R users can construct data sets in programs such as Microsoft Excel or Notepad and then bring those files into R for further analyses. Data input in R is handled through a series of options, including:
read.tableread.csvread.csv2read.delimread.delim2
For further information on the differences amongst these
methods, use the help() or ?
options in R, for example: ?read.table.
The primary information that is required to bring the data into R is the location of the file (e.g., a folder on your hard drive or elsewhere), and information regarding how the data are formatted. The following examples will help explain some of these differences. For the purpose of this tutorial, a Windows system was used and a default “temp” folder was created on the C-drive, but we will also show how to bring data from a source such as “My Documents”, dealing with folder names that incorporate spaces. Examples for importing to a Mac OSX or Linux computer will also be provided. To save the four example files, right click using your mouse, and then select 'Save Link As' or 'Save Target As'.
The two .txt files are spain1.txt and spain2.txt. The two .csv files are spain2.csv and spain3.csv, one file with missing data coded as "?". Remember where you saved these files, because later you will need to tell R where to find them.
In all the example files, there is a heading for maximum and minimum temperature in °F and °C. The files include 141 rows and four columns. Here is an example of reading a .txt file with no missing data.
Working in Windows With No Spaces in File Path
#spain1 will be the name used in R as a data frame #read.table begins by looking for the file in the location you describe and header=T #indicates that there was a heading for this file (header=FALSE is default) spain1<-read.table('c:/temp/spain1.txt', header=T) #To examine the first five rows spain1[1:5,]
Output
MAX MIN MAXC MINC
1 48.6 32.0 9.2 0.0
2 49.3 46.0 9.6 7.8
3 50.7 40.3 10.4 4.6
4 51.8 48.2 11.0 9.0
5 53.2 44.2 11.8 6.8
Working in Windows When the File Path or Name Contains Spaces
If a file is in a folder with spaces in its name, “\” can be used to indicate the presence of a space. For example, if the file is located in a folder named “My Documents”, the following code can be used:
spain1<-read.table('c:/Documents\ and Settings/"username"/My\ Documents/spain2.txt',header=T) # Now, read in spain2 using the .csv format; because of the .csv format, header=T is # assumed; if there was no header, then a header=FALSE would need to be defined spain2<-read.csv('c:/temp/spain2.csv') #Again, as a check, you can examine a subset of the data as: spain2[1:5,]
Output
MAX MIN MAXC MINC
1 47.8 42.1 8.8 5.6
2 46.0 41.0 7.8 5.0
3 45.3 39.2 7.4 4.0
4 48.9 38.5 9.4 3.6
5 46.0 32.7 7.8 0.4
Working in Mac OSX or Linux When the File Path or Name Contains No Spaces
Mac OSX and Linux share the same syntax, with a command generally the same as for Windows except for the file structure syntax. Since a typical user will not have "write" access to the root directory, the files should be stored in your /home directory. In this example, we have created a "temp" folder in the user's /home/"username" directory. In this case, to minimize typing and help reduce the chance of mistakes, a tilde "~" is used in place of "/home/username/".
#if you have a Unix or Mac based operating system and a temp folder in your home directory, #you may bring the dataset in as: spain1<-read.table('~/temp/spain1.txt',header=T)
To import data from a file or directory with spaces in the name, a "\" must be placed in the file path where there are spaces for R to understand how to read the directory structure. Suppose the file is called "2003 Data" and is in a directory called "field work.csv" in your home documents folder. The syntax would be "~/Documents/2003\ Data/field\ work.csv"
Once the data are imported, the commands are the same for the different operating systems. The only difference is in the directory structure syntax used to indicate the location of the files.
Working With Missing Data Values
#In the last example, there are two missing observations in the first row and these were #coded as "?" in the original data entry. R uses "NA" to represent missing observations, #but instead of having change "?" to "NA" before read in the table, you can tell R what #the missing data looks like and R will change "?" to "NA": spain3<-read.csv('c:/temp/spain3.csv', na.strings="?") spain3[1:5,]
Output
MAX MIN MAXC MINC
1 NA 42.1 NA 5.6
2 46.0 41.0 7.8 5.0
3 45.3 39.2 7.4 4.0
4 48.9 38.5 9.4 3.6
5 46.0 32.7 7.8 0.4
Working With Many Data Files
If you work with many data files it may be convenient to
set a working directory using setwd().
For example:
setwd('c:/temp') #Read in the spain2.txt file from this folder as: spain2<-read.table('spain2.txt',header=T) spain2[1:5,]
Output
MAX MIN MAXC MINC
1 47.8 42.1 8.8 5.6
2 46.0 41.0 7.8 5.0
3 45.3 39.2 7.4 4.0
4 48.9 38.5 9.4 3.6
5 46.0 32.7 7.8 0.4
Character Data
Objects can also include character data.
x87 <- c('A','B','G')
Names can be generated by pasting character and number entries together.
names14 <- paste('X', 1:5, sep='-')
Generating Random Values
A series of 20 fair (equal probability of head=1 or tail=0) coin flips can be generated by the random binomial generator
rbinom(n=20, size=1, prob=0.5) # size indicates the number of flips in each of the 20 trials
Samples from a set of numbers or names can be drawn.
sample(20) # this arranges 1:20 in random order sample(20,5) # this draws 5 samples from 1:20 sample(20, replace=T)
The last example draws 20 samples from 1:20 with
replacement, while the default is to sample without
replacement. Use the help command to find more
information about sample and
paste.
x24 <- paste('trt',1:10,sep='') sample(x24,5)
Example: Generating Randomized Treatment Maps
You can use random numbers and sampling in R to generate maps of treatments for experiments. For example, a completely randomized design assigns a set of treatments to experimental units at random throughout the set of experimental units. This is in contrast to a randomized complete block design, where each treatment is assigned to one experimental unit within a block. The easiest ways to conceptualize such experimental designs may be in the context of a field study in which a unit of land or a plot is an experimental unit or in terms of a greenhouse study in which a pot on a bench is an experimental unit.
For a completely randomized design, suppose there are
ntrt treatments labeled 1 through
ntrt, and nrepl replicates. The set
of random treatment assignments can be generated by the
following code, where arbitrary values of ntrt
and nrepl are supplied as an example:
ntrt <- 8 nrepl <- 2 sample(rep(x=1:ntrt,times=nrepl)) # These treatments can be assigned to a map with particular dimensions # by putting the values in a matrix of desired dimension temp <- sample(rep(x=1:ntrt,times=nrepl)) # note that the function ‘rep’ already exists in R # and is not to be confused with our new variable ‘nrepl’ # try ‘help(rep)’ for more information matrix(temp,nrow=nrepl,ncol=ntrt)
For a randomized complete block design, each treatment has
to be assigned once within each block before the same
treatment can be assigned again. If nrepl = 1
(an unrealistic case), then the order of treatments can be
randomized by:
ntrt <- 8 nrepl <- 1 sample(1:ntrt) # Suppose nrepl is greater than 1 nrepl <- 5 randout <- sample(1:ntrt) for (j in 2:nrepl){randout <- c(randout,sample(1:ntrt))} # If the blocks should appear in the map as columns matrix(randout,nrow=ntrt,ncol=nrepl)
Statistical Distributions
Statistical distributions are often used as a tool to simplify comparisons between groups, such as comparisons of the mean and variance of plant biomass under different experimental treatments. R includes four types of functions for manipulating several distributions useful in biological modeling:
- random generation of variables from the distribution,
- returning values of the density,
- returning percentiles for the distribution, and
- returning quantiles of the distribution.
One of the most commonly used distributions is the normal distribution (aka the Gaussian distribution, aka the bell-shaped curve). The normal distribution is described completely by two parameters, the mean and the standard deviation. The following commands illustrate some properties of the Normal distribution.
# Illustrate a normal distribution # with mean 0 and standard deviation 1 # by drawing 100,000 random # observations from the distribution hist(rnorm(n=100000,mean=0,sd=1)) # If your computer is fast enough # you can quickly draw 10,000,000 # (or more) random observations and revel in the power hist(rnorm(n=10000000,mean=0,sd=1)) # You can get an idea of why small sample # sizes only supply limited # information about a process by # sampling just a few random # observations at a time, # while keeping the x axis range constant hist(rnorm(n=3,mean=0,sd=1),xlim=c(-4,4), xlab='Observation',main='Example') # The mfrow command adjusts how many # graphs are illustrated at once par(mfrow=c(2,3)) # A loop can be used to illustrate several # samples at once from the # normal distribution with mean 0 and standard deviation 1 for(j in 1:6){ hist(rnorm(n=3,mean=0,sd=1),xlim=c(-4,4), xlab='Observation',main='Example') } # Compare the normal distributions # for a range of different parameters hist(rnorm(n=100000,mean=0,sd=1),main='Mean=0, SD=1', xlab='Observation',xlim=c(-40,140)) hist(rnorm(n=100000,mean=0,sd=2),main='Mean=0, SD=2', xlab='Observation',xlim=c(-40,140)) hist(rnorm(n=100000,mean=0,sd=10),main='Mean=0, SD=10', xlab='Observation',xlim=c(-40,140)) hist(rnorm(n=100000,mean=100,sd=1),main='Mean=100, SD=1', xlab='Observation',xlim=c(-40,140)) hist(rnorm(n=100000,mean=100,sd=2),main='Mean=100, SD=2', xlab='Observation',xlim=c(-40,140)) hist(rnorm(n=100000,mean=100,sd=10),main='Mean=100, SD=10', xlab='Observation',xlim=c(-40,140)) # For reference, generate a single histogram # of a normal distribution # with mean 0 and standard deviation 1 again par(mfrow=c(1,1)) hist(rnorm(n=100000,mean=0,sd=1),main='Mean=0, SD=1', xlab='Observation') # The density of the random distribution # at any given value indicates # how likely values in that range are # What is the value of the density at the mean, 0? dnorm(0,mean=0,sd=1) # What is the value of the density # for an unlikely part of the range, 4? dnorm(4,mean=0,sd=1) # The values of the density don't # provide the probability of those # values per se, but they give an idea # of the relative probability # The percentile provides information # about the probability that # values from a normal distribution # fall below a particular value # What is the probability that values fall below -1? pnorm(-1,mean=0,sd=1) # What is the probability that values fall below 0? # Or, phrased differently, # what proportion of values fall below 0? pnorm(0,mean=0,sd=1) # In statistical analyses, a normal distribution may be used # to represent potential observations under a null hypothesis. # If observations fall far in the tails of the distribution, this # suggests that the null hypothesis is not true. For example, # the distribution might represent the mean difference between # responses to two treatments, where the mean difference is 0 # under the null hypothesis of no difference. # For this normal distribution, what is the probability # of an observation below -1.96? pnorm(-1.96,mean=0,sd=1) # What is the probability that an observation falls above 1.96? 1 - pnorm(1.96,mean=0,sd=1) # Note that these are frequently used cut-offs, such that the # probability an observation falls either below -1.96 or above 1.96 adds up to 0.05. # Here 0.05 represents an acceptably small probability that the # extreme mean difference could be observed for a sample even # if the mean difference is actually 0. # The quantile function allows you to ask # a related but different type of question # Below what value does half the distribution fall? qnorm(0.5,mean=0,sd=1) # Below what value does 2.5% of the distribution fall? qnorm(0.025,mean=0,sd=1) # Above what value does 2.5% of the distribution fall? qnorm((1-0.025),mean=0,sd=1)
# What is the value of the density # for an unlikely part of the range, 4? dnorm(4,mean=0,sd=1) # The values of the density don't # provide the probability of those # values per se, but they give an # idea of the relative probability # The percentile provides information # about the probability that # values from a normal distribution # fall below a particular value # What is the probability that values fall below -1? pnorm(-1,mean=0,sd=1) # What is the probability that values fall below 0? # Or, phrased differently, # what proportion of values fall below 0? pnorm(0,mean=0,sd=1) # The quantile function allows you # to ask a related but different # type of question # Below what value does half the distribution fall? qnorm(0.5,mean=0,sd=1) # Below what value does 5% of the distribution fall? qnorm(0.05,mean=0,sd=1) # Below what value does 2.5% of the distribution fall? qnorm(0.025,mean=0,sd=1) # Above what value does 2.5% of the distribution fall? qnorm((1-0.025),mean=0,sd=1)
Try repeating the same types of analyses for another
distribution, the binomial, using the functions
rbinom, dbinom,
pbinom, and qbinom. The binomial
distribution has two parameters, the probability of success,
prob, and the number of trials,
size. The binomial distribution is often illustrated in terms of a series of flips of a coin, where success might be defined in terms of the coin flip yielding 'heads', and where
prob=0.5 if the coin is fair. The binomial
distribution can only take on non-negative integer values, in
contrast to the normal distribution which includes any number
from -∞ to ∞.
hist(rbinom(n=100000,size=10,prob=.5))
The Use of Statistical Distributions in Epidemiology
Applications of several distributions in ecology and epidemiology are illustrated using R in other Plant Health Instructor documents.
The exponential model can be used to describe disease progress over time or the declining probability of disease with increasing distance from an inoculum source.
The uniform distribution can be used to generate a random number across a continuous range in which every value is equally likely. Drawing from a uniform distribution is used for selecting initial coordinates of infections and for selecting directions for dispersal in a spatial epidemic simulator.
The binomial distribution can be used to describe the number of ‘successes’ out of a fixed number of trials. For example, it can be used to generate rain events in a disease simulation.
Example: The Beta-binomial Distribution for Analysis of Spatial Pattern
The binomial distribution can also be used to describe the
number of diseased plants within a sampling area. For
example, if a sampling quadrat contains size=40
plants and each plant has probability prob=0.1
of being infected, then the distribution of the number of
plants infected within a quadrat is:
hist(rbinom(n=100000,size=40,prob=0.1),main='Number infected') # This could also be expressed # in terms of the proportion # of plants infected by dividing by size=40 hist(rbinom(n=100000,size=40,prob=0.1)/40,main='Proportion infected') # On average, how many quadrats # will have 10% or fewer plants # infected (4 out of 40 or fewer)? pbinom(4,size=40,prob=0.1)
But when estimates of the proportion infection per quadrat are collected in the field, the binomial distribution may not always provide a good fit. For example, if disease is aggregated, the number of quadrats with high disease incidence and low disease incidence will be higher than would be expected for a binomial distribution. One way to deal with this lack of fit, and to provide a parameter that describes the degree of aggregation, is to fit disease incidence in quadrats using a beta-binomial distribution (Hughes and Madden 1993).
For the binomial distribution, the probability parameter is constant within any given realization. (That is, varying the probability parameter results in a different distribution.) For the beta-binomial distribution, the probability parameter is not constant but instead is described by the beta distribution.
The beta distribution is a very flexible distribution describing probabilities between 0 and 1 with two shape parameters. R provides four different types of functions for describing the beta distribution and a range of other distributions, as illustrated for the normal distribution above.
hist(rbeta(n=100000,shape1=1,shape2=1))
# The next command allows you to plot # figures in 2 rows and 3 columns par(mfrow=c(2,3)) # Try plotting several parameter # combinations to compare the # distributions for values 0.5, 1, 2, 3, and 4 # for each shape parameter hist(rbeta(n=100000,shape1=1,shape2=1)) # Now repeat the hist command with different parameters
# We can generate beta-binomial # random variables in R by using # beta random variables as parameters # in the binomial function par(mfrow=c(1,1)) hist(rbinom(n=100000,size=10,prob=rbeta(n=100000,shape1=1,shape2=3)))
The shape1 parameter is often notated as α
and the shape2 parameter as β. Then θ = 1/(α +
β) measures the variability in the number of events, where
events might be, for example, the number of infected plants
per quadrat. So, as the sum of the two shape parameters
becomes larger, the variability decreases. You can see that
this makes sense because as the shape parameters become
larger, the beta distribution becomes more centralized.
hist(rbeta(n=100000,shape1=2,shape2=1),main='shape1=2,shape2=1', xlim=c(0,1)) # …as compared to… hist(rbeta(n=100000,shape1=26,shape2=30),main='shape1=32,shape2=30', xlim=c(0,1))
Since the beta distribution supplies the probability parameter in the beta-binomial distribution, the less variable the beta distribution is, the less variable the beta-binomial will be.
hist(rbinom(n=100000,size=10,prob=rbeta(n=100000,shape1=2,shape2=1)), main='shape1=2,shape2=1') # …as compared to… hist(rbinom(n=100000,size=10,prob=rbeta(n=100000,shape1=26,shape2=30)), main='shape1=32,shape2=30')
Statistical Analyses
One of the most common uses of R is for statistical analysis of data. There are a number of references available describing analyses such as regression, analysis of variance, nonparametric analyses, and resampling approaches in R. Applications of statistical analyses are also illustrated in the following Plant Health Instructor documents: Modeling Plant Disease Progress Over Time, Modeling Dispersal Gradients, Introduction to Spatial Analysis, Disease Forecasting and Validation for Plant Pathology.
Writing Functions
You can create new custom functions in R to do more complicated calculations.
As a simple example, suppose you want to write a function that will return the squared value of whatever value you enter into the function. You can create such a function with the following command.
square.it <- function(x){x^2}
Note that those are curly brackets around
x^2. Here x is used temporarily and
the intention is not usually to have an object named
x read in, though that is what will happen if no
value for x is specified when the function is
applied (which can lead to confusion). Try these commands,
checking the value of x and y98
after each command.
x <- 2 y98 <- square.it(x=5) y98 <- square.it(x) y98 <- square.it(x=3)
Note that the value in the object x does not
change when the function is applied, unless the output of the
function is assigned to x.
x <- square.it(x=5)
More complicated functions can be made that include
multiple lines of commands split by a semi-colon or on
separate lines. The variables created inside the function,
temp1 and temp2, are also just used
temporarily.
fun91 <- function(x,y){temp1 <- x*y; temp2 <- log10(temp1); temp2}
Or, equivalently, to make the function easier to read:
fun91 <- function(x,y){ temp1 <- x*y temp2 <- log10(temp1) temp2 }This example gives the value in
temp2 as output.
Try:
fun91(x=10,y=100)
The function can be applied to any objects x
and y that make sense. For example:
fun91(x=c(1,2),y=c(10,100))
The function can also be applied to pre-existing objects.
z1 <- c(1,2) z2 <- c(10,100) fun91(x=z1,y=z2)
Closing a Session
Type q() to close the R window.
You have the option of saving objects and functions if you expect to use them again. In that case make sure you have made note of the names of objects so that there will be no confusion in future sessions.
Acknowledgments
We appreciate the very helpful comments of P. Garfinkel, J. Hadam, S. Pethybridge, PHI Reviewers, and members of the 2006 course in Ecology and Epidemiology of Plant Pathogens at KSU. It’s also a pleasure to acknowledge support by the U.S. National Science Foundation under Grants DEB-0130692, DEB-0516046, EF-0525712 (as part of the joint NSF-NIH Ecology of Infectious Disease program) and DBI-0630726, by the Ecological Genomics Initiative of Kansas through NSF Grant No. EPS-0236913 with matching funds from the Kansas Technology Enterprise Corporation, by the Office of Science (Program in Ecosystem Research), U.S. Department of Energy, Grant No. DE-FG02-04ER63892, by the U.S. Agency for International Development for the Sustainable Agriculture and Natural Resources Management Collaborative Research Support Program (SANREM CRSP) under terms of Cooperative Agreement Award No. EPP-A-00-04-00013-00 to the Office of International Research and Development at Virginia Tech and for the Integrated Pest Management CRSP, by USDA grant 2002-34103-11746, and by the NSF Long Term Ecological Research Program at Konza Prairie. This is Kansas State Experiment Station Contribution No. 07-310-J.
References
-
Bolker, B. In press . Ecological Models and Data in R. Princeton University Press. http://www.zoo.ufl.edu/bolker/emdbook.
-
Crawley, M.J. 2007. The R Book. Wiley.
-
Esker, P.D., Sparks, A.H., Antony, G., Bates, M., Dall' Acqua, W., Frank, E., Huebel, L., Segovia, V., and Garrett, K.A. 2007. Ecology and epidemiology in R: Modeling dispersal gradients. The Plant Health Instructor. On-line: DOI:10.1094/PHI-A-2007-1226-03
-
Esker, P.D., Sparks, A.H., Campbell, L., Guo, Z., Rouse, M., Silwal, S.D., Tolos, S., Van Allen, B., and Garrett, K.A. 2008. Ecology and epidemiology in R: Disease Forecasting.. The Plant Health Instructor. On-line: DOI:10.1094/PHI-A-2008-0129-01
-
Gentleman, R., Carey, V.J., Huber, W., Irizarry, R.A., and Dudoit, S., eds. 2005. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer, New York.
-
Hughes, G., and Madden, L.V. 1993. Using the beta-binomial distribution to describe aggregated patterns of disease incidence. Phytopathology 83:759-763.
-
Murrell, P. 2006. R Graphics. Chapman & Hall/CRC, Boca Raton.
-
R Development Core Team. 2007. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
-
Sparks, A..H., P.D. Esker, G. Antony, L. Campbell, E.E. Frank, L. Huebel, M.N. Rouse, B.Van Allen, and K.A. Garrett, 2008. Ecology and epidemiology in R: Introduction to spatial analysis. The Plant Health Instructor. On-line: DOI:10.1094/PHI-A-2008-0129-03.
-
Sparks, A.H., Esker, P.D., Bates, M, Dall' Acqua, W., Guo, Z., Segovia, V. Silwal, S.D., Tolos, S., and Garrett, K.A. 2008. Ecology and epidemiology in R: Disease Progress over Time. The Plant Health Instructor. Online: DOI:10.1094/PHI-A-2008-0129-02
-
Venables, W.N., Smith, D.M., and the R Development Core Team. 2007. An Introduction to R. Notes on R: A Programming Environment for Data Analysis and Graphics. Version 2.6.1. http://cran.r-project.org/doc/manuals/R-intro.pdf.