APSnet Education Center Advanced Topics | Return to the Topics Index | Ecology and Epidemiology in R



Garrett, K.A., P.D. Esker, and A.H. Sparks. 2007. Introduction to the R Programming Environment. The Plant Health Instructor. DOI:10.1094/PHI-A-2007-1226-02.

An Introduction to the R Programming Environment

K. A. Garrett, P. D. Esker, and A. H. Sparks

Student Learning Goals

After completion of this module:

  • Students will understand some of the range of potential applications of R in Biology.
  • Students will be able to:
    1. use R in interactive sessions,
    2. use basic programming in R for graphics and dataset manipulation,
    3. generate random numbers and use statistical distributions.

Feedback

We would appreciate feedback for improving this paper and information about how it has been used for study and teaching. Please send your feedback to kgarrett@ksu.edu. Please include the following text in the e-mail subject line, "Feedback on R Modules", to make sure your comments are received.

Introduction

R is a free software environment that includes a set of base packages for graphics, math, and statistics. You can also make use of specialized packages contributed by R users or write your own new functions. R was developed as a part of the GNU project from the S language, http://cm.bell-labs.com/cm/ms/departments/sia/S/.

R performs many of the same statistical analyses as SAS, http://www.sas.com/. The trade-off in use of R versus SAS might revolve around the fact that R is free while SAS may have better technical support and testing for features such as linear models in packages like SAS Proc Mixed.

Biologists might be most interested in using R for statistical analysis, ecological modeling (Bolker, in press), and in bioinformatics applications using Bioconductor (Gentleman et al. 2005). Also a set of Plant Health Instructor documents have been prepared to illustrate the use of R in epidemiology and ecology: Disease Progress Over Time, Modeling Dispersal Gradients, Introduction to Spatial Analysis, Disease Forecasting.

For a more comprehensive introduction to R, see the book: An Introduction to R. Notes on R: A Programming Environment for Data Analysis and Graphics, by Venables et al. (2007), especially Appendix A/doc/manuals which gives a sample R session illustrating many features. This book is available on-line through the R website, http://www.r-project.org, or a slightly older bound version is available for purchase with money going to support free software development and documentation. Crawley (2007) provides a substantial reference for R.

Installing R

Both source code and binaries for Windows, Mac OS X, and several Linux distributions are available. You can install R on your computer from the R website, http://www.r-project.org/.

  1. Click on the 'CRAN mirror' link in the menu on the left side of the page.
  2. Pick a location for downloading, preferably a geographic location near you.
  3. Pick the operating system that you are using, Windows, Mac, or Linux.
  4. Perhaps start with 'base' software.
  5. Peruse information in the README file.
  6. If using Windows, click on the R-installation .exe file to install.

    Note that the setup wizard gives you an option for where the R directory will be. You might just want to use the default and keep track of where it is for future reference.

R as a Deluxe Calculator

Now that you have installed R on your computer, you can begin trying it out. To start using R, open the interactive R window by clicking on the R icon. Using R interactively, try out common arithmetic operators: +, -, /, *

For example, if 3*3 is entered on the command line, 9 is returned on the following line.

Up and down arrows can be used to scroll among previous commands if you want to repeat them or edit them for re-entry.

Try entering the following commands to see the output

3^3
log10(100)
exp(4)
log10(10^4)

Creating objects and assigning values

Objects are assigned values using <- , an arrow formed out of < and -. (An equal sign, =, can also be used.) For example, the following command assigns the value 5 to the object x.

x <- 5

After this assignment, the object x ‘contains’ the value 5. Another assignment to the same object will change the content.

x <- 107

You can check the content of an object by simply entering the name of the object on an interactive command line. Try that throughout these examples to see what the results are of the different operations and functions illustrated.

R is case sensitive – that is, it treats data15 and Data15 as completely different objects.

This can be a hassle if you write code in a separate file using a word processor that automatically changes case at the beginning of sentences, etc. In Word, you can change 'autocorrect' options in the Menu under Tools, Autocorrect options,… Using an editor like Notepad is a better option when simply writing R code.

An editor is also supplied within the R window through the menu under File, New Script. The simplest way to develop commands in R and then implement them may be to write the code in the R Editor window and then submit the code through the Edit, Run… options.

Free editors designed for R are available that highlight code syntax as in this document and may make working with R easier in other ways as well. Sciviews-R is a Graphical User Interface (GUI) for R for Windows. For Linux and BSD, two Unix-like operating systems, RKWard is a transparent front end for R based on KDE, a desktop environment for Unix systems, that strives to make R easier to use. This program was used while creating this HTML document to highlight the R-code. Other programs such as RCommander, and Emacs/ESS are available as well. While the use of programs such as these will not be covered in this document, you may find them helpful to you as you work through these documents or in the future if you choose to use R in your research.

Comments should be added to programs to clarify steps, define variables, etc., so that anyone reading a program now or in the future can more easily make sense of what the program does. The part of a line that begins with # does not get treated as code, rather as a comment. For example, after the following two lines of code, x still has the value 5.

x <- 5
# x <- 6

Try creating some other objects and assigning values to them. Naming of objects is very flexible, though names must begin with a letter and problems can be avoided if you don’t name objects with the same name as an existing function in R. Picking a name that has a clear meaning is particularly helpful for programs that contain many lines. If the R workspace is saved (more about this below), it is important to take note of what objects have been created, so that old objects are not accidentally used incorrectly in future sessions.

Some names have already been assigned in R, like mean, function, etc.

For example, ‘pi’ is already defined (π = 3.1416).

Try entering the following command line, pi.

An object can be assigned a set of numbers, as for example:

x12 <- c(10,6,8)

Here c combines the numbers 10, 6, and 8 into a vector.

Operations can then be performed on the whole set of numbers. For example, for the object x12 created above, check the results of the following:

x12 * 10
x12/3

The numbers in x can be indexed individually, for example:

x12[2] # gives the second entry in x12
x12[1] # gives the first entry in x12

Or the numbers in x can be indexed in subsets. Consider the results of the following two commands:

x12[1:3] # gives the first through third entries in x12
x12[c(1,3)] # gives the first and third entries in x12

What is the difference between the these two commands?

Logical variables can be used in evaluation of the entries in an object. Logical variables take on the values TRUE or FALSE. Try:

x12 < 3

Logical variables can also be assigned to objects, for example:

y <- x12 < 3

Here the object y is assigned the vector of TRUE and FALSE values corresponding to whether the entries in x12 are less than three or not.

y 

Output

[1] FALSE FALSE FALSE

R includes many functions, especially for statistical summaries and analyses.

Try these functions on x12 and other objects that you create.

mean(x12) # calculates the mean of the entries in x12
var(x12)  # calculates the variance of the entries in x12
sum(x12)
cumsum(x12)  # check what this function is doing…
median(x12)
length(x12)
sum(y)  #  logical values can be 'coerced' to numeric, where FALSE = 0 and TRUE = 1
        # - here the sum is a count of 'TRUE's

The last example can be modified to calculate proportions or percentages. For example, if y is a set of logical variables indicating whether plants are heavily infected,

sum(y)/length(y)

gives the proportion of plants heavily infected. sum(y) gives the number of plants infected and length(y) gives the total number of plants.

The help function gives information about how a function works. For example, help(mean).

The example function gives examples of applications of a function, example(length).

Throughout these examples, try using the help and example commands to better understand what programs are doing.

ls() 
# gives a list of all the objects in your working directory
rm()
# removes the objects indicated.  
# For example the object x is removed by
rm(x)

You might want to clear out old objects that you don’t need anymore to avoid confusion.

Graphics

Simple Plotting

Producing a simple x-y plot, or scatter plot, is easy in R. Use help(plot) to get a description of options available.

DisSev <- c(0,12,67,34,23,28)
RelYield <- c(100,103,79,89,85,97)
plot(x=DisSev, y=RelYield, xlab='Disease Severity (%)',
ylab='Relative Yield (%)')

To draw a line through the points, from the smallest value of DisSev to the largest value of DisSev, the necessary ordering of the values can be determined using the order command and then DisSev and RelYield can be indexed by these ordered values. The first command lines provide an introduction to order and the related command rank.

# Set up a simple example vector

ex8 <- c(103,101,102)

# The order command give the order in which entries
# must be drawn to put them in order from smallest to largest

order(ex8)

# First the 2nd entry must be drawn, then the 3rd, then the 1st.
# In contrast, the rank command gives the rank of the entries
# from smallest to largest.

rank(ex8)

# The first entry comes 3rd, the next entry comes first, and
# the next entry comes second.

# Now use the order command and plot the DisSev data.

DisSevOrd <- order(DisSev)
plot(DisSev[DisSevOrd],RelYield[DisSevOrd], type='l')

Advanced plotting

Using color in plots

Suppose you would like to use color in your plots to make them easier to read or more attractive. Colors are easily added to plots using the col='color name' statement in the plot() command.

plot(DisSev, RelYield,
  xlab='Disease Severity',
  ylab='Relative Yield (%)',
  col='orange',
  pch=15,
)

To find a list of all colors available for use in R graphics type colors() Likewise, calling different numbers for pch in the plot command produces different shapes for the plotting points.

Using subscripts and superscripts in graph labels

To add subscripts to your x or y labels enclose the text in [] brackets, for example H[2]O. To superscript text use a tilde, ~, before the text and use the ^ before the part of the text to be raised, for example ~ cm^3.

plot(x=DisSev, y=RelYield,
  xlab = quote(Size [subscript]),
  ylab = quote(Volume ~ cm^3),
  col='purple',
  pch=17
)

Interactive plotting features

Sometimes it's useful to specify the range for values represented in a graph rather than just using the default ranges produced by R. You can control the range of values on the x (horizontal) and y (vertical) axes using xlim and ylim.

plot(DisSev, RelYield, xlab='Disease Severity (%)',
ylab='Relative Yield (%)', xlim=c(0,100), ylim=c(0,100))

Variations on the plot() function are called when it is applied to different types of objects. For example, when plot() is applied to categorical data, a boxplot is produced. When applied to the output from the linear regression function lm, plot() produces a set of four regression diagnostic plots.

R has interactive features, too. To see the ID for three points, try clicking on three points on the graph after entering the following:

identify(DisSev,RelYield, n=3)

To get the (x,y) coordinates for any four locations on the graph, try clicking any four places on the graph after entering locator(4). The coordinates will appear after all the locations have been clicked on.

You can also plot a particular function instead of just a set of variables.

curve(x^2,from=0,to=10,xlab='Time',ylab='Population size')

or

curve(exp(x),from=0,to=10,xlab='Time',ylab='Population size')

More examples of graphing capabilities in R are available on the web, http://addictedtor.free.fr/graphiques/, and detailed instructions and illustrations are available in Murrell (2005).

Other Plant Health Instructor documents give examples of more complicated plots: Modeling Plant Disease Progress Over Time, Modeling Dispersal Gradients, Introduction to Spatial Analysis, Disease Forecasting and Validation for Plant Pathology.

Saving Graphical Output

Graphs can be saved in a PDF file (or a range of other formats) for later use and addition to documents. For example, the following command would save graphics output to the PDF file as named.

pdf(file='testplot.pdf')

Then, when output to that file is completed, the following command will turn off the ‘device’ that was moving output to the named file.

dev.off()

Using the interactive R windows, the simplest way to save a graph may be to right click on the graphics window and select the format for saving the graph from the menu. This operation depends on your operating system. Options may vary.

Loops

A loop can be used to perform the same function repeatedly. For example:

for(i in 1:3){x12[i] <- x12[i] + 1}
  1. for(i in 1:3){} sets up the loop to run over three values of i, 1, 2, and 3 for whatever commands are entered in the curly brackets {}.
  2. x12[i] indicates the ith entry in x12. When i = 1, x12[i] is the same as x12[1]. When i = 2, x12[i] is the same as x12[2].
  3. x12[i] <- x12[i] + 1

    Here the value in x12[i] is replaced by the value in x12[i] plus 1.

In this case the same result is obtained more simply by:

x12 <- x12 + 1

since R will perform the function on each entry of x12.

A generalization if the function is to be applied to each entry of x12 would be the following.

for(i in 1:length(x12)){x12[i] <- x12[i] + 1}

Here i runs from one to the length of the x12 vector.

The use of a temporary variable named i is arbitrary, another name could be used.

for(Joe in 1:length(x12)){x12[Joe] <- x12[Joe] + 1}

You might choose to apply a function only to entries for which a certain criterion is true.

for(i in 1:length(x12)){if(x12[i] < 4){x12[i] <- x12[i] + 1}}

Here the if() statement controls which entries in the vector x12 are modified.

In this case, only those entries that are less than 4 have 1 added.

Another way to accomplish the same thing would be as follows.

index12 <- x12 < 4
x12[index12] <- x12[index12] + 1

Minimizing the use of loops can make R programs more efficient. Often the use of a loop can be avoided if operations are performed on whole objects at a time, as in:

x12 <- x12 + 1

Working With Data Sets

Suppose you would like to combine two sets of values into a single object.

dai <- c(0,1,2,5,10,20)  # dai might indicate days after inoculation
pinf <- c(0,0,0,5,25,80)  # pinf might indicate percent infection 

Try

rbind(dai,pinf)
cbind(dai,pinf)

Compare to

rbind(c(dai,pinf))

You can create arrays using the following commands

z5 <- array(0,c(5,3))
z6 <- array(-1,c(5,3,2))

Sometimes arrays with entries 0 or -1 are useful for starting values, where the array will later get filled in with other values. For example, 0 might be the starting infection level or -1 might be a placeholder indicating that infection levels haven’t been entered yet.

Data Structures

Venables et al. (2007) recommend keeping variables in data frames for easier organization. See what df5 looks like

df5 <- data.frame(dai,pinf)

Data frames need to be composed of objects with the same length. If objects don't have the same length, you might get an error or you may get surprising results. Check the following example.

var1 <- 1:10

The above command assigns the sequence of numbers between 1 and 10 to the object named var1.

var2 <- 1:2
df6 <- data.frame(var1,var2)

Did df6 look the way you expected it to? This shows how different variable length can produce confusing outcomes when the data.frame command is used.

You can indicate dai within df5 by df5$dai, df5[1], or df5[[1]]. The name dai is only maintained with the call df5[1]

Lists can also be used to gather objects together and the objects don't need to be of the same length.

list6 <- list(var1,var2)

Consider list6[[1]] and list6[[2]]

The interactive R environment in Windows has a data editing window available through the Menu under Edit.

Moving to and From Files

Reading Data into R

R users can construct data sets in programs such as Microsoft Excel or Notepad and then bring those files into R for further analyses. Data input in R is handled through a series of options, including:

For further information on the differences amongst these methods, use the help() or ? options in R, for example: ?read.table.

The primary information that is required to bring the data into R is the location of the file (e.g., a folder on your hard drive or elsewhere), and information regarding how the data are formatted. The following examples will help explain some of these differences. For the purpose of this tutorial, a Windows system was used and a default “temp” folder was created on the C-drive, but we will also show how to bring data from a source such as “My Documents”, dealing with folder names that incorporate spaces. Examples for importing to a Mac OSX or Linux computer will also be provided. To save the four example files, right click using your mouse, and then select 'Save Link As' or 'Save Target As'.

The two .txt files are spain1.txt and spain2.txt. The two .csv files are spain2.csv and spain3.csv, one file with missing data coded as "?". Remember where you saved these files, because later you will need to tell R where to find them.

In all the example files, there is a heading for maximum and minimum temperature in °F and °C. The files include 141 rows and four columns. Here is an example of reading a .txt file with no missing data.

Working in Windows With No Spaces in File Path

#spain1 will be the name used in R as a data frame
#read.table begins by looking for the file in the location you describe and header=T
#indicates that there was a heading for this file (header=FALSE is default)

spain1<-read.table('c:/temp/spain1.txt', header=T)

#To examine the first five rows

spain1[1:5,]

Output


   MAX  MIN MAXC MINC
1 48.6 32.0  9.2  0.0
2 49.3 46.0  9.6  7.8
3 50.7 40.3 10.4  4.6
4 51.8 48.2 11.0  9.0
5 53.2 44.2 11.8  6.8

Working in Windows When the File Path or Name Contains Spaces

If a file is in a folder with spaces in its name, “\” can be used to indicate the presence of a space. For example, if the file is located in a folder named “My Documents”, the following code can be used:

spain1<-read.table('c:/Documents\ and Settings/"username"/My\ Documents/spain2.txt',header=T)

# Now, read in spain2 using the .csv format; because of the .csv format, header=T is 
# assumed; if there was no header, then a header=FALSE would need to be defined

spain2<-read.csv('c:/temp/spain2.csv')

#Again, as a check, you can examine a subset of the data as:

spain2[1:5,]

Output


    MAX  MIN MAXC MINC
1 47.8 42.1  8.8  5.6
2 46.0 41.0  7.8  5.0
3 45.3 39.2  7.4  4.0
4 48.9 38.5  9.4  3.6
5 46.0 32.7  7.8  0.4

Working in Mac OSX or Linux When the File Path or Name Contains No Spaces

Mac OSX and Linux share the same syntax, with a command generally the same as for Windows except for the file structure syntax. Since a typical user will not have "write" access to the root directory, the files should be stored in your /home directory. In this example, we have created a "temp" folder in the user's /home/"username" directory. In this case, to minimize typing and help reduce the chance of mistakes, a tilde "~" is used in place of "/home/username/".

#if you have a Unix or Mac based operating system and a temp folder in your home directory,
#you may bring the dataset in as:
spain1<-read.table('~/temp/spain1.txt',header=T)

To import data from a file or directory with spaces in the name, a "\" must be placed in the file path where there are spaces for R to understand how to read the directory structure. Suppose the file is called "2003 Data" and is in a directory called "field work.csv" in your home documents folder. The syntax would be "~/Documents/2003\ Data/field\ work.csv"

Once the data are imported, the commands are the same for the different operating systems. The only difference is in the directory structure syntax used to indicate the location of the files.

Working With Missing Data Values

#In the last example, there are two missing observations in the first row and these were 
#coded as "?" in the original data entry.  R uses "NA" to represent missing observations, 
#but instead of having change "?" to "NA" before read in the table, you can tell R what 
#the missing data looks like and R will change "?" to "NA":

spain3<-read.csv('c:/temp/spain3.csv', na.strings="?")

spain3[1:5,]

Output


   MAX  MIN MAXC MINC
1   NA 42.1   NA  5.6
2 46.0 41.0  7.8  5.0
3 45.3 39.2  7.4  4.0
4 48.9 38.5  9.4  3.6
5 46.0 32.7  7.8  0.4

Working With Many Data Files

If you work with many data files it may be convenient to set a working directory using setwd().

For example:

setwd('c:/temp')

#Read in the spain2.txt file from this folder as:

spain2<-read.table('spain2.txt',header=T)
spain2[1:5,]

Output


    MAX  MIN MAXC MINC
1 47.8 42.1  8.8  5.6
2 46.0 41.0  7.8  5.0
3 45.3 39.2  7.4  4.0
4 48.9 38.5  9.4  3.6
5 46.0 32.7  7.8  0.4

Character Data

Objects can also include character data.

x87 <- c('A','B','G')

Names can be generated by pasting character and number entries together.

names14 <- paste('X', 1:5, sep='-')

Generating Random Values

A series of 20 fair (equal probability of head=1 or tail=0) coin flips can be generated by the random binomial generator

rbinom(n=20, size=1, prob=0.5)  
# size indicates the number of flips in each of the 20 trials

Samples from a set of numbers or names can be drawn.

sample(20)  # this arranges 1:20 in random order
sample(20,5)  # this draws 5 samples from 1:20
sample(20, replace=T)

The last example draws 20 samples from 1:20 with replacement, while the default is to sample without replacement. Use the help command to find more information about sample and paste.

x24 <- paste('trt',1:10,sep='')
sample(x24,5)

Example: Generating Randomized Treatment Maps

You can use random numbers and sampling in R to generate maps of treatments for experiments. For example, a completely randomized design assigns a set of treatments to experimental units at random throughout the set of experimental units. This is in contrast to a randomized complete block design, where each treatment is assigned to one experimental unit within a block. The easiest ways to conceptualize such experimental designs may be in the context of a field study in which a unit of land or a plot is an experimental unit or in terms of a greenhouse study in which a pot on a bench is an experimental unit.

For a completely randomized design, suppose there are ntrt treatments labeled 1 through ntrt, and nrepl replicates. The set of random treatment assignments can be generated by the following code, where arbitrary values of ntrt and nrepl are supplied as an example:

ntrt <- 8
nrepl <- 2
sample(rep(x=1:ntrt,times=nrepl))
# These treatments can be assigned to a map with particular dimensions
#   by putting the values in a matrix of desired dimension
temp <- sample(rep(x=1:ntrt,times=nrepl))
# note that the function ‘rep’ already exists in R
# and is not to be confused with our new variable ‘nrepl’
# try ‘help(rep)’ for more information
matrix(temp,nrow=nrepl,ncol=ntrt)

For a randomized complete block design, each treatment has to be assigned once within each block before the same treatment can be assigned again. If nrepl = 1 (an unrealistic case), then the order of treatments can be randomized by:

ntrt <- 8
nrepl <- 1
sample(1:ntrt)
# Suppose nrepl is greater than 1
nrepl <- 5
randout <- sample(1:ntrt)
for (j in 2:nrepl){randout <- c(randout,sample(1:ntrt))}
# If the blocks should appear in the map as columns
matrix(randout,nrow=ntrt,ncol=nrepl)

Statistical Distributions

Statistical distributions are often used as a tool to simplify comparisons between groups, such as comparisons of the mean and variance of plant biomass under different experimental treatments. R includes four types of functions for manipulating several distributions useful in biological modeling:

  1. random generation of variables from the distribution,
  2. returning values of the density,
  3. returning percentiles for the distribution, and
  4. returning quantiles of the distribution.

One of the most commonly used distributions is the normal distribution (aka the Gaussian distribution, aka the bell-shaped curve). The normal distribution is described completely by two parameters, the mean and the standard deviation. The following commands illustrate some properties of the Normal distribution.

# Illustrate a normal distribution 
# with mean 0 and standard deviation 1
# by drawing 100,000 random
# observations from the distribution

hist(rnorm(n=100000,mean=0,sd=1))

# If your computer is fast enough
# you can quickly draw 10,000,000
# (or more) random observations and revel in the power

hist(rnorm(n=10000000,mean=0,sd=1))

# You can get an idea of why small sample
# sizes only supply limited
# information about a process by
# sampling just a few random 
# observations at a time,
# while keeping the x axis range constant

hist(rnorm(n=3,mean=0,sd=1),xlim=c(-4,4),
  xlab='Observation',main='Example')

# The mfrow command adjusts how many
# graphs are illustrated at once

par(mfrow=c(2,3))

# A loop can be used to illustrate several
# samples at once from the 
# normal distribution with mean 0 and standard deviation 1

for(j in 1:6){
  hist(rnorm(n=3,mean=0,sd=1),xlim=c(-4,4),
    xlab='Observation',main='Example')
}

# Compare the normal distributions
# for a range of different parameters

hist(rnorm(n=100000,mean=0,sd=1),main='Mean=0, SD=1',
  xlab='Observation',xlim=c(-40,140))
hist(rnorm(n=100000,mean=0,sd=2),main='Mean=0, SD=2',
  xlab='Observation',xlim=c(-40,140))
hist(rnorm(n=100000,mean=0,sd=10),main='Mean=0, SD=10',
  xlab='Observation',xlim=c(-40,140))
hist(rnorm(n=100000,mean=100,sd=1),main='Mean=100, SD=1',
  xlab='Observation',xlim=c(-40,140))
hist(rnorm(n=100000,mean=100,sd=2),main='Mean=100, SD=2',
  xlab='Observation',xlim=c(-40,140))
hist(rnorm(n=100000,mean=100,sd=10),main='Mean=100, SD=10',
  xlab='Observation',xlim=c(-40,140))

# For reference, generate a single histogram
# of a normal distribution 
# with mean 0 and standard deviation 1 again

par(mfrow=c(1,1))
hist(rnorm(n=100000,mean=0,sd=1),main='Mean=0, SD=1',
  xlab='Observation')

# The density of the random distribution
# at any given value indicates
# how likely values in that range are
# What is the value of the density at the mean, 0?

dnorm(0,mean=0,sd=1)

# What is the value of the density
# for an unlikely part of the range, 4?

dnorm(4,mean=0,sd=1)

# The values of the density don't
# provide the probability of those
# values per se, but they give an idea
# of the relative probability
# The percentile provides information
# about the probability that
# values from a normal distribution
# fall below a particular value
# What is the probability that values fall below -1?

pnorm(-1,mean=0,sd=1)

# What is the probability that values fall below 0?
# Or, phrased differently,
# what proportion of values fall below 0?

pnorm(0,mean=0,sd=1)

# In statistical analyses, a normal distribution may be used
# to represent potential observations under a null hypothesis. 
# If observations fall far in the tails of the distribution, this
# suggests that the null hypothesis is not true.  For example,
# the distribution might represent the mean difference between
# responses to two treatments, where the mean difference is 0
# under the null hypothesis of no difference.
# For this normal distribution, what is the probability
# of an observation below -1.96?

pnorm(-1.96,mean=0,sd=1)

# What is the probability that an observation falls above 1.96?

1 - pnorm(1.96,mean=0,sd=1)

# Note that these are frequently used cut-offs, such that the
# probability an observation falls either below -1.96 or above 1.96 adds up to 0.05.
# Here 0.05 represents an acceptably small probability that the
# extreme mean difference could be observed for a sample even
# if the mean difference is actually 0.


# The quantile function allows you to ask
# a related but different type of question
# Below what value does half the distribution fall?

qnorm(0.5,mean=0,sd=1)

# Below what value does 2.5% of the distribution fall?
qnorm(0.025,mean=0,sd=1)

# Above what value does 2.5% of the distribution fall?
qnorm((1-0.025),mean=0,sd=1)
# Illustrate a normal distribution # with mean 0 and standard deviation 1 # by drawing 100,000 random # observations from the distribution hist(rnorm(n=100000,mean=0,sd=1)) # If your computer is fast enough # you can quickly draw 10,000,000 # (or more) random observations and revel in the power hist(rnorm(n=10000000,mean=0,sd=1)) # You can get an idea of why small sample # sizes only supply limited # information about a process by # sampling just a few random # observations at a time, # while keeping the x axis range constant hist(rnorm(n=3,mean=0,sd=1),xlim=c(-4,4), xlab='Observation',main='Example') # The mfrow command adjusts how many # graphs are illustrated at once par(mfrow=c(2,3)) # A loop can be used to illustrate several # samples at once from the # normal distribution with mean 0 and standard deviation 1 for(j in 1:6){ hist(rnorm(n=3,mean=0,sd=1),xlim=c(-4,4), xlab='Observation',main='Example') } # Compare the normal distributions # for a range of different parameters hist(rnorm(n=100000,mean=0,sd=1),main='Mean=0, SD=1', xlab='Observation',xlim=c(-40,140)) hist(rnorm(n=100000,mean=0,sd=2),main='Mean=0, SD=2', xlab='Observation',xlim=c(-40,140)) hist(rnorm(n=100000,mean=0,sd=10),main='Mean=0, SD=10', xlab='Observation',xlim=c(-40,140)) hist(rnorm(n=100000,mean=100,sd=1),main='Mean=100, SD=1', xlab='Observation',xlim=c(-40,140)) hist(rnorm(n=100000,mean=100,sd=2),main='Mean=100, SD=2', xlab='Observation',xlim=c(-40,140)) hist(rnorm(n=100000,mean=100,sd=10),main='Mean=100, SD=10', xlab='Observation',xlim=c(-40,140)) # For reference, generate a single histogram # of a normal distribution # with mean 0 and standard deviation 1 again par(mfrow=c(1,1)) hist(rnorm(n=100000,mean=0,sd=1),main='Mean=0, SD=1', xlab='Observation') # The density of the random distribution # at any given value indicates # how likely values in that range are # What is the value of the density at the mean, 0? dnorm(0,mean=0,sd=1)
 
# What is the value of the density
# for an unlikely part of the range, 4?
dnorm(4,mean=0,sd=1)

# The values of the density don't
# provide the probability of those
# values per se, but they give an
# idea of the relative probability
# The percentile provides information
# about the probability that
# values from a normal distribution
# fall below a particular value
# What is the probability that values fall below -1?
pnorm(-1,mean=0,sd=1)

# What is the probability that values fall below 0?
# Or, phrased differently,
# what proportion of values fall below 0?
pnorm(0,mean=0,sd=1)

# The quantile function allows you
# to ask a related but different
# type of question
# Below what value does half the distribution fall?
qnorm(0.5,mean=0,sd=1)

# Below what value does 5% of the distribution fall?
qnorm(0.05,mean=0,sd=1)

# Below what value does 2.5% of the distribution fall?
qnorm(0.025,mean=0,sd=1)

# Above what value does 2.5% of the distribution fall?
qnorm((1-0.025),mean=0,sd=1)

Try repeating the same types of analyses for another distribution, the binomial, using the functions rbinom, dbinom, pbinom, and qbinom. The binomial distribution has two parameters, the probability of success, prob, and the number of trials, size. The binomial distribution is often illustrated in terms of a series of flips of a coin, where success might be defined in terms of the coin flip yielding 'heads', and where prob=0.5 if the coin is fair. The binomial distribution can only take on non-negative integer values, in contrast to the normal distribution which includes any number from -∞ to ∞.

hist(rbinom(n=100000,size=10,prob=.5))

The Use of Statistical Distributions in Epidemiology

Applications of several distributions in ecology and epidemiology are illustrated using R in other Plant Health Instructor documents.

The exponential model can be used to describe disease progress over time or the declining probability of disease with increasing distance from an inoculum source.

The uniform distribution can be used to generate a random number across a continuous range in which every value is equally likely. Drawing from a uniform distribution is used for selecting initial coordinates of infections and for selecting directions for dispersal in a spatial epidemic simulator.

The binomial distribution can be used to describe the number of ‘successes’ out of a fixed number of trials. For example, it can be used to generate rain events in a disease simulation.

Example: The Beta-binomial Distribution for Analysis of Spatial Pattern

The binomial distribution can also be used to describe the number of diseased plants within a sampling area. For example, if a sampling quadrat contains size=40 plants and each plant has probability prob=0.1 of being infected, then the distribution of the number of plants infected within a quadrat is:

hist(rbinom(n=100000,size=40,prob=0.1),main='Number infected')

# This could also be expressed
# in terms of the proportion 
# of plants infected by dividing by size=40

hist(rbinom(n=100000,size=40,prob=0.1)/40,main='Proportion infected')

# On average, how many quadrats
# will have 10% or fewer plants 
# infected (4 out of 40 or fewer)?

pbinom(4,size=40,prob=0.1)

But when estimates of the proportion infection per quadrat are collected in the field, the binomial distribution may not always provide a good fit. For example, if disease is aggregated, the number of quadrats with high disease incidence and low disease incidence will be higher than would be expected for a binomial distribution. One way to deal with this lack of fit, and to provide a parameter that describes the degree of aggregation, is to fit disease incidence in quadrats using a beta-binomial distribution (Hughes and Madden 1993).

For the binomial distribution, the probability parameter is constant within any given realization. (That is, varying the probability parameter results in a different distribution.) For the beta-binomial distribution, the probability parameter is not constant but instead is described by the beta distribution.

The beta distribution is a very flexible distribution describing probabilities between 0 and 1 with two shape parameters. R provides four different types of functions for describing the beta distribution and a range of other distributions, as illustrated for the normal distribution above.

hist(rbeta(n=100000,shape1=1,shape2=1))

 
# The next command allows you to plot
# figures in 2 rows and 3 columns

par(mfrow=c(2,3))

# Try plotting several parameter
# combinations to compare the 
# distributions for values 0.5, 1, 2, 3, and 4
# for each shape parameter

hist(rbeta(n=100000,shape1=1,shape2=1))

# Now repeat the hist command with different parameters
 
# We can generate beta-binomial
# random variables in R by using
# beta random variables as parameters
# in the binomial function

par(mfrow=c(1,1))
hist(rbinom(n=100000,size=10,prob=rbeta(n=100000,shape1=1,shape2=3)))

The shape1 parameter is often notated as α and the shape2 parameter as β. Then θ = 1/(α + β) measures the variability in the number of events, where events might be, for example, the number of infected plants per quadrat. So, as the sum of the two shape parameters becomes larger, the variability decreases. You can see that this makes sense because as the shape parameters become larger, the beta distribution becomes more centralized.


hist(rbeta(n=100000,shape1=2,shape2=1),main='shape1=2,shape2=1',
  xlim=c(0,1))

# …as compared to…

hist(rbeta(n=100000,shape1=26,shape2=30),main='shape1=32,shape2=30',
  xlim=c(0,1))

Since the beta distribution supplies the probability parameter in the beta-binomial distribution, the less variable the beta distribution is, the less variable the beta-binomial will be.

hist(rbinom(n=100000,size=10,prob=rbeta(n=100000,shape1=2,shape2=1)),
  main='shape1=2,shape2=1')

# …as compared to…

hist(rbinom(n=100000,size=10,prob=rbeta(n=100000,shape1=26,shape2=30)),
  main='shape1=32,shape2=30')

Statistical Analyses

One of the most common uses of R is for statistical analysis of data. There are a number of references available describing analyses such as regression, analysis of variance, nonparametric analyses, and resampling approaches in R. Applications of statistical analyses are also illustrated in the following Plant Health Instructor documents: Modeling Plant Disease Progress Over Time, Modeling Dispersal Gradients, Introduction to Spatial Analysis, Disease Forecasting and Validation for Plant Pathology.

Writing Functions

You can create new custom functions in R to do more complicated calculations.

As a simple example, suppose you want to write a function that will return the squared value of whatever value you enter into the function. You can create such a function with the following command.

square.it <- function(x){x^2}

Note that those are curly brackets around x^2. Here x is used temporarily and the intention is not usually to have an object named x read in, though that is what will happen if no value for x is specified when the function is applied (which can lead to confusion). Try these commands, checking the value of x and y98 after each command.

x <- 2
y98 <- square.it(x=5)
y98 <- square.it(x)
y98 <- square.it(x=3)

Note that the value in the object x does not change when the function is applied, unless the output of the function is assigned to x.

x <- square.it(x=5)

More complicated functions can be made that include multiple lines of commands split by a semi-colon or on separate lines. The variables created inside the function, temp1 and temp2, are also just used temporarily.

fun91 <- function(x,y){temp1 <- x*y; temp2 <- log10(temp1); temp2}

Or, equivalently, to make the function easier to read:

fun91 <- function(x,y){
  temp1 <- x*y
  temp2 <- log10(temp1)
  temp2
}
This example gives the value in temp2 as output. Try:
fun91(x=10,y=100)

The function can be applied to any objects x and y that make sense. For example:

fun91(x=c(1,2),y=c(10,100))

The function can also be applied to pre-existing objects.

z1 <- c(1,2)
z2 <- c(10,100)
fun91(x=z1,y=z2)

Closing a Session

Type q() to close the R window.

You have the option of saving objects and functions if you expect to use them again. In that case make sure you have made note of the names of objects so that there will be no confusion in future sessions.

Acknowledgments

We appreciate the very helpful comments of P. Garfinkel, J. Hadam, S. Pethybridge, PHI Reviewers, and members of the 2006 course in Ecology and Epidemiology of Plant Pathogens at KSU. It’s also a pleasure to acknowledge support by the U.S. National Science Foundation under Grants DEB-0130692, DEB-0516046, EF-0525712 (as part of the joint NSF-NIH Ecology of Infectious Disease program) and DBI-0630726, by the Ecological Genomics Initiative of Kansas through NSF Grant No. EPS-0236913 with matching funds from the Kansas Technology Enterprise Corporation, by the Office of Science (Program in Ecosystem Research), U.S. Department of Energy, Grant No. DE-FG02-04ER63892, by the U.S. Agency for International Development for the Sustainable Agriculture and Natural Resources Management Collaborative Research Support Program (SANREM CRSP) under terms of Cooperative Agreement Award No. EPP-A-00-04-00013-00 to the Office of International Research and Development at Virginia Tech and for the Integrated Pest Management CRSP, by USDA grant 2002-34103-11746, and by the NSF Long Term Ecological Research Program at Konza Prairie. This is Kansas State Experiment Station Contribution No. 07-310-J.

References

  1. Bolker, B. In press . Ecological Models and Data in R. Princeton University Press. http://www.zoo.ufl.edu/bolker/emdbook.

  2. Crawley, M.J. 2007. The R Book. Wiley.

  3. Esker, P.D., Sparks, A.H., Antony, G., Bates, M., Dall' Acqua, W., Frank, E., Huebel, L., Segovia, V., and Garrett, K.A. 2007. Ecology and epidemiology in R: Modeling dispersal gradients. The Plant Health Instructor. On-line: DOI:10.1094/PHI-A-2007-1226-03

  4. Esker, P.D., Sparks, A.H., Campbell, L., Guo, Z., Rouse, M., Silwal, S.D., Tolos, S., Van Allen, B., and Garrett, K.A. 2008. Ecology and epidemiology in R: Disease Forecasting.. The Plant Health Instructor. On-line: DOI:10.1094/PHI-A-2008-0129-01

  5. Gentleman, R., Carey, V.J., Huber, W., Irizarry, R.A., and Dudoit, S., eds. 2005. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer, New York.

  6. Hughes, G., and Madden, L.V. 1993. Using the beta-binomial distribution to describe aggregated patterns of disease incidence. Phytopathology 83:759-763.

  7. Murrell, P. 2006. R Graphics. Chapman & Hall/CRC, Boca Raton.

  8. R Development Core Team. 2007. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

  9. Sparks, A..H., P.D. Esker, G. Antony, L. Campbell, E.E. Frank, L. Huebel, M.N. Rouse, B.Van Allen, and K.A. Garrett, 2008. Ecology and epidemiology in R: Introduction to spatial analysis. The Plant Health Instructor. On-line: DOI:10.1094/PHI-A-2008-0129-03.

  10. Sparks, A.H., Esker, P.D., Bates, M, Dall' Acqua, W., Guo, Z., Segovia, V. Silwal, S.D., Tolos, S., and Garrett, K.A. 2008. Ecology and epidemiology in R: Disease Progress over Time. The Plant Health Instructor. Online: DOI:10.1094/PHI-A-2008-0129-02

  11. Venables, W.N., Smith, D.M., and the R Development Core Team. 2007. An Introduction to R. Notes on R: A Programming Environment for Data Analysis and Graphics. Version 2.6.1. http://cran.r-project.org/doc/manuals/R-intro.pdf.