Author
Affiliation

Alex Kaizer

University of Colorado-Anschutz Medical Campus

This page is part of the University of Colorado-Anschutz Medical Campus’ BIOS 6618 Labs collection, the mini-lecture content delivered at the start of class before breaking out into small groups to work on the homework assignment.

What’s on the docket this week?

In Week 3 we are introducing the basics of building our own functions in R by starting with the basic set-up. We will also introduce the apply family of functions which are similar to loops, but allow for the potential of faster code that may also be more compact.

Writing Functions

Chapter 2.4 of our R textbook introduces functions in the R context and how to write your own. The end of Chapter 2.5 also includes a nice graphic that breaks down what our textbook considers the 5 components of a function.

In brief, a function has a simple set-up:

Code
my_function <- function(arguments) {
  the body of your function that will do stuff
}

In the above code chunk we are creating a function called my_function that will ultimately do whatever we want in the body of our function. Functions can have additional information passed into the function environment by the arguments.

While not universally true, functions generally operate in their own environment. When using R Studio we can look at what objects are stored in our “Global Environment”. So specifying a value, such as

Code
my_value <- 'stats_is_great'

will then appear in our Global Environment until we either remove it or replace it (e.g., using rm(my_value) or remove(my_value)). However, specifying something within a function will generally not appear in the Global Environment. For example, let’s define a function whose sole purpose is to return stats_is_great, which is saved in an object within the function:

Code
grrreat <- function(){
  my_value_fnx <- 'stats_is_great'
  return( my_value_fnx ) #this will return whatever is stored in my_value_fnx for the output of our function
}

grrreat() #let's check out our function
[1] "stats_is_great"

If you try running the two snippets of code in R Studio, you should see my_value appear in the Global Environment, but my_value_fnx does not since it is housed within the function grrreat().

Make Our Own Mean/Average Function

To get some practice at making functions, I think it is nice to start with things where we can directly check our work with existing functions or where we know what the answer should be. For this example we will write a function to calculate the mean, and then check our work with the actual mean function.

Let’s start by writing our function to calculate a sample mean. I like to start my functions with a little commented header to (1) describe what the function does and (2) define what the argument are. This helps me whenever I come back to functions after some time has passed:

Code
sample_mean <- function(dat){
### This is a function to calculate the sample mean
# dat: a vector of values to calculate the mean for
  n <- length(dat) # calculate the sample size
  sum_x <- sum(dat) # sum up all values in dat
  sample_mean <- sum_x / n # calculate the sample mean
  return( sample_mean ) # this isn't necessary here since R should return the last line of code, but I like to include it to be explicit about what I want returned
}

Now that we’ve written a function, let’s generate a sample of data to compare our calculation to the existing mean function:

Code
set.seed(515)
samp_dat <- rnorm( n=50, mean=14, sd=10 )

sample_mean( dat = samp_dat )
[1] 15.64743
Code
mean( samp_dat )
[1] 15.64743

Thankfully, the results match! In future weeks we will discuss some more advanced pieces of building a function (e.g., returning multiple pieces of output), but for now we can use this as a foundation for writing functions.

Additional Function Notes

Our function for the mean above was pretty simple in that we only had one argument (dat) and one piece of information returned (sample_mean). We can write functions that have multiple arguments for a given task and/or return multiple objects. Further, you can also return objects including more than just a vector of information, such as a list object where multiple types of objects can be stored (more on this in a few weeks).

The Apply Family

One area our R Textbook is a bit lacking is a discussion of the apply family of functions in R. DataCamp has a nice tutorial that delves into some of the specifics, of which we’ll highlight a few pieces below.

There are multiple functions that exist in this family: apply, sapply, lapply, mapply, etc. The choice of the apply function to use generally depends on the structure of the data you are working with, and also the format of the output you’d like (e.g., a vector versus a list).

The apply() function

The apply function operates on arrays, which for our purposes will generally mean 2-dimension matrices. It has three major arguments you can review in the help documentation for greater detail, but they include:

  • X: the array to do some calculation on
  • MARGIN: either 1 to indicate applying the function to each row, 2 to indicate applying the function to each column, or c(1,2) to indicate applying the function to both rows and columns (i.e., each element of the array)
  • FUN: the function you wish to apply the rows and/or columns in your array

For example, let’s simulate 50 observations from a random exponential distribution and place them into a matrix with 10 rows and 5 columns:

Code
set.seed(515)
my_mat <- matrix( rexp(n=50, rate=0.2), nrow=10, ncol=5) # here we are simulating the data using the "rexp" function within the "matrix" function itself so that we can include this as one line of code

my_mat # let's see what our data looks like
           [,1]       [,2]        [,3]       [,4]        [,5]
 [1,] 12.217353 17.2375973  0.14967508  1.4993488  1.49949970
 [2,]  8.658003  4.7737718  1.18769738  3.3486665  9.63063688
 [3,] 19.308967  0.5294249  0.08559512  2.2290962 10.03797446
 [4,]  3.589007  1.8239578  2.38238175  2.4613284  0.05722932
 [5,]  3.008182  2.5826566  0.79034221  3.5895691  0.90203442
 [6,]  5.557042  1.4607429  3.96179643  0.9797761  0.38814744
 [7,]  1.783477  3.7350945  3.17351118 20.2186521  6.46094538
 [8,]  5.765822  4.0633058  4.18369018  3.3436216  6.23468792
 [9,]  2.993530  0.4073399  7.43696442  1.0924128  5.81720596
[10,]  3.288153  2.0768742 23.82502088  9.3102057  2.14733849

Now let’s calculate the mean value of each row (MARGIN=1) and column (MARGIN=2):

Code
apply( X=my_mat, MARGIN=1, FUN=mean) # calculate the mean for each row
 [1] 6.520695 5.519755 6.438212 2.062781 2.174557 2.469501 7.074336 4.718225
 [9] 3.549491 8.129518
Code
apply( X=my_mat, MARGIN=2, FUN=mean) # calculate the mean for each column
[1] 6.616954 3.869077 4.717667 4.807268 4.317570

We see that we have 10 means for the 10 rows, and 5 means for the 5 columns.

For Real, What Good is MARGIN=c(1,2)

For MARGIN=c(1,2) there admittedly may not be as many uses with a 2-dimensional array. But if we wanted to concatenate some string to each value we could consider using the function.

As an example, let’s say we wanted to concatenate a “$” as though these were prices. To do this we may want to consider using the paste0() function, but we have to remember that it needs to be written in a way that can apply to each element of my_mat separately. To do this, we may first want to write a function that concatenates a dollar sign to whatever string or number is given, then use the apply() function:

Code
money_money_money <- function(x){ paste0('$',x) }
apply( X=my_mat, MARGIN=c(1,2), FUN = money_money_money)
      [,1]                [,2]                 [,3]                 
 [1,] "$12.2173525666105" "$17.2375972560587"  "$0.149675077271274" 
 [2,] "$8.65800315572815" "$4.77377176263785"  "$1.1876973762149"   
 [3,] "$19.308967021175"  "$0.529424920678139" "$0.0855951215619483"
 [4,] "$3.58900716041631" "$1.82395784649998"  "$2.38238175399601"  
 [5,] "$3.00818230491132" "$2.58265662286664"  "$0.790342205933128" 
 [6,] "$5.55704173116321" "$1.46074291623291"  "$3.96179643438614"  
 [7,] "$1.78347686072811" "$3.73509448742561"  "$3.17351118428633"  
 [8,] "$5.76582155635065" "$4.0633058367693"   "$4.18369018451726"  
 [9,] "$2.9935301374644"  "$0.40733985370025"  "$7.43696442082858"  
[10,] "$3.28815302578732" "$2.07687421934679"  "$23.8250208819721"  
      [,4]                 [,5]                 
 [1,] "$1.49934881060734"  "$1.49949970077926"  
 [2,] "$3.34866651333869"  "$9.63063687975419"  
 [3,] "$2.22909616772085"  "$10.0379744633557"  
 [4,] "$2.46132835591099"  "$0.0572293228469789"
 [5,] "$3.58956913397824"  "$0.902034419123083" 
 [6,] "$0.979776070453227" "$0.388147439807653" 
 [7,] "$20.2186521411616"  "$6.46094538250423"  
 [8,] "$3.34362160181627"  "$6.23468791579312"  
 [9,] "$1.09241279074922"  "$5.81720596158362"  
[10,] "$9.31020567151618"  "$2.14733849463794"  

We could also define the function directly within the apply statement, or consider rounding each value to 2 decimal places before adding the dollar sign, or do both!

Code
# Define the function within the apply statement:
apply(  X=my_mat, MARGIN=c(1,2), FUN = function(x){ paste0('$',x) })
      [,1]                [,2]                 [,3]                 
 [1,] "$12.2173525666105" "$17.2375972560587"  "$0.149675077271274" 
 [2,] "$8.65800315572815" "$4.77377176263785"  "$1.1876973762149"   
 [3,] "$19.308967021175"  "$0.529424920678139" "$0.0855951215619483"
 [4,] "$3.58900716041631" "$1.82395784649998"  "$2.38238175399601"  
 [5,] "$3.00818230491132" "$2.58265662286664"  "$0.790342205933128" 
 [6,] "$5.55704173116321" "$1.46074291623291"  "$3.96179643438614"  
 [7,] "$1.78347686072811" "$3.73509448742561"  "$3.17351118428633"  
 [8,] "$5.76582155635065" "$4.0633058367693"   "$4.18369018451726"  
 [9,] "$2.9935301374644"  "$0.40733985370025"  "$7.43696442082858"  
[10,] "$3.28815302578732" "$2.07687421934679"  "$23.8250208819721"  
      [,4]                 [,5]                 
 [1,] "$1.49934881060734"  "$1.49949970077926"  
 [2,] "$3.34866651333869"  "$9.63063687975419"  
 [3,] "$2.22909616772085"  "$10.0379744633557"  
 [4,] "$2.46132835591099"  "$0.0572293228469789"
 [5,] "$3.58956913397824"  "$0.902034419123083" 
 [6,] "$0.979776070453227" "$0.388147439807653" 
 [7,] "$20.2186521411616"  "$6.46094538250423"  
 [8,] "$3.34362160181627"  "$6.23468791579312"  
 [9,] "$1.09241279074922"  "$5.81720596158362"  
[10,] "$9.31020567151618"  "$2.14733849463794"  
Code
# Let's add some rounding to the mix:
apply(  X=my_mat, MARGIN=c(1,2), FUN = function(x){ paste0('$', round(x,digits=2) ) })
      [,1]     [,2]     [,3]     [,4]     [,5]    
 [1,] "$12.22" "$17.24" "$0.15"  "$1.5"   "$1.5"  
 [2,] "$8.66"  "$4.77"  "$1.19"  "$3.35"  "$9.63" 
 [3,] "$19.31" "$0.53"  "$0.09"  "$2.23"  "$10.04"
 [4,] "$3.59"  "$1.82"  "$2.38"  "$2.46"  "$0.06" 
 [5,] "$3.01"  "$2.58"  "$0.79"  "$3.59"  "$0.9"  
 [6,] "$5.56"  "$1.46"  "$3.96"  "$0.98"  "$0.39" 
 [7,] "$1.78"  "$3.74"  "$3.17"  "$20.22" "$6.46" 
 [8,] "$5.77"  "$4.06"  "$4.18"  "$3.34"  "$6.23" 
 [9,] "$2.99"  "$0.41"  "$7.44"  "$1.09"  "$5.82" 
[10,] "$3.29"  "$2.08"  "$23.83" "$9.31"  "$2.15" 

We see from the above output, that the first statement recreates our result with the function defined within apply. In the second result we see we successfully rounded to 2 decimal places (although R has this nasty default habit of rounding up to the desired number of digits, so extra steps are needed if we want exactly 2 decimal places).

The lapply() and sapply() Functions

The lapply function applies a given function to every element of a list (or vector or data frame, etc.) and returns a list as a result. The sapply function is a wrapper to the lapply function that is a “user-friendly version” (see the help documentation for this quote) that attempts to return a vector or matrix. The “right” function to use will depend on what format you desire the output to be in (potentially to help complete some next step for plotting, summarizing, etc.). Or, in other words, there is no one right function…just different ones that are more useful in certain situations!

A lot of the set-up for loops that we discussed in the lab for last week can be similarly set-up for sapply. For example, we could use an index approach and recreate our column-specific means from apply in the previous section:

Code
sapply( X=1:5, FUN = function(x) mean( my_mat[,x]) ) # notice how we define the function within sapply here to keep it on one line
[1] 6.616954 3.869077 4.717667 4.807268 4.317570
Code
apply( X=my_mat, MARGIN=2, FUN=mean) # check our results
[1] 6.616954 3.869077 4.717667 4.807268 4.317570

And, for completeness, we can see that lapply returns the means in a list:

Code
lapply( X=1:5, FUN = function(x) mean( my_mat[,x]) )
[[1]]
[1] 6.616954

[[2]]
[1] 3.869077

[[3]]
[1] 4.717667

[[4]]
[1] 4.807268

[[5]]
[1] 4.31757

We can also define X as values to be entered into a function. For example, perhaps we want to simulate sample sizes of 5, 10, and 15 from an F-distribution. Here we will use lapply so we can see the results are actually 5, 10, and 15 in length:

Code
lapply( X=c(5,10,15), FUN = function(x) rf(n=x, df1=15, df2=45) )
[[1]]
[1] 1.4579813 1.0352903 1.3878726 0.9470038 0.9811442

[[2]]
 [1] 0.8471400 1.3279725 1.1581030 0.7450090 0.8182207 1.4929248 0.2743359
 [8] 0.3669082 0.7668954 1.1297963

[[3]]
 [1] 1.3295476 0.9085403 0.5368008 0.9059368 1.0249046 1.1086250 0.4956291
 [8] 1.1187265 0.7603356 0.5230847 1.6806868 1.1895109 0.7753558 1.0400509
[15] 0.9721089

In fact, if we tried using sapply we’d see the results are also returned as a list since this is R’s best guess as to the most appropriate object type!

Combining loops and apply functions?

Hopefully by now we’ve caught on that R is verrrry flexible. In fact, we can combine different approaches or the same approach multiple times. For example, we can nest for loops, nest apply statements, or loop through some apply statements!

For example, Homework 3, Exercise 4, asks us to examine the central limit theorem at play with the mean from binomially simulated data increasing from \(n=10\) up to \(n=50\) in increments of 10 over 500 simulations.

For this lab, let’s examine a related problem where we simulate from Poisson distributed data with \(\lambda=3\) at sample sizes of \(n=10\), \(n=30\), and \(n=50\) with 500 simulations. We’ll calculate the mean for each simulation, store the results, and then calculate the mean and standard deviation of \(\bar{x}\) across all 500 simulations using an apply statement. We’ll explore 2 different approaches based on nested loops or using sapply().

for loops solution

If we wish to use for loops, we will likely need to consider nesting two of the loops so we can (1) go through our 3 sample sizes and (2) go through the 500 simulations. For the sample sizes we will need to define a vector to reference, and also index the simulation so we can more easily store the results in a matrix we initialize for the results:

Code
set.seed(515)
nsim <- 500
sizeVec <- c(10,30,50)
meanMatrix <- matrix(NA, nrow=nsim, ncol=length(sizeVec))

for( j in 1:length(sizeVec) ){ # our outer loop will go through the sample sizes
  for( i in 1:nsim){ # our inner loop will go through the number of sims we desire
    poisData <- rpois(n=sizeVec[j], lambda=3) # simulate the data
    meanMatrix[i,j] <- mean(poisData) # save the sample mean in the ith row (1 to 500) and jth column (1 to 3)
  }
}

Let’s first take a peek at the first few rows of our matrix using the head() function, then use the apply() function to calculate the column mean for each sample size:

Code
head(meanMatrix) # see the top few rows of our matrix
     [,1]     [,2] [,3]
[1,]  2.6 3.033333 3.54
[2,]  3.6 2.833333 2.98
[3,]  3.6 2.666667 2.90
[4,]  3.1 3.166667 2.92
[5,]  2.7 2.500000 2.94
[6,]  3.0 2.733333 2.90
Code
apply(X=meanMatrix, MARGIN=2, FUN=mean)
[1] 2.982200 2.975933 3.010560
Code
apply(X=meanMatrix, MARGIN=2, FUN=sd)
[1] 0.5317788 0.3067211 0.2472593

We see that the average of the sample means is pretty close to the true mean of \(\lambda=3\) across all sample sizes, but that as the sample size increases the standard deviation of the sample mean (also known as the standard error) decreases. In other words, there is less variability among all estimated means when we have a larger sample size.

sapply Solution

We can also nest our sapply functions within one another, and it will automatically create a matrix with 3 columns and 500 rows to store the results in. WARNING, this gets a bit nesty… (get it, nasty…but with nesting…I’ll show myself out):

Code
set.seed(515)
nsim <- 500
sizeVec <- c(10,30,50)

meanMatrix_sapply <- sapply( X=sizeVec, FUN= function(x) sapply(X=1:500, FUN=function(y) mean(rpois(n=x,lambda=3)) ) )

Before we look at the first few rows and check that the mean and SD are the same as above, let’s break down what the heck is happening here:

sapply( X=sizeVec, FUN= function(x) sapply(X=1:500, FUN=function(y) mean(rpois(n=x,lambda=3)) ) )

The red text is our “outer” sapply statement, which is essentially going through our sample size vector. It is applying its function(x) to the blue “inner” sapply statement. This inner statement is then going through our vector 1:500 to generate 500 samples where is it applying the function(y) (note the use of y instead of x as our function’s argument to differentiate it) to the actual generation of the Poisson sample of size x (i.e., \(n\)) and calculate its mean.

PHEW!! That is…a lot. In cases like this, it may just be easier (and potentially more intuitive) to use two loops (or a different approach you’ve stumbled upon that involves neither!).

Let’s now double check our results to see they match the above for loop results:

Code
head(meanMatrix_sapply)
     [,1]     [,2] [,3]
[1,]  2.6 3.033333 3.54
[2,]  3.6 2.833333 2.98
[3,]  3.6 2.666667 2.90
[4,]  3.1 3.166667 2.92
[5,]  2.7 2.500000 2.94
[6,]  3.0 2.733333 2.90
Code
apply(X=meanMatrix_sapply, MARGIN=2, FUN=mean)
[1] 2.982200 2.975933 3.010560
Code
apply(X=meanMatrix_sapply, MARGIN=2, FUN=sd)
[1] 0.5317788 0.3067211 0.2472593

Cha-ching! They do match, so even this more potentially chaotic approach worked out.