An introduction to basic R functions, data types, and structures
Hello again! Today we’re going to dive into the idea of functions in coding. Functions are a fundamental building block in coding generally, and you’ve already encountered and used many of them written by others (e.g. dpylr::select()
, base::mean()
etc). Today we’ll discuss why they are so useful, how they work, and how to write and use your very own functions!
Generally, a function does all the regular coding you’re used to but is built using abstract, general data structures rather than with concrete (locally-loaded/stored) data. What does this mean? In brief, a function is a shell of code that is written in general terms and by itself does do anything or manipulate any data until we later “run it.” When we write a function, we are actually telling the computer to store the code that we write in memory for later use. For example:
This makes functions really convenient for repeated use when we want to do the same basic thing many times, typically also with different data. Instead of copy-pasting the same code over and over, we can simply re-run a function over and over. This is both easier to follow and read, and more “safe” in terms of avoiding mistakes. A general rule of thumb in coding is that if you have to copy-paste any significant portion of code to do something two or more times, you’re better off just building a function and running it twice! This is good coding practice for two reasons: first, it actually saves you time in the long run (it may seem unneccesary and difficult at first but the more practice you get the more use you’ll see for it), and second, it saves you from your own silly mistakes when copy-pasting. You’d be amazing at how easy it is cause yourself all kind of headaches simply because you accidentally wrote over an object you needed by accident, mislabel something, or forget to change some small part of the code when copy-pasting. You lose a lot of time (and sanity) tracking down these small mistakes because they often don’t actually produce any code errors, just wrong answers.
Anything in R that has ...()
is a function. For example, when we type test()
we are indicating to R that ‘test’ is a function it should already know about and we want it to now run that function. If we were to type just test
on the other hand, we would be indicating to R that ‘test’ is an object, an actual piece of computer memory that stores some data.
#since I have not told R what "test" or "test()" may be both of these lines cause an error
#you can see from the error messages how "()" changes R's expectations
test
Error in eval(expr, envir, enclos): object 'test' not found
test()
Error in test(): could not find function "test"
The general structure of all functions is the following:
my_function <- function(arguments) {
#fill in your genius code here
return(output)
}
my_function
First let’s just break down each component and then below I’ll give you some more specific things to keep in mind. First on the left we have the nam
my_function
- Function’s Name: just as with objects, we give a function a name so we can later refer back to it. In the above example, I’ve very creatively named it “my_function”.
my_function <- function(...) {}
- this tells R that the object my_function
will be a function. You can see from the handy-color coding in your text editor that both the function()
and return()
language are somewhat special. This syntax is specially reserved in R
to contstruct functions and all functions that you write must have this structure. Elements inside the paratheses function(...)
are referred to as inputs or arguments of the function - they are the data you “pass in” for the function to work with. Everything inside the following brackets {}
forms the the shell code that the function will perform when it is later called.
return(...)
- this tells R what to “return” from the function. In other words, when we run the function what does it give us back?
Functions are always a two stage process. First, we “declare the function,” writing out the function contructor syntax and filling in all the code we want the function to perform, then itializing the function by running this constructor code. At this stage all we are doing is telling R to save to memory the contents of your function. It WILL NOT run/evaluate any of the internal code at this stage. Here we are simply writing the basic shell that we will use later. The second stage is where we “call the function.” This is where we actually ask R
to run the code inside the function and thus where we would need to supply the arguments (that do not have a default, more on this below) in order for things to work.
What does this mean practically? If you have any errors, R
will not find them until you call the function and try and run it on actual data! Additionally, if you edit your function in any way, you must declare it again (rerun the constructor code) for R
to know you made any changes (we have to resave the code it associates with object that is our function name).
#let's say I have a typo and write "frist" inside the function, instead of "first"
#since R has no idea what "frist" is, this will produce an error
#but just establishing the function and running the following reports no errors !
# why? because R has not actually evaluated the code
broken <- function(first, second) {
sum <- frist+second
return(sum)
}
#when we actually call the function however, it complains that it has no idea what "frist" is
broken(3,2)
Error in broken(3, 2): object 'frist' not found
Most functions require some input data that they’ll use internally. Others may not require any inputs. For example “Sys.time()” doesn’t need any data to tell you what time it is. When arguments are written without a “default” they are mandatory and the function will not run without them. For these arguments, R
is expecting some form of input data and will immediately halt if it doesn’t recieve it when the function is called. On the other hand, we can set arguments with a default when we set up the function such that the function will proceed when called even without any inputted data using the default value. Of course, the user can always overwrite a default in “function call,” meaning that any the argument passed in when the function is called will always trump the default.
Note that in R
you do not specify what type/class of data these arguments will be unlike some other languages. This enables great flexibility! An example is near the end of this post.
#I set x to have a default of 4
#I have set y to have no default
#this function "returns" or gives back to the user simply the sum of x and y
ex_1 <- function(x = 4, y) {
out <- x+y
return(out)
}
#because I do not tell it what y is going to be it gives me an error
ex_1()
Error in ex_1(): argument "y" is missing, with no default
#when I do not tell it what x is, it's no problem however! just uses default x = 4
ex_1(y = 1)
[1] 5
#i can always overwrite a default directly in the "function call"
ex_1(x = 0, y = 1)
[1] 1
#some functions have no inputs
Sys.timezone()
[1] "America/Los_Angeles"
Arguments can be “passed in” to a function by position/order or by name. Passing in arguments by order allows you to avoid a little typing when “calling the function” (running it). Instead of fully typing out the names of each of the functions arguments, you can pass in data directly according to the order specified in when the function was declared and R
will rely the order of the arguments to figure out what inputted data refers to what function argument.
#this funciton returns simply the fraction of the first over the second argument
ex_2 <- function(first, second) {
return(first/second)
}
#passing in based on order assumes first = 1, second = 3
ex_2(1,3)
[1] 0.3333333
ex_2(3,1)
[1] 3
Alternatively, you can always use the name of the arguments to clearly tell R what refers to what. This might be a tad more typing, but it also allows you to not worry about remembering what the correct order should be.
#passing in based on name
#most intuitively:
ex_2(first = 1, second = 2)
[1] 0.5
#but this also works,
ex_2(second = 2, first = 1)
[1] 0.5
Functions always “return” some “outputs/values”. This is very general - you can return anything from a simple number all the way to a very complex plot and you can return as many objects as you want of whatever type! This is the real power of a function - we can get a bunch of coding done all in one nice clean chunk and return back all kinds of results all at onee time.
In general, it is good coding practice to explicitly state what to return using return(...)
. That said sometimes you may see functions that don’t have an explicit return statement. R
is kind enough to assume that it should return the most recently computed object (the last line), but this is poor coding practice (most other languages would not let you do this). I really don’t recommend skipping the explicit return statement. For readability of your code and to avoid any confusion between what R does for you and what you want it to do, just stick with return()
In R
, if you want to return multiple objects, you must wrap them inside a list object. Notice that naming the elements of your list will enable you to use the more convenient $
syntax rather than the gross [[...]][...]
of lists.
#nicely labeled means i can access the different results using $
ex_3 <- function(a, b, c) {
sum <- a+b+c
frac <- (a+b)/c
out <- list(sum = sum,
fraction = frac)
return(out)
}
test <- ex_3(1,1,2)
class(test)
[1] "list"
test$fraction
[1] 1
test$sum
[1] 4
test
$sum
[1] 4
$fraction
[1] 1
#more lazy now, means more headache in the [[]][] indexing later :(
ex_3 <- function(a, b, c) {
sum <- a+b+c
frac <- (a+b)/c
return(list(sum,frac))
}
test <- ex_3(1,1,2)
class(test)
[1] "list"
test
[[1]]
[1] 4
[[2]]
[1] 1
test[[1]]
[1] 4
Here’s an example of how R
will take the last line of your code as your intended output if you do not explicitly state it. Again, I wouldn’t recommend this, but you may sometimes see others do it.
#note that if i do not explictly write a return statement we get the last line of code
frac <- function(A,B) {
#we can add code here that R will run, but it will not output because it is not returned!
A+B
#if we don't use return R will assume the last line of code is what we want
A/B
}
frac(6,7)
[1] 0.8571429
[1] 4.000000 2.000000 1.333333
[1] 1 1 1
The basic reason for having to wrap everything in a list, is beyond your current need-to-know level of comp-sci, but if you’re interested, the basic reason is as follows: computers require data assigned to the same piece of memory to be of the same type. This means that you can not save both a number and a character to the same piece of data (you might recall R
will just turn the number into a character so they match). This is an issue for functions because they return a single object that is associated with a single piece of computer memory. So if you want your function to output data of different types, what do you do?! Enter lists to save the day! Recall that list objects allow you to combine all different types of data in one object. The way they do this is by actually saving each data type to separate piece of memory so they can stay different types, but then saving the pointers (locations) to each of these separate objects in the list object the same piece of memory. Since pointers are all of the same type, we can save them all together!
In fancy codespeak, we refer to the name of a function as its “namespace.” Because R
is very flexible, it will allow you to write directly over almost anything you want. You should thereofre be careful not to overwrite the “namespace” of standard/commonly used functions. In other words, there is nothing preventing you from doing the following:
[1] 4.5
[1] "Don't do this!"
#and this will now produce an error because R thinks "mean()" takes NO arguments/inputs
mean(ex)
Error in mean(ex): unused argument (ex)
#we can save ourselves by clearly indicating the namespace by telling R what
#package/environment to look in... but this is bad just don't do it
base::mean(ex)
[1] 4.5
Another thing to note quickly is that all packages are just someone else’s function that you are loading into your own environment! You can now understand some common warning you have perhaps encountered when loading packages.
“The following objects are masked from package:…” indicates that you just loaded a package that has a function with the same namespace! R somewhat infamously has a lot of older packages that supply functions of the same name, but that do totally different things. For example when we load dplyr
, we are told “filter” and “lag” are masked from “package:stats.” In other words,dpylr
has a function, dplyr::filter
, that is writing over the function stats::filter
. Sometimes, you may also encounter error messages where R
is clearly confused about which function it should be using because they both have the same name. To clarify which namespace you are referring to you can use the following syntax: packagename::functionname()
. This may happen annoyingly frequently for dplyr::select
unfortunately.
The last step to understand functions requires diving a bit into the weeds, but bear with me. Let’s back up a little bit: recalling our basics, we learned we can that in code, we “initialize” an object by telling the computer we will use some name, say “x” or “my_dataframe” etc to refer back to some particular amount computer memory that will store a type of data among the fundamental data types. For us in R you will nearly always be working with logical (often called boolean), numeric (decimal or whole), character (often called string). In more traditional forms of coding, code is written and understood in a hierarchical structure, with different levels that expand upward and build on those below, like a russian-doll. In these languages, when we initialize data, the computer therefore also needs to be told how “accessible” this data is - at what level is the object accessible? only the current level, or above it at the higher, more expanded levels? This is referred to as “scope” - the “scope” of an object is where it is defined/accessible.
In R, we (thankfully) do not use this hierarchical structure! So if this is a little confusing fear not! Instead nearly everything is “globally defined.” This mean that the object is the “global environment,” meaning it’s accessible everywhere and we don’t have to worry about any of this heirarchical business and different permissions to use different objects. The downside of this is that it’s very easy to write over something later on in your code without realizing it. For example, on line 15 you may use “x” to refer to a big vector of data and then accidentally later on line 142 you again use “x” to refer to just the number 4, and now you’ve written over your big vector of data and lost it. Having everything globally defined in R is therefore a double edged sword - it saves a lots of pain when declaring data and setting things up, but it can lead to careless mistakes that are a big headache.
So why does this matter for functions? Functions are one of the very few places in R where scope actually comes into play and things are NOT globally defined anymore. Instead, the scope of all the code internal to a function (so everythign inside the brackets {}
) is limited to the function itself. More concretely for example, if inside a funciton I declare a variable “my_var”, I run the function, and then try and use “my_var” outside the function, R will immedaitely complain it has no idea what “my_var” is. This variable was defined inside the fucntion and its scope is limited to the funciton, it is not accessible or knwon beyond the internals of the function. This is beneficial because it means we can run functions and use names inside of it that may be the same as ones we outside the function (globally), but we will not overwrite our original objects!
It’s super important to understand the scope of functions to understand how they work and how you can use them. You will not be able to access any data from inside the function unless you pass it out specifically!
#globally defined x
x <- 4
x
[1] 4
#now what happens if i try and write-over x inside a function?
test_scope <- function() {
x = 6
internal_only = "won't see the light of day"
return(x)
}
#returns 6 as expected
test_scope()
[1] 6
#did I write over x?? No! hurrah
x
[1] 4
#I only write over x if i do so explicitly in the global environment
x <- test_scope()
x
[1] 6
#does R know what internal_only is? Nope!
#even though it was created by the function when i ran it,
#it's lost forever because i did not return it
internal_only
Error in eval(expr, envir, enclos): object 'internal_only' not found
However… I am compelled to also point out that in R
though what a function does internally is not accessible in the global environment (e.g. i can’t access internal_only
outside the function), the reverse is not true. Functions can access and use what is in the global environment (though as we saw it can not write over objects in the global environment). This is a bit of a quirk in R
and would not work in more strictly hierarchical languages. What does this mean practically? You can write a function that uses a global variable that is not passed in as an argument! This, however, is generally very poor coding practice. Why? If the globally defined variable is removed or changes value, the output of the function will change or not even run. And yet, we changed nothign about the function itself! This type of inconsistency is highly vulnerable to user-error and is dangerous for that reason. I’d say it should pretty much always be avoided. (Only in advance use cases where functions are nestsed within eachother would I recommend ever doing this)
#globally define my_var
my_var <- 1234
bad_ex <- function(w) {
#i can still access my_var inside the functino even though it wasn't passed in!
prod = w*my_var
return(prod)
}
#this will still run no problem :O
bad_ex(w = 1)
[1] 1234
#now i go code 100 more lines and at somepoint redefine my_var
my_var = 12
#and i run my exact same code as before, nothing about the function has changed]
#but i get a different answer
bad_ex(w = 1)
[1] 12
#even worse,i removed my_var from the global environmnet the function breaks entirely
rm(my_var)
bad_ex(w=1)
Error in bad_ex(w = 1): object 'my_var' not found
This brings us to debugging. Debugging within a function might seem difficult at first because when we declare the function, R does not evaluate the internal code and will not find any errors. However, when we do later call the function and run the internal code, we don’t get to see R go line by line, making it quite hard to see where things broke down when we run into errors. Typically If you get an error, but can’t tell what’s causing it and then go back inside the function to try to run any of the internal code on its own that uses the functions arguments, R will complain that it doesn’t know what data you’re using since you didn’t pass it in! Yikes… So how to debug a function?
Typically, the approach is to run the internal code of the function yourself line by line to find the issue. In order for this to work, you will of course need to initialze the arguments because otherwise R will have no idea what they are. I recomend explicitly initializing them inside the function at the very top. By doing this, you’re actually making everything global! This allows you to find where issues are, but also means you’ve just stored all the code and objects internal to the function to the global environment. Once you’ve fixed the issue, tgherefore, you should therefore remove these objects from the global environment and be sure to remember to delete the lines at the top where you initialized the argument. Otherwise, you’re function will disregard the user-specified arguments every time.
################ Debugging
#so this won't run because "broken" tries to use "C" before we've declared it
#but if you just run the function R will not complain!
broken <- function(A,B) {
sum = A+B
prod = A*B
broken = A/B - C
C = sqrt(A^2 + B^2)
return(list(sum = sum,
prod = prod,
broken = broken,
C= C))
}
#R only finds the issue when you ask it to actually run the function
broken(6,7)
Error in A/B - C: non-numeric argument to binary operator
#this error mesage is a little opaque and let's say
#it is not immediately obvious to you what went wrong, what to do?
#if you try to run line by line within the function:
#R will complain that it has no idea what A and B are
#So go back inside your function and direcly declare A and B
#then test your code
broken <- function(A,B) {
# XXXX REMINDER DELETE LATER AND RM FROM EVIVORNMENT
A = 1; B = 2
#now R knows what A and B are and we can run the following lines internally
sum = A+B #good
prod = A*B #good
C = sqrt(A^2 + B^2)
broken = A/B - C # ah ha we found the problem!
return(list(sum = sum,
prod = prod,
broken = broken,
C= C))
}
broken(1,2)
$sum
[1] 3
$prod
[1] 2
$broken
[1] -1.736068
$C
[1] 2.236068
#so now we found the problem you'd fix it and BE SURE TO DELETE the stored inputs
#both from your environment and from the function code
#otherwise everytime you run your function it will use what you set internally and ignore the inputs you have
rm(A, B)
fixed<- function(A,B) {
sum = A+B
prod = A*B
C = sqrt(A^2 + B^2)
broken = A/B - C
return(list(sum = sum,
prod = prod,
broken = broken,
C= C))
}
fixed(6,7)
$sum
[1] 13
$prod
[1] 42
$broken
[1] -8.362402
$C
[1] 9.219544
#note ofc that if you don't delete those internal inputs though: you're stuck with 6 and 7
dontdothis<- function(A,B) {
A = 6
B = 7
sum = A+B
prod = A*B
C = sqrt(A^2 + B^2)
broken = A/B - C
return(list(sum = sum,
prod = prod,
broken = broken,
C= C))
}
dontdothis(5,6) #still gives us the results for 6 and 7!
$sum
[1] 13
$prod
[1] 42
$broken
[1] -8.362402
$C
[1] 9.219544
You can also help yourself debug by printing things out. This way you can see what the function was doing internally even when it was called. This is especially handy if your function has a long computation time and you want some updates or verification it’s going according to plan.
#use either paste() or cat()
#cat() is nice because it prints out like a warning/error message would without quotes
#print() prints out as if something is being returned notice the [1]
#and is a little less clean to use but works just fine
#not a major difference so do whatever you like
testprint <- function(A,B) {
#when using cat you need to tell it to start a new line for the next print out using \n
cat("cat: A is", A, "and B is", B, "returning their sum: \n")
#print is a little uglier because c() turns 6 into a string since all elemnts
#of a vector must be the same type (all strings or "characters" in R)
print(x = c("print + c(): A is",A, "B is"))
#or cleaning that up a bit by also using paste()
print(x = paste("print + paste(): A is",A, "B is"))
return(A+B)
}
testprint(4,5)
cat: A is 4 and B is 5 returning their sum:
[1] "print + c(): A is" "4" "B is"
[1] "print + paste(): A is 4 B is"
[1] 9
The real power of functions lies in their infinite flexibility. You can write a function to take in any input and give you any output!
As a totally arbitrary example, I can write a function that will take generalized input object x and: - if it is a character, output the number of letters in it - if it is a single number return the log - if it is a vector return the mean and standard deviation - if it is a matrix return its inverse
my_funct <- function(x) {
if(class(x) == "character") {
return(nchar(x))
} else if(class(x) == "numeric" & length(x) == 1) {
#issue 1: anticipate any issues with this?
return(log(x))
} else if(class(x) == "numeric") {
#is not character or numeric with length 1, but is numeric (must be vector)
return(list(sd = sd(x),
mean = mean(x)))
} else {
#is none of the above, we assume it's a matrix
#issue 2: any possible issues with this?
return(solve(x))
}
}
my_funct("hello there")
[1] 11
my_funct(4)
[1] 1.386294
log(4)
[1] 1.386294
my_funct(x = c(1,2,3))
Error in mean(x): unused argument (x)
[,1] [,2] [,3]
[1,] -0.2480476 2.719875 0.1602922
[2,] 0.2261868 -1.186238 -0.5585234
[3,] -1.2666374 2.333505 -0.6239111
#issue 1?
my_funct(-1)
[1] NaN
Error in solve.default(x): 'a' (3 x 4) must be square
Congrats you made it to the very end! I encourage you to tinker around and try and write (and also break) your own functions to better understand how they work.