11 Core functions
11.1 The dataset we are working on:
11.2 tidyverse
functions
After having imported the data into R
, there is usually a bit of tidying that has to be done. We are going to learn how to do this through the tidyverse
package. This package is fundamental to many operations in R
. One of the main components of the tidyverse
package is the %>%
, also known as the pipe. The pipe allows us to access an object and apply functions to it in sequence. For example, if the object is me
, the first function could be to wake_up()
, then to the me that has woken up, shower()
, and then to the me who has woken up and taken a shower, to drink_coffee()
. This would look like below:
%>%
me wake_up() %>%
shower() %>%
drink_coffee()
We’ll go through some of the main functions in the tidyverse
package. However, first, it can be useful to change the names of the variables to something that does not have space, as it eases the work with the variables. To do this, load the package janitor
and use the function clean_names
. Here, we apply the pipe to the function, which basically means; go into the object costofliving
and use the function clean_names
on it.
library(janitor)
<- costofliving %>%
costofliving clean_names()
glimpse(costofliving)
Rows: 578
Columns: 7
$ city <chr> "Hamilton, Bermuda", "Zurich, Switzerla…
$ cost_of_living_index <dbl> NA, 131.24, NA, 128.13, 123.99, 122.03,…
$ rent_index <dbl> 96.10, 69.26, 49.38, 72.12, 44.99, 59.5…
$ cost_of_living_plus_rent_index <dbl> 124.22, 102.19, NA, 101.87, 86.96, 92.7…
$ groceries_index <dbl> 157.89, 136.14, 137.07, 132.61, 129.17,…
$ restaurant_price_index <dbl> 155.22, 132.52, 130.95, NA, 119.80, 127…
$ local_purchasing_power_index <dbl> 79.43, 129.79, 111.53, 143.40, 111.96, …
11.3 select
The function select
picks out some variables from the dataset.
%>%
dataset select(variable)
Recall that the dataset had 7 variables; city
, cost_of_living_index
, rent_index
, cost_of_living_plus_rent_index
, groceries_index
, restaurant_price_index
and local_purchasing_power_index
. If we only want a dataset with the two variables cost_of_living_index
and groceries_index
, we can use the select
function.
%>%
costofliving select(cost_of_living_index, groceries_index) # Picking only these two variables
Rows: 578
Columns: 2
$ cost_of_living_index <dbl> NA, 131.24, NA, 128.13, 123.99, 122.03, 120.47, 1…
$ groceries_index <dbl> 157.89, 136.14, 137.07, 132.61, 129.17, 122.56, 1…
It can also be used to remove variables that we do not want in the dataset. If we want to remove more than one variable, we have to add a c
and wrap the variables in parantheses.
%>%
costofliving select(-cost_of_living_index)
Rows: 578
Columns: 6
$ city <chr> "Hamilton, Bermuda", "Zurich, Switzerla…
$ rent_index <dbl> 96.10, 69.26, 49.38, 72.12, 44.99, 59.5…
$ cost_of_living_plus_rent_index <dbl> 124.22, 102.19, NA, 101.87, 86.96, 92.7…
$ groceries_index <dbl> 157.89, 136.14, 137.07, 132.61, 129.17,…
$ restaurant_price_index <dbl> 155.22, 132.52, 130.95, NA, 119.80, 127…
$ local_purchasing_power_index <dbl> 79.43, 129.79, 111.53, 143.40, 111.96, …
%>%
costofliving select(-c(cost_of_living_index, groceries_index))
Rows: 578
Columns: 5
$ city <chr> "Hamilton, Bermuda", "Zurich, Switzerla…
$ rent_index <dbl> 96.10, 69.26, 49.38, 72.12, 44.99, 59.5…
$ cost_of_living_plus_rent_index <dbl> 124.22, 102.19, NA, 101.87, 86.96, 92.7…
$ restaurant_price_index <dbl> 155.22, 132.52, 130.95, NA, 119.80, 127…
$ local_purchasing_power_index <dbl> 79.43, 129.79, 111.53, 143.40, 111.96, …
11.4 filter
The function filter
picks out some observations from the dataset.
%>%
dataset filter(variable_name == value)
Curious about which cities that have a cost of living below 100 (meaning that they are cheaper than New York)? Ask for all units that score less than <
100 on the cost_of_living_index
.
%>%
costofliving filter(cost_of_living_index < 100)
Rows: 481
Columns: 7
$ city <chr> "Trondheim, Norway", "Tromso, Norway", …
$ cost_of_living_index <dbl> 99.43, 98.99, 97.61, 94.49, 93.91, 92.9…
$ rent_index <dbl> 37.74, 37.19, 46.27, 53.22, 108.42, 87.…
$ cost_of_living_plus_rent_index <dbl> 70.52, 70.02, 73.55, 75.15, 100.72, 90.…
$ groceries_index <dbl> 95.11, NA, 91.92, 82.98, 97.05, 98.46, …
$ restaurant_price_index <dbl> 103.21, 103.47, 105.77, 106.66, 93.40, …
$ local_purchasing_power_index <dbl> 88.00, NA, 74.84, 70.22, 133.16, 111.73…
Or if you want all the cities with equal =
or higher than >
100 on the cost_of_living_index
, use the code below.
%>%
costofliving filter(cost_of_living_index >= 100)
Rows: 10
Columns: 7
$ city <chr> "Zurich, Switzerland", "Zug, Switzerlan…
$ cost_of_living_index <dbl> 131.24, 128.13, 123.99, 122.03, 120.47,…
$ rent_index <dbl> 69.26, 72.12, 44.99, 59.55, NA, 46.12, …
$ cost_of_living_plus_rent_index <dbl> 102.19, 101.87, 86.96, 92.74, 77.01, NA…
$ groceries_index <dbl> 136.14, 132.61, 129.17, 122.56, 141.33,…
$ restaurant_price_index <dbl> 132.52, NA, 119.80, 127.01, 116.95, 120…
$ local_purchasing_power_index <dbl> 129.79, 143.40, 111.96, NA, 15.40, 112.…
Want to know about only Oslo? Then filter out the row on the city
variable that has the value Oslo, Norway
(in quotation marks ""
). Notice that we use two equals signs ==
. This is to not confuse R
into thinking that city equals Oslo, Norway.
%>%
costofliving filter(city == "Oslo, Norway")
Rows: 1
Columns: 7
$ city <chr> "Oslo, Norway"
$ cost_of_living_index <dbl> 102.33
$ rent_index <dbl> 46.39
$ cost_of_living_plus_rent_index <dbl> 76.11
$ groceries_index <dbl> 97.62
$ restaurant_price_index <dbl> NA
$ local_purchasing_power_index <dbl> 85.18
Want more cities at the same time? Shift ==
with %in%
and wrap the units into parantheses with a c
in front. Also, remember the quotation marks ""
.
%>%
costofliving filter(city %in% c("Oslo, Norway",
"Bergen, Norway",
"Trondheim, Norway"))
Rows: 3
Columns: 7
$ city <chr> "Oslo, Norway", "Bergen, Norway", "Tron…
$ cost_of_living_index <dbl> 102.33, 100.38, 99.43
$ rent_index <dbl> 46.39, 34.84, 37.74
$ cost_of_living_plus_rent_index <dbl> 76.11, 69.66, 70.52
$ groceries_index <dbl> 97.62, 96.22, 95.11
$ restaurant_price_index <dbl> NA, 103.51, 103.21
$ local_purchasing_power_index <dbl> 85.18, 86.96, 88.00
11.5 rename
Would you like to change the name of a variable? Use rename
. First, write the new name of the variable, then the name of the old variable.
%>%
dataset rename(new_variable_name = old_variable_name)
In this case, I change the name an abbreviation of the variable name, so that cost_of_living_index
becomes cofi
.
<- costofliving %>%
costofliving rename(cofi = cost_of_living_index)
Rows: 578
Columns: 7
$ city <chr> "Hamilton, Bermuda", "Zurich, Switzerla…
$ cofi <dbl> NA, 131.24, NA, 128.13, 123.99, 122.03,…
$ rent_index <dbl> 96.10, 69.26, 49.38, 72.12, 44.99, 59.5…
$ cost_of_living_plus_rent_index <dbl> 124.22, 102.19, NA, 101.87, 86.96, 92.7…
$ groceries_index <dbl> 157.89, 136.14, 137.07, 132.61, 129.17,…
$ restaurant_price_index <dbl> 155.22, 132.52, 130.95, NA, 119.80, 127…
$ local_purchasing_power_index <dbl> 79.43, 129.79, 111.53, 143.40, 111.96, …
11.6 separate
and unite
Separate and unite allows us to split or collect two variables based on a separator.
%>%
dataset separate(variable_name, into = c("new_variable_name1", "new_variable_name2"), sep = "separator")
%>%
dataset unite(new_variable_name, c("old_variable_name1", "old_variable_name2"), sep = "separator")
A useful thing to do here would be to separate the city
variable into city
and country
. The separator in this case is a comma ,
, but it could also have been for example a dot, a space or a word such as “and”.
<- costofliving %>%
costofliving separate(city, into = c("city", "country"), sep = ", ")
Rows: 578
Columns: 8
$ city <chr> "Hamilton", "Zurich", "Basel", "Zug", "…
$ country <chr> "Bermuda", "Switzerland", "Switzerland"…
$ cofi <dbl> NA, 131.24, NA, 128.13, 123.99, 122.03,…
$ rent_index <dbl> 96.10, 69.26, 49.38, 72.12, 44.99, 59.5…
$ cost_of_living_plus_rent_index <dbl> 124.22, 102.19, NA, 101.87, 86.96, 92.7…
$ groceries_index <dbl> 157.89, 136.14, 137.07, 132.61, 129.17,…
$ restaurant_price_index <dbl> 155.22, 132.52, 130.95, NA, 119.80, 127…
$ local_purchasing_power_index <dbl> 79.43, 129.79, 111.53, 143.40, 111.96, …
11.7 mutate
If you would like to make a new variable, mutate
is your go-to. First, write the name of the new variable, then add an equals sign =
, and put the operation that you need to create your new variable.
%>%
dataset mutate(variable_name = operation())
For example, if we want a new variable with the restaurant purchasing power as a share of the local purchasing power, we can make a new variable where we divide the first by the second.
%>%
costofliving mutate(share_restaurant_purchasing_power = restaurant_price_index/local_purchasing_power_index)
Rows: 578
Columns: 9
$ city <chr> "Hamilton", "Zurich", "Basel", "Zug"…
$ country <chr> "Bermuda", "Switzerland", "Switzerla…
$ cofi <dbl> NA, 131.24, NA, 128.13, 123.99, 122.…
$ rent_index <dbl> 96.10, 69.26, 49.38, 72.12, 44.99, 5…
$ cost_of_living_plus_rent_index <dbl> 124.22, 102.19, NA, 101.87, 86.96, 9…
$ groceries_index <dbl> 157.89, 136.14, 137.07, 132.61, 129.…
$ restaurant_price_index <dbl> 155.22, 132.52, 130.95, NA, 119.80, …
$ local_purchasing_power_index <dbl> 79.43, 129.79, 111.53, 143.40, 111.9…
$ share_restaurant_purchasing_power <dbl> 1.9541735, 1.0210340, 1.1741236, NA,…
11.7.1 ifelse
If you want to make a variable that takes the value 1 when the country is Norway, match it with ifelse
. The ifelse
function works like this:
If condiction
, then give the new variable the given value, if not, give it the other value. In this case, if the variable country
has value “Norway”, give the new variable norway
value 1
, else give the new variable value 0
.
%>%
costofliving mutate(norway = ifelse(country == "Norway", 1, 0))
Rows: 578
Columns: 9
$ city <chr> "Hamilton", "Zurich", "Basel", "Zug", "…
$ country <chr> "Bermuda", "Switzerland", "Switzerland"…
$ cofi <dbl> NA, 131.24, NA, 128.13, 123.99, 122.03,…
$ rent_index <dbl> 96.10, 69.26, 49.38, 72.12, 44.99, 59.5…
$ cost_of_living_plus_rent_index <dbl> 124.22, 102.19, NA, 101.87, 86.96, 92.7…
$ groceries_index <dbl> 157.89, 136.14, 137.07, 132.61, 129.17,…
$ restaurant_price_index <dbl> 155.22, 132.52, 130.95, NA, 119.80, 127…
$ local_purchasing_power_index <dbl> 79.43, 129.79, 111.53, 143.40, 111.96, …
$ norway <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
You can also overwrite new variables with the function mutate
. For example, if you aren’t happy with the fact that the country
variable is a “character”, you can change it by using the function as_factor
together with mutate
. Notice in the function below that the name of the new variable is the same as the name of the old variable, country
, thus we overwrite the old one.
class(costofliving$country)
[1] "character"
<- costofliving %>%
costofliving mutate(country = as_factor(country))
class(costofliving$country)
[1] "factor"
11.8 group_by
and summarise
At last, we can use the summarise
function to make summaries of the dataset. This function works very well together with the function group_by
.
%>%
dataset group_by(variable) %>%
summarise(new_variable_name = function(variable_name))
While summarise
gives us aggregates of the data, the group_by
specifies which units the aggregates should be by. For example, if we ask for the sum
of the cost of living in general in the dataset, we could just use summarise
directly.
%>%
costofliving summarise(cofi = sum(cofi, na.rm = TRUE))
# A tibble: 1 × 1
cofi
<dbl>
1 28253.
A more useful distinction is, however, to look at the cost of living for each country. Then, we group_by
country first, and then use summarise
and sum
. When using group_by
, remember to ungroup
afterwards.
I also use arrange
along with desc
in this example in order to sort the countries in declining order from the ones with highest “cofi” (highest cost of living) to the lowest.
%>%
costofliving group_by(country) %>%
summarise(cofi = sum(cofi, na.rm = TRUE)) %>%
arrange(desc(cofi)) %>%
ungroup()
# A tibble: 118 × 2
country cofi
<fct> <dbl>
1 " United States" 4886.
2 <NA> 3602.
3 "United Kingdom" 2325.
4 "Germany" 1396.
5 "Italy" 1081.
6 "Canada" 1067.
7 "India" 878.
8 "Netherlands" 832.
9 "Australia" 624.
10 "Switzerland" 624.
# ℹ 108 more rows
Summarise also works with other descriptive functions such as:
min
: Minimum valuemax
: Maximum valuemedian
: Medianmean
: Mean (average)sd
: Standard deviation
So, for example, to find the average cost of living in lagre cities in the respective countries, use group_by
, summarise
and mean
.
%>%
costofliving group_by(country) %>%
summarise(cofi = mean(cofi, na.rm = TRUE)) %>%
arrange(desc(cofi)) %>%
ungroup()
# A tibble: 118 × 2
country cofi
<fct> <dbl>
1 Switzerland 125.
2 Lebanon 120.
3 Norway 101.
4 Iceland 97.6
5 Israel 89.6
6 Bahamas 86.0
7 Denmark 84.9
8 Singapore 84.0
9 Luxembourg 83.0
10 Japan 80.6
# ℹ 108 more rows
11.9 Piping together
As you saw in the part above, pipes can be put in sequence. This is one of the great advantages of using pipes. Lots of code can be executed in a single “storyline” that says “first do this, then do this, then do this”, and so on. Thus tidyverse
is usually seen as a pedagogical tool because it’s supposedly quite intuitive, and the code also becomes very readable. Consider the code below and decide for yourself whether you find it readable.
%>%
costofliving select(city, country, rent_index, restaurant_price_index, groceries_index) %>%
filter(country %in% c("Norway", "Germany", "Spain")) %>%
mutate(total_cost_index = rent_index + restaurant_price_index + groceries_index) %>%
group_by(country) %>%
mutate(max_cost = max(total_cost_index, na.rm = TRUE)) %>%
ungroup() %>%
mutate(highest_city = ifelse(total_cost_index == max_cost,
"High living costs",
"Low living costs")) %>%
group_by(highest_city) %>%
summarise(average_rent = mean(rent_index))
# A tibble: 3 × 2
highest_city average_rent
<chr> <dbl>
1 High living costs 38.6
2 Low living costs 26.5
3 <NA> NA
11.10 Saving the dataset
When you’ve loaded a dataset into R
and tidied it up a bit, you might want to save it until next time. Which format you choose depends partly on your own preferences, and party what you’d like to do further with the dataset. If you want to send it to someone else, a .csv
is probably best. If not, I recommend .rds
because it allows you to give a name to the object when you read it into R
. But some people prefer .rda
and .RData
, and it’s nice to vary a bit to see what you like best.
write_csv(costofliving, file = "../datafolder/costofliving2.csv") # To save a .csv file
saveRDS(costofliving, file = "../datafolder/costofliving2.rds") # To save a .rds file
save(costofliving, file = "../datafolder/costofliving2.rda") # To save a .rda file
save(costofliving, file = "../datafolder/costofliving2.RData") # To save a .RData file