11  Core functions

Author

Solveig Bjørkholt

11.1 The dataset we are working on:

11.2 tidyverse functions

After having imported the data into R, there is usually a bit of tidying that has to be done. We are going to learn how to do this through the tidyverse package. This package is fundamental to many operations in R. One of the main components of the tidyverse package is the %>%, also known as the pipe. The pipe allows us to access an object and apply functions to it in sequence. For example, if the object is me, the first function could be to wake_up(), then to the me that has woken up, shower(), and then to the me who has woken up and taken a shower, to drink_coffee(). This would look like below:

me %>%
  wake_up() %>%
  shower() %>%
  drink_coffee()

We’ll go through some of the main functions in the tidyverse package. However, first, it can be useful to change the names of the variables to something that does not have space, as it eases the work with the variables. To do this, load the package janitor and use the function clean_names. Here, we apply the pipe to the function, which basically means; go into the object costofliving and use the function clean_names on it.

library(janitor)

costofliving <- costofliving %>%
  clean_names()

glimpse(costofliving)
Rows: 578
Columns: 7
$ city                           <chr> "Hamilton, Bermuda", "Zurich, Switzerla…
$ cost_of_living_index           <dbl> NA, 131.24, NA, 128.13, 123.99, 122.03,…
$ rent_index                     <dbl> 96.10, 69.26, 49.38, 72.12, 44.99, 59.5…
$ cost_of_living_plus_rent_index <dbl> 124.22, 102.19, NA, 101.87, 86.96, 92.7…
$ groceries_index                <dbl> 157.89, 136.14, 137.07, 132.61, 129.17,…
$ restaurant_price_index         <dbl> 155.22, 132.52, 130.95, NA, 119.80, 127…
$ local_purchasing_power_index   <dbl> 79.43, 129.79, 111.53, 143.40, 111.96, …

11.3 select

The function select picks out some variables from the dataset.

dataset %>%
  select(variable)

Recall that the dataset had 7 variables; city, cost_of_living_index, rent_index, cost_of_living_plus_rent_index, groceries_index, restaurant_price_index and local_purchasing_power_index. If we only want a dataset with the two variables cost_of_living_index and groceries_index, we can use the select function.

costofliving %>%
  select(cost_of_living_index, groceries_index) # Picking only these two variables
Rows: 578
Columns: 2
$ cost_of_living_index <dbl> NA, 131.24, NA, 128.13, 123.99, 122.03, 120.47, 1…
$ groceries_index      <dbl> 157.89, 136.14, 137.07, 132.61, 129.17, 122.56, 1…

It can also be used to remove variables that we do not want in the dataset. If we want to remove more than one variable, we have to add a c and wrap the variables in parantheses.

costofliving %>%
  select(-cost_of_living_index)
Rows: 578
Columns: 6
$ city                           <chr> "Hamilton, Bermuda", "Zurich, Switzerla…
$ rent_index                     <dbl> 96.10, 69.26, 49.38, 72.12, 44.99, 59.5…
$ cost_of_living_plus_rent_index <dbl> 124.22, 102.19, NA, 101.87, 86.96, 92.7…
$ groceries_index                <dbl> 157.89, 136.14, 137.07, 132.61, 129.17,…
$ restaurant_price_index         <dbl> 155.22, 132.52, 130.95, NA, 119.80, 127…
$ local_purchasing_power_index   <dbl> 79.43, 129.79, 111.53, 143.40, 111.96, …
costofliving %>%
  select(-c(cost_of_living_index, groceries_index))
Rows: 578
Columns: 5
$ city                           <chr> "Hamilton, Bermuda", "Zurich, Switzerla…
$ rent_index                     <dbl> 96.10, 69.26, 49.38, 72.12, 44.99, 59.5…
$ cost_of_living_plus_rent_index <dbl> 124.22, 102.19, NA, 101.87, 86.96, 92.7…
$ restaurant_price_index         <dbl> 155.22, 132.52, 130.95, NA, 119.80, 127…
$ local_purchasing_power_index   <dbl> 79.43, 129.79, 111.53, 143.40, 111.96, …

11.4 filter

The function filter picks out some observations from the dataset.

dataset %>%
  filter(variable_name == value)

Curious about which cities that have a cost of living below 100 (meaning that they are cheaper than New York)? Ask for all units that score less than < 100 on the cost_of_living_index.

costofliving %>%
  filter(cost_of_living_index < 100)
Rows: 481
Columns: 7
$ city                           <chr> "Trondheim, Norway", "Tromso, Norway", …
$ cost_of_living_index           <dbl> 99.43, 98.99, 97.61, 94.49, 93.91, 92.9…
$ rent_index                     <dbl> 37.74, 37.19, 46.27, 53.22, 108.42, 87.…
$ cost_of_living_plus_rent_index <dbl> 70.52, 70.02, 73.55, 75.15, 100.72, 90.…
$ groceries_index                <dbl> 95.11, NA, 91.92, 82.98, 97.05, 98.46, …
$ restaurant_price_index         <dbl> 103.21, 103.47, 105.77, 106.66, 93.40, …
$ local_purchasing_power_index   <dbl> 88.00, NA, 74.84, 70.22, 133.16, 111.73…

Or if you want all the cities with equal = or higher than > 100 on the cost_of_living_index, use the code below.

costofliving %>%
  filter(cost_of_living_index >= 100)
Rows: 10
Columns: 7
$ city                           <chr> "Zurich, Switzerland", "Zug, Switzerlan…
$ cost_of_living_index           <dbl> 131.24, 128.13, 123.99, 122.03, 120.47,…
$ rent_index                     <dbl> 69.26, 72.12, 44.99, 59.55, NA, 46.12, …
$ cost_of_living_plus_rent_index <dbl> 102.19, 101.87, 86.96, 92.74, 77.01, NA…
$ groceries_index                <dbl> 136.14, 132.61, 129.17, 122.56, 141.33,…
$ restaurant_price_index         <dbl> 132.52, NA, 119.80, 127.01, 116.95, 120…
$ local_purchasing_power_index   <dbl> 129.79, 143.40, 111.96, NA, 15.40, 112.…

Want to know about only Oslo? Then filter out the row on the city variable that has the value Oslo, Norway (in quotation marks ""). Notice that we use two equals signs ==. This is to not confuse R into thinking that city equals Oslo, Norway.

costofliving %>%
  filter(city == "Oslo, Norway")
Rows: 1
Columns: 7
$ city                           <chr> "Oslo, Norway"
$ cost_of_living_index           <dbl> 102.33
$ rent_index                     <dbl> 46.39
$ cost_of_living_plus_rent_index <dbl> 76.11
$ groceries_index                <dbl> 97.62
$ restaurant_price_index         <dbl> NA
$ local_purchasing_power_index   <dbl> 85.18

Want more cities at the same time? Shift == with %in% and wrap the units into parantheses with a c in front. Also, remember the quotation marks "".

costofliving %>%
  filter(city %in% c("Oslo, Norway",
                     "Bergen, Norway",
                     "Trondheim, Norway"))
Rows: 3
Columns: 7
$ city                           <chr> "Oslo, Norway", "Bergen, Norway", "Tron…
$ cost_of_living_index           <dbl> 102.33, 100.38, 99.43
$ rent_index                     <dbl> 46.39, 34.84, 37.74
$ cost_of_living_plus_rent_index <dbl> 76.11, 69.66, 70.52
$ groceries_index                <dbl> 97.62, 96.22, 95.11
$ restaurant_price_index         <dbl> NA, 103.51, 103.21
$ local_purchasing_power_index   <dbl> 85.18, 86.96, 88.00

11.5 rename

Would you like to change the name of a variable? Use rename. First, write the new name of the variable, then the name of the old variable.

dataset %>%
  rename(new_variable_name = old_variable_name)

In this case, I change the name an abbreviation of the variable name, so that cost_of_living_index becomes cofi.

costofliving <- costofliving %>%
  rename(cofi = cost_of_living_index)
Rows: 578
Columns: 7
$ city                           <chr> "Hamilton, Bermuda", "Zurich, Switzerla…
$ cofi                           <dbl> NA, 131.24, NA, 128.13, 123.99, 122.03,…
$ rent_index                     <dbl> 96.10, 69.26, 49.38, 72.12, 44.99, 59.5…
$ cost_of_living_plus_rent_index <dbl> 124.22, 102.19, NA, 101.87, 86.96, 92.7…
$ groceries_index                <dbl> 157.89, 136.14, 137.07, 132.61, 129.17,…
$ restaurant_price_index         <dbl> 155.22, 132.52, 130.95, NA, 119.80, 127…
$ local_purchasing_power_index   <dbl> 79.43, 129.79, 111.53, 143.40, 111.96, …

11.6 separate and unite

Separate and unite allows us to split or collect two variables based on a separator.

dataset %>%
  separate(variable_name, into = c("new_variable_name1", "new_variable_name2"), sep = "separator")

dataset %>%
  unite(new_variable_name, c("old_variable_name1", "old_variable_name2"), sep = "separator")

A useful thing to do here would be to separate the city variable into city and country. The separator in this case is a comma ,, but it could also have been for example a dot, a space or a word such as “and”.

costofliving <- costofliving %>%
  separate(city, into = c("city", "country"), sep = ", ")
Rows: 578
Columns: 8
$ city                           <chr> "Hamilton", "Zurich", "Basel", "Zug", "…
$ country                        <chr> "Bermuda", "Switzerland", "Switzerland"…
$ cofi                           <dbl> NA, 131.24, NA, 128.13, 123.99, 122.03,…
$ rent_index                     <dbl> 96.10, 69.26, 49.38, 72.12, 44.99, 59.5…
$ cost_of_living_plus_rent_index <dbl> 124.22, 102.19, NA, 101.87, 86.96, 92.7…
$ groceries_index                <dbl> 157.89, 136.14, 137.07, 132.61, 129.17,…
$ restaurant_price_index         <dbl> 155.22, 132.52, 130.95, NA, 119.80, 127…
$ local_purchasing_power_index   <dbl> 79.43, 129.79, 111.53, 143.40, 111.96, …

11.7 mutate

If you would like to make a new variable, mutate is your go-to. First, write the name of the new variable, then add an equals sign =, and put the operation that you need to create your new variable.

dataset %>%
  mutate(variable_name = operation())

For example, if we want a new variable with the restaurant purchasing power as a share of the local purchasing power, we can make a new variable where we divide the first by the second.

costofliving %>%
  mutate(share_restaurant_purchasing_power = restaurant_price_index/local_purchasing_power_index)
Rows: 578
Columns: 9
$ city                              <chr> "Hamilton", "Zurich", "Basel", "Zug"…
$ country                           <chr> "Bermuda", "Switzerland", "Switzerla…
$ cofi                              <dbl> NA, 131.24, NA, 128.13, 123.99, 122.…
$ rent_index                        <dbl> 96.10, 69.26, 49.38, 72.12, 44.99, 5…
$ cost_of_living_plus_rent_index    <dbl> 124.22, 102.19, NA, 101.87, 86.96, 9…
$ groceries_index                   <dbl> 157.89, 136.14, 137.07, 132.61, 129.…
$ restaurant_price_index            <dbl> 155.22, 132.52, 130.95, NA, 119.80, …
$ local_purchasing_power_index      <dbl> 79.43, 129.79, 111.53, 143.40, 111.9…
$ share_restaurant_purchasing_power <dbl> 1.9541735, 1.0210340, 1.1741236, NA,…

11.7.1 ifelse

If you want to make a variable that takes the value 1 when the country is Norway, match it with ifelse. The ifelse function works like this:

If condiction, then give the new variable the given value, if not, give it the other value. In this case, if the variable country has value “Norway”, give the new variable norway value 1, else give the new variable value 0.

costofliving %>%
  mutate(norway = ifelse(country == "Norway", 1, 0))
Rows: 578
Columns: 9
$ city                           <chr> "Hamilton", "Zurich", "Basel", "Zug", "…
$ country                        <chr> "Bermuda", "Switzerland", "Switzerland"…
$ cofi                           <dbl> NA, 131.24, NA, 128.13, 123.99, 122.03,…
$ rent_index                     <dbl> 96.10, 69.26, 49.38, 72.12, 44.99, 59.5…
$ cost_of_living_plus_rent_index <dbl> 124.22, 102.19, NA, 101.87, 86.96, 92.7…
$ groceries_index                <dbl> 157.89, 136.14, 137.07, 132.61, 129.17,…
$ restaurant_price_index         <dbl> 155.22, 132.52, 130.95, NA, 119.80, 127…
$ local_purchasing_power_index   <dbl> 79.43, 129.79, 111.53, 143.40, 111.96, …
$ norway                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …

You can also overwrite new variables with the function mutate. For example, if you aren’t happy with the fact that the country variable is a “character”, you can change it by using the function as_factor together with mutate. Notice in the function below that the name of the new variable is the same as the name of the old variable, country, thus we overwrite the old one.

class(costofliving$country)
[1] "character"
costofliving <- costofliving %>%
  mutate(country = as_factor(country))

class(costofliving$country)
[1] "factor"

11.8 group_by and summarise

At last, we can use the summarise function to make summaries of the dataset. This function works very well together with the function group_by.

dataset %>%
  group_by(variable) %>%
  summarise(new_variable_name = function(variable_name))

While summarise gives us aggregates of the data, the group_by specifies which units the aggregates should be by. For example, if we ask for the sum of the cost of living in general in the dataset, we could just use summarise directly.

costofliving %>%
  summarise(cofi = sum(cofi, na.rm = TRUE))
# A tibble: 1 × 1
    cofi
   <dbl>
1 28253.

A more useful distinction is, however, to look at the cost of living for each country. Then, we group_by country first, and then use summarise and sum. When using group_by, remember to ungroup afterwards.

I also use arrange along with desc in this example in order to sort the countries in declining order from the ones with highest “cofi” (highest cost of living) to the lowest.

costofliving %>%
  group_by(country) %>%
  summarise(cofi = sum(cofi, na.rm = TRUE)) %>%
  arrange(desc(cofi)) %>%
  ungroup()
# A tibble: 118 × 2
   country           cofi
   <fct>            <dbl>
 1 " United States" 4886.
 2  <NA>            3602.
 3 "United Kingdom" 2325.
 4 "Germany"        1396.
 5 "Italy"          1081.
 6 "Canada"         1067.
 7 "India"           878.
 8 "Netherlands"     832.
 9 "Australia"       624.
10 "Switzerland"     624.
# ℹ 108 more rows

Summarise also works with other descriptive functions such as:

  • min : Minimum value
  • max : Maximum value
  • median : Median
  • mean : Mean (average)
  • sd : Standard deviation

So, for example, to find the average cost of living in lagre cities in the respective countries, use group_by, summarise and mean.

costofliving %>%
  group_by(country) %>%
  summarise(cofi = mean(cofi, na.rm = TRUE)) %>%
  arrange(desc(cofi)) %>%
  ungroup()
# A tibble: 118 × 2
   country      cofi
   <fct>       <dbl>
 1 Switzerland 125. 
 2 Lebanon     120. 
 3 Norway      101. 
 4 Iceland      97.6
 5 Israel       89.6
 6 Bahamas      86.0
 7 Denmark      84.9
 8 Singapore    84.0
 9 Luxembourg   83.0
10 Japan        80.6
# ℹ 108 more rows

11.9 Piping together

As you saw in the part above, pipes can be put in sequence. This is one of the great advantages of using pipes. Lots of code can be executed in a single “storyline” that says “first do this, then do this, then do this”, and so on. Thus tidyverse is usually seen as a pedagogical tool because it’s supposedly quite intuitive, and the code also becomes very readable. Consider the code below and decide for yourself whether you find it readable.

costofliving %>%
  select(city, country, rent_index, restaurant_price_index, groceries_index) %>%
  filter(country %in% c("Norway", "Germany", "Spain")) %>%
  mutate(total_cost_index = rent_index + restaurant_price_index + groceries_index) %>%
  group_by(country) %>%
  mutate(max_cost = max(total_cost_index, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(highest_city = ifelse(total_cost_index == max_cost, 
                               "High living costs", 
                               "Low living costs")) %>%
  group_by(highest_city) %>%
  summarise(average_rent = mean(rent_index))
# A tibble: 3 × 2
  highest_city      average_rent
  <chr>                    <dbl>
1 High living costs         38.6
2 Low living costs          26.5
3 <NA>                      NA  

11.10 Saving the dataset

When you’ve loaded a dataset into R and tidied it up a bit, you might want to save it until next time. Which format you choose depends partly on your own preferences, and party what you’d like to do further with the dataset. If you want to send it to someone else, a .csv is probably best. If not, I recommend .rds because it allows you to give a name to the object when you read it into R. But some people prefer .rda and .RData, and it’s nice to vary a bit to see what you like best.

write_csv(costofliving, file = "../datafolder/costofliving2.csv") # To save a .csv file

saveRDS(costofliving, file = "../datafolder/costofliving2.rds") # To save a .rds file

save(costofliving, file = "../datafolder/costofliving2.rda") # To save a .rda file

save(costofliving, file = "../datafolder/costofliving2.RData") # To save a .RData file