12  Missing values

Author

Solveig Bjørkholt

12.1 The dataset we are working on (after data cleaning in previous chapter):

costofliving %>%
  head()
# A tibble: 6 × 8
  city     country      cofi rent_index cost_of_living_plus_re…¹ groceries_index
  <chr>    <fct>       <dbl>      <dbl>                    <dbl>           <dbl>
1 Hamilton Bermuda       NA        96.1                    124.             158.
2 Zurich   Switzerland  131.       69.3                    102.             136.
3 Basel    Switzerland   NA        49.4                     NA              137.
4 Zug      Switzerland  128.       72.1                    102.             133.
5 Lugano   Switzerland  124.       45.0                     87.0            129.
6 Lausanne Switzerland  122.       59.6                     92.7            123.
# ℹ abbreviated name: ¹​cost_of_living_plus_rent_index
# ℹ 2 more variables: restaurant_price_index <dbl>,
#   local_purchasing_power_index <dbl>

What are those NA things that occur in some cells in the dataset? They are called “missing values”. Missing values occur when there are values in the dataset that should ideally have been there, but they’re not. This could for example be because we didn’t find data on that particular thing, somebody refused to give us the information we needed, or because we’ve done something wrong in the code, generating NA.

To see how much missing there is in our dataset, we can use the function is.na along with the function table. is.na gives the value TRUE if the cell has a missing value and FALSE otherwise, while the function table counts the instances in which TRUE and FALSE occurs. This means, in other words, that we have 682 missing values in our data.

costofliving %>%
  is.na() %>%
  table()
.
FALSE  TRUE 
 3942   682 

Want to remove all rows that have one or more missing values? Use the function na.omit.

costofliving %>%
  na.omit()
# A tibble: 187 × 8
   city          country  cofi rent_index cost_of_living_plus_…¹ groceries_index
   <chr>         <fct>   <dbl>      <dbl>                  <dbl>           <dbl>
 1 Zurich        "Switz… 131.        69.3                  102.            136. 
 2 Lugano        "Switz… 124.        45.0                   87.0           129. 
 3 Bergen        "Norwa… 100.        34.8                   69.7            96.2
 4 Trondheim     "Norwa…  99.4       37.7                   70.5            95.1
 5 Reykjavik     "Icela…  97.6       46.3                   73.6            91.9
 6 Tel Aviv-Yafo "Israe…  94.5       53.2                   75.2            83.0
 7 San Francisco " Unit…  93.9      108.                   101.             97.0
 8 Oakland       " Unit…  92.9       87.8                   90.5            98.5
 9 Santa Clara   " Unit…  89.4       90.4                   89.9           101. 
10 Seattle       " Unit…  88.5       65.8                   77.9            87.3
# ℹ 177 more rows
# ℹ abbreviated name: ¹​cost_of_living_plus_rent_index
# ℹ 2 more variables: restaurant_price_index <dbl>,
#   local_purchasing_power_index <dbl>

If you would like to more carefully select which variables you’d like to remove the missing values from, use drop_na and choose the variables where the rows with missing values should be removed.

costofliving %>%
  drop_na(country, city, cofi)
# A tibble: 427 × 8
   city      country      cofi rent_index cost_of_living_plus_…¹ groceries_index
   <chr>     <fct>       <dbl>      <dbl>                  <dbl>           <dbl>
 1 Zurich    Switzerland 131.        69.3                  102.            136. 
 2 Zug       Switzerland 128.        72.1                  102.            133. 
 3 Lugano    Switzerland 124.        45.0                   87.0           129. 
 4 Lausanne  Switzerland 122.        59.6                   92.7           123. 
 5 Beirut    Lebanon     120.        NA                     77.0           141. 
 6 Bern      Switzerland 118.        46.1                   NA              NA  
 7 Stavanger Norway      105.        35.4                   72.2           102. 
 8 Oslo      Norway      102.        46.4                   76.1            97.6
 9 Bergen    Norway      100.        34.8                   69.7            96.2
10 Trondheim Norway       99.4       37.7                   70.5            95.1
# ℹ 417 more rows
# ℹ abbreviated name: ¹​cost_of_living_plus_rent_index
# ℹ 2 more variables: restaurant_price_index <dbl>,
#   local_purchasing_power_index <dbl>