12 Missing values
12.1 The dataset we are working on (after data cleaning in previous chapter):
%>%
costofliving head()
# A tibble: 6 × 8
city country cofi rent_index cost_of_living_plus_re…¹ groceries_index
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 Hamilton Bermuda NA 96.1 124. 158.
2 Zurich Switzerland 131. 69.3 102. 136.
3 Basel Switzerland NA 49.4 NA 137.
4 Zug Switzerland 128. 72.1 102. 133.
5 Lugano Switzerland 124. 45.0 87.0 129.
6 Lausanne Switzerland 122. 59.6 92.7 123.
# ℹ abbreviated name: ¹cost_of_living_plus_rent_index
# ℹ 2 more variables: restaurant_price_index <dbl>,
# local_purchasing_power_index <dbl>
What are those NA
things that occur in some cells in the dataset? They are called “missing values”. Missing values occur when there are values in the dataset that should ideally have been there, but they’re not. This could for example be because we didn’t find data on that particular thing, somebody refused to give us the information we needed, or because we’ve done something wrong in the code, generating NA
.
To see how much missing there is in our dataset, we can use the function is.na
along with the function table
. is.na
gives the value TRUE
if the cell has a missing value and FALSE
otherwise, while the function table
counts the instances in which TRUE
and FALSE
occurs. This means, in other words, that we have 682 missing values in our data.
%>%
costofliving is.na() %>%
table()
.
FALSE TRUE
3942 682
Want to remove all rows that have one or more missing values? Use the function na.omit
.
%>%
costofliving na.omit()
# A tibble: 187 × 8
city country cofi rent_index cost_of_living_plus_…¹ groceries_index
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 Zurich "Switz… 131. 69.3 102. 136.
2 Lugano "Switz… 124. 45.0 87.0 129.
3 Bergen "Norwa… 100. 34.8 69.7 96.2
4 Trondheim "Norwa… 99.4 37.7 70.5 95.1
5 Reykjavik "Icela… 97.6 46.3 73.6 91.9
6 Tel Aviv-Yafo "Israe… 94.5 53.2 75.2 83.0
7 San Francisco " Unit… 93.9 108. 101. 97.0
8 Oakland " Unit… 92.9 87.8 90.5 98.5
9 Santa Clara " Unit… 89.4 90.4 89.9 101.
10 Seattle " Unit… 88.5 65.8 77.9 87.3
# ℹ 177 more rows
# ℹ abbreviated name: ¹cost_of_living_plus_rent_index
# ℹ 2 more variables: restaurant_price_index <dbl>,
# local_purchasing_power_index <dbl>
If you would like to more carefully select which variables you’d like to remove the missing values from, use drop_na
and choose the variables where the rows with missing values should be removed.
%>%
costofliving drop_na(country, city, cofi)
# A tibble: 427 × 8
city country cofi rent_index cost_of_living_plus_…¹ groceries_index
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 Zurich Switzerland 131. 69.3 102. 136.
2 Zug Switzerland 128. 72.1 102. 133.
3 Lugano Switzerland 124. 45.0 87.0 129.
4 Lausanne Switzerland 122. 59.6 92.7 123.
5 Beirut Lebanon 120. NA 77.0 141.
6 Bern Switzerland 118. 46.1 NA NA
7 Stavanger Norway 105. 35.4 72.2 102.
8 Oslo Norway 102. 46.4 76.1 97.6
9 Bergen Norway 100. 34.8 69.7 96.2
10 Trondheim Norway 99.4 37.7 70.5 95.1
# ℹ 417 more rows
# ℹ abbreviated name: ¹cost_of_living_plus_rent_index
# ℹ 2 more variables: restaurant_price_index <dbl>,
# local_purchasing_power_index <dbl>