23 Databases and SQL

Author

Solveig Bjørkholt

When we have large amounts of structured data, we often place it in a database. In companies that rely on data (which is a large amount of companies in these days), it is very normal to have databases. One database might for example contain lots of tables detailing information on the work-life on Norwegian inhabitants, including one table on salaries and one on absence. The unit in these tables would be the same – Norwegian inhabitants – which makes the tables relational. You could merge them and create one big table with all variables. However, since this would overload both us and the computer with unnecessary information and make everything slower, we split it up into tables and pick and choose what we need from each table instead.

There are several different databases, for example MySQL, Oracle and PostgreSQL. All of these databases are relational. Then we also have non-relational databases such as MongoDB, where the relationship between different data sources is shown in a different way. Non-relational databases work better for semi-structured and unstructured data. We are not going to go into the details on these different databases here, but it’s good to know that they exist.

23.1 SQLite

Databases can be large-scale and tricky, but they don’t have to be. There is a smooth way into the world of databases, and that is SQLite. SQLite is a program that allows you to create databases for yourself, your friends and others who might be interested, and it’s quite well integrated with R.

We will give an example on how to create your own database using SQLite. First, we create two datasets to add to the database .

dataset_salaries <- as_tibble(list(id = c(1, 2, 3, 4, 5),
                                   age = c(43, 53, 25, 37, 41),
                                   occupation = c("carpenter", "doctor", "accountant", "priest", "reseptionist"),
                                   salary = c(5300, 6800, 2100, 1600, 1300)))

dataset_absence <- as_tibble(list(id = c(1, 2, 3, 4, 5),
                                  age = c(43, 53, 25, 37, 41),
                                  absence_this_year = c(1, 0, 0, 0, 1)))

Then, we need to get the two packages DBI to work with databases in R and RSQLite to work with SQLite in R. If this is the first time you use these packages, remember to install them first using install.packages.

library(DBI)
library(RSQLite)

Warning: package 'RSQLite' was built under R version 4.2.3

In the next step, we create a database and save it on our computer. If you want to just save it in the short-term memory and do not care about whether you save your data in the long run, you can write `“:memory:”``in the last argument.

con <- dbConnect(RSQLite::SQLite(), "../datafolder/worklife.sqlite")

We now have an empty database which we can fill with the datasets we made above. To add dataset_salaries and dataset_absence, use dbCreateTable. It takes an R dataframe and converts it to a table inside the database. This function takes three arguments: dbCreateTable(conn, name, field):

Conn: The DBI connection object.
Name: The name of the new table.
Field: The data you want to insert. It has to be either a data frame or a character vector.

dbListTables will then show the tables in the database.

dbWriteTable(con, "salaries", dataset_salaries)
dbWriteTable(con, "absence", dataset_absence)

dbListTables(con)

To add more data to your database, use can use dbWriteTable.

salaries_datanew <- as_tibble(list(id = c(6, 7),
                                   age = c(31, 73),
                                   occupation = c("doctor", "self-employed"),
                                   salary = c(7100, 1100)))

dbWriteTable(con, name = "salaries", value = salaries_datanew, append = TRUE)

And to see what the table contains and/or fetch it into R, use dbReadTable.

dbReadTable(con, "salaries")

   id age    occupation salary
1   6  31        doctor   7100
2   7  73 self-employed   1100
3   6  31        doctor   7100
4   7  73 self-employed   1100
5   6  31        doctor   7100
6   7  73 self-employed   1100
7   6  31        doctor   7100
8   7  73 self-employed   1100
9   6  31        doctor   7100
10  7  73 self-employed   1100
11  6  31        doctor   7100
12  7  73 self-employed   1100

df <- dbReadTable(con, "salaries")

To remove a table from a database, use dbRemoveTable.

dbRemoveTable(con, "absence")

dbListTables(con)

[1] "salaries"

To disconnect from the database:

dbDisconnect(con)

To learn more on how work with SQLite in R, you can visit this page.

23.2 SQL

SQL stands for Structured Query Language. It’s the computer language used to store, manipulate and retrieve data from relational databases. This is done through “SQL queries”, a chuck of code that tells the computer to do something with a database, for example fetch some data from it.

The above examples uses R-code to work with databases, but this is actually just a wrapper around SQL. Behind the curtains, they are not R-queries but SQL-queries. It’s generally good to know a little SQL, because then you can work with databases from any program. Here, we offer a very brief introduction.

To work with SQL in R, first load the package sqldf.

library(sqldf)

Loading required package: gsubfn

Loading required package: proto

Now, you can write SQL-code in R.

sqldf('SELECT age
       FROM salaries
       WHERE occupation = "doctor"')

Of course, this all depends on the kind of database you use again (MySQL, Oracle, PostgreSQL, etc.). If we wanted to use SQL to read from the worklife.sqlite file, we could use the dbGetQuery function, add the name of the database-connection object and then the SQL-code. For example, the code below extracts the variable age from the table object salaries and filters the variable to where occupation is “doctor”. So we get the ages of the people in the dataset who are doctors.

con <- dbConnect(RSQLite::SQLite(), "../datafolder/worklife.sqlite")

dbGetQuery(con,
           'SELECT age
            FROM salaries
            WHERE occupation = "doctor"')

dbDisconnect(con)

To learn more about how to make SQL queries in R, visit this link

23.3 Publicly available datasets

There are lots of datasets out there, and many of them are even structured. Some of them might be useful for your projects. The department of political science has a webpage with many relevant datasets sorted by topic.

Other examples include:

Many researchers also publish data for their papers in Harvard dataverse, and there can be useful avenues for using these data on other things.