SQLite, json and denormalization
I was reading SQLite’s documentation and I discovered an extension called JSON1, which, you guess, allows us to manipulate JSON inside SQLite.
In this article, we’ll see how we can use this extension on a denormalized database. And the benefits it can have on the read performances
The code used to generate the database is available here
What is denormalization?
According to Wikipedia, “denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data.”
I’ll add that it’s a mechanism in which we’re allowing ourselves to not use the goods practices taught by our teachers #ThugLife. We forget about 3rd normal form by putting data from joins tables into “normals tables”. If it’s still not clear, the following example should help.
The tables
Instead of using the classical example with posts and tags, this time, we’ll use wines and bars. We’ll have a many-to-many relationship where each bar can serve multiple wines and each wine can be served into multiple bars.
So, we’ll need four tables :
- wines
- bars_normalized
- bars_wines
- bars_denormalized
CREATE TABLE IF NOT EXISTS wines (
id TEXT PRIMARY KEY,
name TEXT,
country TEXT,
year INTEGER
);
CREATE TABLE IF NOT EXISTS bars_normalized (
id TEXT PRIMARY KEY,
name TEXT
);
CREATE TABLE IF NOT EXISTS bars_wines (
bar_id TEXT,
wine_id TEXT
);
CREATE TABLE IF NOT EXISTS bars_denormalized (
id TEXT PRIMARY KEY,
name TEXT,
country TEXT,
wines_ids JSON /* will contain: ["wine-uuid-1","wine-uuid-234"...] */
);
JSON1 presentation
Let’s look at how we can do usuals operations on bar_denormalized.
Inserting
INSERT INTO bars_denormalized (id, name, wines_ids)
VALUES (
'random-uuid',
'La taverne',
/* The json() function validate the provided JSON
and also remove useless spaces. */
json('["wine-uuid-1", "wine-uuid-45", "wine-uuid-423"]')
);
Reading
SELECT DISTINCT bars_denormalized.id, bars_denormalized.name
/* 1 - The json_each() function gives us the ability
to iterate over a field */
FROM bars_denormalized, json_each(bars_denormalized.wines_ids)
/* 2 - When a field has been selected with json_each(),
we can use json_each.value to get the value of each item
of the array */
WHERE json_each.value = 'wine-uuid-1';
The query above will, for each bar, iterate over the wines_ids
array. And for each item of this array, it will verify if this value equals wine-uuid-1
.
Finally, this query is returning all bars that serve wines having wine-uuid-1
as an identifier.
If we had an object we could have used the json_extract
function. And for an array of objects, the json_tree
function.
Updating and deleting
Both operations will almost act the same. In the first case, we’ll push a new item into the wines_ids
array and in the second case, we’ll remove one.
UPDATE bars_denormalized
SET wines_ids = (
/* 2 - We insert the data at the given index, in other words,
the end of the array */
SELECT json_insert(
bars_denormalized.wines_ids,
/* 1 - We get the length of the array, which will be the index
of the new item */
'$[' || json_array_length(bars_denormalized.wines_ids) || ']',
'wine-uuid-297'
)
)
WHERE id = 'bar-uuid-3';
The json_insert
function takes three parameters :
- the field we want insert a new value
- the index where the value should be inserted
- the value
/* 3 - We update wines_ids with our new array */
UPDATE bars_denormalized
SET wines_ids = (
/* 2 - We return an array without the item that should be removed */
SELECT json_remove(bars_denormalized.wines_ids,
'$[' || (
/* 1 - We search the index of the item to remove */
SELECT json_each.key
FROM json_each(bars_denormalized.wines_ids)
WHERE json_each.value = 'wine-uuid-297'
) || ']'
)
WHERE id = 'bar-uuid-3';
The json_remove
function takes two parameters :
- the field we want to remove a value
- the index of the value that we want to remove
Queries
In order to test the performances, of both data models, we’ll perform two queries :
- obtain the names of the bars serving wines that are more recent than 2018
- obtain the names of the wines of Portuguese’s wines served in UK’s bars
With these two queries, we’ll be able to see how to get data ob each table.
For the normalized version, we’ll make a joint between the wines, bars_normalized and bars_wines tables. After the join, we’ll apply a filter based on our needs.
SELECT DISTINCT bars_normalized.id, bars_normalized.name
FROM bars_normalized
LEFT OUTER JOIN bars_wines
ON bars_normalized.id = bars_wines.bar_id
LEFT OUTER JOIN wines
ON wines.id = bars_wines.wine_id
WHERE year > 2018;
SELECT DISTINCT wines.id, wines.name
FROM wines
LEFT OUTER JOIN bars_wines
ON wines.id = bars_wines.wine_id
LEFT OUTER JOIN bars_normalized
ON bars_normalized.id = bars_wines.bar_id
WHERE bars_normalized.country = 'United Kingdom'
AND wines.country = 'Portugal';
For the denormalized version, we’ll use the functions which are provided by the JSON1 extension, json_each
in this case.
SELECT DISTINCT bars_denormalized.id, name
FROM bars_denormalized, json_each(bars_denormalized.wines_ids)
WHERE json_each.value IN (
SELECT id
FROM wines
WHERE year > 2018
);
SELECT DISTINCT wines.id, wines.name
FROM wines
WHERE id IN (
SELECT DISTINCT json_each.value AS id
FROM bars_denormalized, json_each(bars_denormalized.wines_ids)
WHERE bars_denormalized.country = 'United Kingdom'
) AND wines.country = 'Portugal';
Let’s talk about the query above, or more precisely about the sub-query. If we just read it, it can be strange to see that I’m using the json_each.value
inside the SELECT
whereas I will get this information into the FROM
which is the next line. That’s the point! It’s necessary to understand that relatinnal databases are treating queries. They are executed like described in the following schema.
Even if it can be strange according to the reading direction, it’s perfectly valid on a technical side.
Results
Reads
I’ve made tests with various parameters. On each query, I changed some parameters. On queries below, numbers in parentheses represents : (number of bars, number of wines, number of wines per bar).
In ordet to get the results, I ran each queries five times and I took the median value.
Normalized | Denormalized | |
---|---|---|
Query 1 (500, 1000, 100) | 874ms | 74ms |
Query 1 (5000, 10000, 100) | 9143ms | 569ms |
Query 1 (500, 1000, 500) | 4212ms | 197ms |
Query 2 (500, 1000, 100) | 726ms | 16ms |
Query 2 (5000, 10000, 100) | 8648ms | 83ms |
Query 2 (500, 1000, 500) | 4132ms | 27ms |
Obviously, the denormalized version is much faster than the normalized one. Joints are heavy operations. For normalized databases, duration is correlated to the join table size. Let’s take query 1 as an example :
- 50000 entrée → 874ms
- 500000 entrée → 9143ms (x10 based on previous)
- 250000 entrée → 4212ms (÷2 based on previous)
We can observe similar results on query n°2. When we increase the amount of data, the duration grows linearly. We don’t see the same results with the denormalized form which can better handle the growth of data.
In addition, we can see that query n°2 is more performant than query n°1. It means that making query on the table that doesn’t contain the ids array. So, in my example, if I need to often load data related to the bar, I should put bars_ids
inside table wines
instead of keeping wines_ids
.
Tables size
These statistics are provided by sqlite3_analyzer. The denormalized version is two times lighter than the normalized one which includes bars_normalized and bars_wines. We’re winning on performances and on database size.
Conclusion
Denormalization is a great solution to gain space and reading performances. But, we have to care more about writing operations. For example, during deletions, we will not be able to use the cascade deletions provided by RDBMS.
Through this article, I wanted to show that denormalization isn’t a concept reserved for NoSQL. Based on my researches, MySQL/MariaDB and PostgreSQL also provide JSON manipulations. I’m not saying that NoSQL databases are bad, they can be more adapted in certain use cases, Redis and ElasticSearch are perfect examples. But it’s not necessary to throw away solutions that resist time.
Thanks for reading.