SQLite, json and denormalization

I was reading SQLite's documentation and I discovered an extension called JSON1, which, you guess, allows us to manipulate JSON inside SQLite.
In this article, we'll see how we can use this extension on a denormalized database. And the benefits it can have on the read performances
The code used to generate the database is available here
What is denormalization?
According to Wikipedia, "denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data."
I'll add that it's a mechanism in which we're allowing ourselves to not use the goods practices taught by our teachers #ThugLife. We forget about 3rd normal form by putting data from joins tables into "normals tables". If it's still not clear, the following example should help.
The tables
Instead of using the classical example with posts and tags, this time, we'll use wines and bars. We'll have a many-to-many relationship where each bar can serve multiple wines and each wine can be served into multiple bars.
So, we'll need four tables :
- wines
- bars_normalized
- bars_wines
- bars_denormalized
CREATE TABLE IF NOT EXISTS wines (
id TEXT PRIMARY KEY,
name TEXT,
country TEXT,
year INTEGER
);
CREATE TABLE IF NOT EXISTS bars_normalized (
id TEXT PRIMARY KEY,
name TEXT
);
CREATE TABLE IF NOT EXISTS bars_wines (
bar_id TEXT,
wine_id TEXT
);
CREATE TABLE IF NOT EXISTS bars_denormalized (
id TEXT PRIMARY KEY,
name TEXT,
country TEXT,
wines_ids JSON /* will contain: ["wine-uuid-1","wine-uuid-234"...] */
);
JSON1 presentation
Let's look at how we can do usuals operations on bar_denormalized.
Inserting
INSERT INTO bars_denormalized (id, name, wines_ids)
VALUES (
'random-uuid',
'La taverne',
/* The json() function validate the provided JSON
and also remove useless spaces. */
json('["wine-uuid-1", "wine-uuid-45", "wine-uuid-423"]')
);
Reading
SELECT DISTINCT bars_denormalized.id, bars_denormalized.name
/* 1 - The json_each() function gives us the ability
to iterate over a field */
FROM bars_denormalized, json_each(bars_denormalized.wines_ids)
/* 2 - When a field has been selected with json_each(),
we can use json_each.value to get the value of each item
of the array */
WHERE json_each.value = 'wine-uuid-1';
The query above will, for each bar, iterate over the
wines_ids
wine-uuid-1
wine-uuid-1
json_extract
json_tree
Updating and deleting
Both operations will almost act the same. In the first case, we'll push a new item into the
wines_ids
UPDATE bars_denormalized
SET wines_ids = (
/* 2 - We insert the data at the given index, in other words,
the end of the array */
SELECT json_insert(
bars_denormalized.wines_ids,
/* 1 - We get the length of the array, which will be the index
of the new item */
'$[' || json_array_length(bars_denormalized.wines_ids) || ']',
'wine-uuid-297'
)
)
WHERE id = 'bar-uuid-3';
The
json_insert
- the field we want insert a new value
- the index where the value should be inserted
- the value
/* 3 - We update wines_ids with our new array */
UPDATE bars_denormalized
SET wines_ids = (
/* 2 - We return an array without the item that should be removed */
SELECT json_remove(bars_denormalized.wines_ids,
'$[' || (
/* 1 - We search the index of the item to remove */
SELECT json_each.key
FROM json_each(bars_denormalized.wines_ids)
WHERE json_each.value = 'wine-uuid-297'
) || ']'
)
WHERE id = 'bar-uuid-3';
The
json_remove
- the field we want to remove a value
- the index of the value that we want to remove
Queries
In order to test the performances, of both data models, we'll perform two queries :
- obtain the names of the bars serving wines that are more recent than 2018
- obtain the names of the wines of Portuguese's wines served in UK's bars
With these two queries, we'll be able to see how to get data ob each table.
For the normalized version, we'll make a joint between the wines, bars_normalized and bars_wines tables. After the join, we'll apply a filter based on our needs.
SELECT DISTINCT bars_normalized.id, bars_normalized.name
FROM bars_normalized
LEFT OUTER JOIN bars_wines
ON bars_normalized.id = bars_wines.bar_id
LEFT OUTER JOIN wines
ON wines.id = bars_wines.wine_id
WHERE year > 2018;
SELECT DISTINCT wines.id, wines.name
FROM wines
LEFT OUTER JOIN bars_wines
ON wines.id = bars_wines.wine_id
LEFT OUTER JOIN bars_normalized
ON bars_normalized.id = bars_wines.bar_id
WHERE bars_normalized.country = 'United Kingdom'
AND wines.country = 'Portugal';
For the denormalized version, we'll use the functions which are provided by the JSON1 extension,
json_each
SELECT DISTINCT bars_denormalized.id, name
FROM bars_denormalized, json_each(bars_denormalized.wines_ids)
WHERE json_each.value IN (
SELECT id
FROM wines
WHERE year > 2018
);
SELECT DISTINCT wines.id, wines.name
FROM wines
WHERE id IN (
SELECT DISTINCT json_each.value AS id
FROM bars_denormalized, json_each(bars_denormalized.wines_ids)
WHERE bars_denormalized.country = 'United Kingdom'
) AND wines.country = 'Portugal';
Let's talk about the query above, or more precisely about the sub-query. If we just read it, it can be strange to see that I'm using the
json_each.value
SELECT
FROM

Even if it can be strange according to the reading direction, it's perfectly valid on a technical side.
Results
Reads
I've made tests with various parameters. On each query, I changed some parameters. On queries below, numbers in parentheses represents : (number of bars, number of wines, number of wines per bar).
In ordet to get the results, I ran each queries five times and I took the median value.
| | Normalized | Denormalized | | -------------------------- | ---------- | ------------ | | Query 1 (500, 1000, 100) | 874ms | 74ms | | Query 1 (5000, 10000, 100) | 9143ms | 569ms | | Query 1 (500, 1000, 500) | 4212ms | 197ms | | Query 2 (500, 1000, 100) | 726ms | 16ms | | Query 2 (5000, 10000, 100) | 8648ms | 83ms | | Query 2 (500, 1000, 500) | 4132ms | 27ms |
Obviously, the denormalized version is much faster than the normalized one. Joints are heavy operations. For normalized databases, duration is correlated to the join table size. Let's take query 1 as an example :
- 50000 entrée → 874ms
- 500000 entrée → 9143ms (x10 based on previous)
- 250000 entrée → 4212ms (÷2 based on previous)
We can observe similar results on query n°2. When we increase the amount of data, the duration grows linearly. We don't see the same results with the denormalized form which can better handle the growth of data.
In addition, we can see that query n°2 is more performant than query n°1. It means that making query on the table that doesn't contain the ids array. So, in my example, if I need to often load data related to the bar, I should put
bars_ids
wines
wines_ids
Tables size

These statistics are provided by sqlite3_analyzer. The denormalized version is two times lighter than the normalized one which includes bars_normalized and bars_wines. We're winning on performances and on database size.
Conclusion
Denormalization is a great solution to gain space and reading performances. But, we have to care more about writing operations. For example, during deletions, we will not be able to use the cascade deletions provided by RDBMS.
Through this article, I wanted to show that denormalization isn't a concept reserved for NoSQL. Based on my researches, MySQL/MariaDB and PostgreSQL also provide JSON manipulations. I'm not saying that NoSQL databases are bad, they can be more adapted in certain use cases, Redis and ElasticSearch are perfect examples. But it's not necessary to throw away solutions that resist time.
Thanks for reading.