In this article, we'll see how we can use this extension on a denormalized database. And the benefits it can have on the read performances
The code used to generate the database is available here
According to Wikipedia, "denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data."
I'll add that it's a mechanism in which we're allowing ourselves to not use the goods practices taught by our teachers #ThugLife. We forget about 3rd normal form by putting data from joins tables into "normals tables". If it's still not clear, the following example should help.
Instead of using the classical example with posts and tags, this time, we'll use wines and bars. We'll have a many-to-many relationship where each bar can serve multiple wines and each wine can be served into multiple bars.
So, we'll need four tables :
CREATE TABLE IF NOT EXISTS wines ( id TEXT PRIMARY KEY, name TEXT, country TEXT, year INTEGER ); CREATE TABLE IF NOT EXISTS bars_normalized ( id TEXT PRIMARY KEY, name TEXT ); CREATE TABLE IF NOT EXISTS bars_wines ( bar_id TEXT, wine_id TEXT ); CREATE TABLE IF NOT EXISTS bars_denormalized ( id TEXT PRIMARY KEY, name TEXT, country TEXT, wines_ids JSON /* will contain: ["wine-uuid-1","wine-uuid-234"...] */ );
Let's look at how we can do usuals operations on bar_denormalized.
INSERT INTO bars_denormalized (id, name, wines_ids) VALUES ( 'random-uuid', 'La taverne', /* The json() function validate the provided JSON and also remove useless spaces. */ json('["wine-uuid-1", "wine-uuid-45", "wine-uuid-423"]') );
SELECT DISTINCT bars_denormalized.id, bars_denormalized.name /* 1 - The json_each() function gives us the ability to iterate over a field */ FROM bars_denormalized, json_each(bars_denormalized.wines_ids) /* 2 - When a field has been selected with json_each(), we can use json_each.value to get the value of each item of the array */ WHERE json_each.value = 'wine-uuid-1';
The query above will, for each bar, iterate over the
wines_ids array. And for each item of this array, it will verify if this value equals
Finally, this query is returning all bars that serve wines having
wine-uuid-1 as an identifier.
If we had an object we could have used the
json_extract function. And for an array of objects, the
Both operations will almost act the same. In the first case, we'll push a new item into the
wines_ids array and in the second case, we'll remove one.
UPDATE bars_denormalized SET wines_ids = ( /* 2 - We insert the data at the given index, in other words, the end of the array */ SELECT json_insert( bars_denormalized.wines_ids, /* 1 - We get the length of the array, which will be the index of the new item */ '$[' || json_array_length(bars_denormalized.wines_ids) || ']', 'wine-uuid-297' ) ) WHERE id = 'bar-uuid-3';
json_insert function takes three parameters :
- the field we want insert a new value
- the index where the value should be inserted
- the value
/* 3 - We update wines_ids with our new array */ UPDATE bars_denormalized SET wines_ids = ( /* 2 - We return an array without the item that should be removed */ SELECT json_remove(bars_denormalized.wines_ids, '$[' || ( /* 1 - We search the index of the item to remove */ SELECT json_each.key FROM json_each(bars_denormalized.wines_ids) WHERE json_each.value = 'wine-uuid-297' ) || ']' ) WHERE id = 'bar-uuid-3';
json_remove function takes two parameters :
- the field we want to remove a value
- the index of the value that we want to remove
In order to test the performances, of both data models, we'll perform two queries :
- obtain the names of the bars serving wines that are more recent than 2018
- obtain the names of the wines of Portuguese's wines served in UK's bars
With these two queries, we'll be able to see how to get data ob each table.
For the normalized version, we'll make a joint between the wines, bars_normalized and bars_wines tables. After the join, we'll apply a filter based on our needs.
SELECT DISTINCT bars_normalized.id, bars_normalized.name FROM bars_normalized LEFT OUTER JOIN bars_wines ON bars_normalized.id = bars_wines.bar_id LEFT OUTER JOIN wines ON wines.id = bars_wines.wine_id WHERE year > 2018;
SELECT DISTINCT wines.id, wines.name FROM wines LEFT OUTER JOIN bars_wines ON wines.id = bars_wines.wine_id LEFT OUTER JOIN bars_normalized ON bars_normalized.id = bars_wines.bar_id WHERE bars_normalized.country = 'United Kingdom' AND wines.country = 'Portugal';
For the denormalized version, we'll use the functions which are provided by the JSON1 extension,
json_each in this case.
SELECT DISTINCT bars_denormalized.id, name FROM bars_denormalized, json_each(bars_denormalized.wines_ids) WHERE json_each.value IN ( SELECT id FROM wines WHERE year > 2018 );
SELECT DISTINCT wines.id, wines.name FROM wines WHERE id IN ( SELECT DISTINCT json_each.value AS id FROM bars_denormalized, json_each(bars_denormalized.wines_ids) WHERE bars_denormalized.country = 'United Kingdom' ) AND wines.country = 'Portugal';
Let's talk about the query above, or more precisely about the sub-query. If we just read it, it can be strange to see that I'm using the
json_each.value inside the
SELECT whereas I will get this information into the
FROM which is the next line. That's the point! It's necessary to understand that relatinnal databases are treating queries. They are executed like described in the following schema.
Even if it can be strange according to the reading direction, it's perfectly valid on a technical side.
I've made tests with various parameters. On each query, I changed some parameters. On queries below, numbers in parentheses represents : (number of bars, number of wines, number of wines per bar).
In ordet to get the results, I ran each queries five times and I took the median value.
|Query 1 (500, 1000, 100)||874ms||74ms|
|Query 1 (5000, 10000, 100)||9143ms||569ms|
|Query 1 (500, 1000, 500)||4212ms||197ms|
|Query 2 (500, 1000, 100)||726ms||16ms|
|Query 2 (5000, 10000, 100)||8648ms||83ms|
|Query 2 (500, 1000, 500)||4132ms||27ms|
Obviously, the denormalized version is much faster than the normalized one. Joints are heavy operations. For normalized databases, duration is correlated to the join table size. Let's take query 1 as an example :
- 50000 entrée → 874ms
- 500000 entrée → 9143ms (x10 based on previous)
- 250000 entrée → 4212ms (÷2 based on previous)
We can observe similar results on query n°2. When we increase the amount of data, the duration grows linearly. We don't see the same results with the denormalized form which can better handle the growth of data.
In addition, we can see that query n°2 is more performant than query n°1. It means that making query on the table that doesn't contain the ids array. So, in my example, if I need to often load data related to the bar, I should put
bars_ids inside table
wines instead of keeping
These statistics are provided by sqlite3_analyzer. The denormalized version is two times lighter than the normalized one which includes bars_normalized and bars_wines. We're winning on performances and on database size.
Denormalization is a great solution to gain space and reading performances. But, we have to care more about writing operations. For example, during deletions, we will not be able to use the cascade deletions provided by RDBMS.
Through this article, I wanted to show that denormalization isn't a concept reserved for NoSQL. Based on my researches, MySQL/MariaDB and PostgreSQL also provide JSON manipulations. I'm not saying that NoSQL databases are bad, they can be more adapted in certain use cases, Redis and ElasticSearch are perfect examples. But it's not necessary to throw away solutions that resist time.
Thanks for reading.