24 juil. 2021

SQLite, json and denormalization

I was reading SQLite’s documentation and I discovered an extension called JSON1, which, you guess, allows us to manipulate JSON inside SQLite.

In this article, we’ll see how we can use this extension on a denormalized database. And the benefits it can have on the read performances

The code used to generate the database is available here

What is denormalization?

According to Wikipedia, “denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data.”

I’ll add that it’s a mechanism in which we’re allowing ourselves to not use the goods practices taught by our teachers #ThugLife. We forget about 3rd normal form by putting data from joins tables into “normals tables”. If it’s still not clear, the following example should help.

The tables

Instead of using the classical example with posts and tags, this time, we’ll use wines and bars. We’ll have a many-to-many relationship where each bar can serve multiple wines and each wine can be served into multiple bars.

So, we’ll need four tables :

wines
bars_normalized
bars_wines
bars_denormalized

CREATE TABLE IF NOT EXISTS wines (
  id TEXT PRIMARY KEY,
  name TEXT,
  country TEXT,
  year INTEGER
);

CREATE TABLE IF NOT EXISTS bars_normalized (
  id TEXT PRIMARY KEY,
  name TEXT
);

CREATE TABLE IF NOT EXISTS bars_wines (
  bar_id TEXT,
  wine_id TEXT
);

CREATE TABLE IF NOT EXISTS bars_denormalized (
  id TEXT PRIMARY KEY,
  name TEXT,
  country TEXT,
  wines_ids JSON /* will contain: ["wine-uuid-1","wine-uuid-234"...] */
);

JSON1 presentation

Let’s look at how we can do usuals operations on bar_denormalized.

Inserting

INSERT INTO bars_denormalized (id, name, wines_ids)
VALUES (
  'random-uuid',
  'La taverne',
  /* The json() function validate the provided JSON
  and also remove useless spaces. */
  json('["wine-uuid-1", "wine-uuid-45", "wine-uuid-423"]')
);

Reading

SELECT DISTINCT bars_denormalized.id, bars_denormalized.name
/* 1 - The json_each() function gives us the ability
to iterate over a field */
FROM bars_denormalized, json_each(bars_denormalized.wines_ids)
/* 2 - When a field has been selected with json_each(),
we can use json_each.value to get the value of each item
of the array */
WHERE json_each.value = 'wine-uuid-1';

The query above will, for each bar, iterate over the wines_ids array. And for each item of this array, it will verify if this value equals wine-uuid-1. Finally, this query is returning all bars that serve wines having wine-uuid-1 as an identifier. If we had an object we could have used the json_extract function. And for an array of objects, the json_tree function.

Updating and deleting

Both operations will almost act the same. In the first case, we’ll push a new item into the wines_ids array and in the second case, we’ll remove one.

UPDATE bars_denormalized
SET wines_ids = (
  /* 2 - We insert the data at the given index, in other words,
  the end of the array */
  SELECT json_insert(
    bars_denormalized.wines_ids,
    /* 1 - We get the length of the array, which will be the index
    of the new item */
    '$[' || json_array_length(bars_denormalized.wines_ids) || ']',
    'wine-uuid-297'
  )
)
WHERE id = 'bar-uuid-3';

The json_insert function takes three parameters :

the field we want insert a new value
the index where the value should be inserted
the value

/* 3 - We update wines_ids with our new array */
UPDATE bars_denormalized
SET wines_ids = (
  /* 2 - We return an array without the item that should be removed */
  SELECT json_remove(bars_denormalized.wines_ids,
    '$[' || (
      /* 1 - We search the index of the item to remove */
      SELECT json_each.key
      FROM json_each(bars_denormalized.wines_ids)
      WHERE json_each.value = 'wine-uuid-297'
  ) || ']'
)
WHERE id = 'bar-uuid-3';

The json_remove function takes two parameters :

the field we want to remove a value
the index of the value that we want to remove

Queries

In order to test the performances, of both data models, we’ll perform two queries :

obtain the names of the bars serving wines that are more recent than 2018
obtain the names of the wines of Portuguese’s wines served in UK’s bars

With these two queries, we’ll be able to see how to get data ob each table.

For the normalized version, we’ll make a joint between the wines, bars_normalized and bars_wines tables. After the join, we’ll apply a filter based on our needs.

SELECT DISTINCT bars_normalized.id, bars_normalized.name
FROM bars_normalized
LEFT OUTER JOIN bars_wines
  ON bars_normalized.id = bars_wines.bar_id
LEFT OUTER JOIN wines
  ON wines.id = bars_wines.wine_id
WHERE year > 2018;

SELECT DISTINCT wines.id, wines.name
FROM wines
LEFT OUTER JOIN bars_wines
  ON wines.id = bars_wines.wine_id
LEFT OUTER JOIN bars_normalized
  ON bars_normalized.id = bars_wines.bar_id
WHERE bars_normalized.country = 'United Kingdom'
  AND wines.country = 'Portugal';

For the denormalized version, we’ll use the functions which are provided by the JSON1 extension, json_each in this case.

SELECT DISTINCT bars_denormalized.id, name
FROM bars_denormalized, json_each(bars_denormalized.wines_ids)
WHERE json_each.value IN (
  SELECT id
  FROM wines
  WHERE year > 2018
);

SELECT DISTINCT wines.id, wines.name
FROM wines
WHERE id IN (
  SELECT DISTINCT json_each.value AS id
  FROM bars_denormalized, json_each(bars_denormalized.wines_ids)
  WHERE bars_denormalized.country = 'United Kingdom'
) AND wines.country = 'Portugal';

Let’s talk about the query above, or more precisely about the sub-query. If we just read it, it can be strange to see that I’m using the json_each.value inside the SELECT whereas I will get this information into the FROM which is the next line. That’s the point! It’s necessary to understand that relatinnal databases are treating queries. They are executed like described in the following schema.

sql-query-order

Even if it can be strange according to the reading direction, it’s perfectly valid on a technical side.

Results

Reads

I’ve made tests with various parameters. On each query, I changed some parameters. On queries below, numbers in parentheses represents : (number of bars, number of wines, number of wines per bar).

In ordet to get the results, I ran each queries five times and I took the median value.

	Normalized	Denormalized
Query 1 (500, 1000, 100)	874ms	74ms
Query 1 (5000, 10000, 100)	9143ms	569ms
Query 1 (500, 1000, 500)	4212ms	197ms
Query 2 (500, 1000, 100)	726ms	16ms
Query 2 (5000, 10000, 100)	8648ms	83ms
Query 2 (500, 1000, 500)	4132ms	27ms

Obviously, the denormalized version is much faster than the normalized one. Joints are heavy operations. For normalized databases, duration is correlated to the join table size. Let’s take query 1 as an example :

50000 entrée → 874ms
500000 entrée → 9143ms (x10 based on previous)
250000 entrée → 4212ms (÷2 based on previous)

We can observe similar results on query n°2. When we increase the amount of data, the duration grows linearly. We don’t see the same results with the denormalized form which can better handle the growth of data.

In addition, we can see that query n°2 is more performant than query n°1. It means that making query on the table that doesn’t contain the ids array. So, in my example, if I need to often load data related to the bar, I should put bars_ids inside table wines instead of keeping wines_ids.

Tables size

database-size

These statistics are provided by sqlite3_analyzer. The denormalized version is two times lighter than the normalized one which includes bars_normalized and bars_wines. We’re winning on performances and on database size.

Conclusion

Denormalization is a great solution to gain space and reading performances. But, we have to care more about writing operations. For example, during deletions, we will not be able to use the cascade deletions provided by RDBMS.

Through this article, I wanted to show that denormalization isn’t a concept reserved for NoSQL. Based on my researches, MySQL/MariaDB and PostgreSQL also provide JSON manipulations. I’m not saying that NoSQL databases are bad, they can be more adapted in certain use cases, Redis and ElasticSearch are perfect examples. But it’s not necessary to throw away solutions that resist time.

Thanks for reading.