Optimizing Database Query Performance: Solving Text Search Challenges with Table Partitioning

In the development of the "antikollektor" project, which aims to provide a user-friendly interface for checking judicial databases, I encountered a major challenge related to how fast the database could respond to queries. Specifically, queries involving text searches across a large number of rows were taking an unacceptably long time to execute. In this article, I will share the steps taken to optimize database performance by implementing a table partitioning strategy, addressing the issues encountered and achieving significant improvements in query speed.

The Issue: Slow Text Search Queries

Users typically search for court cases by defendant name within a specified date range. The average request was taking approximately 5 seconds to execute.

casesdbv2=# explain analyze select * from cases where date >= '2023-01-01' and date <= '2023-06-30' and defendant ilike '%ivanov%sergey%' order by date desc offset 30 limit 10;
                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=194015.73..194015.85 rows=1 width=558) (actual time=5397.335..5406.489 rows=10 loops=1)
   ->  Gather Merge  (cost=194012.46..194015.73 rows=28 width=558) (actual time=5384.596..5393.766 rows=40 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=193012.44..193012.48 rows=14 width=558) (actual time=5266.087..5266.094 rows=30 loops=3)
               Sort Key: date DESC
               Sort Method: quicksort  Memory: 73kB
               Worker 0:  Sort Method: quicksort  Memory: 55kB
               Worker 1:  Sort Method: quicksort  Memory: 61kB
               ->  Parallel Seq Scan on cases  (cost=0.00..193012.17 rows=14 width=558) (actual time=34.175..5265.729 rows=41 loops=3)
                     Filter: ((date >= '2023-01-01'::date) AND (date <= '2023-06-30'::date) AND ((defendant)::text ~~* '%ivanov%sergey%'::text))
                     Rows Removed by Filter: 820492
 Planning Time: 1.535 ms
 JIT:
   Functions: 8
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 14.455 ms, Inlining 0.000 ms, Optimization 3.085 ms, Emission 47.067 ms, Total 64.606 ms
 Execution Time: 5409.240 ms
(18 rows)

In this explaination you can see that the longest part of the request is the parallel sequence scan on table "cases".

The reasons for this were the server's limited resources, the large size of the database table, and PostgreSQL having to read a substantial number of rows because of wildcard searching in the text field. And only in the end, PostgreSQL was returning a small subset of rows after applying the offset and limit.

The issue solving ways

Indexing Text Fields

One approach to improve performance would be to index the "defendant" field. However, traditional indexing methods do not work for text fields with wildcard searches (e.g., ilike '%ivanov%sergey%'), making this solution ineffective in our case.

Request Partitioning by Key Field

Another approach to improve performance would be splitting the date range in the query into smaller ranges and executing separate queries for each range in sorted order.

The average request:

select * from cases
where date >= '2023-01-01'
  and date <= '2023-06-30'
  and defendant ilike '%ivanov%sergey%'
order by date desc
offset 30 limit 10

is transformed into several requests:

select * from cases
where date >= '2023-06-01'
  and date <= '2023-06-30'
  and defendant ilike '%ivanov%sergey%'
order by date desc

select * from cases
where date >= '2023-05-01'
  and date <= '2023-05-31'
  and defendant ilike '%ivanov%sergey%'
order by date desc

select * from cases
where date >= '2023-04-01'
  and date <= '2023-04-30'
  and defendant ilike '%ivanov%sergey%'
order by date desc
...

This approach could allow us to stop querying once we had retrieved the required number of rows, significantly reducing the total query time in some cases. However, we must be sure that sorting will always be by the indexed field like "date" to ensure the effectiveness of this approach.

Table Partitioning

Instead of splitting the query, I chosed to partition the database table into smaller tables based on the "date" field, with each table containing rows for one month only.

Data table before partitioning:

Table "cases"

id	date	defendant
1	2023-03-02	Ivanov Ivan Ivanovich
2	2023-03-02	Petrov Petr Petrovich
3	2023-04-06	Smirnova Anna Aleksandrovn
4	2023-05-12	Sokolova Elena Igorevna
5	2023-05-13	Kuznetsov Dmitriy Sergeevich
6	2023-06-01	Popova Mariya Vladimirovna
...	...	...

Data tables after partitioning:

Table "cases_202303"

id	date	defendant
1	2023-03-02	Ivanov Ivan Ivanovich
2	2023-03-02	Petrov Petr Petrovich
...	...	...

Table "cases_202304"

id	date	defendant
3	2023-04-06	Smirnova Anna Aleksandrovn
...	...	...

Table "cases_202305"

id	date	defendant
4	2023-05-12	Sokolova Elena Igorevna
5	2023-05-13	Kuznetsov Dmitriy Sergeevich
...	...	...

Table "cases_202306"

id	date	defendant
6	2023-06-01	Popova Mariya Vladimirovna
...	...	...

This partitioning strategy offers the same benefits as request partitioning, but also allows frequently used tables and indexes to be cached more efficiently. Additionally, I implemented a cache that allows me to skip duplicate requests.

After implementing table partitioning, the same average query now takes less than 1 second to execute in the best-case scenario: when the defendant's name is found in the first tables.

skip request to table "cases_202406" because value (21 rows) was found in cache;

fetch 19 rows from table "cases_202405";

casesdbv4=# explain analyze select * from cases_202305 where defendant ilike '%ivanov%sergey%' order by date desc limit 19;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=4113.51..4113.52 rows=5 width=467) (actual time=278.474..278.481 rows=19 loops=1)
   ->  Sort  (cost=4113.51..4113.52 rows=5 width=467) (actual time=278.472..278.475 rows=19 loops=1)
         Sort Key: date DESC
         Sort Method: quicksort  Memory: 44kB
         ->  Seq Scan on cases_202305  (cost=0.00..4113.45 rows=5 width=467) (actual time=4.048..278.263 rows=24 loops=1)
               Filter: ((defendant)::text ~~* '%ivanov%sergey%'::text)
               Rows Removed by Filter: 54492
 Planning Time: 14.386 ms
 Execution Time: 278.582 ms
(9 rows)

stop processing and return 10 rows;

Let's consider the worst-case scenario: when the defendant's name is not found in the database.

Before partitioning:

casesdbv2=# explain analyze select * from cases where date >= '2023-01-01' and date <= '2023-06-30' and defendant ilike '%noname%' order by date desc;
...
 Execution Time: 5871.037 ms

After partitioning:

casesdbv4=# explain analyze select * from cases_202306 where defendant ilike '%noname%' order by date desc;
...
 Execution Time: 275.426 ms

casesdbv4=# explain analyze select * from cases_202305 where defendant ilike '%noname%' order by date desc;
...
 Execution Time: 217.230 ms

casesdbv4=# explain analyze select * from cases_202304 where defendant ilike '%noname%' order by date desc;
...
 Execution Time: 239.350 ms

casesdbv4=# explain analyze select * from cases_202303 where defendant ilike '%noname%' order by date desc;
...
 Execution Time: 282.054 ms

casesdbv4=# explain analyze select * from cases_202302 where defendant ilike '%noname%' order by date desc;
...
 Execution Time: 283.139 ms

casesdbv4=# explain analyze select * from cases_202301 where defendant ilike '%noname%' order by date desc;
...
 Execution Time: 190.038 ms

Analyzing the request, we observed the following: before partitioning, this request took about 6 seconds, as PostgreSQL had to read all rows in the specified date range. However, after partitioning, executing six requests to the partitions only took about 1 second in total. This improvement was possible because PostgreSQL operates much faster with smaller tables and does cache much more efficiently, even though it reads the same number of rows.

Conclusion

In conclusion, optimizing database query performance often requires innovative solutions tailored to the specific challenges faced. In this case, table partitioning proved to be an effective strategy for improving query speed and overall system performance. By sharing this experience, I hope to inspire others to explore creative solutions to their own technical challenges. I welcome feedback and suggestions from readers to further improve this approach and future projects.

Wow, is that how it works?

Search This Blog

Optimizing Database Query Performance: Solving Text Search Challenges with Table Partitioning

The Issue: Slow Text Search Queries

The issue solving ways

Indexing Text Fields

Request Partitioning by Key Field

Table Partitioning

Conclusion

Labels

Comments

Post a Comment

Popular posts from this blog

HowTo: Hot migration of OS Linux (Debian 12) from one encrypted disk to another

How to create a Simple Captcha Resolver