Skip to main content

Optimizing Database Query Performance: Solving Text Search Challenges with Table Partitioning

In the development of the "antikollektor" project, which aims to provide a user-friendly interface for checking judicial databases, I encountered a major challenge related to how fast the database could respond to queries. Specifically, queries involving text searches across a large number of rows were taking an unacceptably long time to execute. In this article, I will share the steps taken to optimize database performance by implementing a table partitioning strategy, addressing the issues encountered and achieving significant improvements in query speed.

The Issue: Slow Text Search Queries

Users typically search for court cases by defendant name within a specified date range. The average request was taking approximately 5 seconds to execute.

casesdbv2=# explain analyze select * from cases where date >= '2023-01-01' and date <= '2023-06-30' and defendant ilike '%ivanov%sergey%' order by date desc offset 30 limit 10;
                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=194015.73..194015.85 rows=1 width=558) (actual time=5397.335..5406.489 rows=10 loops=1)
   ->  Gather Merge  (cost=194012.46..194015.73 rows=28 width=558) (actual time=5384.596..5393.766 rows=40 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=193012.44..193012.48 rows=14 width=558) (actual time=5266.087..5266.094 rows=30 loops=3)
               Sort Key: date DESC
               Sort Method: quicksort  Memory: 73kB
               Worker 0:  Sort Method: quicksort  Memory: 55kB
               Worker 1:  Sort Method: quicksort  Memory: 61kB
               ->  Parallel Seq Scan on cases  (cost=0.00..193012.17 rows=14 width=558) (actual time=34.175..5265.729 rows=41 loops=3)
                     Filter: ((date >= '2023-01-01'::date) AND (date <= '2023-06-30'::date) AND ((defendant)::text ~~* '%ivanov%sergey%'::text))
                     Rows Removed by Filter: 820492
 Planning Time: 1.535 ms
 JIT:
   Functions: 8
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 14.455 ms, Inlining 0.000 ms, Optimization 3.085 ms, Emission 47.067 ms, Total 64.606 ms
 Execution Time: 5409.240 ms
(18 rows)

In this explaination you can see that the longest part of the request is the parallel sequence scan on table "cases".

The reasons for this were the server's limited resources, the large size of the database table, and PostgreSQL having to read a substantial number of rows because of wildcard searching in the text field. And only in the end, PostgreSQL was returning a small subset of rows after applying the offset and limit.

The issue solving ways

Indexing Text Fields

One approach to improve performance would be to index the "defendant" field. However, traditional indexing methods do not work for text fields with wildcard searches (e.g., ilike '%ivanov%sergey%'), making this solution ineffective in our case.

Request Partitioning by Key Field

Another approach to improve performance would be splitting the date range in the query into smaller ranges and executing separate queries for each range in sorted order.

The average request:

select * from cases
where date >= '2023-01-01'
  and date <= '2023-06-30'
  and defendant ilike '%ivanov%sergey%'
order by date desc
offset 30 limit 10

is transformed into several requests:

select * from cases
where date >= '2023-06-01'
  and date <= '2023-06-30'
  and defendant ilike '%ivanov%sergey%'
order by date desc

select * from cases
where date >= '2023-05-01'
  and date <= '2023-05-31'
  and defendant ilike '%ivanov%sergey%'
order by date desc

select * from cases
where date >= '2023-04-01'
  and date <= '2023-04-30'
  and defendant ilike '%ivanov%sergey%'
order by date desc
...

This approach could allow us to stop querying once we had retrieved the required number of rows, significantly reducing the total query time in some cases. However, we must be sure that sorting will always be by the indexed field like "date" to ensure the effectiveness of this approach.

Table Partitioning

Instead of splitting the query, I chosed to partition the database table into smaller tables based on the "date" field, with each table containing rows for one month only.

Data table before partitioning:

Table "cases"
id date defendant
1 2023-03-02 Ivanov Ivan Ivanovich
2 2023-03-02 Petrov Petr Petrovich
3 2023-04-06 Smirnova Anna Aleksandrovn
4 2023-05-12 Sokolova Elena Igorevna
5 2023-05-13 Kuznetsov Dmitriy Sergeevich
6 2023-06-01 Popova Mariya Vladimirovna
... ... ...

Data tables after partitioning:

Table "cases_202303"
id date defendant
1 2023-03-02 Ivanov Ivan Ivanovich
2 2023-03-02 Petrov Petr Petrovich
... ... ...

Table "cases_202304"
id date defendant
3 2023-04-06 Smirnova Anna Aleksandrovn
... ... ...

Table "cases_202305"
id date defendant
4 2023-05-12 Sokolova Elena Igorevna
5 2023-05-13 Kuznetsov Dmitriy Sergeevich
... ... ...

Table "cases_202306"
id date defendant
6 2023-06-01 Popova Mariya Vladimirovna
... ... ...

This partitioning strategy offers the same benefits as request partitioning, but also allows frequently used tables and indexes to be cached more efficiently. Additionally, I implemented a cache that allows me to skip duplicate requests.

After implementing table partitioning, the same average query now takes less than 1 second to execute in the best-case scenario: when the defendant's name is found in the first tables.

  1. skip request to table "cases_202406" because value (21 rows) was found in cache;
  2. fetch 19 rows from table "cases_202405";
    casesdbv4=# explain analyze select * from cases_202305 where defendant ilike '%ivanov%sergey%' order by date desc limit 19;
                                                              QUERY PLAN
    ------------------------------------------------------------------------------------------------------------------------------
     Limit  (cost=4113.51..4113.52 rows=5 width=467) (actual time=278.474..278.481 rows=19 loops=1)
       ->  Sort  (cost=4113.51..4113.52 rows=5 width=467) (actual time=278.472..278.475 rows=19 loops=1)
             Sort Key: date DESC
             Sort Method: quicksort  Memory: 44kB
             ->  Seq Scan on cases_202305  (cost=0.00..4113.45 rows=5 width=467) (actual time=4.048..278.263 rows=24 loops=1)
                   Filter: ((defendant)::text ~~* '%ivanov%sergey%'::text)
                   Rows Removed by Filter: 54492
     Planning Time: 14.386 ms
     Execution Time: 278.582 ms
    (9 rows)
        
  3. stop processing and return 10 rows;

Let's consider the worst-case scenario: when the defendant's name is not found in the database.

Before partitioning:

casesdbv2=# explain analyze select * from cases where date >= '2023-01-01' and date <= '2023-06-30' and defendant ilike '%noname%' order by date desc;
...
 Execution Time: 5871.037 ms

After partitioning:

casesdbv4=# explain analyze select * from cases_202306 where defendant ilike '%noname%' order by date desc;
...
 Execution Time: 275.426 ms

casesdbv4=# explain analyze select * from cases_202305 where defendant ilike '%noname%' order by date desc;
...
 Execution Time: 217.230 ms

casesdbv4=# explain analyze select * from cases_202304 where defendant ilike '%noname%' order by date desc;
...
 Execution Time: 239.350 ms

casesdbv4=# explain analyze select * from cases_202303 where defendant ilike '%noname%' order by date desc;
...
 Execution Time: 282.054 ms

casesdbv4=# explain analyze select * from cases_202302 where defendant ilike '%noname%' order by date desc;
...
 Execution Time: 283.139 ms

casesdbv4=# explain analyze select * from cases_202301 where defendant ilike '%noname%' order by date desc;
...
 Execution Time: 190.038 ms

Analyzing the request, we observed the following: before partitioning, this request took about 6 seconds, as PostgreSQL had to read all rows in the specified date range. However, after partitioning, executing six requests to the partitions only took about 1 second in total. This improvement was possible because PostgreSQL operates much faster with smaller tables and does cache much more efficiently, even though it reads the same number of rows.

Conclusion

In conclusion, optimizing database query performance often requires innovative solutions tailored to the specific challenges faced. In this case, table partitioning proved to be an effective strategy for improving query speed and overall system performance. By sharing this experience, I hope to inspire others to explore creative solutions to their own technical challenges. I welcome feedback and suggestions from readers to further improve this approach and future projects.

Comments

Popular posts from this blog

HowTo: Hot migration of OS Linux (Debian 12) from one encrypted disk to another

This is a step-by-step HowTo article on performing a hot migration of a Linux operating system to another encrypted disk. It provides almost no explanations, focusing solely on the commands required to complete the task. Use it at your own risk. You could lose your data during the process, so ensure you have a backup of all critical data before proceeding. Part of the output has been removed to make this article shorter. Environment checking $ sudo apt list --installed lvm2 cryptsetup fdisk efibootmgr Listing... Done cryptsetup /stable,now 2:2.6.1-4~deb12u2 amd64 [installed] efibootmgr /stable,now 17-2 amd64 [installed,automatic] fdisk /stable,now 2.38.1-5+deb12u3 amd64 [installed] lvm2 /stable,now 2.03.16-2 amd64 [installed] $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS nvme0n1 259:0 0 931.5G 0 disk nvme1n1 259:4 0 238.5G 0 disk ├─nvme1n1p1 259:5 0 512M 0 part /boot/e...

How to create a Simple Captcha Resolver

"All aircraft designers once learned to make paper airplanes." Someone Smart This article shows my journey of developing a simple captcha resolver. By repeating the steps described, you will be able to get acquainted with the technologies and create your own recognizer for experimentation and study. Some specific development points have been omitted to simplify the material in the article. It should be noted that I am not an ML specialist and very superficially familiar with machine learning technologies. I developed this captcha recognizer out of curiosity, driven by an interest in understanding how the learning process works. So I would be very grateful for advice and comments from those who understand this. There are numerous parameters mentioned in the article that I don’t explain in detail. The goal of this approach is to make the article shorter and simpler while encouraging readers to experiment with the parameters themselves. I have used the simplest algorithms po...