In the development of the "antikollektor" project, which aims to provide a user-friendly interface for checking judicial databases, I encountered a major challenge related to how fast the database could respond to queries. Specifically, queries involving text searches across a large number of rows were taking an unacceptably long time to execute. In this article, I will share the steps taken to optimize database performance by implementing a table partitioning strategy, addressing the issues encountered and achieving significant improvements in query speed.
The Issue: Slow Text Search Queries
Users typically search for court cases by defendant name within a specified date range. The average request was taking approximately 5 seconds to execute.
casesdbv2=# explain analyze select * from cases where date >= '2023-01-01' and date <= '2023-06-30' and defendant ilike '%ivanov%sergey%' order by date desc offset 30 limit 10; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=194015.73..194015.85 rows=1 width=558) (actual time=5397.335..5406.489 rows=10 loops=1) -> Gather Merge (cost=194012.46..194015.73 rows=28 width=558) (actual time=5384.596..5393.766 rows=40 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (cost=193012.44..193012.48 rows=14 width=558) (actual time=5266.087..5266.094 rows=30 loops=3) Sort Key: date DESC Sort Method: quicksort Memory: 73kB Worker 0: Sort Method: quicksort Memory: 55kB Worker 1: Sort Method: quicksort Memory: 61kB -> Parallel Seq Scan on cases (cost=0.00..193012.17 rows=14 width=558) (actual time=34.175..5265.729 rows=41 loops=3) Filter: ((date >= '2023-01-01'::date) AND (date <= '2023-06-30'::date) AND ((defendant)::text ~~* '%ivanov%sergey%'::text)) Rows Removed by Filter: 820492 Planning Time: 1.535 ms JIT: Functions: 8 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 14.455 ms, Inlining 0.000 ms, Optimization 3.085 ms, Emission 47.067 ms, Total 64.606 ms Execution Time: 5409.240 ms (18 rows)
In this explaination you can see that the longest part of the request is the parallel sequence scan on table "cases".
The reasons for this were the server's limited resources, the large size of the database table, and PostgreSQL having to read a substantial number of rows because of wildcard searching in the text field. And only in the end, PostgreSQL was returning a small subset of rows after applying the offset and limit.
The issue solving ways
Indexing Text Fields
One approach to improve performance would be to index the "defendant" field. However, traditional indexing methods do not work for text fields with wildcard searches (e.g., ilike '%ivanov%sergey%'), making this solution ineffective in our case.
Request Partitioning by Key Field
Another approach to improve performance would be splitting the date range in the query into smaller ranges and executing separate queries for each range in sorted order.
The average request:
select * from cases where date >= '2023-01-01' and date <= '2023-06-30' and defendant ilike '%ivanov%sergey%' order by date desc offset 30 limit 10
is transformed into several requests:
select * from cases where date >= '2023-06-01' and date <= '2023-06-30' and defendant ilike '%ivanov%sergey%' order by date desc select * from cases where date >= '2023-05-01' and date <= '2023-05-31' and defendant ilike '%ivanov%sergey%' order by date desc select * from cases where date >= '2023-04-01' and date <= '2023-04-30' and defendant ilike '%ivanov%sergey%' order by date desc ...
This approach could allow us to stop querying once we had retrieved the required number of rows, significantly reducing the total query time in some cases. However, we must be sure that sorting will always be by the indexed field like "date" to ensure the effectiveness of this approach.
Table Partitioning
Instead of splitting the query, I chosed to partition the database table into smaller tables based on the "date" field, with each table containing rows for one month only.
Data table before partitioning:
Table "cases"id | date | defendant |
---|---|---|
1 | 2023-03-02 | Ivanov Ivan Ivanovich |
2 | 2023-03-02 | Petrov Petr Petrovich |
3 | 2023-04-06 | Smirnova Anna Aleksandrovn |
4 | 2023-05-12 | Sokolova Elena Igorevna |
5 | 2023-05-13 | Kuznetsov Dmitriy Sergeevich |
6 | 2023-06-01 | Popova Mariya Vladimirovna |
... | ... | ... |
Data tables after partitioning:
Table "cases_202303"id | date | defendant |
---|---|---|
1 | 2023-03-02 | Ivanov Ivan Ivanovich |
2 | 2023-03-02 | Petrov Petr Petrovich |
... | ... | ... |
Table "cases_202304"
id | date | defendant |
---|---|---|
3 | 2023-04-06 | Smirnova Anna Aleksandrovn |
... | ... | ... |
Table "cases_202305"
id | date | defendant |
---|---|---|
4 | 2023-05-12 | Sokolova Elena Igorevna |
5 | 2023-05-13 | Kuznetsov Dmitriy Sergeevich |
... | ... | ... |
Table "cases_202306"
id | date | defendant |
---|---|---|
6 | 2023-06-01 | Popova Mariya Vladimirovna |
... | ... | ... |
This partitioning strategy offers the same benefits as request partitioning, but also allows frequently used tables and indexes to be cached more efficiently. Additionally, I implemented a cache that allows me to skip duplicate requests.
After implementing table partitioning, the same average query now takes less than 1 second to execute in the best-case scenario: when the defendant's name is found in the first tables.
- skip request to table "cases_202406" because value (21 rows) was found in cache;
-
fetch 19 rows from table "cases_202405";
casesdbv4=# explain analyze select * from cases_202305 where defendant ilike '%ivanov%sergey%' order by date desc limit 19; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Limit (cost=4113.51..4113.52 rows=5 width=467) (actual time=278.474..278.481 rows=19 loops=1) -> Sort (cost=4113.51..4113.52 rows=5 width=467) (actual time=278.472..278.475 rows=19 loops=1) Sort Key: date DESC Sort Method: quicksort Memory: 44kB -> Seq Scan on cases_202305 (cost=0.00..4113.45 rows=5 width=467) (actual time=4.048..278.263 rows=24 loops=1) Filter: ((defendant)::text ~~* '%ivanov%sergey%'::text) Rows Removed by Filter: 54492 Planning Time: 14.386 ms Execution Time: 278.582 ms (9 rows)
- stop processing and return 10 rows;
Let's consider the worst-case scenario: when the defendant's name is not found in the database.
Before partitioning:
casesdbv2=# explain analyze select * from cases where date >= '2023-01-01' and date <= '2023-06-30' and defendant ilike '%noname%' order by date desc; ... Execution Time: 5871.037 ms
After partitioning:
casesdbv4=# explain analyze select * from cases_202306 where defendant ilike '%noname%' order by date desc; ... Execution Time: 275.426 ms casesdbv4=# explain analyze select * from cases_202305 where defendant ilike '%noname%' order by date desc; ... Execution Time: 217.230 ms casesdbv4=# explain analyze select * from cases_202304 where defendant ilike '%noname%' order by date desc; ... Execution Time: 239.350 ms casesdbv4=# explain analyze select * from cases_202303 where defendant ilike '%noname%' order by date desc; ... Execution Time: 282.054 ms casesdbv4=# explain analyze select * from cases_202302 where defendant ilike '%noname%' order by date desc; ... Execution Time: 283.139 ms casesdbv4=# explain analyze select * from cases_202301 where defendant ilike '%noname%' order by date desc; ... Execution Time: 190.038 ms
Analyzing the request, we observed the following: before partitioning, this request took about 6 seconds, as PostgreSQL had to read all rows in the specified date range. However, after partitioning, executing six requests to the partitions only took about 1 second in total. This improvement was possible because PostgreSQL operates much faster with smaller tables and does cache much more efficiently, even though it reads the same number of rows.
Conclusion
In conclusion, optimizing database query performance often requires innovative solutions tailored to the specific challenges faced. In this case, table partitioning proved to be an effective strategy for improving query speed and overall system performance. By sharing this experience, I hope to inspire others to explore creative solutions to their own technical challenges. I welcome feedback and suggestions from readers to further improve this approach and future projects.
Comments
Post a Comment