Skip to main content

Dataset Analysis and Sorting Speedup

In this article, I discuss the analysis and sorting of data while saving computational resources.

Problem

It is extremely rare to need to find the median or another percentile of a large set of unordered numbers, especially when you do not know the minimum, maximum, or the distribution of values in the set. However, when such a situation arises, it can cause a lot of headaches since obtaining an element by index requires sorting the entire set. The situation worsens when there is a lot of data and not enough free memory. The time spent on sorting can also play a significant role.

Simple Solution

What can be done to retrieve any element from a sorted set without sorting the entire set? You can analyze the data to determine approximately where the element with the desired index is located, thus simplifying the task of finding the needed element. Typically, such an analysis involves dividing the overall range of numbers into an appropriate number of ranges, reading the series of numbers, counting how many elements fall into each range, and thereby determining which range contains the desired element. Then, you can select the elements of this range, sort them, and obtain the desired element by index. However, if you do not know the overall range of numbers and their distribution in the set, you cannot determine which subranges to divide into. Additionally, reading the entire set again to obtain this information is undesirable. In this situation, I propose determining the ranges automatically during the process of reading the dataset using the following algorithm:
The complexity of the process is "O(n)".
In the best case, we get a list of ranges containing almost the same number of elements. Usually, the ranges will be close in size.
For a certain sequence of numbers at the beginning of the set, it might happen that some ranges will contain orders of magnitude more elements than others. In the worst case, we get a list of ranges where all but one contain one element, and the remaining one contains all the other elements. Essentially, we return to the original task, but we can merge the smaller ranges and process the larger one again.
You can find the Python implementation of the algorithm for clarity at the link: ranges.py

How to Use Range Information to Select an Element by Index

Now, with an ordered list of ranges with the number of elements in each, we can determine the exact range where the element with the desired index is located. Then, we select the numbers from the set only within this range, sort them, and obtain the required element. Sorting will be faster, and less memory will be needed.
Similarly, you can select not just one element but a slice. To do this, you need to find several ranges containing the required elements, but the logic of the process remains the same.
Let:
  • O - complexity of analysis
  • T - complexity of selecting elements by value
  • S - complexity of sorting
  • n - number of elements in the dataset
  • r - number of ranges
Then, the complexity of the entire process of selecting an element will be:
  • "O(n) + T(n) + S(n-r+1)" in the worst case
  • "O(n) + T(n/2) + S(n/r)" in the general case
  • "O(n) + T(1) + S(1)" in the best case
As we can see, it makes sense to start this process when the speed gain from sorting justifies the additional resource costs of data analysis and selection.

Other Uses of Range Information

Grouping Data in a Set

Range information can be used not only to get elements by index but also for other purposes. For example, you can group elements in a dataset in one pass since we know where elements from specific ranges should be.
Grouping algorithm:
  1. "Pick up" the first element from the set.
  2. By comparing the element's value with the ranges in the list, determine its native group, i.e., the section of the set where the element should be located.
  3. Find an element in the native group that should not be there.
  4. Place the "picked-up" element in place of the found one, and "pick up" the found element.
  5. Repeat the process from step 2 for all elements in the dataset.
Python implementation: groups.py
As a result, we get a dataset where elements close in value are grouped together, opening up additional possibilities:
  • Faster sorting of the entire dataset;
  • "Lazy" sorting;

Faster Sorting of the Entire Dataset

Since we can now sort the grouped dataset in parts, it will be sorted faster. Although due to additional actions, the effect will only be noticeable on very large datasets.
If the complexity of grouping is denoted as G, then the complexity of sorting the entire set will be:
  • "O(n) + G(n) + S(n-r+1) + S(1) * (r-1)" in the worst case
  • "O(n) + G(n) + S(n/r) * r" in the best case
The more evenly the elements are distributed across the ranges during the analysis, the closer we are to the "best case".
Even better, the dataset can be sorted in parallel across different threads or processes since the groups of elements do not intersect. Each thread or process can safely sort its group of elements.
Python implementation: multiprocess_sorter.py

"Lazy" Sorting

Why sort the entire dataset when you can sort groups of elements as needed?
Analyze the set, group the elements, but sort the groups only when an element from that place is needed.
Python implementation: lazy_sorter.py

Simple File Sorting Without Loading Them Entirely Into Memory

Using range information, you can sort files by loading them into memory in parts. The algorithm is as follows:
  1. Read the file for the first time, analyzing the ranges of values.
  2. Read it a second time and write the values into separate files for each range.
  3. If any of the files turn out to be too large, apply the same process to it.
  4. Then, load small files sequentially into memory, sort, and save them to the final result file.
Python implementation: file_sorter.py

Comments

Popular posts from this blog

Optimizing Database Query Performance: Solving Text Search Challenges with Table Partitioning

In the development of the " antikollektor " project, which aims to provide a user-friendly interface for checking judicial databases, I encountered a major challenge related to how fast the database could respond to queries. Specifically, queries involving text searches across a large number of rows were taking an unacceptably long time to execute. In this article, I will share the steps taken to optimize database performance by implementing a table partitioning strategy, addressing the issues encountered and achieving significant improvements in query speed. The Issue: Slow Text Search Queries Users typically search for court cases by defendant name within a specified date range. The average request was taking approximately 5 seconds to execute. casesdbv2=# explain analyze select * from cases where date >= '2023-01-01' and date <= '2023-06-30' and defendant ilike '%ivanov%sergey%' order by date desc offset 30 limit 10; ...

HowTo: Hot migration of OS Linux (Debian 12) from one encrypted disk to another

This is a step-by-step HowTo article on performing a hot migration of a Linux operating system to another encrypted disk. It provides almost no explanations, focusing solely on the commands required to complete the task. Use it at your own risk. You could lose your data during the process, so ensure you have a backup of all critical data before proceeding. Part of the output has been removed to make this article shorter. Environment checking $ sudo apt list --installed lvm2 cryptsetup fdisk efibootmgr Listing... Done cryptsetup /stable,now 2:2.6.1-4~deb12u2 amd64 [installed] efibootmgr /stable,now 17-2 amd64 [installed,automatic] fdisk /stable,now 2.38.1-5+deb12u3 amd64 [installed] lvm2 /stable,now 2.03.16-2 amd64 [installed] $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS nvme0n1 259:0 0 931.5G 0 disk nvme1n1 259:4 0 238.5G 0 disk ├─nvme1n1p1 259:5 0 512M 0 part /boot/e...

How to create a Simple Captcha Resolver

"All aircraft designers once learned to make paper airplanes." Someone Smart This article shows my journey of developing a simple captcha resolver. By repeating the steps described, you will be able to get acquainted with the technologies and create your own recognizer for experimentation and study. Some specific development points have been omitted to simplify the material in the article. It should be noted that I am not an ML specialist and very superficially familiar with machine learning technologies. I developed this captcha recognizer out of curiosity, driven by an interest in understanding how the learning process works. So I would be very grateful for advice and comments from those who understand this. There are numerous parameters mentioned in the article that I don’t explain in detail. The goal of this approach is to make the article shorter and simpler while encouraging readers to experiment with the parameters themselves. I have used the simplest algorithms po...