Dataset Analysis and Sorting Speedup

In this article, I discuss the analysis and sorting of data while saving computational resources.

Problem

It is extremely rare to need to find the median or another percentile of a large set of unordered numbers, especially when you do not know the minimum, maximum, or the distribution of values in the set. However, when such a situation arises, it can cause a lot of headaches since obtaining an element by index requires sorting the entire set. The situation worsens when there is a lot of data and not enough free memory. The time spent on sorting can also play a significant role.

Simple Solution

What can be done to retrieve any element from a sorted set without sorting the entire set? You can analyze the data to determine approximately where the element with the desired index is located, thus simplifying the task of finding the needed element. Typically, such an analysis involves dividing the overall range of numbers into an appropriate number of ranges, reading the series of numbers, counting how many elements fall into each range, and thereby determining which range contains the desired element. Then, you can select the elements of this range, sort them, and obtain the desired element by index. However, if you do not know the overall range of numbers and their distribution in the set, you cannot determine which subranges to divide into. Additionally, reading the entire set again to obtain this information is undesirable. In this situation, I propose determining the ranges automatically during the process of reading the dataset using the following algorithm:

The complexity of the process is "O(n)".

In the best case, we get a list of ranges containing almost the same number of elements. Usually, the ranges will be close in size.

For a certain sequence of numbers at the beginning of the set, it might happen that some ranges will contain orders of magnitude more elements than others. In the worst case, we get a list of ranges where all but one contain one element, and the remaining one contains all the other elements. Essentially, we return to the original task, but we can merge the smaller ranges and process the larger one again.

You can find the Python implementation of the algorithm for clarity at the link: ranges.py

How to Use Range Information to Select an Element by Index

Now, with an ordered list of ranges with the number of elements in each, we can determine the exact range where the element with the desired index is located. Then, we select the numbers from the set only within this range, sort them, and obtain the required element. Sorting will be faster, and less memory will be needed.

Similarly, you can select not just one element but a slice. To do this, you need to find several ranges containing the required elements, but the logic of the process remains the same.

Let:

O - complexity of analysis
T - complexity of selecting elements by value
S - complexity of sorting
n - number of elements in the dataset
r - number of ranges

Then, the complexity of the entire process of selecting an element will be:

"O(n) + T(n) + S(n-r+1)" in the worst case
"O(n) + T(n/2) + S(n/r)" in the general case
"O(n) + T(1) + S(1)" in the best case

As we can see, it makes sense to start this process when the speed gain from sorting justifies the additional resource costs of data analysis and selection.

Other Uses of Range Information

Grouping Data in a Set

Range information can be used not only to get elements by index but also for other purposes. For example, you can group elements in a dataset in one pass since we know where elements from specific ranges should be.

Grouping algorithm:

"Pick up" the first element from the set.
By comparing the element's value with the ranges in the list, determine its native group, i.e., the section of the set where the element should be located.
Find an element in the native group that should not be there.
Place the "picked-up" element in place of the found one, and "pick up" the found element.
Repeat the process from step 2 for all elements in the dataset.

Python implementation: groups.py

As a result, we get a dataset where elements close in value are grouped together, opening up additional possibilities:

Faster sorting of the entire dataset;
"Lazy" sorting;

Faster Sorting of the Entire Dataset

Since we can now sort the grouped dataset in parts, it will be sorted faster. Although due to additional actions, the effect will only be noticeable on very large datasets.

If the complexity of grouping is denoted as G, then the complexity of sorting the entire set will be:

"O(n) + G(n) + S(n-r+1) + S(1) * (r-1)" in the worst case
"O(n) + G(n) + S(n/r) * r" in the best case

The more evenly the elements are distributed across the ranges during the analysis, the closer we are to the "best case".

Even better, the dataset can be sorted in parallel across different threads or processes since the groups of elements do not intersect. Each thread or process can safely sort its group of elements.

Python implementation: multiprocess_sorter.py

"Lazy" Sorting

Why sort the entire dataset when you can sort groups of elements as needed?

Analyze the set, group the elements, but sort the groups only when an element from that place is needed.

Python implementation: lazy_sorter.py

Simple File Sorting Without Loading Them Entirely Into Memory

Using range information, you can sort files by loading them into memory in parts. The algorithm is as follows:

Read the file for the first time, analyzing the ranges of values.
Read it a second time and write the values into separate files for each range.
If any of the files turn out to be too large, apply the same process to it.
Then, load small files sequentially into memory, sort, and save them to the final result file.

Python implementation: file_sorter.py

Wow, is that how it works?

Search This Blog

Dataset Analysis and Sorting Speedup

Problem

Simple Solution

How to Use Range Information to Select an Element by Index

Other Uses of Range Information

Grouping Data in a Set

Faster Sorting of the Entire Dataset

"Lazy" Sorting

Simple File Sorting Without Loading Them Entirely Into Memory

Labels

Comments

Post a Comment

Popular posts from this blog

Optimizing Database Query Performance: Solving Text Search Challenges with Table Partitioning

HowTo: Hot migration of OS Linux (Debian 12) from one encrypted disk to another

How to create a Simple Captcha Resolver