Google Summer of Code 2024 Final Report

Contributor: Olesia Slavska

Organisation: Open2C

Mentors: Nezar Abdennur and Anton Goloborodko

About Open2C

Open2C is a community dedicated to creating and maintaining open-source tools for studying the 3D structure of chromosomes and analyzing genomic data. Our main interest lies in the analysis of Hi-C and related datasets, which capture the 3D organization of chromosomes. We prioritize making their tools user-friendly, adaptable for new analytic methods, and scalable to handle large, cutting-edge datasets.

Background

Pairtools is a software tool developed by the Open2C community for processing and analyzing paired-end sequencing data generated by Hi-C and other chromosome conformation capture techniques. It is specifically designed to handle the raw pairs of reads produced during Hi-C experiments, allowing users to efficiently filter, sort, and manipulate the paired-end data to study 3D genome organization. Pairtools is optimized for scalability, enabling the analysis of large genomic datasets while providing flexibility and ease of use for researchers working on 3D chromosome biology.

The main format of pairtools for storing Hi-C contacts is a gzipped .tsv format, which sometimes causes slowdowns for computations and storing. One of the ideas to speed it up was to implement an alternative binary pairs format, based on Apache Parquet to improve storing and reduce memory storage.

Discussions with mentors during bonding period led to also highlighting the importance of optimizing I/O operations for scalability, which will involve integrating technologies like Apache Arrow, Polars or Dask for more efficient data handling.

Converting pairs.gz files into Parquet was a bit of a challenge, but in the end we successfully done it. We still need to work on handling meta data correctly, but it is okay and we will work on it in the future. But what we found out during first coding weeks is that there is actually a big challenge with processing(especially sorting) big data files(approximately 7-10G compressed gzip). And unfortunately most of the tools, we initially wanted to use for it are not capable of it at all. We were waiting for 5-10 hours, whether it will be computed. And in most cases not. So we switched our goals.

Processing big data was tough, and sort had a special place in it. We wanted to find another way of doing it and improve the performance. The goal became to create our own merge sort based on Pandas and Parquet, where the key challenge is that the input table can not fit into the memory.

Pairtools already uses a really nice Unix utility sort. It is quick due to its efficient algorithms and use of system resources. It employs optimized algorithms including merge sort or quicksort, which have good average-case performance. Additionally, it uses efficient I/O operations and can take advantage of system memory and disk to handle large data sets efficiently. So it is already great. ANd it will be hard to beat it up in time consume, temporary memory usage and etc. But we tried and here we are.

(Spoiler: in the end Unix sort won everything)

Contribution

We were not sure, which of the tools will be successful in the end, so there are 2-3 variants of each. Each of them can be found in the gists.

But what we have created in the end:

Implementing a convertation from pairs.gz format to .parquet format
Implementing sort of pairs.gz and saving the result as a parquet file
Implementing similar to Unix utility merge sort using Pandas and .parquet format

Gists

Design and Implementation

Convering from pairs.gz to .parquet format + sorting

While implementing a converter for pairs.gz to Parquet within Pairtools we wanted to consider metadata handling, performance, time consume and user experience. So we tried to use all the advantages of each one.

DuckDB provides seamless integration, as it automatically manages metadata during the read and write processes. This makes it easy to handle SQL queries for optimizing performance, but it can obscure some lower-level operations, which might create challenges for us in the future, when we need a more detailed control of situation.

On the other hand, Dask can be particularly advantageous for processing large files. By breaking datasets into smaller chunks, Dask manages metadata efficiently, allowing for parallel processing that preserves memory. However, this chunking can introduce overhead, slowing down operations on smaller files.

Polars presents an exciting option with its focus on performance and efficient metadata handling. It provides detailed insights into data types and schema, facilitating quick adjustments during conversion. Though still maturing, Polars is designed to handle large datasets swiftly, making it a really strong candidate for this implementation on middle size files. But unfortunately it couldn't handle our big data.

Overall, we created all versions, were experimenting on different formats and parallelizations. The results of each you can find in the Conclusion.

Merge Pandas sort ≈ Unix

Unix sort utility is a powerful tool that can sort files larger than memory by using an external, hierarchical merge sort algorithm. We wanted to implement something similar using Python and pandas to handle the sorting of large tables that exceed memory capacity.

Unix sort operates in two distinct phases:

Partitioning and Sorting: The input file is divided into blocks that can comfortably fit into memory. Each of these blocks is individually sorted and then written to temporary files. This ensures that all of the data is eventually sorted, but it's broken down into manageable blocks.
Merging Sorted Chunks: In the second phase, these sorted blocks are merged into a final sorted output. Since each block is already sorted, the algorithm performs an efficient merge. However, because all blocks together will still be too large to fit into memory altogether, they are processed in smaller chunks. At each point all the active chunks are not more than memory and can fit into memory. The chunks are loaded, merged, and sorted incrementally, with each new merged-chunk being appended to the output.

To replicate this behavior in Python using pandas, we can follow a similar two-step process. However, there are some challenges specific to memory management when dealing with large datasets. The key issue is that not all of the sorted data can be output immediately, since some later partitions may still contain entries that are "smaller" than the entries in the current merged-chunk.

So the idea is:

Partitioning and Sorting: Load the input table in blocks that can each fit into memory. Sort each block using pandas' built-in sorting functions. Save each sorted chunk to temporary files.
Merging Sorted Chunks: Load first chunks from each sorted block in a way that all of them together can fit into memory. Mark the end of each chunk with flag == true. Merge the loaded chunks to form a chunk. However, here's the tricky part: not all entries in the merged-chunk should necessarily be written out immediately, because in the future chunks there can be date smaller, than in current chunks.

To handle this, we provided a solution: We added a column called "last_in_chunk" to each chunk, where all entries are marked as False, except the last entry which is marked True. After merging, the merged-chunk was splitted at the first entry where last_in_chunk == True. The upper part can be safely dumped into the output file since we know it's fully sorted and smaller than future entries. The lower part (containing entries that may still need to be compared with future chunks) is carried over into the next iteration, with the "last_in_chunk" reset to False for all entries. And you repeat this, until all the chunks of the blocks are processed.

Conclusion

The original goal of this project was to create a new binary pairs format to improve performance and storage efficiency. It evolved into finding a quicklier way to sort pairs. However, as the project progressed, it became clear that leveraging Parquet files and developing a Unix-like merge sort could offer a faster, more scalable solution. While the final implementation outperformed existing alternatives in many areas, it unfortunately did not reach the performance improvements we initially aimed for.

Despite this, the project yielded valuable insights into data structures, file formats, and algorithmic limitations. These lessons will guide future exploration of sorting methods, with a focus on refining the merge sort or combining it with other approaches to better meet performance and efficiency goals.

1.4G pairs.gz	Max Mem usage	Time
Pandas pairs.gz (Merge sort ~Unix)	9.64G	5.08 min
Duckdb pairs.gz	52.78G	5.50 min
Pairtools (Unix utility sort)	2.13G	4.14 min
Pandas parquet	62.5G	6.23 min
Duckdb parquet	50.3G	6.50 min
Dask parquet*	42.55G	6.21 min
Polars parquet*	46.7G	2.12 min

* - just sorting, without convertation from pairs.gz to .parquet

Acknowledgement

I want to say a special thank you for my mentor, Anton Goloborodko, for all the support and guidance throughout my GSOC journey. Anton, your mentorship has been invaluable—I’ve learned so much from you, and your advice made a real difference.

To the Open2C team, thank you for creating such a positive and collaborative environment. It’s been a great experience being part of the group, and I’m grateful for all the opportunities and encouragement along the way.

ayaksvals/FinalWorkReport2024Slavska.md