rohithreddykota / compression_in_modern_data_storage.md

Created April 1, 2025 11:37

The Role of Data Compression in Modern Data Storage & Processing

Why Compression Matters in Big Data Systems

Modern data storage and processing systems handle massive volumes of data, making efficient storage and access crucial. Data compression addresses this by reducing dataset sizes, which in turn lowers storage costs and speeds up data reads. In columnar data formats (like Parquet or ORC), similar values are stored together, enabling very high compression ratios. For example, Parquet’s columnar storage often yields higher compression than row-based formats (All About Parquet Part 05 - Compression Techniques in Parquet - DEV Community). Compressed data means less I/O (disk and network) for the same information, which can dramatically improve query performance. In fact, using Parquet or ORC can shrin

rohithreddykota / gpu_in_analytical_db.md

Last active April 1, 2025 11:39

GPU Acceleration in Analytical Databases: Promise and Reality

Introduction

Graphics Processing Units (GPUs) offer massive parallelism and high memory bandwidth, theoretically enabling order-of-magnitude speedups for data analytics. However, mainstream analytical databases like ClickHouse, DuckDB, and BigQuery remain largely CPU-based. This report examines the technical reasons behind the limited GPU usage in traditional OLAP databases. We compare CPU vs GPU performance for real-world analytical queries, analyzing how factors like data volume, memory access patterns, and architecture affect outcomes. We then contrast this with GPU-native analytic engines – OmniSci (MapD/Heavy.AI), BlazingSQL (RAPIDS), and related technologies – highlighting their architecture, performance benchmarks, and practical challenges. Engineering trade-offs, scalability considerations, and the suitability of GPU acceleration for startups vs. enterprises are discussed, supported by findings from academic resea

rohithreddykota / optimizer_in_decorrelate_predicate_subquery.md

Created March 31, 2025 21:35

Recursion in the `decorrelate_predicate_subquery` Optimizer Pass

Recursion Mechanics of `DecorrelatePredicateSubquery`

The DecorrelatePredicateSubquery rule in Apache DataFusion is responsible for rewriting correlated subqueries in WHERE/HAVING clauses (specifically IN and EXISTS predicates, including their negations) into semijoin or antijoin operations. This transforms a nested query into a flat join-based plan for execution. To achieve this, the rule employs a carefully orchestrated recursion strategy that handles subqueries within subqueries (nested subqueries) and coordinates with DataFusion’s optimizer driver to avoid duplicate traversals.

Top-Down Invocation: The rule is registered to run in a top-down manner. In its implementation of OptimizerRule, it overrides apply_order to return Some(ApplyOrder::TopDown) ([decorrelate_predicate_subquery.rs - source](https://docs.rs/datafusion-optimizer/46.0.1/x86_64-unknown-linux-gnu/src/datafusion_optimizer/decorrelate_predicate_

rohithreddykota / arrow_physical_data_types.md

Created March 29, 2025 12:10

Apache Arrow Physical Data Types and Layouts (v17.0.0)

Apache Arrow defines a standardized in-memory columnar format composed of primitive and nested data types. Each Arrow array is backed by one or more contiguous buffers (blocks of memory) and optional metadata such as length and null count (Internal structure of Arrow objects • Arrow R Package) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). This report provides a deep dive into all Arrow physical layouts in version 17.0.0, covering their memory structure, Rust implementation, code examples, and performance trade-offs. We will explore primitives, variable-length types, list types, struct, union, dictionary encoding, run-end encoding, and the n

rohithreddykota / arrow_memory_layout.md

Created March 29, 2025 12:09

Apache Arrow Physical Memory Layout (Rust Implementation Focus)

Apache Arrow defines a standardized columnar in-memory format to enable high-performance analytics across languages. It emphasizes contiguous, aligned memory and minimal metadata, allowing zero-copy sharing and efficient vectorized processing (Arrow Columnar Format — Apache Arrow v19.0.1). This answer breaks down Arrow’s physical memory layout fields (validity bitmap, offsets, data buffers, type IDs, etc.), and explains how these map to the Rust implementation (using the arrow and arrow-array crates). We’ll cover buffer structures (Buffer 0/1/2), validity bitmaps, alignment/padding rules, nested types (List, Struct, Union), offset buffers for variable-length types, dictionary encoding, and run-end encoding. Finally, we highlight how Arrow’s Rust crates represent these concepts with structs, enums, and safe memory management. Diagrams and code examples are included

rohithreddykota / umbra_pointer_sizzling_latching_using_rust.md

Created March 29, 2025 11:09

Pointer Swizzling, Optimized Versioned Latching, and Adaptive Compilation

Advanced Database Engine Techniques in Rust:

Modern high-performance database systems like Umbra () () use a combination of low-level techniques to maximize performance. We’ll explore three such concepts – pointer swizzling, optimized versioned latching, and adaptive compilation – with clear explanations and Rust code examples. Each section includes a standalone demo and then shows how these techniques integrate into a mini database component (like a B⁺-tree and query executor). We’ll discuss design considerations (performance, concurrency, correctness) and use Rust’s low-level features (unsafe, atomics, custom memory layouts) where appropri

rohithreddykota / umbra_paper.md

Created March 29, 2025 11:09

Introduction
Buffer Manager
Further Considerations
- String Handling
Statistics

rohithreddykota / build_sql_engine_with_arrow_datafusion.md

Last active March 29, 2025 10:48

Building a High-Performance SQL Query Engine with Apache Arrow and DataFusion

Building a High-Performance SQL Query Engine with Apache Arrow and DataFusion
- Apache Arrow Format Specifications
- arrow-rs Crate Usage Guide
  - Core Data Structures
Array Builders for Different Data Types

rohithreddykota / duck_to_load_from_s3_to_aurora.md

Created January 31, 2025 18:23

Serverless ETL: Load CSV from S3 into Aurora Using DuckDB in AWS Lambda

Introduction

Building a serverless ETL pipeline that efficiently loads structured data into Amazon Aurora is a common requirement. AWS Lambda, combined with DuckDB, provides a powerful way to perform in-memory analytics and bulk insert data into Aurora.

This guide demonstrates how to:

Read a CSV file from Amazon S3
Process the data in-memory using DuckDB

rohithreddykota / rust_waitgroup_implementation.md

Created January 31, 2025 04:27

Waitgroup Implementation in Rust

While Rust provides tokio::join! and std::sync::Barrier for synchronizing tasks, WaitGroup (available in crossbeam and tokio) offers a lightweight, efficient way to wait for multiple tasks to complete, similar to Go’s sync.WaitGroup.

Example: Synchronizing Multiple Async Tasks

Using tokio::sync::Notify to implement a simple WaitGroup:

Rohith Reddy Kota rohithreddykota

The Role of Data Compression in Modern Data Storage & Processing

Why Compression Matters in Big Data Systems

GPU Acceleration in Analytical Databases: Promise and Reality

Introduction

Recursion in the decorrelate_predicate_subquery Optimizer Pass

Recursion Mechanics of DecorrelatePredicateSubquery

Apache Arrow Physical Data Types and Layouts (v17.0.0)

Apache Arrow Physical Memory Layout (Rust Implementation Focus)

Pointer Swizzling, Optimized Versioned Latching, and Adaptive Compilation

Serverless ETL: Load CSV from S3 into Aurora Using DuckDB in AWS Lambda

Introduction

Waitgroup Implementation in Rust

Example: Synchronizing Multiple Async Tasks

Recursion in the `decorrelate_predicate_subquery` Optimizer Pass

Recursion Mechanics of `DecorrelatePredicateSubquery`