Trino's Efficiency in Processing JSON Files in Data Lakes: A Technical Deep Dive

Introduction

Trino, the distributed SQL query engine formerly known as PrestoSQL, is engineered for high-performance, interactive analytics across a multitude of heterogeneous data sources.1 Its architecture is particularly well-suited for querying large datasets residing in data lakes, whether deployed on-premises using HDFS or in the cloud on object storage systems like Amazon S3, Google Cloud Storage, or Azure Blob Storage.2 A key capability enabling this is Trino's schema-on-read approach, allowing users to query data in various formats directly where it resides, without requiring upfront transformation and loading into a proprietary storage system.3

JSON (JavaScript Object Notation) has become ubiquitous in modern data ecosystems, frequently used for API payloads, application logs, configuration files, and semi-structured data exchange.7 However, querying JSON efficiently at scale presents significant challenges compared to optimized columnar formats like Parquet or ORC. These challenges stem from JSON's inherently flexible and often variable schema, the potential for deep nesting, and the computational overhead associated with parsing text-based structures into queryable formats. Indeed, parsing JSON can consume a substantial portion, sometimes cited as 80-90%, of the total processing time in big data applications.8

This report provides a detailed technical analysis of Trino's architecture, optimizations, and comparative performance specifically tailored for processing JSON files within data lake environments like S3 and HDFS. It delves into the engine's internal mechanisms, memory management strategies, comparative standing against other relevant query engines, and the potential role of advanced CPU features like SIMD in accelerating JSON parsing. The objective is to offer system architects and engineers a comprehensive understanding of how Trino tackles the complexities of JSON data at scale.

Section 1: Trino Engine Internals and JSON Processing Enhancements

Trino's efficiency in handling diverse workloads, including JSON processing, is rooted in its core distributed architecture and specific optimizations targeting data retrieval and processing.

Core Architecture Recap

Trino employs a Massively Parallel Processing (MPP) architecture, distributing query execution across a cluster of nodes.2 This cluster comprises a single coordinator node and multiple worker nodes.10 The coordinator acts as the brain of the cluster, responsible for receiving client queries, parsing SQL, analyzing semantics, optimizing the query plan, scheduling distributed tasks, and managing worker node status.2 It serves as the sole point of contact for client applications.11

The workers execute the tasks assigned by the coordinator. These tasks involve retrieving data from underlying sources and performing computations like filtering, transformations, and aggregations.10 A fundamental design principle is that, by default, data flows in a pipelined fashion between operators and workers, primarily residing in memory without intermediate persistence to disk.11 This streaming model minimizes I/O latency during query execution.

Central to Trino's ability to query diverse systems is its connector architecture.3 Connectors act as plugins that abstract the specifics of underlying data sources, allowing Trino to interact with relational databases, NoSQL systems, and data lake storage.13 For data lakes built on HDFS or S3, the Hive connector is paramount. It's crucial to understand that the Hive connector enables Trino to read data organized according to Hive conventions (i.e., using a Hive Metastore Service (HMS) for metadata and storing data files in specified locations) but it does not use the Hive query execution engine (like MapReduce or Tez).5 Trino replaces the Hive runtime with its own MPP engine, leveraging the existing metastore and data files.14 Modern table formats like Apache Iceberg are also increasingly used in data lakes and accessed via dedicated Trino connectors, offering improved metadata management and transactional capabilities compared to traditional Hive tables.15

JSON Data Representation and Handling

Trino provides several mechanisms for handling JSON data, aligning with SQL standards while offering flexibility:

SQL/JSON Functions: Trino implements standard SQL/JSON functions, namely json_exists, json_query, and json_value, introduced around release 383.16 These functions operate primarily on JSON data represented as character strings (VARCHAR) or binary strings (VARBINARY).16 They utilize a powerful JSON path language to navigate and extract specific elements or values from within the JSON structure, supporting operations like member access ($.key), array indexing ([*], [index]), and filtering (?(...)).7
Native JSON Type: Beyond treating JSON as strings, Trino supports a native JSON data type.18 This type can represent any valid JSON value (object, array, string, number, boolean, null) and allows for more structured handling within the engine. It can be used in table definitions, for instance, when connecting to sources like MongoDB 19, or as a target type when casting from other types like VARCHAR or structured types like ROW or MAP.16
Hive Connector JSON Support: The Hive connector facilitates reading JSON files stored in the data lake. Historically, this often involved using Hive SerDes (Serializer/Deserializer) like org.apache.hadoop.hive.serde2.JsonSerDe or org.apache.hive.hcatalog.data.JsonSerDe.22 Trino also includes its own native readers for formats like JSON, aiming for compatibility with Hive's behavior.23 Notably, early versions of the Hive connector might have returned complex Hive types (arrays, maps) as JSON-encoded strings, whereas later versions materialize them directly into Trino's corresponding structural types (ARRAY, MAP).25

A critical consideration arises from the goal of maintaining compatibility between Trino's native Hive readers and traditional Hive behavior. Specifically, to replicate Hive's handling of duplicate keys within a JSON object (where Hive typically returns the value associated with the last occurrence of the key), Trino's native JSON reader may need to parse JSON messages twice.23 This double parsing stems from an internal architectural constraint related to the immutability of data blocks once written; the engine cannot easily "unwrite" or overwrite the first key-value pair encountered.23 Consequently, benchmarks comparing the native reader against the traditional Hive SerDe for certain workloads, such as GZ-compressed JSON files at terabyte scale, have shown the native reader to be substantially slower (e.g., 18 seconds vs. 8 seconds in one reported test).23 This presents a trade-off: using the native reader ensures closer behavioral compatibility with Hive but may incur a performance penalty, while using the external SerDe might be faster but could lead to subtle differences in handling edge cases or limit the applicability of certain internal Trino optimizations.

Optimization: Predicate Pushdown for JSON

Predicate pushdown is a fundamental optimization technique where filtering conditions (predicates) from the WHERE clause of a SQL query are pushed down from the query engine to the underlying data source connector, and potentially even into the storage system itself.26 This reduces the amount of data that needs to be scanned, transferred over the network, and processed by Trino's workers, leading to significant performance improvements.26 A successful pushdown is often indicated by the absence of a ScanFilterProject operation for the corresponding filter in the query's EXPLAIN plan.26

For nested data structures, including JSON that has been parsed into Trino's internal ROW type or handled via the native JSON type, Trino employs a powerful optimization called dereference pushdown.27 This optimization focuses on queries that access subfields within nested structures (e.g., SELECT event.payload.user_id...). The key ideas are:

Early Projection: The query planner pushes the projection of the required subfields (the dereference operation, .) as close to the data source scan as possible in the execution plan.27
Connector-Level Pruning: For connectors interacting with columnar storage formats (like ORC or Parquet), this pushdown allows the connector to read only the specific subfield columns required by the query, skipping entire columns corresponding to unused parts of the nested structure.27
Predicate Pushdown on Subfields: Filters applied to these nested fields (e.g., WHERE event.payload.country = 'US') are also pushed down. In columnar formats, the readers can leverage statistics stored within the files (min/max values, bloom filters) to prune entire files, row groups (Parquet), or stripes (ORC) that cannot possibly match the predicate.27

While dereference pushdown significantly optimizes access after parsing or when data is already structured, applying predicates directly to raw JSON content during the scan is more challenging. SQL/JSON functions like json_value(data, 'lax $.status') = 'ACTIVE' inherently involve parsing the JSON string data within Trino before the comparison can occur.16 True pushdown of such a predicate into the raw file scan (i.e., skipping rows based on JSON content before full parsing) is generally limited for plain JSON files on S3/HDFS. This is because, unlike columnar formats, standard JSON files lack the necessary structure and statistics for efficient pre-parse filtering.26 Optimization for raw JSON relies more heavily on minimizing the data sent to the parser (projection/dereference pushdown) and filtering efficiently immediately after parsing.

Dynamic filtering offers an indirect optimization route for JSON queries involving joins.28 If a table containing JSON data is joined with a smaller dimension table, and filters are applied to the dimension table, Trino can collect the join key values from the dimension table during the build phase. These values can then be broadcast to the workers scanning the JSON table and pushed down into the scan after the join build is complete. This allows pruning JSON rows that won't match the join condition, reducing processing in later stages. The effectiveness depends on connector support; the Hive connector, for instance, supports dynamic filter pushdown.28

Therefore, while Trino employs multiple strategies, filtering efficiency for raw JSON files primarily hinges on reducing the scope of data before or immediately after parsing, rather than sophisticated predicate evaluation within the raw JSON scan itself.

Optimization: Lazy Parsing and Streamed Row Processing

Lazy parsing, the technique of deferring the parsing of data until it is explicitly required, can significantly improve performance by avoiding unnecessary work. While Trino doesn't heavily advertise a sophisticated, field-level lazy JSON parsing mechanism comparable to specialized libraries 29, its architecture incorporates principles that achieve similar effects:

Projection and Dereference Pushdown as Implicit Laziness: By pushing down projections (selecting specific columns) and dereferences (accessing specific subfields), Trino effectively avoids parsing and materializing portions of JSON documents that are not needed for the query outcome.26 This acts as a form of coarse-grained laziness, ensuring only relevant data sections are processed by the parser.
Pipelined Execution: Trino's execution model processes data in pipelined stages using pages (batches of rows).2 As soon as a page of JSON data is read, parsed, and filtered by one operator stage, it can be immediately streamed to the next stage in the query plan (e.g., for aggregation or join). This avoids waiting for entire files or splits to be processed before downstream work can begin, reducing overall query latency and enabling efficient, streamed row processing.

The efficiency gains observed when querying JSON in Trino are likely more attributable to the combination of its MPP architecture, aggressive projection/dereference pushdown, and pipelined execution model rather than an intricate lazy parsing strategy within the JSON decoding logic itself. The emphasis is on minimizing the data fed to the parser and processing the parsed results efficiently in a streaming manner.11 The lack of detailed documentation or prominent features around fine-grained lazy JSON parsing, coupled with performance discussions focusing on other aspects like double-parsing or architectural parallelism 9, supports this view.

Optimization: Parallel Reading via Split Processing

Trino achieves data processing parallelism by dividing the data of a table into smaller, manageable chunks called splits.4 The coordinator node is responsible for identifying these splits and assigning them to available worker nodes for concurrent processing.2

For data stored in file systems like HDFS or object stores like S3, accessed through connectors such as Hive or Iceberg, splits typically correspond to individual files or byte ranges (blocks) within larger files.4 The specific logic for determining splits resides within the connector implementation.13 The Hive connector, for example, uses configuration properties like hive.max-split-size (target size for splits), hive.max-initial-split-size (size for initial splits from large files), and hive.max-initial-splits (limit on initially generated splits) to control how data is divided, influencing the degree of parallelism and scheduling behavior.31 Workers process their assigned splits concurrently, enabling parallel reading, parsing, and filtering of JSON data distributed across the data lake.4

Challenge: Handling File Offset Boundaries and Ensuring Valid JSON Parsing Across Splits

A significant challenge arises when processing large JSON files in parallel using byte-offset based splitting. Standard JSON syntax, particularly for files containing a single large object or array spanning multiple lines, is not inherently designed to be split at arbitrary byte boundaries.32 If a split starts or ends in the middle of a JSON string, number, or structural element, naive parsing from the split's start offset will fail.

Connectors reading such formats must implement strategies to handle these boundaries correctly. While specific implementation details for Trino's JSON readers are not fully exposed in the provided materials, common approaches include:

Whole File Splitting: For formats known to be non-splittable (like gzipped files or potentially standard multi-line JSON), the connector might treat entire files as single splits. This ensures valid parsing but limits parallelism for very large individual files. However, the existence of split-size configurations 31 suggests Trino can split large files, implying a more sophisticated mechanism.
Record Boundary Detection: A reader processing a byte-range split might need to perform extra I/O to find valid record boundaries. This could involve scanning backward from the split start to locate the beginning of the first complete JSON record within its assigned range and potentially reading beyond the split end to ensure the last record started within the range is fully captured. This requires careful logic within the file reader.
Using Splittable Formats (JSON Lines): The most robust solution is to use a format designed for splittability. JSON Lines (also known as newline-delimited JSON or NDJSON), where each line in the file is a complete, independent JSON object, is ideal for parallel processing.35 Since each line can be parsed independently, files can be split cleanly along newline boundaries without ambiguity. Trino's Hive connector supports SerDes like org.apache.hadoop.hive.serde2.JsonSerDe 22, which commonly handle the JSON Lines format. Trino's own logging can output in JSON Lines format 35, indicating engine awareness of the format.

The ability to efficiently parallelize the processing of large JSON files in Trino is therefore heavily influenced by the chosen file format. While Trino's connectors likely employ boundary-handling logic for standard JSON, using JSON Lines significantly simplifies parallel reading and parsing, making it the recommended format for large-scale JSON datasets in data lakes intended for processing with engines like Trino. Difficulties reported in forums regarding querying standard JSON from S3 might often stem from these inherent splitting and parsing complexities.32

Section 2: In-Memory Execution vs. I/O-Bound Models

Trino's performance characteristics are significantly shaped by its core design as an in-memory, distributed processing engine, contrasting with architectures that rely more heavily on intermediate disk I/O.

Trino's In-Memory MPP Engine

Trino is fundamentally designed as an in-memory MPP query engine.9 This means that during the execution of a query, intermediate data generated by one stage (e.g., filtering) is typically held in the memory of the worker nodes and streamed directly (pipelined) to the next stage (e.g., aggregation or join) without being written to disk by default.11 This approach minimizes I/O overhead and latency associated with disk access during query processing. This contrasts sharply with execution models, like classic MapReduce or engines that frequently spill intermediate data to disk during operations like shuffles or when memory limits are reached.12

Reducing I/O Bottlenecks

While Trino must initially read source data from storage systems like S3 or HDFS (an I/O operation), it employs several strategies to minimize this initial I/O and avoid subsequent I/O bottlenecks:

Scan Optimization: Techniques like predicate pushdown 26, projection pushdown, and dereference pushdown 27 reduce the volume of data that needs to be read from the source storage in the first place. Dynamic filtering can further prune data scanned based on join conditions.28
Avoiding Intermediate Disk I/O: The most significant I/O reduction comes from the in-memory, pipelined execution model.11 By keeping intermediate results in RAM and streaming them between workers and stages, Trino avoids the substantial latency penalties associated with writing and reading temporary data to/from local disks. This is particularly advantageous for interactive queries where low response times are critical.3

Memory Management for JSON

Processing JSON data within this in-memory framework presents specific memory management challenges:

Variable and Unpredictable Size: Unlike binary columnar formats (e.g., Parquet) where column data types and compression allow for more predictable memory footprints, raw text formats like JSON can be highly variable in structure and size. Parsing JSON strings into Trino's internal in-memory representations (using types like VARCHAR, BIGINT, DOUBLE, ARRAY, MAP, ROW, or the native JSON type 18) can lead to less predictable memory consumption per row or per page.
Trino's Memory Management: Trino manages memory allocation across the cluster using configured memory pools. Administrators can set limits on memory usage per query per node (e.g., query.max-memory-per-node) and total memory per node (query.max-total-memory-per-node) through configuration files.39 Resource groups provide a more granular mechanism for controlling concurrency and memory usage based on query characteristics or user groups.39 The simple Memory connector also demonstrates node-level memory awareness with its memory.max-data-per-node setting.40 Internally, Trino uses efficient data structures like Block and Page to manage data batches in memory.13
JVM Garbage Collection: As Trino runs on the Java Virtual Machine (JVM) 42, its memory management is subject to Java's garbage collection (GC) processes. High memory usage, common when processing large datasets in memory, can lead to increased GC activity and potential pauses, impacting performance. Careful tuning of JVM options (heap size, GC algorithm) is often necessary for optimizing Trino deployments, especially under heavy load or when dealing with memory-intensive formats like JSON.1

The reliance on in-memory processing, while beneficial for speed, makes Trino susceptible to OutOfMemory (OOM) errors if a query's memory requirements exceed the configured limits or available resources.1 This risk can be elevated when processing large, complex, or unpredictably sized JSON datasets, or during memory-intensive operations like joins involving large tables or high-cardinality aggregations. Unlike engines that can spill excess data to disk, Trino (in its default execution mode) will typically fail the query if it runs out of memory. This underscores the importance of adequate cluster provisioning, careful memory configuration tuning, and potentially query rewriting to manage memory consumption effectively when working with large JSON workloads in Trino.12 The Memory connector documentation explicitly warns that data stored in memory is lost if a worker fails, highlighting the inherent volatility of purely in-memory approaches without fault tolerance mechanisms.40

Comparison with Disk-Spilling Engines (e.g., Spark)

The architectural differences between Trino and engines like Apache Spark, which readily spill intermediate data to disk, lead to distinct trade-offs:

Trino (Default Mode):
- Pros: Optimized for low-latency, interactive SQL queries due to in-memory pipelined execution; potentially lower resource overhead as it avoids disk I/O for intermediate data.12
- Cons: Higher risk of OOM errors for memory-intensive queries; historically lacked mid-query fault tolerance (query fails if a worker dies mid-execution).12
Spark:
- Pros: More resilient to memory pressure by spilling intermediate data to disk, allowing larger-than-memory jobs to complete; robust fault tolerance mechanisms built-in (recomputes lost partitions).12 Stronger capabilities for complex, programmatic transformations beyond SQL (using Scala, Python, R).38
- Cons: Higher latency for interactive queries due to potential disk spilling and greater scheduling overhead (lazy evaluation, stage-based execution).12 Disk spilling significantly impacts performance.12

This creates a classic trade-off: Trino traditionally favored speed for interactive analytics, while Spark prioritized resilience and scale for large batch ETL/ML workloads.12 However, this distinction is becoming less rigid. Trino has introduced a fault-tolerant execution (FTE) mode, designed to provide resilience against worker failures and potentially handle queries exceeding available memory by leveraging techniques like intermediate data buffering (potentially to shared storage, differing from Spark's typical local disk spill).38 Concurrently, Spark continues to enhance its performance for interactive SQL workloads (e.g., Spark SQL, Photon engine in Databricks). Benchmarks comparing modern versions can show surprising results; for instance, one comparison using TPC-DS showed Hive 4 on MR3 (an engine capable of spilling) achieving slightly better total execution time than Trino 453, although Trino maintained a better average query response time.43 This highlights that performance is highly dependent on the specific workload, data characteristics, and cluster configuration. The choice between Trino and Spark increasingly depends on the specific mix of interactive vs. batch workloads, ecosystem integration needs (e.g., dbt support 38), tolerance for configuration complexity, and operational familiarity, rather than a simple "in-memory vs. disk-spilling" dichotomy.

Section 3: JSON Workload Comparison: Trino vs. Apache Drill vs. ClickHouse

Evaluating Trino's JSON processing capabilities requires comparison with other engines commonly used for querying data lakes, particularly Apache Drill and ClickHouse, each with distinct architectural philosophies.

Engine Overviews

Trino: As established, an MPP distributed SQL query engine focused on interactive analytics and data federation across diverse sources, including data lakes.3 It separates compute and storage, relying on connectors.
Apache Drill: A schema-free distributed SQL query engine designed explicitly for querying complex, semi-structured, and evolving data formats like JSON and Parquet, alongside NoSQL databases (MongoDB, HBase) and file systems (HDFS, S3).6 It emphasizes flexibility and runtime schema discovery.45 It can query data on S3.46
ClickHouse: A high-performance, open-source columnar database management system optimized for Online Analytical Processing (OLAP) and real-time analytics.48 While primarily a database, it can query external data stored in formats like JSON on systems like S3 using specialized table functions (s3, s3Cluster) or the S3 table engine.36 Its core strength lies in extremely fast aggregation and filtering performance on columnar data.

Schema-on-Read Approaches

The engines differ significantly in how they handle schema discovery for JSON data:

Trino: Implements schema-on-read primarily through its connector architecture. For Hive-compatible tables (including those storing JSON), schema is typically defined in a metastore (like HMS or AWS Glue).5 Trino reads this metadata to understand the table structure. For JSON stored as VARCHAR, the SQL/JSON functions (json_exists, json_query, json_value) provide runtime flexibility for extracting data based on JSON paths.16 The native JSON type allows defining columns that hold arbitrary JSON, offering a more structured approach within Trino itself.18
Drill: Excels at schema-on-read; it's a core design principle.44 Drill discovers the schema of JSON data dynamically at query time, without requiring predefined metadata. It can handle variations in structure and data types between different JSON documents within the same dataset.44 To manage potential type inconsistencies (e.g., a field being a number in one document and a string in another, or numbers appearing with and without decimals), Drill offers session settings like store.json.read_numbers_as_double (treat all numbers as DOUBLE) or store.json.all_text_mode (treat everything as VARCHAR).44 Users employ the USE schema_name command to set the context for queries.52
ClickHouse: Traditionally operates with a schema-on-write model for its internal tables. When querying external JSON files on S3 using the s3 table function or S3 engine, the table structure (schema) must typically be explicitly defined in the CREATE TABLE statement or within the function call itself using the structure parameter.49 While it handles the JSONEachRow format well 36, this explicit definition requirement offers less flexibility than Drill or Trino's metastore approach for evolving schemas. ClickHouse's native JSON type (marked production-ready as of v25.3) provides a more dynamic option, automatically creating subcolumns for discovered paths.21 However, for optimal performance, it benefits from configuration hints (e.g., max_dynamic_paths, specifying types for known paths) to avoid storing excessive paths in a less efficient generic structure.21

This comparison reveals a spectrum: Drill offers maximum runtime flexibility for highly variable or unknown JSON schemas. Trino provides a balance, leveraging metastores for known structures while offering powerful SQL functions for runtime interpretation of string-based JSON. ClickHouse generally requires more upfront schema knowledge for external data, reflecting its database origins and focus on predictable performance, although its native JSON type adds significant dynamic capabilities.

Efficiency of Predicate Pushdown and Split Processing

The effectiveness of filtering and parallel processing also varies:

Trino: Possesses robust predicate pushdown capabilities, including the highly effective dereference pushdown for nested structures.27 Dynamic filtering optimizes join performance.28 Parallel reading is achieved through splitting data based on connector logic (e.g., Hive connector split sizing 31), but optimal performance for large files relies heavily on using splittable formats like JSON Lines.
Drill: Also implements predicate pushdown. Its native understanding of JSON structure during the scan phase could potentially allow for more effective filtering deep within complex JSON documents compared to engines that must first parse the entire object into a generic representation. Drill supports querying S3 files.46 However, documentation suggests manually splitting very large, single JSON objects (not JSON Lines) for better performance 51, hinting at potential challenges with automatically handling split boundaries for non-line-delimited complex JSON.
ClickHouse: Predicate pushdown capabilities for external data sources queried via functions like s3 appear less sophisticated than for internal tables. Filtering often relies on path-based wildcards (*, {}) provided in the S3 path specification, which prunes files before reading.49 More complex filtering based on JSON content likely occurs within ClickHouse after the relevant data has been read from S3. Parallelism can be achieved by using the s3Cluster function, which distributes the S3 file processing across nodes in a ClickHouse cluster.49 The S3 table engine might offer tighter integration but comes with limitations like lack of support for ALTER, SAMPLE, or indexes.49 ClickHouse's strength lies in its incredibly fast internal processing engine once data is loaded into memory.

In essence, Trino provides strong general-purpose pushdown with specific optimizations for nested data. Drill's schema-aware JSON reader may offer advantages for filtering within variable JSON structures during the scan. ClickHouse likely relies more on basic file/path pruning for external S3 JSON, compensating with its exceptional post-scan processing speed.

Limitations with Deeply Nested or Variable JSON Structures

Handling complex JSON poses challenges for all engines:

Trino: Manages nesting using JSON path navigation within SQL/JSON functions 16 and benefits from dereference pushdown for optimized access.27 Extremely deep nesting or highly variable structures can still lead to performance degradation due to increased parsing complexity and potentially intricate query plans. The native JSON type offers an alternative representation. Issues like the native Hive reader's double-parsing for duplicate key compatibility 23 and potential errors when encountering invalid JSON 24 are relevant considerations.
Drill: Is explicitly designed to handle complex, nested, and variable JSON structures effectively.44 It provides intuitive dot and array notation for navigating nested elements.44 Functions like flatten() (to unnest arrays) and kvgen() (to extract key-value pairs) are provided to simplify querying complex objects.51 An experimental "Union Type" feature (exec.enable_union_type) aims to directly support fields containing heterogeneous data types.44 Despite this focus, Drill documentation notes potential difficulties managing gigabyte-sized single JSON objects, recommending pre-splitting.51
ClickHouse: Standard JSONExtract* functions can become cumbersome for very deep or complex paths.56 The native JSON type is the intended solution 21, storing paths in a flattened manner and allowing dynamic typing. However, it imposes performance-related limits on the number of distinct dynamic paths (max_dynamic_paths) and distinct data types per path (max_dynamic_types) within a single data part (the unit of storage in MergeTree tables).21 Exceeding these limits causes remaining paths/types to be stored in a less optimized generic subcolumn, potentially degrading query performance.21 For extremely variable schemas with many distinct keys, alternative approaches like Uber's pairwise array storage pattern (storing keys and values in separate parallel arrays) have been explored.54

I/O and Memory Handling Differences

The underlying architectures lead to different resource utilization patterns:

Trino: Primarily relies on in-memory processing, minimizing intermediate disk I/O for low latency but increasing susceptibility to OOM errors if memory limits are exceeded.11 Its optional fault-tolerant execution mode adds resilience.38
Drill: Also appears to focus on in-memory execution, leveraging an efficient in-memory JSON data model.45 Documentation is less explicit about its disk-spilling capabilities compared to Trino or Spark, but it likely faces similar memory pressures for large, complex operations.
ClickHouse: As a full-fledged columnar database, ClickHouse has highly optimized mechanisms for managing both memory and disk I/O, including efficient compression and data skipping indexes for its internal tables. When querying external S3 data, it reads the required data into its own execution engine, where it can leverage these robust resource management capabilities. While still consuming memory, it is likely better equipped to handle situations where intermediate data exceeds available RAM by utilizing disk more effectively than purely in-memory engines, potentially at the cost of some latency compared to a fully in-memory operation.

This reinforces the trade-off: Trino and likely Drill prioritize in-memory speed, potentially sacrificing robustness under extreme memory pressure (unless FTE is used in Trino). ClickHouse offers the resource management capabilities of a mature database, potentially handling memory constraints more gracefully by leveraging disk, but its external data access might involve different latency characteristics.

Comparative Summary

The following table summarizes the key characteristics of Trino, Apache Drill, and ClickHouse concerning JSON workloads in data lakes:

Feature/Aspect	Trino	Apache Drill	ClickHouse
Engine Architecture	MPP, Coordinator/Worker, Compute/Storage Separation	MPP, Schema-Free, Distributed	Columnar Database (MPP capable), Integrated Storage/Compute
Primary Use Case	Interactive SQL Analytics, Data Federation	Flexible SQL on NoSQL/Files/Lakes, Schema Evolution	OLAP, Real-time Analytics, Fast Aggregations
Schema-on-Read (JSON)	Metastore (Hive/Iceberg) + SQL/JSON functions + Native JSON type	Dynamic Runtime Discovery, Handles Variability Natively	Explicit Schema (s3 func/engine) / Dynamic Native JSON type (with limits)
Predicate Pushdown (JSON)	Strong (esp. Dereference Pushdown), Dynamic Filtering	Schema-Aware JSON Reader (potential for deep filtering)	Limited External (Path Wildcards), Relies on Fast Internal Filtering
Split Processing (JSON)	Connector-based (e.g., Hive), Relies on Splittable Formats (JSON Lines)	Format/Size Dependent (Manual Split Recommended for Large Single Objects)	s3Cluster for Parallelism, S3 Engine Integration
Nested/Variable JSON Handling	Good (SQL/JSON functions, Dereference), Native JSON type	Excellent (Native Design, flatten/kvgen), Experimental Union Type	Good (Native JSON type, limits apply), JSONExtract* functions
Memory Model	Primarily In-Memory, Pipelined	Primarily In-Memory Focus	Efficient Memory + Disk Management (Database Architecture)
I/O Pattern	Minimize Intermediate Disk I/O	Minimize Intermediate Disk I/O	Read External + Efficient Internal Processing (Mem+Disk)
Fault Tolerance (Default)	Query Fails (FTE Optional)	Less Explicitly Documented (Likely Query Fails)	Database-level Resilience (Replication, etc.)

Section 4: SIMD and JSON: Benefits, Limits, and Implementation

Modern CPUs include Single Instruction, Multiple Data (SIMD) instruction sets (like SSE, AVX on x86; NEON on ARM) that allow a single instruction to perform the same operation on multiple data elements in parallel. This capability holds significant potential for accelerating computationally intensive tasks, including aspects of JSON parsing.

Benefits of SIMD for JSON Parsing

Parsing JSON involves several repetitive, low-level operations that are amenable to SIMD acceleration:

Lexical Analysis Speedup: SIMD instructions can process multiple bytes or characters simultaneously, dramatically speeding up tasks like identifying structural characters ({, }, [, ], ,, :), finding quotes (") delimiting strings, locating escape characters (\), identifying whitespace for skipping, and validating character sequences (e.g., ensuring only valid characters appear outside strings).8
Increased Throughput: Libraries like simdjson demonstrate that leveraging SIMD can push JSON parsing throughput into the gigabytes-per-second range on a single CPU core, significantly outperforming traditional scalar parsers.8
Instruction Reduction & Efficiency: By processing data in parallel, SIMD reduces the total number of instructions executed compared to scalar code, which can lead to lower CPU utilization and potentially reduced power consumption for the parsing task.8
Specialized Libraries: The development of libraries like simdjson provides readily available, highly optimized implementations that perform full JSON validation according to standards while achieving state-of-the-art speeds.8 simdjson employs techniques like building an indexed view directly over the raw input bytes, avoiding the overhead of creating intermediate data structures.58

Trino and SIMD for JSON

Despite the proven benefits of SIMD for JSON parsing demonstrated by libraries like simdjson, there is no clear evidence in the reviewed documentation or public discussions (like GitHub issues) that Trino's core JSON parsing mechanisms currently utilize SIMD instructions for acceleration.23 Performance discussions around Trino's JSON handling typically focus on higher-level architectural aspects (MPP, memory management), query optimization (pushdowns), or specific reader/connector behaviors (like the Hive native reader's compatibility issues) rather than low-level SIMD parsing techniques.

Several factors might explain the apparent absence of SIMD optimization for JSON parsing within Trino:

Implementation Complexity: Effectively utilizing SIMD often requires writing platform-specific intrinsic functions or assembly code, which adds complexity to development and maintenance.
Java Ecosystem: Trino is primarily written in Java.2 While modern Java versions (especially with Project Panama) are improving support for native code and vector operations, historically, achieving peak SIMD performance directly within Java or through JNI (Java Native Interface) wrappers has been more challenging and potentially less performant than native C/C++ implementations.
Architectural Focus: Trino's performance strategy heavily emphasizes its distributed MPP architecture, efficient data shuffling (or lack thereof via pipelining), connector-based optimizations like pushdowns, and memory management.9 These higher-level optimizations might provide sufficient performance gains for many typical data lake query workloads, reducing the perceived need for micro-optimizing the parsing step itself with SIMD.
Bottleneck Location: In a distributed query involving network I/O, data shuffling, joins, and aggregations, the raw JSON parsing step might not always be the primary performance bottleneck.62 Optimizing parsing with SIMD would yield limited overall query improvement if other stages dominate the execution time.

SIMD Integration in Competitors

The provided information does not explicitly confirm or deny the use of SIMD for JSON parsing in Apache Drill or ClickHouse. However, given that ClickHouse is written in C++ and has a strong focus on leveraging hardware capabilities for maximum performance, it is plausible that its developers have incorporated SIMD optimizations in various parts of the engine, potentially including aspects of text or JSON processing. Drill's status regarding SIMD usage is less clear from the available context.

Limitations of Applying SIMD to JSON

While powerful, SIMD is not a universal solution, and its application to JSON parsing faces inherent limitations:

Variable and Nested Structure: JSON's flexible, nested structure contrasts with the uniform, flat data arrays where SIMD typically excels. Parsing JSON requires handling variable-length strings, nested objects/arrays, and conditional logic (e.g., distinguishing keys from values, numbers from strings), which often necessitates branching. Branching within SIMD code paths can significantly reduce efficiency, as different lanes in a vector might need to take different paths.8
Number Parsing Bottleneck: Accurately parsing decimal representations of floating-point numbers into binary formats (like double) is computationally intensive and can become a bottleneck, even with SIMD acceleration, especially if the JSON data contains a high density of such numbers.60 Parsing decimal integers is generally much faster.60
String Handling Complexity: Processing strings containing escape sequences (\n, \", \uXXXX) requires scalar logic, interrupting purely parallel SIMD processing.
State Management: Maintaining the parser's state (e.g., current nesting level, expecting a key vs. a value) across parallel SIMD lanes adds complexity to the implementation.

The effectiveness of SIMD for JSON parsing depends heavily on the specific characteristics of the JSON data and the sophistication of the parsing algorithm in minimizing branching and handling complex cases efficiently.

Considering these factors, while SIMD offers substantial potential for accelerating the raw parsing component, its apparent absence in Trino's JSON processing suggests that Trino's overall performance strategy relies more on its distributed architecture, memory management, and higher-level query optimizations. The integration of SIMD would require significant engineering effort, potentially involving native code integration, and its benefits would need to be carefully weighed against whether parsing is the dominant bottleneck in real-world distributed JSON query scenarios executed by Trino.

Conclusion

Trino presents a robust and high-performance engine for querying JSON data stored within data lakes on platforms like S3 and HDFS. Its core strengths lie in its Massively Parallel Processing (MPP) architecture, which distributes work across multiple workers, and its in-memory, pipelined execution model, which minimizes intermediate I/O latency, making it well-suited for interactive exploration and analysis.9 The flexible connector architecture allows seamless access to various storage systems and metastores (like Hive and Iceberg) 6, while standard SQL/JSON functions and a native JSON type provide powerful tools for navigating and extracting data from complex JSON structures.16 Furthermore, optimizations like dereference pushdown significantly accelerate queries accessing nested JSON fields by minimizing data read and processed.27

However, achieving optimal performance with JSON in Trino requires awareness of certain challenges and trade-offs. The native Hive JSON reader, while aiming for compatibility, may exhibit lower performance than traditional Hive SerDes due to internal implementation constraints related to handling duplicate keys.23 Trino's reliance on in-memory processing makes it sensitive to memory limits, potentially leading to query failures (OOM errors) for very large or complex JSON workloads if resources are not adequately provisioned or queries are not carefully tuned.1 Efficient parallel processing of large JSON files is also highly dependent on the file format; JSON Lines (NDJSON) is strongly preferred over standard multi-line JSON objects/arrays due to its inherent splittability, which aligns well with Trino's split-based parallelism.22 Lastly, unlike some specialized libraries, Trino does not appear to leverage low-level SIMD optimizations for accelerating the raw JSON parsing step, relying instead on its broader architectural strengths.

Compared to Apache Drill and ClickHouse, Trino occupies a compelling middle ground for many data lake use cases involving JSON. Drill offers superior flexibility for handling highly variable or unknown schemas at runtime 44, while ClickHouse provides the raw speed and robust resource management of a columnar database, particularly excelling at aggregations, but generally requires more explicit schema definition for external data.21 Trino shines in scenarios requiring interactive SQL analytics across federated sources, combining good performance with reasonable flexibility for JSON data, especially when leveraging metastores and appropriate file formats.

In conclusion, Trino is a capable and performant engine for querying JSON in data lakes. Its effectiveness hinges on leveraging its distributed execution model, optimizing data access through pushdowns, ensuring sufficient memory resources, and crucially, utilizing splittable JSON formats like JSON Lines for large datasets. Understanding these architectural nuances and operational considerations is key to successfully deploying Trino for demanding JSON analytics workloads.

Works cited

Shopify's Path to a Faster Trino Query Execution: Infrastructure, accessed May 5, 2025, https://shopify.engineering/faster-trino-query-execution-infrastructure
Trino (SQL query engine) - Wikipedia, accessed May 5, 2025, https://en.wikipedia.org/wiki/Trino_(SQL_query_engine)
What is Trino? - Starburst, accessed May 5, 2025, https://www.starburst.io/blog/what-is-trino/
ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines - Onehouse, accessed May 5, 2025, https://www.onehouse.ai/blog/apache-spark-vs-clickhouse-vs-presto-vs-starrocks-vs-trino-comparing-analytics-engines
Hive connector — Trino 475 Documentation, accessed May 5, 2025, https://trino.io/docs/current/connector/hive.html
What Is Trino And Why Is It Great At Processing Big Data - Seattle Data Guy, accessed May 5, 2025, https://www.theseattledataguy.com/what-is-trino-and-why-is-it-great-at-processing-big-data/
Advanced analytics with SQL and Trino, accessed May 5, 2025, https://trinodb.github.io/presentations/presentations/sql-adv-analytics/index.html
Parsing Gigabytes of JSON per Second - arXiv, accessed May 5, 2025, https://arxiv.org/html/1902.08318v7
Understanding Trino and Presto: Core Features Explained - CelerData, accessed May 5, 2025, https://celerdata.com/glossary/understanding-trino-and-presto
Starburst | Trino-based architecture - Starburst documentation, accessed May 5, 2025, https://docs.starburst.io/introduction/architecture.html
Introducing the Trino spooling protocol - Starburst, accessed May 5, 2025, https://www.starburst.io/blog/trino-spooling-protocol/
Why is Trino is faster than Spark? - Starburst, accessed May 5, 2025, https://www.starburst.io/community/forum/t/why-is-trino-is-faster-than-spark/38
Connectors — Trino 475 Documentation, accessed May 5, 2025, https://trino.io/docs/current/develop/connectors.html
A gentle introduction to the Hive connector - Trino, accessed May 5, 2025, https://trino.io/blog/2020/10/20/intro-to-hive-connector.html
Iceberg connector — Trino 475 Documentation, accessed May 5, 2025, https://trino.io/docs/current/connector/iceberg.html
JSON functions and operators — Trino 475 Documentation, accessed May 5, 2025, https://trino.io/docs/current/functions/json.html
Release 383 (1 Jun 2022) — Trino 475 Documentation, accessed May 5, 2025, https://trino.io/docs/current/release/release-383.html
Data types — Trino 475 Documentation, accessed May 5, 2025, https://trino.io/docs/current/language/types.html
How to get in presto/trino as JSON object columns from mongo? - Stack Overflow, accessed May 5, 2025, https://stackoverflow.com/questions/74834458/how-to-get-in-presto-trino-as-json-object-columns-from-mongo
Unnest elements of each object in Trino/Presto - Stack Overflow, accessed May 5, 2025, https://stackoverflow.com/questions/77645907/unnest-elements-of-each-object-in-trino-presto
JSON Data Type | ClickHouse Docs, accessed May 5, 2025, https://clickhouse.com/docs/sql-reference/data-types/newjson
Object storage file formats — Trino 435 Documentation, accessed May 5, 2025, https://ta.thinkingdata.cn/trino-docs/connector/object-storage-file-formats.html
Performance degradation with Hive HCat native implementation ..., accessed May 5, 2025, trinodb/trino#19396
Hive JSON and OpenX JSON native readers data corruption · Issue #22278 · trinodb/trino, accessed May 5, 2025, trinodb/trino#22278
Release 0.78 — Trino 474 Documentation, accessed May 5, 2025, https://trino.io/docs/current/release/release-0.78.html
Pushdown — Trino 475 Documentation, accessed May 5, 2025, https://trino.io/docs/current/optimizer/pushdown.html
Faster Queries on Nested Data - Trino, accessed May 5, 2025, https://trino.io/blog/2020/08/14/dereference-pushdown.html
Dynamic filtering — Trino 475 Documentation, accessed May 5, 2025, https://trino.io/docs/current/admin/dynamic-filtering.html
ilikechooros/lazyjson: Lazy JSON parser for embedded systems.… - PlatformIO Registry, accessed May 5, 2025, https://registry.platformio.org/libraries/ilikechooros/lazyjson
HDFS vs S3: Amazon EMR, Spark, Trino, and open data lakehouse - Starburst, accessed May 5, 2025, https://www.starburst.io/blog/hdfs-vs-s3/
Number of splits using the Hive connector - Showcase - Starburst forum, accessed May 5, 2025, https://www.starburst.io/community/forum/t/number-of-splits-using-the-hive-connector/363
Trino S3 partitions folder structure - Stack Overflow, accessed May 5, 2025, https://stackoverflow.com/questions/75507496/trino-s3-partitions-folder-structure
Any hack pulling json files from s3 into trino via hive connector : r/dataengineering - Reddit, accessed May 5, 2025, https://www.reddit.com/r/dataengineering/comments/1fnf4ar/any_hack_pulling_json_files_from_s3_into_trino/
Trino not able to create table from JSON file - sql - Stack Overflow, accessed May 5, 2025, https://stackoverflow.com/questions/75338550/trino-not-able-to-create-table-from-json-file
Logging — Trino 475 Documentation, accessed May 5, 2025, https://trino.io/docs/current/admin/logging.html
Working with JSON | ClickHouse Docs, accessed May 5, 2025, https://clickhouse.com/docs/en/integrations/data-formats/json/loading
Apache Spark Alternatives: 7 Powerful Competitors (2025), accessed May 5, 2025, https://www.chaosgenius.io/blog/apache-spark-alternatives/
Why use Spark at all? : r/dataengineering - Reddit, accessed May 5, 2025, https://www.reddit.com/r/dataengineering/comments/y8o1sy/why_use_spark_at_all/
Resource groups — Trino 475 Documentation, accessed May 5, 2025, https://trino.io/docs/current/admin/resource-groups.html
Memory connector — Trino 475 Documentation, accessed May 5, 2025, https://trino.io/docs/current/connector/memory.html
all classes - trino-spi 457 javadoc, accessed May 5, 2025, https://javadoc.io/doc/io.trino/trino-spi/457/allclasses-index.html
Deploying Trino — Trino 475 Documentation, accessed May 5, 2025, https://trino.io/docs/current/installation/deployment.html
Performance Evaluation of Hive 4 on MR3 and Trino using the TPC-DS Benchmark, accessed May 5, 2025, https://www.datamonad.com/post/2024-08-01-hive-4.0-performance-1.11/
JSON Data Model - Apache Drill, accessed May 5, 2025, https://drill.apache.org/docs/json-data-model/
Apache Drill - Schema-free SQL for Hadoop, NoSQL and Cloud Storage, accessed May 5, 2025, https://drill.apache.org/
S3 Storage Plugin - Apache Drill, accessed May 5, 2025, https://drill.apache.org/docs/s3-storage-plugin/
Configuring Apache Drill for Querying AWS S3 - RightData, accessed May 5, 2025, https://getrightdata.zendesk.com/hc/en-us/articles/360033261511-Configuring-Apache-Drill-for-Querying-AWS-S3
Data Lake | ClickHouse Docs, accessed May 5, 2025, https://clickhouse.com/docs/use-cases/data-lake
Integrating S3 with ClickHouse, accessed May 5, 2025, https://clickhouse.com/docs/integrations/s3
What's a Data Lake and What Does It Mean for My Open Source ClickHouse Stack? | Altinity, accessed May 5, 2025, https://altinity.com/wp-content/uploads/2025/01/Whats-a-Data-Lake-and-What-Does-It-Mean-for-My-Open-Source-ClickHouse-Stack-2025-01-22.pdf
QuerySurge and Apache Drill - JSON Files - Customer Support, accessed May 5, 2025, https://querysurge.zendesk.com/hc/en-us/articles/115004668446-QuerySurge-and-Apache-Drill-JSON-Files
USE - Apache Drill, accessed May 5, 2025, https://drill.apache.org/docs/use/
Using SQL to Query JSON Files with Apache Drill - Rittman Mead, accessed May 5, 2025, https://www.rittmanmead.com/blog/2016/11/using-sql-to-query-json-files-with-apache-drill/
How to store JSON in ClickHouse® the right way - Propel Data, accessed May 5, 2025, https://www.propeldata.com/blog/how-to-store-json-in-clickhouse-the-right-way
Trino-Pinot-Connector throws error when a table with a JSON dimensional fields with invalid JSON is queried. #19663 - GitHub, accessed May 5, 2025, trinodb/trino#19663
JSON Functions | ClickHouse Docs, accessed May 5, 2025, https://clickhouse.com/docs/sql-reference/functions/json-functions
Demystifying JSON Data With ClickHouse | ChistaDATA Blog, accessed May 5, 2025, https://chistadata.com/ingesting-json-data-in-clickhouse/
Fast JSON Processing in Real-time Systems: simdjson and Zero-Copy Design | Estuary, accessed May 5, 2025, https://estuary.dev/blog/fast-json-processing-simdjson/
simdjson/simdjson: Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks - GitHub, accessed May 5, 2025, https://github.com/simdjson/simdjson
Performance Notes - Simdjson, accessed May 5, 2025, https://simdjson.org/api/0.5.0/md_doc_performance.html
JSON parsing performance has quite a lot of overhead · Issue #6334 · prestodb/presto - GitHub, accessed May 5, 2025, prestodb/presto#6334
trino query performance · trinodb trino · Discussion #24360 - GitHub, accessed May 5, 2025, trinodb/trino#24360
Handle fields that are not able to be inferred by returning the raw json · Issue #7308 · trinodb/trino - GitHub, accessed May 5, 2025, trinodb/trino#7308

rifkiamil/Trino_JSON_Processing_gemini.md