Trino JSON Processing in Data Lakes: Engine Internals, Comparisons, and SIMD Aspects

Engine Internals and JSON Processing Enhancements

Trino’s JSON architecture: Trino (formerly PrestoSQL) is a distributed MPP query engine where workers scan data in parallel and pipeline results in memory. JSON in a data lake (e.g. files on S3 or HDFS) is typically handled via the Hive connector, which treats JSON files as line-oriented text. Each JSON object (or array) is expected to be a record – often one JSON per line (NDJSON). Trino splits large JSON files into segments for parallel reading, aligning splits on record boundaries (usually newline delimited) so that no JSON object is cut in half between workers. This ensures each split contains whole JSON records for valid parsing. Internally, Trino uses a LinePageSource to read text files and find record boundaries (e.g. newline positions) so that each worker thread reads a chunk of the file and emits complete JSON rows. For extremely large JSON objects that might span multiple lines, Trino’s reader will treat them as a single record (by reading past split bounds until the closing brace is found), guaranteeing correctness across split boundaries. The result is that JSON files can be scanned in parallel by many workers, each processing a different byte range of the file, which greatly increases throughput in a data lake environment.

Hive SerDe vs native JSON reader: In the past, Trino relied on Hive’s JSON SerDes (e.g. the HCatalog JsonSerDe or the OpenX JSON SerDe) to parse JSON text into columns. As of the last couple of years, the Trino project introduced a native JSON reading implementation for the Hive connector, inspired by the OpenX SerDe logic but optimized for Trino’s execution model. This native reader (called OpenXJsonDeserializer in Trino’s code) can handle complex nested JSON (maps, arrays, structs) and match them to Trino column types, without needing to invoke Hive libraries at runtime. The native path brought performance improvements but initially had some quirks – for example, it had to maintain Hive’s behavior for duplicate JSON keys. As Trino co-founder Dain Sundstrom noted, “to maintain compatibility with Hive’s duplicate JSON key behavior, we [ended] up having to parse messages twice” in the new implementation. In other words, early versions of Trino’s native JSON reader were doing redundant work (double-parsing each JSON record) to ensure that if a JSON object had duplicate fields, the result would match Hive (which takes the last occurrence of the key). This caused a performance regression: a benchmark at 1 TB (GZ compressed JSON) showed the new “native” JSON reader took ~18 s vs ~8 s with the old Hive SerDe. The Trino team addressed this in late 2023 by enhancing the engine’s data structures. They added a resetTo API on the internal BlockBuilder (which accumulates column values in memory) and updated the JSON reader to use it. This change lets Trino “un-write” or roll back partially parsed data when a duplicate key is encountered, instead of starting over. With these fixes (released in Trino 440), the JSON reader now avoids double parsing and preserves Hive semantics efficiently. The Trino 440 release notes explicitly highlight “Improve performance when reading JSON files”, and the issue tracking this improvement was closed as fixed in PR #19816.

Predicate pushdown and lazy parsing: Trino’s execution engine applies pushdown optimizations even for JSON data. At query planning time, Trino knows which columns (JSON fields) and which rows (via filters) are needed. The native JSON reader uses this to avoid unnecessary work. For example, if a query only selects a few JSON fields, the reader will lazily parse only those fields from each JSON object, skipping over others. It does this using a streaming JSON parser that reads the text character by character (or byte by byte) and extracts values for the requested keys, ignoring any irrelevant keys. In a large nested JSON object, Trino does not build a full in-memory DOM; instead, it streams through the JSON and populates each output column’s value as it encounters it. This streamed row processing means Trino can stop parsing a given record early if all needed data has been found. If a WHERE clause filter is on a JSON field that appears early in the object, Trino can evaluate the predicate during parsing and skip the rest of that object if the predicate fails, effectively pushing down the filter to the parsing stage. Combined with split-wise filtering (skipping entire files or partitions that don’t match a predicate) and even storage-level pushdown (discussed below), Trino minimizes the JSON data it actually parses. This kind of lazy, selective parsing is crucial when dealing with very large nested JSON documents, as it avoids materializing huge in-memory objects only to drop most of their content later.

Parallel reading and split handling: As mentioned, Trino uses the Hive connector’s splitting logic to break large JSON datasets into manageable chunks. Each worker is assigned splits (file segments) and uses one or more threads to parse JSON rows from that segment. The JSON parser in Trino is stateless per reader, so multiple threads can parse different portions of the same file safely. (Notably, a bug in 2022 around the thread-safety of the JSON SerDe was fixed – each split/reader now gets its own JSON deserializer instance to avoid concurrency issues.) Trino ensures that split start and end offsets align with record boundaries. For typical JSON text files (one JSON object per line), this means each split starts at a byte offset that is the beginning of a line, and the previous split stopped at a newline. If a JSON record is extremely large (larger than the split size) and spans multiple newline characters, the framework will treat that entire multi-line record as one unit so it isn’t split between two workers. Essentially, Trino’s LinePageSource will extend a split’s range until it finds the closing delimiter of the JSON record it started, guaranteeing well-formed JSON input to the parser. This approach allows Trino to process huge JSON files in parallel while maintaining correctness of the JSON tokenization. In summary, recent developments in Trino (past ~2 years) have significantly optimized JSON processing: a new native JSON parser with fewer overheads, adherence to Hive semantics (duplicate keys, null handling) with better efficiency, predicate pushdown during parsing, and robust parallel scan capabilities. All these enhancements make Trino much more performant on raw JSON in data lakes than earlier iterations, as confirmed by user benchmarks and release notes.

Additionally, Trino can leverage Amazon S3 Select when reading JSON from S3 object storage. This is an optimization contributed by AWS in late 2022 that pushes down JSON filtering and projection to the S3 service itself. With S3 Select enabled, Trino offloads the parsing work to S3: the S3 API will read the JSON object and return only the fields and records that match the query’s filters. By retrieving “only the required data from Amazon S3,” Trino can skip downloading and parsing large swathes of JSON text. AWS reported up to 9× faster queries on JSON data at 3 TB scale by using Trino with S3 Select pushdown, compared to Trino’s normal JSON scan path. This pushdown effectively eliminates much of the JSON I/O and parsing cost for eligible queries. It’s an example of how Trino’s engine internals integrate with storage-layer capabilities to optimize JSON handling even further (though it applies specifically to S3 as a source).

In-Memory Execution vs I/O-Bound Models

Trino’s in-memory MPP design: Trino is a pure in-memory massively-parallel engine, meaning it pipelines all query operations in memory without materializing intermediate data to disk under normal execution. Each stage of a query exchanges data via an in-memory shuffle (using network buffers), and operators (scan, filter, join, aggregate, etc.) process data in a streaming fashion. This design was inherited from Presto and optimized for low-latency analytics. By avoiding writes to disk for intermediate results, Trino eliminates the substantial I/O overhead that disk-based engines incur during query execution. For example, when joining or aggregating data, Trino will build hash tables and sort sets in RAM. The result is that Trino can quickly move data through the query plan, only reading from the external source (e.g. S3 or HDFS) and writing final results (if needed) at the end. There is no “spill to HDFS” for shuffles or temp storage in the normal path – a stark contrast to MapReduce or classical Hadoop, where each stage writes to disk.

This in-memory model greatly reduces I/O bottlenecks. Reading JSON data from a data lake is often I/O-bound – reading terabytes of text from remote storage can be slow. Trino mitigates this by doing one read pass and no extra writes. In an I/O-bound scenario, eliminating additional read/write cycles (like those an engine would do if it wrote intermediates to disk) can cut total I/O by half or more. Trino’s coordinators also try to schedule splits such that data locality is maximized (in Hadoop environments, Trino can schedule a split on the same node as the HDFS DataNode with the data), further reducing network I/O. When JSON data is compressed (commonly the case, e.g. gzip/bzip2 JSON files), Trino workers decompress and parse on the fly in streaming mode. While compressed JSON cannot be split in the middle of a block (since gzip is not splittable), Trino will still distribute files across workers. Each worker’s CPU will be busy with decompression and JSON parsing (a CPU-intensive task), but Trino’s design means that as soon as a JSON record is parsed and satisfies the query conditions, it flows into joins or aggregations without hitting disk. This overlaps CPU parsing with network shuffle in a pipeline, often leading to better utilization and latency.

By contrast, I/O-bound models like Apache Spark’s execution engine tend to use disk both for shuffle and as a safety net for memory overflow. Spark, for example, breaks a query into stages; after a stage, data is typically written to disk (or spilled to disk if it doesn’t fit in memory) before being read by the next stage. This provides fault tolerance (recovering from node failures by re-reading shuffle data) but at the cost of heavy I/O. Scanning JSON in Spark will involve reading the JSON text, parsing it (like Trino, Spark uses in-memory parsing with Jackson), then writing the parsed rows to disk if they need to be shuffled or if memory pressure is high, and later reading them back. Thus, Spark inserts additional I/O steps that Trino avoids. In practice, Trino’s approach yields lower latency for interactive queries, as there is no wait for disk operations – data flows directly through memory buffers from source to sink. A comparison of Presto/Trino vs Spark noted that Trino “uses an in-memory buffered shuffle to exchange intermediate task results without persisting to disk, speeding up query performance,” whereas Spark’s stage-based model relies on disk I/O for shuffles.

Memory management for raw JSON scanning: Processing raw JSON (a text-based, non-columnar format) is memory-intensive. Trino manages this by using a tiered memory system with per-task and per-query limits. As it reads JSON, Trino allocates buffers for file input and uses a BlockBuilder to accumulate rows in columnar form (even text JSON is turned into Trino’s columnar in-memory format – slices for strings, long arrays for numbers, etc.). Each worker will parse JSON in batches (pages) perhaps on the order of tens of thousands of rows at a time, to balance throughput and memory footprint. Trino’s memory model is largely on-heap (since it’s Java), but the engine uses optimized structures to avoid creating too many small objects. For example, when reading a JSON field that is a string, Trino will create a single large byte array for a batch of values and slice it rather than instantiate many Java String objects. This reduces GC pressure and keeps memory contiguous. The lazy parsing mentioned earlier also plays a role – by not materializing fields or entire records that aren’t needed, Trino saves memory. If a JSON document has 50 fields but the query only needs 5, Trino will never allocate memory for the other 45 fields.

Trino’s “pure in-memory” philosophy means it historically would rather throw an error than spill to disk if a query’s memory exceeds available RAM. However, recent versions have introduced optional spilling and fault-tolerant execution modes. Trino 411+ added exchange materialization and spill for certain operators (e.g. sorting, aggregations) in order to handle extreme cases and to allow query retries. This is not the default mode and usually not used for small/medium queries. By default, Trino assumes you have enough memory in the cluster to hold the working set of the query. This is unlike Spark, which by design will spill to disk when a dataset doesn’t fit in memory. In practice, for moderate-sized JSON analytics (that fit in cluster memory), Trino will outperform Spark by avoiding those disk spills. If the JSON dataset is massive (far exceeding memory), Spark might complete the query (with lots of spilling) whereas Trino might fail unless the new spilling mode is enabled. It’s a trade-off between raw speed and ability to handle unlimited data. That said, Trino’s new fault-tolerant mode does allow it to spill and continue for large data, bringing it closer to Spark’s resilience but without losing the inherent in-memory speed for typical cases.

Another difference is how these engines utilize the CPU. Trino’s operators are usually tight loops in Java processing data, whereas Spark often generates Java bytecode for operations (Whole Stage Codegen) to optimize execution. Both approaches keep things in memory, but Trino’s push-based execution can more effectively stream data (particularly when reading JSON, Trino can start filtering and projecting the early rows while still reading later ones, thanks to pipelining). Spark’s pipeline is broken at stage boundaries, which can idle some CPU while waiting for disk I/O. In summary, Trino’s in-memory model minimizes I/O by doing just one read of the JSON and no other disk activity, versus an I/O-bound model like Spark which does multiple read/write cycles per query stage. This design choice makes Trino very efficient for interactive JSON queries, as long as the working set fits in RAM.

Memory vs disk in Drill and ClickHouse: Apache Drill, like Trino, was designed for interactive analysis and uses a vectorized in-memory execution engine. Drill will read JSON into Arrow-like value vectors in memory and pipeline execution without mandatory disk writes. However, Drill can spill to disk in out-of-memory situations for some operators. Drill’s focus on schema-free JSON meant it had to manage potentially very large nested structures in memory. It does so by reading data in manageable record batches (e.g. 64K records at a time) – much like Trino’s pages – and using an off-heap memory allocator to store values in a columnar format. Drill tries to use direct memory and minimize Java object overhead (it was one of the first to adopt Apache Arrow memory structures). This gives it an advantage in handling larger JSON records without garbage collection pauses. In contrast, Trino until recently used mostly on-heap memory (with sophisticated pooling to mitigate GC costs). In practice, both Trino and Drill aim to do as much as possible in memory; neither will, for example, write each JSON record out to disk after parsing – they immediately operate on it (filtering, etc.) in RAM.

ClickHouse is a bit different: it’s a database system that usually stores data on disk in a columnar format. If one is querying JSON via ClickHouse, often the JSON has been ingested and converted into ClickHouse’s native columnar storage (so the query isn’t reading raw JSON text at runtime). In that case, ClickHouse’s execution is fully in-memory for processing (it memory-maps or reads the needed columns, then uses vectorized in-memory operations). If ClickHouse is directly reading JSON (say via its file table function or Kafka engine), it will parse the JSON on the fly. ClickHouse being an in-memory columnar process as well, it will parse and immediately convert JSON fields into column values stored in memory buffers. It heavily optimizes memory usage and can even use SIMD instructions in C++ to speed up parsing (more on that below). So ClickHouse also minimizes extra I/O – it does the parsing in memory and does not spill that data to disk (unless perhaps the query is doing an enormous sort that requires temporary files, but that’s an edge case). Overall, Trino, Drill, and ClickHouse all strive to be CPU-bound rather than I/O-bound when querying JSON: they try to push the limits of in-memory processing to avoid slow disk access. The main difference is that Spark (and MapReduce/Hive) were willing to use disk as a crutch for big data, whereas Trino’s philosophy is to throw more memory and streaming computation at the problem to keep interaction fast.

It’s worth noting that while being in-memory accelerates JSON handling, the parsing of JSON itself is CPU-intensive. On a per-record basis, JSON parsing is slower than reading the same data in a binary format (like Parquet) due to the cost of interpreting text. Thus, engines compensate by parallelizing and optimizing code. A user of Drill once observed it could process JSON around 100–200 MB/s per core on simple queries (actual speed depends on JSON complexity). Trino’s recent improvements and pushdown can elevate its effective throughput by not parsing unnecessary fields and by leveraging features like S3 Select (which can dramatically cut down data to parse). In one test, Trino with pushdown was able to scan JSON from S3 at over 3.5 GB/s aggregated (essentially limited by S3’s ability to filter out data). These numbers underscore that I/O can be the bottleneck (reading 3.5 GB/s requires many parallel S3 fetches), but with pushdown and in-memory pipelining, Trino ensures that the JSON parsing step is not doing wasted work on extraneous data.

In summary, Trino’s in-memory MPP execution reduces I/O bottlenecks by eliminating unnecessary disk writes and by smartly pushing down and pruning JSON data at the source. Compared to engines that lean on disk (Spark) or that don’t push down as much, Trino spends more time doing useful CPU work (parsing and computing) rather than waiting on I/O. This design is a key reason why Trino excels at interactive querying of data lake files.

JSON Workload Comparison: Trino vs Apache Drill vs ClickHouse

Schema-on-read and flexibility: One of the biggest differences among Trino, Apache Drill, and ClickHouse is how they handle schema-on-read for JSON. Apache Drill is renowned for its schema-free JSON querying. You can point Drill at a directory of JSON files and immediately run SQL queries without defining a schema up front – Drill will automatically interpret the JSON structure. It dynamically discovers fields as it parses the JSON, effectively treating the data itself as the schema. This allows great flexibility with semi-structured data (you can query new JSON fields that weren’t known ahead of time). Trino, by contrast, typically requires a schema to be defined (e.g. via a Hive table or a predefined CREATE TABLE with columns corresponding to JSON keys). Trino’s Hive connector does support an “OPENX_JSON” format where the schema is provided in the DDL and Trino will parse JSON accordingly. Thus, Trino is schema-on-read in the sense that the JSON file isn’t stored with a fixed schema, but the user or metastore must supply a schema for the query to treat the JSON fields as table columns. If a JSON field is missing or extra, Trino might return null for missing fields or ignore unknown fields. Drill’s approach is more flexible: it can handle missing fields and even varying schemas between JSON documents gracefully, since it doesn’t expect a fixed set of columns – it discovers whatever is present.

ClickHouse is the least schema-on-read of the three. ClickHouse typically works best when data is loaded into its own tables with a defined schema. It does have the ability to read JSON directly (for example, using JSONEachRow input format or table functions), but you must define how to map JSON keys to columns in a query or table definition. It won’t automatically infer arbitrary new fields at query time. In other words, ClickHouse prefers schema-on-write: you design the schema (possibly denormalizing JSON into a wide table) for performance. As one expert noted, if the JSON structure is known or predictable, it’s often best in ClickHouse to extract those fields into separate columns rather than store raw JSON, because that yields much faster querying with indexes and compression. So, in terms of schema flexibility: Drill offers the most ad-hoc schema-on-read for JSON (no upfront schema needed), Trino requires a schema to be specified but can then read JSON files accordingly, and ClickHouse is usually used by first transforming JSON into a structured schema for analysis (though it can do on-the-fly JSON function queries too, with performance caveats).

JSON parsing and internal representation: All three engines ultimately parse JSON text into an internal form to query it. Drill uses a columnar vectorized execution engine (based on Apache Arrow memory format). When Drill reads JSON, it uses a high-performance JSON reader that parses values and writes them into Arrow vectors (one vector per JSON field/column). Drill’s JSON reader is quite sophisticated: it can handle deeply nested JSON by representing nested structures as repeated (list) or struct vectors. Under the hood, Drill’s JSON parser was built in Java (originally using Jackson’s streaming API) but optimized to minimize object creation. It effectively does a streaming parse much like Trino does, reading tokens and writing values into memory buffers. Drill was designed from the ground up for complex data types, so it has SQL extensions like FLATTEN() for nested arrays and dot notation for accessing nested fields. This makes Drill very capable for JSON analytics – you can, for example, select customer.name directly from a JSON, or flatten an array of addresses into multiple rows. Drill will navigate the JSON structure dynamically at runtime. The cost is that this dynamic parsing can be CPU-heavy (since it may check the structure of each record to see what fields exist). Drill’s execution remains in-memory after parsing, using vectorized operators on the columnar data.

Trino’s JSON handling, as described, either uses Hive SerDes or its native OpenX JSON parser. In either case, once a JSON row is parsed, Trino represents it in its Page and Block objects (which are columnar). For example, an id field might go into a Block of BIGINTs, an address field which is an array might become a Block that contains an array of nested row Blocks, etc. Trino’s execution engine then treats these Blocks like any other table data (it can join on them, filter, etc., even if they are nested structures). However, Trino’s SQL syntax for nested JSON is not as transparent as Drill’s. Typically, if you have a nested JSON, you define the table with that nested type (Trino supports ROW and ARRAY column types). Querying nested data might involve using ELEMENT_AT() or UNNEST to dig into arrays. It’s powerful, but you need to know the schema. If instead the JSON is stored as a single text column and you want to parse it at query time, Trino offers functions like json_extract() or JSON_VALUE() to retrieve data by a JSON path. Those functions will invoke a JSON parser on the fly for that value, which is less efficient than having done a full parse at scan time. So the preferred method in Trino is to use the Hive connector’s JSON table format so that parsing happens at scan and the engine deals with typed columns thereafter.

ClickHouse, if reading raw JSON using JSONEachRow format, will parse each line of JSON into columns using its SIMD-accelerated C++ parser (which uses libraries like RapidJSON or simdjson). Once in columns, ClickHouse stores the data in its native columnar format (either in-memory if it’s a transient query or on-disk if inserting into a table). For querying, ClickHouse provides many JSON functions (like JSONExtract* family) that can parse JSON strings and extract values. These functions are also based on simdjson for speed. However, using them on the fly (on raw JSON text stored in a column) every time you query can be slow compared to having the data already split into columns. Thus, the recommended approach is often to ingest JSON into a structured table. In that scenario, the heavy JSON parse happens once during ingestion, and queries thereafter operate on efficient binary columns.

Predicate pushdown and filtering: When it comes to filtering JSON data, Trino and Drill both attempt to do predicate pushdown at the file scan level, but the mechanisms differ. Trino can push down file-level predicates (e.g., if your JSON files are partitioned by date and your query has WHERE date='2025-01-01', Trino will only read the relevant files or directories). For record-level predicates on JSON fields, Trino’s native reader can apply them during parsing as described – effectively skipping parsing of unnecessary records. This isn’t a “pushdown” to an external system, but it’s an early filter in the scan operator. The most striking pushdown Trino offers is with S3 Select, where the S3 service does the filtering. Drill, on the other hand, does not have an equivalent to S3 Select. Any filtering on JSON fields in Drill is done by Drill itself after parsing the records (Drill’s JSON reader doesn’t have an index to skip ahead within a file). However, Drill can do partition pruning if the JSON files are in partitioned directories (similar to Hive). Drill’s execution engine might also short-circuit the parsing of a record if a filter is false – for example, if you do WHERE json_field = 5, Drill could conceivably detect early that the field is not 5 and skip further nested parsing of that record (though specifics of that optimization are not well-documented). In general, Drill’s strength was not predicate pushdown, but rather brute-force reading and vectorized processing. It relies on being able to scan lots of JSON quickly, rather than skipping content. ClickHouse again is a different animal: if data is in ClickHouse columns (after ingestion), predicate pushdown is trivial – it will use indexes, min/max values, etc., to skip blocks that don’t match, just as with any structured data. If you are querying raw JSON via JSONEachRow without ingesting, ClickHouse has to parse each JSON and then apply the filter. There’s no external “server” to push the JSON filter to (unless you consider something like S3 Select outside of ClickHouse, but ClickHouse doesn’t integrate with S3 Select). So in that raw scenario, ClickHouse will parse everything, but because its parser is extremely fast in C++, it can still be very efficient.

One notable area is nested and complex structures. Drill was particularly good at querying deeply nested JSON because it could project paths inside the JSON without requiring them to be flattened. You could have wildly nested data and still query it in SQL (Drill’s engine would create matching nested vectors of data). Trino can also handle nested data, but the schema must reflect it and queries can become complex (using JOIN UNNEST to expand arrays, etc.). ClickHouse historically did not support arbitrarily nested JSON structures in a single table row – you would typically normalize or flatten it. ClickHouse does have a Nested type, but that is more like a shortcut for multiple arrays with a common index (it’s not a general JSON nested object type). If you gave ClickHouse a JSON with 5 levels of nesting and asked to query something deep inside, you’d likely either use JSON functions (which parse the text at query time) or load it into a structured table with multiple tables or columns representing the hierarchy. Thus, for deeply nested or schema-less JSON analytics, Apache Drill often required the least upfront work – you could just query it. A user could run SELECT flatten(person.addresses) AS addr FROM json_data in Drill to explode an inner array, whereas in Trino you’d need to know the structure and do something like SELECT t.addr FROM mytable CROSS JOIN UNNEST(person.addresses) AS t(addr) .... The end result is similar, but Drill made it feel more like the JSON is directly queryable.

Performance comparisons and benchmarks: Comparing the raw performance of Trino vs Drill vs ClickHouse on JSON workloads is nuanced, because each excels under different conditions:

Trino vs Drill: Drill’s JSON parsing was highly optimized for its time, but Drill is an older project (its peak development was mid-2010s). Trino’s recent enhancements (in 2023–2024) likely closed the gap or surpassed Drill for scanning JSON, especially when using pushdown. A specific comparison by a user or blog is hard to find, but consider that Drill could scan JSON at ~100–200 MB/s per core in 2017. Trino’s native JSON reader after improvements might be in the same ballpark per core in Java. Both will scale out linearly with more nodes. Drill’s advantage was zero-config schema and on-the-fly querying. However, that flexibility sometimes meant extra overhead: Drill might spend time discovering schema or handling varying types. If the JSON data has a consistent schema and the query is known, Trino’s more static approach (with a defined table schema) can be faster because it doesn’t have to infer anything – it simply applies the parsing logic. In terms of query engine (beyond parsing), Trino’s scheduler and optimizer are more advanced today (cost-based optimization, dynamic filtering, etc.), which can make a difference if JSON data is large but the query is selective. For example, Trino can dynamically filter using join keys (even on JSON fields if they are exposed as columns) to avoid unnecessary scanning of a second dataset. Drill did not have such advanced runtime filtering. That said, for straightforward full scans of JSON, their performance might be comparable, with Drill possibly faster if the query involves heavy use of nested data (since it was built exactly for that case). As of 2024, Trino has the momentum – ongoing work continues to improve its JSON performance (the Trino 475 docs explicitly call out improvements for “queries with filters on ROW columns stored in Parquet” and JSON, indicating continued focus).
Trino vs ClickHouse: ClickHouse is known for being extremely fast for analytics, but mostly when data is in its optimized columnar format. If we consider ingested JSON (converted to columns), ClickHouse would likely outperform Trino because the problem reduces to scanning columnar data with all of ClickHouse’s low-level optimizations (vectorized C++ processing, indexing, etc.). However, if we focus on reading raw JSON files (like a data lake scenario), the comparison is between Trino’s Java-based JSON scan and ClickHouse’s JSONEachRow scan. Here, ClickHouse’s use of highly optimized C++ libraries (simdjson, etc.) can give it a big edge in pure scan speed. For instance, simdjson can parse gigabytes per second on a single core under ideal conditions. ClickHouse has reported JSON scanning at multiple GB/s in some cases. Trino’s Java parser, while improved, is unlikely to reach that throughput per core (Java’s speed is typically lower for this kind of task). So for a full scan of a massive JSON dataset with no filtering, a compiled engine like ClickHouse will have a raw parsing speed advantage. One must also consider parallelism: Trino can distribute the scan across many nodes reading from S3; ClickHouse can also distribute if the JSON is on multiple cluster nodes or read in parallel from storage. A real-world benchmark (Onehouse.ai’s 2025 analytics engine comparison) noted that ClickHouse and another C++ engine (StarRocks) showed very high performance on internal benchmarks, whereas Trino was competitive but slightly more latency-focused. We don’t have specific JSON query numbers from that source, but it’s fair to say ClickHouse’s strength is raw throughput (at the cost of requiring data to be structured or pre-loaded for best results).
Drill vs ClickHouse: Drill and ClickHouse are rarely directly compared since their use cases differ (Drill for flexible exploration, ClickHouse for high-speed aggregation). If both are reading raw JSON, ClickHouse’s simdjson-based parser will likely outstrip Drill’s Java parser in speed. However, Drill can do things like query arbitrary nested JSON without predefined schema, which ClickHouse cannot do easily. So if the task is “ad-hoc analyze this unknown JSON”, Drill shines. If the task is “ingest this JSON and run repeated queries”, ClickHouse would ingest then answer very fast. A benchmark by MapR in 2016 showed Drill outperforming some SQL-on-Hadoop engines for JSON queries, but that was before ClickHouse (and Trino) became popular.

In terms of memory and I/O handling:

Trino and Drill both rely on distributed execution reading from a data lake. They benefit from data locality if on Hadoop. Both will read each file once (no caching unless the user or engine caches at a higher level). Drill can cache metadata about JSON structure to skip reading all files for schema, but not the data itself. Trino can use its filesystem cache feature to cache frequently accessed files in memory across queries. ClickHouse, if used in a data-lake style (reading external files), doesn’t cache by itself (you’d rely on OS cache). But usually ClickHouse would store data internally, meaning after initial load, data is kept in a format that’s efficient to read (compressed columns) and often hot in OS page cache or RAM. So for repeated queries, ClickHouse might end up reading from memory-mapped files (essentially an in-memory read) whereas Trino/Drill reading JSON will re-read the text each time (unless an external cache like Alluxio or Trino’s cache is used). This gives ClickHouse an edge in repeated workload scenarios.

Benchmark examples: In an AWS benchmark (TPC-DS-like queries on 3 TB of JSON data), Trino with S3 Select was 2–9× faster than without pushdown. This suggests that when Trino can avoid parsing many JSON rows (by offloading filter to S3), its performance jumps significantly. A Drill user on Stack Overflow noted that using Drill on 160k small JSON files was suboptimal and that combining them into larger files (100–200 MB each) yielded about 100–200 MB/s per core scan speed. This is actually quite decent, but it underscores that small files hurt performance due to overhead per file (Drill and Trino both prefer fewer large files to amortize startup costs). ClickHouse’s documentation and user reports indicate that using their JSON functions (simdjson-based) can parse JSON much faster than naive methods – simdjson itself boasts 4× faster than RapidJSON (a typical C++ parser) and can stream parse at gigabyte speeds. In practice, one user found that even with these fast libraries, it was still faster in ClickHouse to normalize nested JSON into separate columns rather than query it as raw JSON on the fly. That tells us that while ClickHouse can parse JSON quickly, the real speed comes once the data is in a structured form (where it can use index on columns, skip scanning irrelevant columns entirely, etc.).

To summarize the comparison:

Apache Drill: excels in ease of querying semi-structured JSON. No upfront schema, just point and query. It can handle deeply nested and varying JSON structures elegantly. Its vectorized execution gives good performance, but being Java-based and now older, its JSON parsing might not match a modern optimized approach. Drill is best when you need flexibility and are exploring JSON data or integrating with many formats quickly. Predicate pushdown is limited to partition pruning; it generally brute-force scans JSON (albeit efficiently in memory). Drill’s memory usage is optimized via off-heap buffers, but if data is huge, you might need to increase its memory or accept some spilling.
Trino: now offers a strong balance – you define the schema (which can handle nested types), and Trino will efficiently scan and filter JSON. It supports some pushdown (especially with S3 Select or by skipping fields/rows during parse). Trino’s strength is in complex query processing (joins, aggregations) across JSON and other data sources together, using its advanced optimizer. It might require more upfront work (like setting up Hive tables or schemas), but once that’s done, it’s very powerful for interactive analysis. Recent improvements have vastly sped up JSON throughput, addressing earlier performance gaps. Trino is likely faster than Drill for large-scale joins or multi-step pipelines involving JSON, thanks to its CBO and parallel execution refinements.
ClickHouse: is the fastest for pure aggregation and scanning, but only after you’ve structured the data. If you can convert your JSON into a table (even with a “Nested” type or flattened columns), ClickHouse will deliver sub-second query times on large data due to its engine optimizations. However, for ad hoc JSON where the structure isn’t predetermined, ClickHouse is not as convenient. It expects or benefits from a schema. It also doesn’t connect to heterogeneous sources in the way Trino or Drill do – it’s not designed as a federated query engine, but rather as a high-performance DB. For JSON specifically, ClickHouse leverages SIMD parsing to make reading JSON as fast as possible in C++, and it provides JSON functions that are quite powerful (JSONExtract can even return nested structures). In scenarios where one might consider using ClickHouse vs Trino on a data lake: if low-latency queries on known JSON fields are needed repeatedly, one might load them into ClickHouse. If flexibility or one-time analysis on arbitrary JSON is needed, Trino (or Drill) would be more suitable.

Finally, any standardized benchmark must be taken with context. Each of these engines can be tuned. For instance, enabling compression pushdown, adjusting batch sizes, or using specialized formats (Parquet instead of JSON) can change the picture. But focusing on JSON, the above outlines the general trade-offs: Drill for flexibility, Trino for a mix of performance and federation, and ClickHouse for sheer speed when schema is fixed.

SIMD and JSON: Benefits, Limits, and Implementation

The role of SIMD in JSON parsing: SIMD (Single Instruction, Multiple Data) allows a processor to perform the same operation on multiple pieces of data simultaneously. For parsing JSON – which is essentially processing a sequence of characters – SIMD can greatly accelerate the scanning and tokenizing of text. A conventional JSON parser checks one byte at a time (e.g. “is this character a { or } or a digit, etc.”) which can be slow due to branching on each byte. A SIMD-accelerated parser, like simdjson, will load 16 or 32 bytes at once into a wide register and use vector instructions to detect JSON structural characters (like quotes, braces, commas) in parallel. For example, simdjson can validate that brackets/braces are properly matched by processing many characters in one go, and find token boundaries much faster than a byte-by-byte loop. The benefit is dramatic: simdjson’s authors report parsing throughput over 4× faster than traditional parsers (RapidJSON) and can reach gigabytes per second on modern CPUs. In practical terms, SIMD can turn the CPU-intensive part of JSON parsing (lexical analysis) into a much more cache-friendly and parallel process. It can quickly skip whitespace, locate the next key or value, and even validate UTF-8 strings at very high speed.

SIMD in Trino: Trino is written in Java, and currently it does not directly use a SIMD-optimized JSON library like simdjson. Java’s standard libraries (Jackson, etc.) don’t utilize SIMD explicitly. The Trino native JSON reader is built in Java and relies on the JVM’s bytecode execution. While modern JVM JITs can auto-vectorize very simple loops, parsing JSON is complex and involves a lot of branch logic that typically isn’t auto-vectorized. There have been efforts in the Java ecosystem to create faster JSON parsers (e.g. JSONiter, DSL-JSON) which use techniques like partial vectorization or optimized bit loops. It’s possible Trino’s OpenX JSON code uses some low-level tricks (like bit masks to find certain characters), but it’s constrained by what pure Java can do. There is no indication in Trino’s release notes or code that simdjson (a C++ library) has been integrated. Doing so would require using JNI (Java Native Interface) or forthcoming Panama APIs, which introduces complexity and potential overheads that the Trino team may have deemed not worth it (at least as of 2024). Instead, Trino improved JSON performance via algorithmic changes (like the resetTo avoidance of double parse) and pushdown, rather than trying to massively speed up the character-by-character parsing itself in Java.

SIMD in competing engines: ClickHouse and some other database engines do use SIMD for JSON. ClickHouse specifically bases its JSON functions on simdjson. This means that when you call something like JSONExtractUInt(json, 'user.id') in ClickHouse, under the hood it uses simdjson’s on-demand parser to quickly navigate to user.id in the JSON string using vectorized instructions. Likewise, when reading a JSONEachRow format, ClickHouse leverages simdjson/RapidJSON with vectorized optimizations to parse each row. This gives ClickHouse a big speed advantage in raw parsing. Another engine, Apache Drill, being Java, did not use SIMD (same limitation as Trino). Apache Spark also uses Jackson for JSON, which is not SIMD-accelerated. However, note that Spark can benefit from Project Tungsten’s off-heap and vectorized processing for the post-parse part of execution (and some vectorized Parquet/ORC decoding), but for JSON text it still relies on standard parsing. There has been research in the Arrow community about integrating simdjson with Java through the Arrow C++ library for JSON (Apache Arrow has a C++ JSON reader that uses simdjson under the hood). In the future, Trino could potentially use Arrow’s C++ parser via JNI to get SIMD benefits, but this isn’t in place yet.

Benefits of SIMD for JSON: The primary benefits are speed and lower CPU usage. By doing in 1 CPU cycle what would normally take ~16 cycles (one per byte), a SIMD parser can dramatically reduce parsing time, which is often the bottleneck in JSON queries. For example, simdjson can parse NDJSON at 3.5 GB/s on a single machine (under ideal conditions), meaning it can handle extremely large JSON feeds in real-time. SIMD also tends to use fewer branches, which avoids misprediction penalties on modern CPUs. It treats parsing as more of a data transformation problem (with precomputed lookup tables, etc.), which is ideal for modern hardware. Another benefit is that SIMD implementations often come with additional goodies – simdjson, for instance, does a two-stage parse (stage 1 finds all structural indexes, stage 2 builds the actual JSON values) which is very cache-efficient. This organization means even if the JSON is complex, a lot of the heavy lifting (stage 1) is done at memory bandwidth limits thanks to SIMD, and stage 2 benefits from having all the structure identified upfront.

Limits and challenges with SIMD on nested/variable JSON: Despite its speed, using SIMD for JSON isn’t a silver bullet, especially for very nested or irregular JSON. A key limitation is that JSON data isn’t a simple uniform array of numbers – it’s a hierarchical structure with variable lengths. SIMD works best when the work is regular and predictable (e.g., parsing a fixed number of bytes in a known pattern). In JSON, after the initial tokenization, you have to interpret the structure: match braces, identify which value belongs to which key, etc. Branchy logic is inevitable when you have conditionals like “if key == X then do Y” or when handling numbers vs strings vs booleans differently. SIMD can accelerate the initial scan for tokens, but once you start building the actual objects or navigating the hierarchy, the code becomes more scalar. For instance, simdjson uses SIMD to quickly identify all the positions of : and " and { in the text, but then it still has to iterate through those positions to assemble the JSON tree. For a very deeply nested JSON, the performance might get bottlenecked by pointer-chasing through the structure rather than the initial scan. Another limitation is memory overhead: simdjson’s approach of indexing all structural characters means it uses extra memory proportional to the size of the JSON (to store indices). For large nested JSON, this index itself can be large (though still linear in input size).

When JSON fields are out-of-order or have missing fields, a naive SIMD approach won’t skip unnecessary data by itself – it will still parse everything. For example, if you only care about one field, simdjson will still parse the whole JSON (though you can stop early if you recognize you found what you need, but general SIMD parsing reads most of it anyway). This is where Trino’s lazy approach can actually save work: if Trino knows only field “foo” is needed, it can try to jump to “foo” and ignore others (though in pure Java without an index, it still has to scan char by char until it finds “foo”). Simdjson does have an “on-demand” API that somewhat mimics lazy parsing – it lets you search for a key on the fly without materializing the full DOM, but under the hood it uses the prebuilt structural index to jump quickly. The caveat is that extremely nested JSON with many tiny elements could diminish SIMD efficiency: each tiny element might still incur overhead to process individually after the bulk scan. SIMD shines with large text and bulk operations, but after that stage, the fine-grained processing of many small values can become the bottleneck.

Another structural limitation is that SIMD benefits are tied to hardware. If you run on older CPUs without wide SIMD registers, you get less benefit. Simdjson automatically selects the best instruction set available (SSE4.2, AVX2, AVX-512, etc.). But Trino’s cluster might have mixed CPUs or Java might not leverage those fully. Also, using SIMD via native code in a Java engine introduces complexity in deployment (one must ship native binaries for each architecture). It’s not impossible (Spark’s Arrow-based UDFs do similar for columnar data), but it’s added complexity.

Comparing engines in SIMD context: ClickHouse clearly gains from SIMD – part of why it’s so fast for JSON is exactly because it can use AVX2 instructions, etc. Drill and Trino, being Java, traditionally did not have that advantage. There have been interesting developments, though: for example, Facebook’s in-memory query engine Velox (C++), which is used in PrestoDB (Facebook’s fork of Presto), integrates simdjson for some operations. The simdjson GitHub notes that Meta Velox and even StarRocks (another database) use simdjson. So the idea of SIMD-accelerated JSON is catching on in the query engine world.

Trino might eventually get similar benefits if it leverages Velox or Arrow for certain connectors (the Velox project aims to provide a C++ execution engine that could underlie Trino, but Trino currently is pure Java, whereas Velox is being used in Meta’s Presto fork). As of now, Trino doesn’t explicitly list SIMD JSON parsing as a feature, whereas ClickHouse explicitly documents simdjson usage.

One more limit of SIMD with JSON is handling of floating-point and integers conversion from text. While you can use SIMD to identify digits quickly, converting a string of digits into an actual 64-bit number involves arithmetic that can’t be fully vectorized (you can do tricks like 8 digits at a time with SIMD, but it gets complex). Simdjson does some of this with parallel algorithms, but certain parts of JSON (like parsing a large decimal number, or parsing date strings) will still be relatively slow and scalar. So if a JSON dataset has many numbers or dates in string format, SIMD helps read the characters but the conversion step may dominate. In Java engines, this is often a hotspot – converting strings to numbers, and it’s hard to SIMD that in Java. Some libraries use hand-tuned algorithms (e.g., fast double parsing algorithms) to speed it up.

In summary, SIMD acceleration can greatly benefit JSON parsing by speeding up the tokenization and validation phase. Engines like ClickHouse (and libraries like simdjson) harness this to parse JSON at blazing speeds (multiple GB/s). However, Trino’s current implementation doesn’t directly use SIMD, instead focusing on high-level optimizations (pushing down filters, avoiding unnecessary parsing) to mitigate JSON costs. The limitation of SIMD in the context of nested JSON is that after the initial SIMD-powered scan, dealing with the hierarchical structure and lots of tiny fields can become the bottleneck – essentially, JSON parsing has parts that are amenable to SIMD and parts that are not. The parts that aren’t (like building the actual nested objects or performing per-value logic) mean that beyond a point, an engine still has to do work proportional to the complexity of the JSON. Another limitation is integration: an engine like Trino would need to bridge between managed code and native SIMD code, which has its own overhead (context switches, data copying, etc.) that could eat into the gains unless done carefully (e.g., processing large buffers at a time to amortize JNI overhead).

Despite these limits, the trend is clear: wherever possible, JSON parsing is being enhanced with SIMD. ClickHouse’s documentation even notes simdjson is “rigorous in parsing” and distinguishes nested fields properly while being fast. Engineers must balance the complexity – for a Java-based engine, an alternate route is to offload JSON scanning to something like S3 Select (using the cloud service’s native optimized C++ under the hood), which Trino has done. That’s effectively another way to leverage native speedups without integrating at the code level. Going forward, if Trino or others integrate Arrow’s native parsers, we may see more SIMD usage.

Bottom line: SIMD can massively speed up JSON parsing (e.g., simdjson ~4× faster than non-SIMD parsers), but it primarily helps the low-level byte processing. Engines like Trino that don’t yet use SIMD have focused on reducing the amount of JSON that needs to be parsed (via pushdowns and lazy evaluation) to compensate. Engines like ClickHouse that fully exploit SIMD show extremely high raw performance, although even they recommend structuring data for best results. The irregular, nested nature of JSON means pure SIMD parsing will always face diminishing returns at some point – you can tokenize quickly, but still have to handle the tokens in a CPU-friendly way. The most successful approaches, such as simdjson’s, combine SIMD scanning with clever indexing so that accessing a nested field becomes more like a binary search or pointer lookup (almost like an index lookup) rather than a naive traversal. This works very well for repeated queries or known access patterns.

For Trino users, the takeaway is that the engine achieves JSON performance through software techniques rather than hardware-specific ones, at least for now. It will parse JSON efficiently across many nodes in parallel and try not to do more work than necessary. ClickHouse or a simdjson-based approach will parse faster per core, but might require more upfront work (like defining schema or loading data). As data engines evolve, we may see Trino incorporate more native acceleration, possibly using something like SIMD for its JSON and CSV readers (especially as Java becomes more friendly to native vector calls via Panama). Until then, the current state is as described above: Trino – optimized in Java and via pushdown, Drill – pure Java and schema-flexible, ClickHouse – C++ with SIMD optimizations, each with their trade-offs on semi-structured JSON data.

Sources:

Trino Hive connector improvements and JSON parsing performance
AWS S3 Select pushdown in Trino (performance results and mechanism)
Apache Drill’s schema-free JSON querying and usage of columnar execution
ClickHouse using simdjson for JSON parsing and performance considerations
Explanation of SIMD json parsing approach (simdjson) and its benefits
Onehouse analytics engine comparison (Trino vs ClickHouse vs Spark) for context on engine designs
Stack Overflow discussions on Drill JSON performance and ClickHouse JSON storage tips

rifkiamil/Trino_JSON_Processing_chatgpt.md

Trino JSON Processing in Data Lakes: Engine Internals, Comparisons, and SIMD Aspects

Engine Internals and JSON Processing Enhancements

In-Memory Execution vs I/O-Bound Models

JSON Workload Comparison: Trino vs Apache Drill vs ClickHouse

SIMD and JSON: Benefits, Limits, and Implementation