SuperDB is an open-source analytical SQL database that eliminates the traditional divide between relational tables and JSON data. By making dynamic typing the architectural foundation while maintaining strong type safety, SuperDB enables developers to query heterogeneous data without the fragility of schema inference or the complexity of separate type systems. Currently in active development by Brimdata, the system shows significant promise for organizations dealing with mixed structured and semi-structured data.
SuperDB's revolutionary approach centers on its super-structured data model, where every value carries its own fully-specified type information. Unlike traditional databases that require predefined schemas or suffer from brittle schema inference, SuperDB treats data as self-describing sequences of typed values. This means a single query can seamlessly process records with different structures - one row might contain a simple string field while the next contains a complex nested array, and SuperDB handles both naturally.
The system implements SuperSQL, an extended SQL dialect that maintains backward compatibility with classic SQL while adding powerful pipe operators for linear data transformations. This dual approach lets developers write traditional SELECT statements or compose operations using the more intuitive pipe syntax: from users | where created_at > 2024-01-01 | aggregate count() by department. The pipe model proves particularly effective for complex transformations that would require nested subqueries in standard SQL.
SuperDB supports 30 primitive types including all standard numeric types (int8 through int64, uint8 through uint64, float32, float64), temporal types (time, duration), network types (ip, net), and specialized types like uuid and error. Complex types include records, arrays (which can be heterogeneous), sets, maps, and unions. This rich type system enables precise data representation without forcing artificial constraints.
For macOS and Linux users, Homebrew provides the simplest installation path:
brew install brimdata/tap/superAlternatively, developers can build from source with Go 1.24+:
go install github.com/brimdata/super/cmd/super@mainAfter installation, verify the setup with a simple test:
echo '"hello, world"' | super -s -To initialize a SuperDB data lake for persistent storage, set up the environment and create your first pool:
export SUPER_DB_LAKE=./mydata
super db init
super db create production_data
echo '{"status": "active", "value": 42}' | super db load -use production_data -The system requires no external dependencies, auxiliary databases, or schema registries. Everything needed for operation comes built into the single super binary, making deployment remarkably straightforward compared to traditional analytical databases.
The super command-line tool serves as the primary interface for all SuperDB operations. Its flexible syntax supports both file processing and streaming operations:
# Basic query execution
super -c 'from data.json | where status = "active" | count()'
# Format conversion
super -f parquet input.json > output.parquet
# Pretty-printed output
super -S messy_data.json # 4-space indented SUP format
# Streaming pipeline
cat logs.json | super -c 'where grep("ERROR") | aggregate count() by service' -Key command-line flags optimize different workflows:
-c "query": Specify SuperSQL query-f format: Output format (sup, json, csv, parquet, arrow, bsup, csup)-s: Human-readable SUP output-j: JSON output-S: Pretty-printed SUP with indentation
The system automatically detects input formats, allowing seamless processing of mixed data sources:
super -c 'count()' data.json other.csv third.parquetSuperSQL provides two syntactic approaches for different use cases. Classic SQL mode maintains familiarity for traditional database users:
SELECT customer_id, SUM(amount) as total
FROM sales
WHERE date >= '2024-01-01'
GROUP BY customer_id
HAVING SUM(amount) > 1000
ORDER BY total DESC
LIMIT 10;The pipe syntax offers a more linear, composable approach:
from sales
| where date >= 2024-01-01
| aggregate sum(amount) as total by customer_id
| where total > 1000
| order by total desc
| limit 10SuperSQL supports operator inference for interactive use. These shortcuts accelerate exploratory analysis:
# Search shorthand
?error # Implies: search error
# Field assignment
status := "active" # Implies: put status := "active"
# Boolean expression
amount > 100 # Implies: where amount > 100
# Aggregation
count() by service # Implies: aggregate count() by serviceThe this keyword references the current value in the pipeline, enabling powerful transformations:
from api_response
| put total := this.items | sum()
| where this.total > threshold
| put percentage := this.value / this.total * 100SuperDB introduces three native formats optimized for different use cases. SUP (Super) provides human-readable text representation ideal for development and debugging. BSUP (Super Binary) offers efficient binary encoding similar to Avro, optimized for streaming and storage. CSUP (Super Columnar) delivers columnar storage for analytical workloads, competing with Parquet performance.
The format selection significantly impacts performance:
# Convert to optimized binary format
super -f bsup large_dataset.json > optimized.bsup
# Use columnar format for analytics
super -f csup analytics_data.json > columnar.csup
# Query optimized data
super -c 'aggregate sum(revenue) by region' optimized.bsupSuperDB's type system seamlessly handles heterogeneous data:
// SUP format examples
{name: "John", age: 30} // Record
[1, "two", 3.0] // Heterogeneous array
2024-02-01T00:00:00Z // Time literal
5m30s // Duration
192.168.1.1/24 // Network CIDR
|["apple", "banana"]| // Set
error("division by zero") // First-class errorSuperDB's pipeline operators enable sophisticated data manipulations. The fork operator splits data into parallel processing branches:
fork (
where type = "order" | aggregate sum(amount) by customer
) (
where type = "return" | aggregate count() by reason
) | merge timestampThe switch operator routes data conditionally:
switch status
case "pending" (
put priority := "high" | output to "urgent.bsup"
)
case "complete" (
aggregate count() by outcome
)
default (
put status := "unknown"
)User-defined functions extend SuperSQL capabilities:
func calculate_tax(amount, rate): (
amount * rate / 100
)
func business_quarter(date): (
"Q" + string(ceil(month(date) / 3))
)
from sales
| put tax := calculate_tax(amount, 8.5)
| put quarter := business_quarter(order_date)
| aggregate sum(tax) by quarterUser-defined operators create reusable pipeline components:
op EnrichCustomer(): (
join (
from customers
) on this.customer_id = customers.id
| put customer_name := customers.name
| put customer_tier := customers.tier
)
from orders | EnrichCustomer() | where customer_tier = "gold"SuperDB's data lake provides Git-like semantics for data management with ACID properties:
# Initialize and configure lake
export SUPER_DB_LAKE=/data/superdb/lake
super db init
# Create pools (similar to tables/collections)
super db create raw_events
super db create processed_metrics
# Load data with automatic format detection
super db load -use raw_events events_2024*.json
super db load -use raw_events s3://bucket/events.parquet
# Query across pools
super db query "from raw_events | where timestamp > yesterday() | count()"The lake architecture supports branching and versioning similar to Git:
# Create development branch
super db branch create dev
# Switch to branch
super db branch use dev
# Merge changes back
super db branch merge dev mainSuperDB provides an extensive library of built-in functions. Mathematical functions handle numeric operations:
abs(value) -- Absolute value
ceil(number) -- Ceiling
floor(number) -- Floor
round(number, n) -- Round to n decimal places
pow(base, exp) -- Exponentiation
sqrt(number) -- Square rootString manipulation functions process text data:
len(string) -- String length
upper(string) -- Convert to uppercase
lower(string) -- Convert to lowercase
split(string, delimiter) -- Split into array
replace(string, old, new) -- String replacement
regexp_replace(str, pattern, replacement) -- Regex replacementTime and date functions handle temporal data:
now() -- Current timestamp
date_part("hour", timestamp) -- Extract time component
bucket(time, 5m) -- Group into 5-minute buckets
date_trunc("day", timestamp) -- Truncate to day boundaryType functions enable dynamic type handling:
typeof(value) -- Get type of value
cast(value, type) -- Type conversion
is_error(value) -- Check if value is error
has(field) -- Check field existence
coalesce(val1, val2, ...) -- Return first non-null valueOptimizing SuperDB performance requires understanding format characteristics and query patterns. Binary formats consistently outperform text formats - BSUP provides 3-5x better performance than JSON for most operations, while CSUP excels at analytical queries with selective column access.
For production workloads, pre-convert data to optimized formats:
# One-time conversion for frequently queried data
super -f bsup raw_data.json > optimized.bsup
# Use columnar format for analytical queries
super -f csup historical_data.json > analytics.csup
# Query optimized data
super -c 'complex_analytical_query' analytics.csupQuery optimization techniques leverage SuperSQL's capabilities:
-- Use grep for multi-field text search instead of multiple OR conditions
from logs | where grep("error", message, description, notes)
-- Minimize data movement with early filtering
from large_dataset
| where timestamp > recent_cutoff -- Filter early
| put expensive_calc := complex_function(field)
-- Leverage type information for optimization
from mixed_data
| switch typeof(this) (
case "record" => process_record()
case "array" => process_array()
)SuperDB excels at heterogeneous data integration. Common patterns include unified log analysis across multiple formats:
from "app.json", "nginx.log", "metrics.csv"
| where grep("ERROR") or severity > "WARN"
| put source := filepath()
| aggregate count() by source, hour := bucket(timestamp, 1h)ETL pipelines benefit from SuperDB's schema flexibility:
from "s3://raw-data/events-*.json"
| where is_valid(data)
| put enriched := lookup_enrichment(data.id)
| put processed_at := now()
| aggregate metrics by category
| output to "s3://processed/daily-metrics.bsup"API data processing leverages native HTTP support:
from "https://api.example.com/v2/data"
| unnest items
| where items.status = "active"
| put api_timestamp := now()
| join (from local_db) on items.id = local_db.idSuperDB treats errors as first-class values rather than exceptions, enabling robust data pipelines:
from problematic_data
| put safe_calc := this.value / this.divisor
| switch (
case is_error(safe_calc) => put result := 0, error_flag := true
default => put result := safe_calc, error_flag := false
)
| where not error_flag or log_errorsCommon troubleshooting approaches for performance issues include checking data format efficiency, monitoring memory usage during complex operations, and validating query logic with smaller datasets:
# Profile query execution
super -c 'explain query' data.bsup
# Monitor resource usage
time super -c 'complex_query' large_dataset.bsup
# Test with sample data
head -1000 full_data.json | super -c 'test_query' -For production deployments, establish clear data governance patterns. Use BSUP format for operational data requiring fast access, CSUP for analytical workloads with column-selective queries, and SUP only for debugging and development.
Implement robust error handling in all pipelines:
op SafeTransform(field): (
put original := field
| put transformed := risky_operation(field)
| where is_error(transformed)
=> put result := original, status := "failed"
=> put result := transformed, status := "success"
)Design modular, reusable components using user-defined operators:
op StandardizeTimestamps(): (
put timestamp := case(
has(created_at) => time(created_at)
has(date) => time(date + "T00:00:00Z")
default => now()
)
)
op ValidateRequiredFields(fields): (
put validation := all(fields, has)
| where not validation => error("Missing required fields")
)When migrating from traditional SQL databases, leverage SuperDB's SQL compatibility while gradually adopting pipe syntax:
-- Start with familiar SQL
SELECT * FROM customers WHERE region = 'APAC'
-- Gradually introduce pipes
from customers | where region = 'APAC'
-- Evolve to leverage SuperDB features
from customers
| where region = 'APAC'
| put revenue_tier := case(
annual_revenue > 1000000 => "enterprise"
annual_revenue > 100000 => "mid-market"
default => "smb"
)For JSON-heavy workloads, preserve structure during migration:
# Direct JSON ingestion without transformation
super db load -use json_archive existing_mongodb_export.json
# Query nested structures naturally
super db query "from json_archive | where metadata.version > 2"The SuperDB community provides multiple support channels. The official documentation at https://superdb.org/docs/ offers comprehensive reference material with interactive examples. The GitHub repository at https://github.com/brimdata/super contains source code, issue tracking, and contribution guidelines. The community Slack workspace provides real-time support and discussion forums.
For learning, follow this recommended path: start with the installation guide and basic super command tutorial, explore interactive examples in the documentation, experiment with your own datasets in development, and engage with the community for advanced use cases.
As a pre-GA system, SuperDB has several limitations to consider. Performance optimization continues with active work on vectorization and parallel processing. SQL compatibility is improving toward full Postgres compatibility. Production features like comprehensive monitoring and management tools are still developing.
The development roadmap prioritizes completing vectorized runtime implementation, achieving full SQL standard compliance, improving performance benchmarking and optimization, and expanding enterprise features for production deployments.
Despite current limitations, SuperDB shows remarkable promise for organizations dealing with heterogeneous data. Its unified approach to structured and semi-structured data, combined with powerful query capabilities and active development, positions it as a potentially transformative technology in the analytics database landscape. Senior engineers should evaluate it for development and experimental workloads while monitoring its progress toward production readiness.