Skip to content

Instantly share code, notes, and snippets.

@philipz
Created September 2, 2025 01:33
Show Gist options
  • Save philipz/d07819a83cdc0f689f9214aa547fa5b1 to your computer and use it in GitHub Desktop.
Save philipz/d07819a83cdc0f689f9214aa547fa5b1 to your computer and use it in GitHub Desktop.
Hazelcast Platform Architecture Guide

Hazelcast Platform Architecture Guide

Hazelcast Platform 5.x has emerged as the definitive solution for distributed computing architectures, combining in-memory data grid capabilities with stream processing, microservices support, and enterprise-grade security in a unified platform. This comprehensive analysis reveals how organizations across industries are leveraging Hazelcast's capabilities to build resilient, high-performance systems that scale from startup applications to mission-critical enterprise deployments processing billions of events per second.

The platform's evolution to a unified architecture in version 5.x eliminates the complexity of managing separate systems for caching, stream processing, and messaging, while introducing breakthrough capabilities like Thread-Per-Core (TPC) architecture that delivers linear scaling and vector search for AI workloads. Real-world implementations demonstrate performance improvements of up to 200x over traditional approaches, with organizations like BNP Paribas achieving 84,000 requests per second at 0.8ms average response time.

Distributed caching architectures and patterns

Modern caching architectures must balance performance, consistency, and operational complexity. Hazelcast's distributed caching capabilities provide multiple deployment topologies optimized for different use cases, from embedded mode delivering microsecond latencies to client-server architectures supporting polyglot microservices environments.

The embedded topology co-locates application logic with cache data in the same JVM, eliminating network latency entirely and providing the fastest possible data access. This approach proves ideal for latency-sensitive applications where microsecond performance matters, such as high-frequency trading systems and real-time fraud detection. However, it limits deployments to Java applications and increases memory footprint per application instance.

Client-server topology separates application clients from dedicated Hazelcast clusters, enabling independent scaling, polyglot language support, and resource isolation. While introducing 1-2ms network latency overhead, this architecture supports complex microservices environments where different services may require different scaling characteristics. The near cache enhancement bridges this gap by providing client-side LRU caching that delivers microsecond response times while maintaining cluster-wide consistency through automatic invalidation.

Cache access patterns determine both performance characteristics and architectural complexity. The cache-aside pattern places full control in application hands, making it suitable for legacy system integration and gradual cache adoption. Applications manually manage cache population and invalidation, trading implementation complexity for maximum flexibility in cache provider selection and data consistency management.

Read-through and write-through patterns shift responsibility to the cache provider through Hazelcast's MapLoader and MapStore interfaces. Read-through automatically populates cache on miss scenarios, while write-through ensures synchronous updates to both cache and backend store. These patterns implement separation of concerns principles but require strong consistency guarantees suitable for financial applications and mission-critical data integrity scenarios.

Write-behind patterns provide significant performance improvements for write-heavy workloads through asynchronous backend updates. Configuration parameters like write-delay-seconds > 0 enable batching strategies that dramatically improve throughput while accepting eventual consistency trade-offs. Gaming platforms and social media applications benefit from this approach, though potential data loss during system failures requires careful consideration.

Multi-tier caching strategies create layered approaches combining local caches (L1), near cache (L2), distributed cache (L3), and persistent stores (L4). This architecture optimizes for both performance and cost by placing frequently accessed data in faster storage tiers while maintaining comprehensive data availability. CDN integration extends this pattern to edge locations, combining HTTP-level caching for static content with near cache for dynamic, personalized content delivery.

Transaction management and database integration

Distributed transaction management requires careful balance between consistency guarantees and performance characteristics. Hazelcast Platform 5.x provides multiple transaction consistency models tailored to different business requirements and performance constraints.

ONE_PHASE transactions offer local tracking with commit logs suitable for non-critical scenarios accepting eventual consistency. This lightweight approach minimizes coordination overhead while maintaining basic durability guarantees. TWO_PHASE transactions enhance durability through commit log replication across cluster members, following traditional distributed transaction protocols with prepare and commit phases. This approach ensures ACID compliance essential for production financial systems and mission-critical applications.

XA transaction support provides full distributed transaction coordination across multiple resource managers, enabling integration with Java EE containers and complex multi-database scenarios. The implementation includes automatic coordinator selection, transaction ID pooling to handle connector limitations, and idempotent commit operations supporting failure recovery scenarios.

Event sourcing and CQRS patterns leverage Hazelcast's distributed capabilities for modern architectural approaches. Events stored as immutable state changes in distributed event stores enable automatic replay for aggregate reconstruction while providing built-in auditing capabilities. The platform's pub/sub messaging supports both orchestration-based saga patterns with central coordinators and choreography-based approaches using event-driven coordination between microservices.

Database integration through MapStore and MapLoader interfaces provides comprehensive connectivity patterns supporting both relational and NoSQL databases. The Generic MapStore offers zero-code configuration for standard database operations, while custom implementations enable sophisticated business logic integration. MapStore offloading in Platform 5.2+ runs operations on dedicated executors preventing database operations from impacting cluster performance, achieving up to 45% performance improvement for write operations.

Change Data Capture (CDC) integration using Debezium 2.7.x engine provides real-time database synchronization with support for MySQL, PostgreSQL, Oracle, SQL Server, MongoDB, and Cassandra. This approach enables event-driven architectures without modifying existing systems, processing millions of events per second with guaranteed at-least-once delivery and offset management for fault tolerance.

Performance characteristics vary significantly across patterns: write-through with TWO_PHASE transactions prioritizes strong consistency over performance, while write-behind patterns with CDC integration optimize for high throughput with eventual consistency. The choice depends on business requirements, with financial systems typically choosing consistency while gaming and social platforms optimize for performance.

Microservices integration and event processing

Service mesh integration represents a crucial capability for modern microservices architectures. Hazelcast successfully integrates with Istio service mesh through careful configuration of TCP port 5701 handling and service discovery coordination. The implementation supports both embedded and client-server topologies with automatic sidecar injection and mutual TLS authentication.

Key implementation patterns include headless Kubernetes services for Hazelcast discovery, proper RBAC configuration for Kubernetes API access, and careful port naming conventions (tcp-hazelcast) for protocol detection. Network partitioning challenges in mesh environments require attention to load balancing complexities with stateful services and coordination between Hazelcast and service mesh discovery mechanisms.

API gateway caching patterns provide multiple architectural approaches optimized for different deployment scenarios. The reverse proxy cache pattern positions cache between gateway and services for transparent HTTP-level caching. Sidecar cache patterns deploy cache as container companions in Kubernetes pods with shared network namespaces. Embedded cache patterns integrate directly within gateway applications for lowest latency data access.

Inter-service communication leverages Hazelcast's messaging capabilities for event-driven architectures. Topics provide pub/sub messaging with reliable delivery backed by Ringbuffers, while queues enable point-to-point message delivery with ordering guarantees. The aggregator message pattern coordinates HTTP requests in load-balanced environments using distributed locks and CompletableFuture for asynchronous processing with timeout mechanisms.

Spring Boot integration provides comprehensive auto-configuration support for both client and embedded modes. The framework automatically configures HazelcastInstance beans with environment-specific settings via application.yml, while @EnableCaching annotations enable declarative caching patterns. Dependency management through spring-integration-hazelcast enables event-driven channel adapters and executor-based message channels.

Real-time event processing through integrated Jet engine delivers industry-leading performance with single nodes processing 10 million events per second at sub-10ms latencies. Cluster deployments scale to billions of events per second with guaranteed ordering and exactly-once processing semantics. CQRS implementations separate command and query responsibilities through event stores, event buses using Hazelcast topics, and snapshot optimizations for query performance.

Stream processing pipelines support complex event-driven scenarios including fraud detection with real-time scoring, IoT sensor data processing with pattern recognition, and real-time analytics with windowing operations. The platform's cooperative multithreading approach optimizes resource utilization while maintaining millisecond-level latency guarantees essential for responsive applications.

High availability and cloud deployment

Multi-data center deployments require sophisticated replication strategies balancing performance, consistency, and operational complexity. WAN replication supports both active-passive failover scenarios and active-active geographic distribution with configurable consistency models. The WanBatchReplication mechanism optimizes network utilization through configurable count and time-based thresholds, while Delta Synchronization using Merkle Trees minimizes bandwidth by replicating only changed entries.

Kubernetes integration through the Platform Operator 5.15+ provides enterprise-grade lifecycle management with Custom Resource Definitions enabling Kubernetes-native cluster management. The operator handles configuration, scaling, and recovery automatically while supporting WAN replication across regions and simplified persistence management. Deployment methods range from production-ready operator installations to Helm charts for development scenarios.

Cloud provider integrations span AWS EKS/EC2/ECS with zone-aware deployment, Azure AKS/VM Scale Sets with auto-scaling, and Google Cloud GKE/Compute Engine with service account authentication. Each integration includes native discovery plugins (hazelcast-aws, hazelcast-azure, hazelcast-gcp) and supports auto-scaling based on both traditional CPU/memory metrics and custom Hazelcast-specific metrics like memory utilization and request rates.

Container strategies leverage official Docker images available on Docker Hub with Alpine Linux bases for minimal footprint. Production configurations emphasize StatefulSets for persistent identity, headless services for discovery, and proper resource limits preventing over-provisioning. Auto-scaling configurations combine Horizontal Pod Autoscaler with custom metrics for memory utilization (80% scale-up threshold) and CPU-based scaling for compute-intensive workloads.

Cost optimization strategies include Hazelcast Cloud Standard's pay-as-you-go pricing with automatic scaling and cluster pausing during inactivity. Self-managed optimizations leverage spot instances for non-critical workloads, scheduled scaling during off-peak hours, and memory-optimized instance types for cache-heavy applications. Resource planning follows the formula: Total Memory = Active Data × (1 + Backup Count) × 1.4 overhead factor.

Security and compliance frameworks

Enterprise security requires comprehensive authentication, authorization, and encryption capabilities. Hazelcast's JAAS-based security framework supports LDAP integration for enterprise directory services, X.509 certificate authentication for mutual TLS, and custom authentication protocols through socket interceptors. Role-Based Access Control (RBAC) provides fine-grained permissions on data structures with configurable client authorization policies and endpoint-based network access controls.

Data protection encompasses both encryption at rest and in transit. Hot Restart Persistence supports configurable disk encryption for persisted data, while application-level encryption enables custom serialization before caching. Cloud provider integration supports AWS KMS, Azure Key Vault, and GCP KMS for centralized key management. TLS/SSL configuration enables end-to-end encryption for member-to-member and client-to-member communication with mutual authentication, configurable cipher suites, and secure WAN replication across data centers.

Compliance framework support addresses multiple regulatory requirements through comprehensive audit logging and controls. GDPR compliance includes data subject rights implementation, detailed processing records, privacy-by-design features, and configurable retention policies. SOC 2 compliance encompasses security controls through multi-factor authentication, availability controls via high availability monitoring, confidentiality through access restrictions, and processing integrity with transaction logs. HIPAA compliance implements required administrative, physical, and technical safeguards.

Audit logging captures comprehensive security events including authentication attempts, authorization decisions, data access patterns, administrative actions, and security policy changes. The network security architecture emphasizes segmentation through isolated subnets, restrictive firewall policies, VPC peering for secure connectivity, and service mesh integration for zero-trust networking principles.

Performance optimization and scaling strategies

Thread-Per-Core (TPC) architecture represents a fundamental advancement in distributed system design, moving beyond traditional SEDA approaches that suffer from context switching overhead and thread parking/unparking bottlenecks. TPC assigns one thread per CPU core handling all operations non-blockingly, delivering linear scaling with core count and up to 10x performance improvement on systems not matching traditional 8-core optimizations.

Memory management strategies balance performance, capacity, and operational complexity. High-Density Memory Store (HDMS) eliminates garbage collection pressure through off-heap storage supporting terabytes of data per member. Memory planning formulas account for backup counts and overhead: memory requirements increase from 100% for backup count 0 to 200% for backup count 1, with 40% failure headroom recommended for production deployments.

Network optimization requires attention to both hardware and software configurations. TCP buffer tuning follows the formula: Buffer Size = Round Trip Time × Network Bandwidth, with Hazelcast defaults of 128KB typically requiring adjustment for high-throughput scenarios. Hardware recommendations emphasize dedicated 10Gbps+ NICs per member, uniform cluster hardware preventing bottlenecks, and single-member-per-machine deployment avoiding context switching overhead.

Cluster sizing guidelines establish baseline hardware requirements: minimum 8 CPU cores, 16GB RAM, and 10Gbps network connectivity, with AWS c5.2xlarge representing reference architecture standards. Capacity planning parameters include active data size, backup requirements, query patterns, concurrent processing needs, and high availability requirements determining final cluster configuration.

Performance benchmarking demonstrates competitive advantages: Hazelcast shows 56% better performance than Redis without near cache and 5x improvement with near cache enabled. The platform's multi-threaded architecture scales linearly beyond 32 concurrent threads where Redis single-threaded design creates bottlenecks. Streaming performance achieves billion events per second capability with 99.99% latency under 16ms at 1M events/second throughput.

Industry applications and business impact

Financial services implementations demonstrate mission-critical capabilities across trading, risk management, and fraud detection scenarios. Major investment banks leverage Hazelcast as centralized APIs for front, middle, and back-office systems providing golden source market data with microsecond latencies. NYC investment banks process hundreds of thousands of messages per second for margin systems, settlement systems, and liquidity management, while London banks manage external FX brokerages with 99.9% uptime over multi-year periods.

Fraud detection systems process millions of payment transactions per second, with major credit card providers reducing annual fraud write-downs by $100 million through enhanced accuracy from similarity search and composite scoring algorithms. These systems manage 2TB+ customer data with projected scaling to 5TB using WAN replication for operational continuity across regional data centers.

E-commerce platforms achieve remarkable scale improvements: a global retailer with $18.3 billion annual sales scaled from hundreds of nodes to 6 servers with 168GB total cache space while maintaining sub-second personalized shopping experiences. The platform handles extreme burst traffic during product launches and Black Friday events through elastic scaling and real-time inventory management across multiple locations.

Gaming industry implementations like Gamesys demonstrate embedded architecture advantages for real money gaming platforms. The system handles tens of thousands of concurrent players with zero data loss guarantees through fast read/write operations, proportional scaling, and message broker capabilities via Topics. Rolling upgrades achieve zero downtime while reducing complexity compared to NoSQL/RDBMS alternatives.

Banking modernization projects showcase transformation potential: BNP Paribas Bank Polska achieved 12x faster data reconciliation (20 minutes vs 4 hours), 200x performance improvement in internet banking over SOA approaches, and production performance of 84,000 requests/second with 0.8ms average response time. The implementation maintained zero downtime across two years with automatic network partition handling.

Migration strategies and modernization approaches

Legacy system modernization follows proven methodologies balancing business continuity with technical advancement. Gradual migration strategies progress through assessment, proof-of-concept development, pilot implementation, phased rollout, and full deployment with comprehensive rollback capabilities. BNP Paribas migration exemplified this approach with weekend migration windows minimizing business impact while maintaining stable core systems through modern abstraction layers.

Data Migration Tool (DMT) capabilities support comprehensive migration scenarios from IMDG 4.x/5.x to Platform 5.3.x+ across infrastructure types including on-premise to cloud and Community to Enterprise Edition transitions. The three-cluster architecture approach (source, target, migration) ensures data integrity while providing built-in rollback procedures for risk mitigation.

Performance testing methodologies leverage Hazelcast Simulator for stress testing, benchmarking, and load generation with configurable workloads matching realistic scenarios. Testing frameworks include JUnit integration, pipeline-based assertions with timeout handling, and fault injection capabilities simulating network failures and member crashes. These tools validate performance improvements and ensure production readiness.

Risk mitigation strategies encompass operational and technical considerations through automated backup and recovery, split-brain protection, rolling upgrades, and multi-data center WAN replication. Technical risk mitigation includes graceful degradation mechanisms, circuit breaker patterns, comprehensive health monitoring, and established performance baselines with continuous SLA tracking.

Decision frameworks and architectural guidance

Topology selection depends on performance requirements and operational constraints. Choose embedded mode for ultra-low latency requirements under 1ms in Java-only environments where tight application coupling is acceptable. Select client-server architecture for microservices environments requiring polyglot language support, independent service lifecycle management, and centralized data management where 1-2ms network latency is acceptable.

Caching pattern selection follows business requirement priorities. Implement cache-aside for legacy integration and gradual adoption scenarios, read-through for new applications with simplified architecture requirements, write-through for strong consistency needs, and write-behind for high write throughput with acceptable eventual consistency. Refresh-ahead patterns suit predictable access patterns with zero cache miss tolerance.

Transaction pattern decisions balance consistency requirements with performance needs. Choose ONE_PHASE for applications tolerating eventual consistency prioritizing performance, TWO_PHASE for strong consistency requirements with multiple external systems, and XA transactions for Java EE container integration with multiple transactional resources requiring cross-platform coordination.

Scaling and deployment strategies consider both current requirements and future growth projections. Memory management recommendations include HDMS for datasets exceeding 100GB, appropriate eviction policies (LRU, TTL), efficient serialization formats (Avro, Protocol Buffers), and near cache implementation for read-heavy workloads. Operational excellence requires Infrastructure as Code deployments, comprehensive monitoring with alerting, established runbooks, and capacity management based on application-specific metrics.

Conclusion

This comprehensive architectural analysis demonstrates Hazelcast Platform 5.x's capability to serve as the foundation for modern distributed applications across industries, providing unified solutions for caching, stream processing, microservices integration, and data management while maintaining enterprise-grade security and operational excellence standards. The platform's evolution toward simplified deployment models, enhanced performance through TPC architecture, and comprehensive cloud-native integration positions it as the definitive choice for organizations building next-generation distributed systems that demand both performance and reliability at scale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment