Introduction

I ran into a some problems trying to deploy a Vert.x based HTTP service, using Hazelcast as the cluster manager, on AWS ECS Fargate (a container orchestration service where you deploy docker containers without worrying about the underlying virtual machines - kind of like Kubernetes, but simpler perhaps?).

Some Problems

Hazelcast has good documentation on their website on using Hazelcast on AWS ECS, but all of it assumes that you are configuring the cluster with their YAML configuration language - which unfortunately doesn't appear to be supported by Vert.x - setting the system property vertx.hazelcast.config will only take an XML file - Vert.x will try to parse it to configure Hazelcast and will crap on a YAML configuration file. Also, they assume you'll host the configuration file on an EFS volume - which is another hassle to set up.

Other public documentation where XML configuration is used, do not use the Hazelcast AWS plugin but some other discovery mechanism which is either abandondend or rely on some external non-trivial infrastructure.

A Simple Setup Should Work

My requirements are to keep it as simple as possible:

One docker container with a straight-forward Vert.x application
- I'm using a shaded jar with Vert.x included, but probably any other Vert.x deployment should work.
Use only supported software for the Hazelcast cluster.
- Vert.x (as of current 4.2.1) uses Hazelcast 4.2.2, and I'm also adding the latest hazelcast-aws library (3.4 as of this writing).
- Hazelcast 5 will have the AWS support built-in, but I'm not sure when we're going to get that for Vert.x.
CloudFormation for setting up and managing the cluster
- All configuration has to be in environment variables - no additional files to upload and manage in EFS or S3. CloudFormation can't do that.
- Side note: Terraform allows you to orchestrate files in storage, but I'm not using it for reasons that are out of scope from this discussion.

And that's it - if I need more setup than a single CloudFormation template, I'll look somewhere else.

Something That Works For Me (YMMV)

Application Setup

As I've mentioned above, my application is built into a single JAR - built and packaged using Maven - that can be deployed with any supporting JVM without needing to install the "Vert.X distribution", but if you do use a distribution the configuration will be somewhat different (and you might want to base it on the official Vert.X image though as that is currently based on Java 8 - maybe don't that that either).

The application is built and packaged using Maven, and I'm using maven-shade-plugin to pack all dependencies (including Vert.x, Hazelcast and the Hazelcast plugins) into a single JAR and also to create a MANIFEST.MF file that automatically runs my verticle when the JAR is "run", using this maven-shade-plugin transformer configuration:

<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
  <manifestEntries>
    <Main-Class>io.vertx.core.Launcher</Main-Class>
    <Main-Verticle>my.main.verticle</Main-Verticle>
    <Version>${project.version}</Version>
  </manifestEntries>
</transformer>

Hazelcast Plugin Configuration

An issue with shading multiple Hazelcast discovery plugins is that the plugins expose themselves using a service loader configuration file in the JAR META-INF directory. When shading multiple plugins, each will offer itself in the identically named META-INF/services/com.hazelcast.spi.discovery.DiscoveryStrategyFactory configuration file, and shading all of them together will cause the only the first such file to be created in the JAR - and this is often the "Multicast" discovery strategy, as it is built-in to the main Hazelcast JAR.

When this issue is triggered, you'd get errors in the application log in the form: There is no discovery strategy factory to create 'DiscoveryStrategyConfig{properties={}, className='com.hazelcast.aws.AwsDiscoveryStrategy', discoveryStrategyFactory=null}'

Hopefully with Hazelcast 5 - where all the plugins are in the same JAR - it won't be an issue but until then you need to take care of service files when shading, using the AppendingTransformer:

<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
  <resource>META-INF/services/com.hazelcast.spi.discovery.DiscoveryStrategyFactory</resource>
</transformer>

This will bunch up all the discovery factory configurations into a single file so that Hazelcast can find all of them.

Container Configuration

Because we are already deep in AWS, I'm just using the current LTS Corretto JVM as the base image and package my JAR file:

Dockerfile:

FROM amazoncorretto:17

RUN yum install -y unzip && rm -rf /var/cache/yum # or other linux deps you want
WORKDIR /srv/app
ADD src/main/resources/docker-entrypoint.sh /docker-entrypoint.sh
ENTRYPOINT [ "/docker-entrypoint.sh" ]

ARG JAR_NAME
ADD target/$JAR_NAME.jar /srv/app/app.jar

More Maven

I'm using Maven to build the docker image as well, in which case the plugin configuration looks like this (though you may have other fine ideas):

<plugin>
	<groupId>com.spotify</groupId>
	<artifactId>dockerfile-maven-plugin</artifactId>
	<version>1.4.2</version>
	<executions>
		<execution>
			<id>default</id>
			<goals>
				<goal>build</goal>
				<goal>push</goal>
			</goals>
		</execution>
	</executions>
	<configuration>
		<repository>your.docker.registry.possibly.aws.ecr/your-tag</repository>
		<tag>${project.version}</tag>
		<buildArgs>
			<!-- we use the output from maven-shade-plugin, which should be configured appropriately -->
			<JAR_NAME>${project.artifactId}-${project.version}-shaded</JAR_NAME>
		</buildArgs>
	</configuration>
</plugin>

The Actual Hazelcast Support

The docker-entrypoint.sh script is where the interesting stuff happens:

#!/bin/bash -x
if [ -n "$HZ_CLUSTER_XML" ]; then
	base64 -d <<<$HZ_CLUSTER_XML > /cluster.xml
	APP_ARGS=-cluster 
	JVM_OPTIONS="$JVM_OPTIONS \
		-Dvertx.hazelcast.config=/cluster.xml \
		-Dhazelcast.http.healthcheck.enabled=true \
  "
  # Hazelcast really wants these things for Java >= 9
  JVM_OPTIONS="$JVM_OPTIONS \
		--add-modules java.se --add-exports java.base/jdk.internal.ref=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED \
		--add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED \
		--add-opens java.management/sun.management=ALL-UNNAMED --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED \
	"
fi

[ -z "$JAVA_HOME" ] && JAVA_HOME=/usr
[ -x "$JAVA_HOME/bin/java" ] || { echo "No Java runtime found!"; exit 5; }

$JAVA_HOME/bin/java $JVM_OPTIONS -jar /srv/app/app.jar $APP_ARGS "$@"

The main thing that happens here is that we detect a Hazelcast XML configuration content in the HZ_CLUSTER_XML environment variable and if so - set up Hazelcast clustering for Vert.x (if you omit that environment variable, you get non-clustered local SharedData instances that work fine but don't actually share data - its good for testing).

CloudFormation Template

Now you can create your Fargate cluster using CloudFormation - configuration is code!

I do use YAML for my configuration file as it is saner to read and write than JSON (and XML, but sometimes we can't help it). The configuration example below assumes more basic entities (such as VPC and load balancer) have already been configured - I use this snippet as a nested stack in a larger CloudFormation setup that has a lot of other things and a few more stanard HTTP services that hook into the load balancer. But it should be easy to extrapolate or even use as is.

AWSTemplateFormatVersion: '2010-09-09'
Description: My Fargate cluster

Parameters:
  VpcId:
    Type: AWS::EC2::VPC::Id
    Description: Select a VPC that allows instances access to the Internet.
  VPCipv4Prefix:
    Description: The VPC network IPv4 prefix
    Type: String
  VPCipv6Prefix:
    Description: The VPC network IPv6 prefix
    Type: String
  RouteTable:
    Description: The VPC main routing table where subnet can attach for network access
    Type: String
  DesiredCapacity:
    Type: Number
    Default: 3
    Description: Number of instances to launch in your ECS cluster.
  MaxCapacity:
    Type: Number
    Default: 10
    Description: Maximum number of instances that can be launched in your ECS cluster.
  LoadBalancerListener:
    Description: The public load balancer listener
    Type: String
  LoadBalancerSecurityGroup:
    Description: The load balancer security group reference
    Type: String
  EcrRepository:
    Description: the ECR repository for the service
    Type: String
  ContainerTag:
    Description: The image to deploy for the service (default: latest)
    Type: String
    Default: latest

Resources:
  MyCluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterName: my-cluster

  MyFargateExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: my-ecs-role
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: ecs-tasks.amazonaws.com
            Action: 'sts:AssumeRole'
      ManagedPolicyArns:
        - 'arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy'

  MyLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: fargate/my-service # could be anything, I just like prefixes that look like directories

  MyTaskPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      PolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Action:
              - logs:CreateLogGroup
              - logs:CreateLogStream
              - logs:PutLogEvents
              - logs:DescribeLogStreams
            Resource:
              - arn:aws:logs:*:*:*
          # for hazelcast
          - Effect: Allow
            Action:
              - ec2:DescribeNetworkInterfaces
              - ecs:ListTasks
              - ecs:DescribeTasks
            Resource:
              - "*"

  MyTaskRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: my-task-ecs-role
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: ecs-tasks.amazonaws.com
            Action: 'sts:AssumeRole'
      ManagedPolicyArns:
        - Ref: MyTaskPolicy

  MySecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: My Task Security Group
      VpcId: !Ref VpcId
      SecurityGroupIngress:
        - Description: Allow access from load balancer
          IpProtocol: tcp
          FromPort: 8080
          ToPort: 8080
          SourceSecurityGroupId: !Ref LoadBalancerSecurityGroup

  MySecurityGroupSelfIngress: # allow access to myself, so cluster nodes can communicate
      Type: AWS::EC2::SecurityGroupIngress
      Properties:
        GroupId: !Ref MySecurityGroup
        IpProtocol: -1
        SourceSecurityGroupId: !Ref MySecurityGroup

  MySubnetA:
    Type: AWS::EC2::Subnet
    Properties:
      AvailabilityZone: !Select [ 0, !GetAZs { Ref: "AWS::Region" } ]
      CidrBlock: !Sub ${VPCipv4Prefix}.10.0/24
      Ipv6CidrBlock: !Sub "${VPCipv6Prefix}10::/64"
      VpcId: !Ref VpcId

  MySubnetB:
    Type: AWS::EC2::Subnet
    Properties:
      AvailabilityZone: !Select [ 1, !GetAZs {Ref: "AWS::Region"} ]
      CidrBlock: !Sub ${VPCipv4Prefix}.11.0/24
      Ipv6CidrBlock: !Sub "${VPCipv6Prefix}11::/64"
      VpcId: !Ref VpcId

  MySubnetC:
    Type: AWS::EC2::Subnet
    Properties:
      AvailabilityZone: !Select [ 2, !GetAZs {Ref: "AWS::Region"} ]
      CidrBlock: !Sub ${VPCipv4Prefix}.12.0/24
      Ipv6CidrBlock: !Sub "${VPCipv6Prefix}12::/64"
      VpcId: !Ref VpcId

  MySubnetRoutingA:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref RouteTable
      SubnetId: !Ref MySubnetA

  MySubnetRoutingB:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref RouteTable
      SubnetId: !Ref MySubnetB

  MySubnetRoutingC:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref RouteTable
      SubnetId: !Ref MySubnetC

  MyTargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      HealthCheckIntervalSeconds: 10
      HealthCheckPath: /
      HealthCheckTimeoutSeconds: 5
      UnhealthyThresholdCount: 2
      HealthyThresholdCount: 2
      Name: my-vertx-target-group
      Port: 8080
      Protocol: HTTP
      TargetGroupAttributes:
        - Key: deregistration_delay.timeout_seconds
          Value: 60 # default is 300
      TargetType: ip
      VpcId: !Ref VpcId

  MyALBListenerRule:
    Type: AWS::ElasticLoadBalancingV2::ListenerRule
    Properties:
      Actions:
        - Type: forward
          TargetGroupArn: { Ref: MyTargetGroup }
      Conditions:
        - Field: path-pattern
          Values:
            - /my-api/v1
            - /my-api/v1/*
      ListenerArn: !Ref LoadBalancerListener
      Priority: 1

  MyTaskDefinition:
    Type: AWS::ECS::TaskDefinition
    DependsOn:
      - MyLogGroup
    Properties:
      Family: my-task
      NetworkMode: awsvpc
      RequiresCompatibilities:
        - FARGATE
      Cpu: 512
      Memory: 1GB
      ExecutionRoleArn: !Ref MyFargateExecutionRole
      TaskRoleArn: !Ref MyTaskRole
      ContainerDefinitions:
        - Name: my-task-container
          Image: !Sub ${EcrRepository}:${ContainerTag}
          Environment:
            # I find this one useful
            - Name: JVM_OPTIONS
              Value: >
                -XX:+CrashOnOutOfMemoryError
            # configure our cluster!
            - Name: HZ_CLUSTER_XML
              Value:
                Fn::Base64: !Sub |
                   <hazelcast xsi:schemaLocation="http://www.hazelcast.com/schema/config hazelcast-config-3.7.xsd" xmlns="http://www.hazelcast.com/schema/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                     <properties>
                       <property name="hazelcast.discovery.enabled">true</property>
                     </properties>
                     <cluster-name>my-cluster</cluster-name>
                     <network>
                       <join>
                         <multicast enabled="false"/>
                         <aws enabled="true"/>
                       </join>
                       <interfaces enabled="true">
                         <interface>${VPCipv4Prefix}.*.*</interface>
                       </interfaces>
                     </network>
                     <multimap name="__vertx.subs">
                       <backup-count>1</backup-count>
                       <value-collection-type>SET</value-collection-type>
                     </multimap>
                     <map name="__vertx.haInfo">
                       <backup-count>1</backup-count>
                     </map>
                     <map name="__vertx.nodeInfo">
                       <backup-count>1</backup-count>
                     </map>
                     <cp-subsystem>
                       <cp-member-count>0</cp-member-count>
                       <semaphores>
                         <semaphore>
                           <name>__vertx.*</name>
                           <jdk-compatible>false</jdk-compatible>
                           <initial-permits>1</initial-permits>
                         </semaphore>
                       </semaphores>
                     </cp-subsystem>
                   </hazelcast>
          PortMappings:
            - ContainerPort: 8080
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-region: !Ref AWS::Region
              awslogs-group: !Ref MyLogGroup
              awslogs-stream-prefix: ecs

  MyService:
    Type: AWS::ECS::Service
    Properties: 
      ServiceName: my-service
      Cluster: !Ref MyCluster
      TaskDefinition: !Ref MyTaskDefinition
      DeploymentConfiguration:
        MinimumHealthyPercent: 100
        MaximumPercent: 200
      DesiredCount: !Ref DesiredCapacity
      HealthCheckGracePeriodSeconds: 30
      LaunchType: FARGATE
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: ENABLED
          Subnets:
            - !Ref MySubnetA
            - !Ref MySubnetB
            - !Ref MySubnetC
          SecurityGroups:
            - !Ref MySecurityGroup
      LoadBalancers:
        - ContainerPort: 8080
          ContainerName: my-task-container
          TargetGroupArn: !Ref MyTargetGroup

  MyAutoScalingTarget:
    Type: AWS::ApplicationAutoScaling::ScalableTarget
    Properties:
      MinCapacity: !Ref DesiredCapacity
      MaxCapacity: !Ref MaxCapacity
      ResourceId: !Sub service/${MyCluster}/${MyService.Name}
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs
      RoleARN:  arn:aws:iam::756645658314:role/aws-service-role/ecs.amazonaws.com/AWSServiceRoleForECS

  ConferenceAutoScalingPolicy:
    Type: AWS::ApplicationAutoScaling::ScalingPolicy
    Properties:
      PolicyName: my-autoscaling-policy
      PolicyType: TargetTrackingScaling
      ScalingTargetId: !Ref MyAutoScalingTarget
      TargetTrackingScalingPolicyConfiguration:
        PredefinedMetricSpecification:
          PredefinedMetricType: ECSServiceAverageCPUUtilization
        ScaleInCooldown: 20
        ScaleOutCooldown: 20
        TargetValue: 75

The "magic" here is in embedding the Hazelcast XML configuration into the task definition using a Base64 encoded environment variable.

All the rest is just bog-standard CloudFormation for a Fargate cluster - which also took me a while to set up properly, and you are welcome to that as well.

guss77/vertx-hazelcast-cluster-on-fargate-ecs.md