Table of Contents generated with DocToc

Overview
- Rules
- Service types
  - worker
  - gateway
  - step
- Message types
  - events
  - tasks
Service base - language and framework
- NodeJS Stats
- SenecaJS for NodeJS
  - Relevant plugins
Build, test, deploy - CI/CD with Docker
- Feature Acceptance - Branch based previews and E2E testing via Runnable.com
- Production Deployment
Cloud Service Providers
- Internal MicroService
- Cluster
Patterns
Phases of implementation

Overview

This is a general plan and preliminary spec for Updater's Microservices architecture. The overall Microservices system will be responsible for performing much of the business logic, data manipulation (reconciliation, normalization, references), and service request fulfillment.

Rules

Isolate everything
- failures shouldn't bring down other services (Bulkheading)
- use asynchronous messaging between services (events and tasks)
Single Responsibility Principle
- Do ONE THING, and DO IT WELL
- business domains and responsibilities are not entangled
- LOC stays small; testing is easier
- maintenance and frequent continuous deployments are easier
Own your state EXCLUSIVELY
- maintain your own persistent data stores, caches, etc
- share anything via message bus events
Location Transparency
- services can be recycled, relocated, fail, upgrade or scale
- should be addressable through a virtual network - usually a cluster DNS entry
- addressable through single unit no matter how many instances or their locations

Service types

The 3 main types of services we'll employ, differentiated by how they're executed and how they communicate with other services.

All service types emit events, tasks, and manage their own state.

worker

Consumes asynchronous events and tasks via message bus.

This is the general, long-running worker, communicating solely by asynchronous messaging via the event bus and queues, processing a single function, business or otherwise.

gateway

Responds to upstream synchronous requests for entity data via RESTful, CQRS or GraphQL endpoints.

These service synchronous requests from upstream systems such as the core API or publicly accessible clients. These lie on the edge of the Microservices ecosystem, and require an Authorization header to identify the calling user.

step

Finite running jobs, launched by a state machine or workflow engine.

These are single tasks that are grouped into collective systems that complete Complex Workflows, which may make up a state machine which may also adhere to the Saga Pattern. They are launched when needed, perform their function, and exit.

Message types

events

Events are published by any service type and following the naming convention: <entity>.<verb> where the verb is past tense. These represent "things that have happened" and are subscribed to by any interested service. Examples:

user.authed
mover.created
address.received
address.reconciled

Any service can subscribe to message topics by their names and process the payload of these events.

tasks

Tasks are published by any service type and following the naming convention: <service>.<entity>.<request>. These represent "things that need to be done" and by which service. This is an asynchronous way of requesting work. Examples:

auth.user.authorize
movers.mover.created
addressing.address.get
addressing.address.reconcile

Service base - language and framework

NodeJS Stats

excellent tooling, profiling and debugging
npm hosts > 100k packages
most devs already know JS
concurrency via callback hell, but "promises" and "async generator" alleviate
mature; JS evolves slowly due to older standard
inexplicit error handling (throw/catch or vague callbacks)
performance on the rise, but dynamism of runtime can cause hinderances
many frameworks available

SenecaJS for NodeJS

SenecaJS is an application framework that separates business logic into separate, composable blocks that communicate with each other (async or sync) regardless of the communication mechanism between them.

Behind the scenes plumbing for transport (MQ or RPC), auth, etc are abstracted away from the functional building blocks of a Seneca app via plugins.

Relevant plugins

Web - map web routes to actions
Message - map message events to actions
User - map JWT tokens to identity context
AWS Lambda - invoke Lambda for actions

Build, test, deploy - CI/CD with Docker

What would it take for someone to spin up a new service?

Feature Acceptance - Branch based previews and E2E testing via Runnable.com

This can be done right on a CI build server with multiple docker-compose.yml files, extending a common one and building against dependent services using environment variables based on local vs CI server vs production clusters, but Runnable.com looks like a less fragile way of doing the same while eliminating the CI server.

In-house services shouldn't have a direct dependency on any other running service, only events from the pub/sub event bus, simplifying the use case where docker-compose is leveraged in lieu of Runnable.com.

install Docker toolset, clone Git repos
create new Git branch (automatically launching a Preview Environment)
spin up local dev server via docker-compose, that:

attaches service local event bus, database server, and other services (ie SMTP) needed to build/test on local machine
local specific config envs should vary minimally or not at all if kept within confines of the docker-compose environment

push changes to Git, which automatically:

run unit tests
update/deploy to Preview Environment
notify Slack, Jira, GitHub PR, etc
run functional and end-to-end tests

Runnable.com Pros

Runnable aims to let you run all your End-to-End tests continuously
cross-team changes can be validated by connecting your environment to other services in co-development
support is quick to respond

Runnable.com Caveats

Runnable.com documentation isn't complete
using docker-compose.yml for your Preview Environments requires support assistance (for now)
supports RabbitMQ only (unless you pay for Enterprise support and install it on AWS)

Production Deployment

How would it operate in terms of deployment?

Docker image update

code review on PR (enforceable via new GitHub PR rules or team convention)
merge to master, which automatically:

run unit tests
update/deploy to Preview Environment
notify Slack, JIRA, GitHub PR, etc
run functional and end-to-end tests

trigger deployment script from Docker image repository notification

upon new image, built by master deploy, webhook to deployment script
Docker stacks and deploys are managed via Docker Cloud + Docker BYOH

Zero downtime

For zero downtime Docker container updates, you need:

at least 3 running container instances per image
at least 5 compute nodes to spread all containers across

Database migration

Tools like https://flywaydb.org/ make database migrations easier for developers

migrate at service startup, fast fail on schema inconsistencies
able to drop and rebuild entire database in test environments
able to drop everything but the schemas for clean start during testing
cluster safe - locking migrations
does NOT support Drops for rollbacks by design
- https://flywaydb.org/documentation/faq.html#downgrade
- schema shouldn't apply destructive changes alongside code that depends on them
- use snapshots if rollbacks are needed

Cluster infrastructure options:

Docker 1.13 Swarm
Kubernetes - I've had issues w/ inter-container discovery via cluster DNS
AWS Elastic Container - pricier

Cloud Service Providers

Internal MicroService

Dependencies that each of our services require but we don't want to maintain can be outsources to relevant cloud service providers. These include:

RDBMS (Heroku Postgres or Google Spanner)
REDIS K/V (Redis Cloud or AWS ElastiCache)
SMTP (SendGrid or Mandrill by Mailchimp)

Note: Assuming we host our cluster environment on AWS, it may be beneficial to use as many compatible services from the AWS ecosystem for performance and cost reasons.

These types of dependent services can be spun up by docker-compose or configured within Runnable.com and seeded for local development and Feature Acceptance testing. Publicly available Docker images that also work for test:

Cluster

Cloud service providers the cluster itself requires:

Event Message Bus and Queue - Amazon SNS + SQS or RabbitMQ

RabbitMQ is AMQP compliant and would not require updating each services queue subscription mechanism if we moved cloud providers.
Cluster Logging (Elastic Cloud or Logit.io)

Elastic is more mature.
Metrics and Monitoring (Prometheus.io and Graphana - hosted version coming soon)

Prometheus is built in to Docker 1.13, and provides pull metrics.
Secrets and Config Management (Hashicorp Vault)

Already being implemented at Updater.

Patterns

Reactive

Having every component in an ecosystem adhere to reactive patterns lends all moving parts to decoupling, failure isolation, scalability and overall resiliency.

By communicating between services through topics and queues, we isolate services from each others' failures. If a consuming services goes down, the queuing system will keep a replayable backlog.

Authentication

An Authentication Service is a gateway service, employed to handle user authentication events. Given credentials received by web or mobile clients, this service handles authentication and emits the user.authed event with JWT payload.

Tasks to be performed or events to be handled are to be done on behalf of an identity, authenticated in the case of a non-system identity:

system for maintenance jobs, etc
user-mover for movers who are requesting to perform moving services provided
user-client for companies operating through dashboards that provide mover data
user-business for major companies operating through dashboards that provide mover services

If any service performs a request based upon an even external entity, a JSON Web Token or JWT will be generated by the Authentication Service and emitted to the Microservices cluster as a user.authed event with the JWT token as part of the payload.

This token can be used by any service to:

cache in memory users currently authed
embedded in HTTP headers when accessing external services
embedded in message headers when emitting subsequent events performed on behalf of said user in session

Authorization

Mapping of global roles (as provided by the user.authed event) to authz roles specific to the service handling the event.

REST vs CQRS vs GraphQL

All 3 patterns are implemented as synchronous services for upstream systems and clients.

GraphQL and REST provide CRUD operations on resource entities, while CQRS is a pattern to request an action on a resource and receive the result of the action's operation in the response payload.

Complex workflows and Long-running business transactions

Saga Pattern

For long running business transactions (hours, days) that pass through multiple services, we may have a business need to handle them like a state machine with accompanying rules to rollback or fail all of them. Each change at each service will have to maintain reversal rules, and a broadcast message upon business transaction success or failure will need to be designed to trigger those rules upon failure.

https://medium.com/@roman01la/confusion-about-saga-pattern-bbaac56e622

State machines

The state machine mechanics can be managed by frameworks like:

Netflix Conductor or
AWS Step-Functions - visual workflow

and dispatched on one-time-use compute nodes such as:

AWS Lambda
Iron Workers
or internal cluster managed tasks as dispatched by Swarm or Kubernetes

Phases of implementation

automated testing, deploying, rolling image updates
setup configuration environment management (vault or otherwise)
cluster monitoring, logging, metrics, alerts with dashboard(s)
design message bus event and tasks specs, health check and reporting checkpoints
setup pub/sub eventbus,
docker base images for service types, triggered jobs

coding convention, choose linting libraries to enforce
test coverage requirements, set up code coverage plugins accordingly
decide on testing frameworks for unit, functional, and e2e
decide on persistence, caching, HTTP endpoints for reporting, management

build services
optimize for HA, performance

bloo/msa_updater.md