Skip to content

Instantly share code, notes, and snippets.

@bloo
Last active October 22, 2018 15:46
Show Gist options
  • Save bloo/c6ccd6eb3eae809b10bd610525f7e72f to your computer and use it in GitHub Desktop.
Save bloo/c6ccd6eb3eae809b10bd610525f7e72f to your computer and use it in GitHub Desktop.
Microservices Architecture for Updater

Table of Contents generated with DocToc

Overview

This is a general plan and preliminary spec for Updater's Microservices architecture. The overall Microservices system will be responsible for performing much of the business logic, data manipulation (reconciliation, normalization, references), and service request fulfillment.

Rules

  • Isolate everything
    • failures shouldn't bring down other services (Bulkheading)
    • use asynchronous messaging between services (events and tasks)
  • Single Responsibility Principle
    • Do ONE THING, and DO IT WELL
    • business domains and responsibilities are not entangled
    • LOC stays small; testing is easier
    • maintenance and frequent continuous deployments are easier
  • Own your state EXCLUSIVELY
    • maintain your own persistent data stores, caches, etc
    • share anything via message bus events
  • Location Transparency
    • services can be recycled, relocated, fail, upgrade or scale
    • should be addressable through a virtual network - usually a cluster DNS entry
    • addressable through single unit no matter how many instances or their locations

Service types

The 3 main types of services we'll employ, differentiated by how they're executed and how they communicate with other services.

All service types emit events, tasks, and manage their own state.

worker

Consumes asynchronous events and tasks via message bus.

This is the general, long-running worker, communicating solely by asynchronous messaging via the event bus and queues, processing a single function, business or otherwise.

gateway

Responds to upstream synchronous requests for entity data via RESTful, CQRS or GraphQL endpoints.

These service synchronous requests from upstream systems such as the core API or publicly accessible clients. These lie on the edge of the Microservices ecosystem, and require an Authorization header to identify the calling user.

step

Finite running jobs, launched by a state machine or workflow engine.

These are single tasks that are grouped into collective systems that complete Complex Workflows, which may make up a state machine which may also adhere to the Saga Pattern. They are launched when needed, perform their function, and exit.

Message types

events

Events are published by any service type and following the naming convention: <entity>.<verb> where the verb is past tense. These represent "things that have happened" and are subscribed to by any interested service. Examples:

  • user.authed
  • mover.created
  • address.received
  • address.reconciled

Any service can subscribe to message topics by their names and process the payload of these events.

tasks

Tasks are published by any service type and following the naming convention: <service>.<entity>.<request>. These represent "things that need to be done" and by which service. This is an asynchronous way of requesting work. Examples:

  • auth.user.authorize
  • movers.mover.created
  • addressing.address.get
  • addressing.address.reconcile

Service base - language and framework

NodeJS Stats

  • excellent tooling, profiling and debugging
  • npm hosts > 100k packages
  • most devs already know JS
  • concurrency via callback hell, but "promises" and "async generator" alleviate
  • mature; JS evolves slowly due to older standard
  • inexplicit error handling (throw/catch or vague callbacks)
  • performance on the rise, but dynamism of runtime can cause hinderances
  • many frameworks available

SenecaJS for NodeJS

SenecaJS is an application framework that separates business logic into separate, composable blocks that communicate with each other (async or sync) regardless of the communication mechanism between them.

Behind the scenes plumbing for transport (MQ or RPC), auth, etc are abstracted away from the functional building blocks of a Seneca app via plugins.

Relevant plugins

  • Web - map web routes to actions
  • Message - map message events to actions
  • User - map JWT tokens to identity context
  • AWS Lambda - invoke Lambda for actions

Build, test, deploy - CI/CD with Docker

What would it take for someone to spin up a new service?

Feature Acceptance - Branch based previews and E2E testing via Runnable.com

This can be done right on a CI build server with multiple docker-compose.yml files, extending a common one and building against dependent services using environment variables based on local vs CI server vs production clusters, but Runnable.com looks like a less fragile way of doing the same while eliminating the CI server.

In-house services shouldn't have a direct dependency on any other running service, only events from the pub/sub event bus, simplifying the use case where docker-compose is leveraged in lieu of Runnable.com.

  1. install Docker toolset, clone Git repos
  2. create new Git branch (automatically launching a Preview Environment)
  3. spin up local dev server via docker-compose, that:
  • attaches service local event bus, database server, and other services (ie SMTP) needed to build/test on local machine
  • local specific config envs should vary minimally or not at all if kept within confines of the docker-compose environment
  1. push changes to Git, which automatically:
  • run unit tests
  • update/deploy to Preview Environment
  • notify Slack, Jira, GitHub PR, etc
  • run functional and end-to-end tests

Runnable.com Pros

  • Runnable aims to let you run all your End-to-End tests continuously
  • cross-team changes can be validated by connecting your environment to other services in co-development
  • support is quick to respond

Runnable.com Caveats

  • Runnable.com documentation isn't complete
  • using docker-compose.yml for your Preview Environments requires support assistance (for now)
  • supports RabbitMQ only (unless you pay for Enterprise support and install it on AWS)

Production Deployment

How would it operate in terms of deployment?

Docker image update

  1. code review on PR (enforceable via new GitHub PR rules or team convention)
  2. merge to master, which automatically:
  • run unit tests
  • update/deploy to Preview Environment
  • notify Slack, JIRA, GitHub PR, etc
  • run functional and end-to-end tests
  1. trigger deployment script from Docker image repository notification
  • upon new image, built by master deploy, webhook to deployment script
  • Docker stacks and deploys are managed via Docker Cloud + Docker BYOH

Zero downtime

For zero downtime Docker container updates, you need:

  • at least 3 running container instances per image
  • at least 5 compute nodes to spread all containers across

Database migration

Tools like https://flywaydb.org/ make database migrations easier for developers

  • migrate at service startup, fast fail on schema inconsistencies
  • able to drop and rebuild entire database in test environments
  • able to drop everything but the schemas for clean start during testing
  • cluster safe - locking migrations
  • does NOT support Drops for rollbacks by design

Cluster infrastructure options:

Cloud Service Providers

Internal MicroService

Dependencies that each of our services require but we don't want to maintain can be outsources to relevant cloud service providers. These include:

Note: Assuming we host our cluster environment on AWS, it may be beneficial to use as many compatible services from the AWS ecosystem for performance and cost reasons.

These types of dependent services can be spun up by docker-compose or configured within Runnable.com and seeded for local development and Feature Acceptance testing. Publicly available Docker images that also work for test:

Cluster

Cloud service providers the cluster itself requires:

  • Event Message Bus and Queue - Amazon SNS + SQS or RabbitMQ

    RabbitMQ is AMQP compliant and would not require updating each services queue subscription mechanism if we moved cloud providers.

  • Cluster Logging (Elastic Cloud or Logit.io)

    Elastic is more mature.

  • Metrics and Monitoring (Prometheus.io and Graphana - hosted version coming soon)

    Prometheus is built in to Docker 1.13, and provides pull metrics.

  • Secrets and Config Management (Hashicorp Vault)

    Already being implemented at Updater.

Patterns

Reactive

Having every component in an ecosystem adhere to reactive patterns lends all moving parts to decoupling, failure isolation, scalability and overall resiliency.

By communicating between services through topics and queues, we isolate services from each others' failures. If a consuming services goes down, the queuing system will keep a replayable backlog.

Authentication

An Authentication Service is a gateway service, employed to handle user authentication events. Given credentials received by web or mobile clients, this service handles authentication and emits the user.authed event with JWT payload.

Tasks to be performed or events to be handled are to be done on behalf of an identity, authenticated in the case of a non-system identity:

  • system for maintenance jobs, etc
  • user-mover for movers who are requesting to perform moving services provided
  • user-client for companies operating through dashboards that provide mover data
  • user-business for major companies operating through dashboards that provide mover services

If any service performs a request based upon an even external entity, a JSON Web Token or JWT will be generated by the Authentication Service and emitted to the Microservices cluster as a user.authed event with the JWT token as part of the payload.

This token can be used by any service to:

  • cache in memory users currently authed
  • embedded in HTTP headers when accessing external services
  • embedded in message headers when emitting subsequent events performed on behalf of said user in session

Authorization

Mapping of global roles (as provided by the user.authed event) to authz roles specific to the service handling the event.

REST vs CQRS vs GraphQL

All 3 patterns are implemented as synchronous services for upstream systems and clients.

GraphQL and REST provide CRUD operations on resource entities, while CQRS is a pattern to request an action on a resource and receive the result of the action's operation in the response payload.

Complex workflows and Long-running business transactions

Saga Pattern

For long running business transactions (hours, days) that pass through multiple services, we may have a business need to handle them like a state machine with accompanying rules to rollback or fail all of them. Each change at each service will have to maintain reversal rules, and a broadcast message upon business transaction success or failure will need to be designed to trigger those rules upon failure.

https://medium.com/@roman01la/confusion-about-saga-pattern-bbaac56e622

State machines

The state machine mechanics can be managed by frameworks like:

and dispatched on one-time-use compute nodes such as:

Phases of implementation

  1. automated testing, deploying, rolling image updates
  2. setup configuration environment management (vault or otherwise)
  3. cluster monitoring, logging, metrics, alerts with dashboard(s)
  4. design message bus event and tasks specs, health check and reporting checkpoints
  5. setup pub/sub eventbus,
  6. docker base images for service types, triggered jobs
  • coding convention, choose linting libraries to enforce
  • test coverage requirements, set up code coverage plugins accordingly
  • decide on testing frameworks for unit, functional, and e2e
  • decide on persistence, caching, HTTP endpoints for reporting, management
  1. build services
  2. optimize for HA, performance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment