Table of Contents generated with DocToc
- Overview
- Service base - language and framework
- Build, test, deploy - CI/CD with Docker
- Cloud Service Providers
- Patterns
- Phases of implementation
This is a general plan and preliminary spec for Updater's Microservices architecture. The overall Microservices system will be responsible for performing much of the business logic, data manipulation (reconciliation, normalization, references), and service request fulfillment.
- Isolate everything
- failures shouldn't bring down other services (Bulkheading)
- use asynchronous messaging between services (events and tasks)
- Single Responsibility Principle
- Do ONE THING, and DO IT WELL
- business domains and responsibilities are not entangled
- LOC stays small; testing is easier
- maintenance and frequent continuous deployments are easier
- Own your state EXCLUSIVELY
- maintain your own persistent data stores, caches, etc
- share anything via message bus events
- Location Transparency
- services can be recycled, relocated, fail, upgrade or scale
- should be addressable through a virtual network - usually a cluster DNS entry
- addressable through single unit no matter how many instances or their locations
The 3 main types of services we'll employ, differentiated by how they're executed and how they communicate with other services.
All service types emit events, tasks, and manage their own state.
Consumes asynchronous events and tasks via message bus.
This is the general, long-running worker, communicating solely by asynchronous messaging via the event bus and queues, processing a single function, business or otherwise.
Responds to upstream synchronous requests for entity data via RESTful, CQRS or GraphQL endpoints.
These service synchronous requests from upstream systems such as the core API or publicly accessible clients. These lie on the edge of the Microservices ecosystem, and require an Authorization header to identify the calling user.
Finite running jobs, launched by a state machine or workflow engine.
These are single tasks that are grouped into collective systems that complete Complex Workflows, which may make up a state machine which may also adhere to the Saga Pattern. They are launched when needed, perform their function, and exit.
Events are published by any service type and following the naming convention: <entity>.<verb>
where the verb is past tense. These represent "things that
have happened" and are subscribed to by any interested service. Examples:
user.authed
mover.created
address.received
address.reconciled
Any service can subscribe to message topics by their names and process the payload of these events.
Tasks are published by any service type and following the naming convention: <service>.<entity>.<request>
. These represent "things that need to be done"
and by which service. This is an asynchronous way of requesting work. Examples:
auth.user.authorize
movers.mover.created
addressing.address.get
addressing.address.reconcile
- excellent tooling, profiling and debugging
- npm hosts > 100k packages
- most devs already know JS
- concurrency via callback hell, but "promises" and "async generator" alleviate
- mature; JS evolves slowly due to older standard
- inexplicit error handling (throw/catch or vague callbacks)
- performance on the rise, but dynamism of runtime can cause hinderances
- many frameworks available
SenecaJS is an application framework that separates business logic into separate, composable blocks that communicate with each other (async or sync) regardless of the communication mechanism between them.
Behind the scenes plumbing for transport (MQ or RPC), auth, etc are abstracted away from the functional building blocks of a Seneca app via plugins.
- Web - map web routes to
actions
- Message - map message events to
actions
- User - map JWT tokens to identity context
- AWS Lambda - invoke Lambda for
actions
What would it take for someone to spin up a new service?
This can be done right on a CI build server with multiple docker-compose.yml
files, extending a common one and building against dependent services using
environment variables based on local vs CI server vs production clusters, but
Runnable.com looks like a less fragile way of doing the
same while eliminating the CI server.
In-house services shouldn't have a direct dependency on any other running
service, only events from the pub/sub event bus, simplifying the use case where
docker-compose
is leveraged in lieu of Runnable.com.
- install Docker toolset, clone Git repos
- create new Git branch (automatically launching a Preview Environment)
- spin up local dev server via
docker-compose
, that:
- attaches service local event bus, database server, and other services (ie SMTP) needed to build/test on local machine
- local specific config envs should vary minimally or not at all if kept
within confines of the
docker-compose
environment
- push changes to Git, which automatically:
- run unit tests
- update/deploy to Preview Environment
- notify Slack, Jira, GitHub PR, etc
- run functional and end-to-end tests
Runnable.com Pros
- Runnable aims to let you run all your End-to-End tests continuously
- cross-team changes can be validated by connecting your environment to other services in co-development
- support is quick to respond
Runnable.com Caveats
- Runnable.com documentation isn't complete
- using
docker-compose.yml
for your Preview Environments requires support assistance (for now) - supports RabbitMQ only (unless you pay for Enterprise support and install it on AWS)
How would it operate in terms of deployment?
- code review on PR (enforceable via new GitHub PR rules or team convention)
- merge to master, which automatically:
- run unit tests
- update/deploy to Preview Environment
- notify Slack, JIRA, GitHub PR, etc
- run functional and end-to-end tests
- trigger deployment script from Docker image repository notification
- upon new image, built by
master
deploy, webhook to deployment script - Docker stacks and deploys are managed via Docker Cloud + Docker BYOH
For zero downtime Docker container updates, you need:
- at least 3 running container instances per image
- at least 5 compute nodes to spread all containers across
Tools like https://flywaydb.org/ make database migrations easier for developers
- migrate at service startup, fast fail on schema inconsistencies
- able to drop and rebuild entire database in test environments
- able to drop everything but the schemas for clean start during testing
- cluster safe - locking migrations
- does NOT support Drops for rollbacks by design
- https://flywaydb.org/documentation/faq.html#downgrade
- schema shouldn't apply destructive changes alongside code that depends on them
- use snapshots if rollbacks are needed
- Docker 1.13 Swarm
- Kubernetes - I've had issues w/ inter-container discovery via cluster DNS
- AWS Elastic Container - pricier
Dependencies that each of our services require but we don't want to maintain can be outsources to relevant cloud service providers. These include:
- RDBMS (Heroku Postgres or Google Spanner)
- REDIS K/V (Redis Cloud or AWS ElastiCache)
- SMTP (SendGrid or Mandrill by Mailchimp)
Note: Assuming we host our cluster environment on AWS, it may be beneficial to use as many compatible services from the AWS ecosystem for performance and cost reasons.
These types of dependent services can be spun up by docker-compose
or
configured within Runnable.com and seeded for local development and Feature
Acceptance testing. Publicly available Docker images that also work for test:
Cloud service providers the cluster itself requires:
-
Event Message Bus and Queue - Amazon SNS + SQS or RabbitMQ
RabbitMQ is AMQP compliant and would not require updating each services queue subscription mechanism if we moved cloud providers.
-
Cluster Logging (Elastic Cloud or Logit.io)
Elastic is more mature.
-
Metrics and Monitoring (Prometheus.io and Graphana - hosted version coming soon)
Prometheus is built in to Docker 1.13, and provides pull metrics.
-
Secrets and Config Management (Hashicorp Vault)
Already being implemented at Updater.
Having every component in an ecosystem adhere to reactive patterns lends all moving parts to decoupling, failure isolation, scalability and overall resiliency.
By communicating between services through topics and queues, we isolate services from each others' failures. If a consuming services goes down, the queuing system will keep a replayable backlog.
An Authentication Service is a gateway service, employed to handle user
authentication events. Given credentials received by web or mobile clients, this
service handles authentication and emits the user.authed
event with JWT
payload.
Tasks to be performed or events to be handled are to be done on behalf of an
identity, authenticated in the case of a non-system
identity:
system
for maintenance jobs, etcuser-mover
for movers who are requesting to perform moving services provideduser-client
for companies operating through dashboards that provide mover datauser-business
for major companies operating through dashboards that provide mover services
If any service performs a request based upon an even external entity, a JSON Web
Token or JWT will be generated by the Authentication
Service and emitted to the Microservices cluster as a user.authed
event
with the JWT token as part of the payload.
This token can be used by any service to:
- cache in memory users currently authed
- embedded in HTTP headers when accessing external services
- embedded in message headers when emitting subsequent events performed on behalf of said user in session
Mapping of global roles (as provided by the user.authed
event) to authz roles
specific to the service handling the event.
All 3 patterns are implemented as synchronous services for upstream systems and clients.
GraphQL and REST provide CRUD operations on resource entities, while CQRS is a pattern to request an action on a resource and receive the result of the action's operation in the response payload.
For long running business transactions (hours, days) that pass through multiple services, we may have a business need to handle them like a state machine with accompanying rules to rollback or fail all of them. Each change at each service will have to maintain reversal rules, and a broadcast message upon business transaction success or failure will need to be designed to trigger those rules upon failure.
https://medium.com/@roman01la/confusion-about-saga-pattern-bbaac56e622
The state machine mechanics can be managed by frameworks like:
- Netflix Conductor or
- AWS Step-Functions - visual workflow
and dispatched on one-time-use compute nodes such as:
- AWS Lambda
- Iron Workers
- or internal cluster managed tasks as dispatched by Swarm or Kubernetes
- automated testing, deploying, rolling image updates
- setup configuration environment management (vault or otherwise)
- cluster monitoring, logging, metrics, alerts with dashboard(s)
- design message bus event and tasks specs, health check and reporting checkpoints
- setup pub/sub eventbus,
- docker base images for service types, triggered jobs
- coding convention, choose linting libraries to enforce
- test coverage requirements, set up code coverage plugins accordingly
- decide on testing frameworks for unit, functional, and e2e
- decide on persistence, caching, HTTP endpoints for reporting, management
- build services
- optimize for HA, performance