One year ago my consultancy was hired to construct a modern backend for a travel site that serves millions of unique visitors each month after the previous team had failed. This is the story of our success.
If your business is based on content, then you must solve two separate problems: 1) how to create new content and 2) how to serve content via your web site, via your mobile applications, and perhaps via third parties. If your business has tenure, then your content creation suite is likely varied, and your data is probably spread across many different data sources. Worse, each of your applications access each data source directly with inconsistent APIs. If you want your business to grow faster than your applications teams can drive it, then you need to provide third parties access to your data in a manner that you control.
This is the state that we found our client in. Their rich content was stuck behind Ruby on Rails applications, Wordpress instances, and various other data stores. They needed 1) to create new and richer content faster and 2) to serve their content to new web and mobile applications at scale.
Their previous attempts to solve this problem had failed. Those teams had tried to solve both problems simultaneously. With so many stakeholders, they failed to get consensus and churned.
We took a different approach. We decoupled the two problems. We left the content creation workflow alone and instead introduced a new serving layer atop the storage layer using a data model that could be easily understood and served via a truly RESTful API. With this architecture, we enabled the client to build new web and mobile applications quickly, moving independently of the content creation tooling development.
This is the story of what we delivered to client and how it satisfies their current and foreseeable future needs.
As a travel company with a global brand, our client had accumulated a wealth of content, some generated in house, and some integrated from third parties.
- In-house objective content and subjective narratives about places and points of interests are created by a custom Ruby/Rails content management system backed by a Postgres database.
- Articles and news are created in various Wordpress instances backed by MySQL databases.
- Books and e-books are sold in a custom e-commerce application backed by a Microsoft SQL Server database.
- Bookings services for lodgings are provided by booking.com and Hostel World.
- Booking services for tours are provided by G Adventures and Viator.
- Additional objective content such as hours of operation or location (address or latitude/longitude) data from Google Places and Factual.com would also need to be integrated.
With so many disparate data sources, we needed an organizing principle. As a travel site, internal clients access the data either by navigating or searching via the web or mobile applications. We needed to support navigation following a simple geographic (place) hierarchy: world, continents, countries, states/regions, cities, and neighborhoods.
Points of interest would be indexed by the places that contain them. Articles, news, books, and e-books would be indexed by the places mentioned within them. Individual lodgings would be indexed by the places in which they reside. Tours would be indexed by the places that they visit.
As one of the most visited travel sites, we needed to support 30–60 GET requests per second on launch. New mobile applications were also being built, so demand would be unpredictable. We needed to build a system that would scale elastically.
Given the failure of the previous team, we knew that we needed to show results quickly. Moving quickly meant that we would leverage popular open source software, well-thought standards, and cloud infrastructure. Our client was already using Amazon Web Services (AWS) to run dozens of Ruby on Rails applications, each on a separate EC2 instance. So the decision to host our backend on AWS was easy.
We needed a high degree of scalability, support for multiple programming languages, and clear isolation of services, so we opted for a microservice architecture using Docker for containerization and using Json for data interchange using the principles of the 12-factor application.
We needed to design our APIs first so that clients to the services that we were building could be developed in parallel to those services. We opted to write our APIs using the RAML API modeling language and Json Schema, a vocabulary that allows you to annotate and validate JSON documents. We wrote tools using the Ramlfications RAML parser and the jsonschema Json schema validation tool.
Prior to implementing services, we wrote example requests, responses and their schemas. This proved invaluable in expediting client development and in identifying inconsistencies between our implementation and our specifications.
To structure our APIs, we opted to standardize on a well-structured approach to building Json based web APIS, JsonAPI. While many alternatives exists (including HAL, JSON-LD, Collection+JSON, and SIREN), JsonAPI provided structure and direction with acceptable overhead.
My team consisted of two Ruby/Elixir developers, one senior Python engineer, and two Java/Scala engineers. Standardizing on a single language would have to wait. We allowed each developer to use the language of his or her choice, as long as it was contained in a Docker container and built according to the JsonAPI standard. Later, after demonstrating success, we would standardize on one or two languages.
To manage our containers, we opted for Kubernetes, an open-source system for automating deployment, scaling, and management of containerized applications. We chose Kubernetes over its competitors because of its maturity and growing user base.
Over the course of several months, we implemented a microservice for each data set. Each micro-service consists of two sub-services: 1) a single synchronization sub-service that indexes data by location and writes it into a AWS RDS Postgres/PostGIS datastore and 2) a stateless, replicable API sub-service that serves Json data from that datastore. Exactly one instance of each synchronization sub-service runs on each cluster, but using Kubernetes auto-scaling, we scale the API sub-services based on CPU load. This allows us to adapt to increasing/decreasing demand without human intervention.
Our first microservice, written in Python, serves places (e.g. New York City) and points of interests (e.g. the Statue of Liberty). The synchronization sub-service replicates data from a remote non-hosted Postgres database into an AWS RDS Postgres database. The API sub-service uses the Flask framework, a gevent.pywsgi web server, and the psycopg PostgreSQL adapter. Gevent provides co-routines that allow scaling the web server without the overhead of heavyweight threads.
Our second microservice, written in Elixir, serves lodgings from booking.com and HostelWorld. The synchronization sub-service polls the web sites for changes several times each day, and indexes the data into an AWS RDS Postgres database. The API sub-service uses the Phoenix web framework. Elixir lacks the maturity of the other languages, which required us to extend open source monitoring and serialization libraries in order to use them in our project. However, Elixir proved to be a solid performer with low memory requirements.
Our third microservice, written in Ruby, serves activities and tours from Viator and G Adventures. The synchronization sub-service polls the web sites for changes several times each day, and indexes the data into an AWS RDS Postgres database. The API sub-service uses the Rails web framework. This service proved to be a memory hog.
Our fourth microservice, written in Scala, serves articles and news from Mysql databases managed by Wordpress. Unlike the other micro-services, it does not have a separate synchronization layer, since the source data is already indexed by location via a separate custom Wordpress plugin. Instead, the API sub-service reads directly from a read replica of the source Wordpress database, and serves the data using the spray framework.
Our fifth microservice, written in Elixir, serves the inventory of books and e-books from a custom e-commerce system backed by Microsoft SQL Server. The synchronization sub-service polls a custom XML feed several times each day, and indexes the data into an AWS RDS Postgres database.
Logging in Kubernetes is supported with several cluster add-ons, including: 1) elasticsearch, a distributed, open source search and analytics engine; 2) Fluentd, an open source log data collector; and, 3) Kibana, an open source log data visualization platform. We frequently ran out of space when storing elasticsearch data locally within the Kubernetes cluster, so we are moving that storage outside of the cluster
To get the most our of logging, we standardized on Json-formatted log messages with a required minimum set of fields. We validated the log messages according to a Json schema that we created.
To monitor our Kubernetes cluster and provide alerts, we used the Prometheus toolkit. Unlike many monitoring tools, Prometheus operates using a pull model, so each service must implement a standard API endpoint that provides metrics.
We implemented a small set of standard metrics for each sub-service. Each synchronization service provides a simple binary flag indicating whether synchronization is in progress. Each API sub-service provides a request time histogram. We use labels to identify request types for fine grain insight into response time.
We analyze and visualize the metrics that Prometheus collects using Grafana. We augmented Prometheus metrics with exceptional events from our elasticsearch logs using Grafana annotations. During early development, this allowed us to count exceptions and to correlate exceptions with operational metrics, which helped us to find bugs quickly.
For longer term analysis and visualization of exceptions, we used Airbrake, which has better support for seeing long term trends.
Prior to the availability of actual application traffic, we tested our services under artificial load using the open source Locust load testing tool. When used with Prometheus response time metrics that we visualized in Grafana, we were able to gain insights into the performance and behavior of our system.
Once we had actual traffic hitting our production site, we were able to replay that live traffic into our staging cluster using Gor in order to continuously test our system with real data.
To direct traffic to the individual microservices, we used the open source proxy Nginx.
In front of our Nginx instance, we used the open-source API gateway and microservices management tool Kong to authorize and rate-limit API requests from third parties.
Overall the project was a major success.
Within three months with five developers and one DevOps person we had a proof of concept up and running. Two months later, we had support for orchestration with Kubernetes. Our developers were able to create a new simple service and deploy it to any of three Kubernetes clusters (QA, staging, and production) in under two hours without any assistance or support from DevOps.
More importantly, our client was able to realize tangible and quantifiable benefits.
First, the point of interest web pages were reimplemented to use the new microservices. Now, 100% of traffic is being directed to the new Kubernetes production cluster, using Kubernetes auto-scaling for elasticity and efficiency. Development was fast due to the design-first approach and the employ of mock servers.
Second, for the first time in company history, the company is able to offer third parties access to an API to the custom content. This will offer new revenue streams.
Third, applications are being ported from the old AWS EC2/CloudFormation/Chef based infrastructure. This will lower the operating cost since Kubernetes will provide greater EC2 instance utilization.
On the down side, as mentioned in What I Wish I Had Known Before Scaling Uber to 1000 Services, there are real costs to supporting multiple language environments. First, expertise in each environment must be acquired. This splits teams into camps, and slowed code review and bug fixing. Second, deficiencies in dependencies needed to be fixed in each language environment. This occurred quite frequently, and slowed adoption of new monitoring and tracing tools. We will migrate away from Ruby and one of the other languages.
Unlocking the value of content takes more than simply bolting on a Json API to an existing application or web service. It requires a well considered data model, a clean API, and a scalable infrastructure. But with containerization, cloud compute and storage services, open source tools, and a solid architecture and engineering team, you can indeed unlock that value.
Comments
Q: Did you consider a ‘serverless’ AWS API Gateway/Lamda/DynamoDB, Node JS based solution? Curious what you think of those technologies separately or together. Also, can you elaborate on Kubernetes having a lower operating cost via better EC2 utilization?
A: We did consider a “serverless” AWS API Gateway/Lambda/DynamoDB, Node JS based solution. However we ran into some obstacles. First, we wanted our API specs to be useful for validating examples and the contract with the services. However, we could not specify our API using Swagger (which is what API Gateway uses). The Swagger spec did not allow recursive definitions or “union” types. We needed to represent geometry in JSON. We standardized on GeoJSON, which cannot be defined in Swagger. Second, our data was already stored in Postgres, Mysql, and MS SQL Server. Migrating to DynamoDB was not reasonable. Finally, we did not want to commit to Amazon API Gateway and Lambda so early in their development cycle.
Q: Did you put the database into kubernetes as well or just the application services?
A: The databases are outside Kubernetes. We used Amazon RDS hosted databases.
Q: How are you authenticating and authorizing the API clients and dealing with request rate limiting and scraping?
A: We used Kong to address those needs.
Much also depends on your architecture.
If you plan on having many duplicated shared-nothing Docker-based services where failure of a service just means having the client retry and be connected to another service, then you probably don't need Erlang’s fault-tolerance and can use Go, which creates a single standalone binary that can be deployed by just copying it. Your architecture can scale by adding more Docker containers.
On the other hand, if you need strong fault-tolerance, the ability to hot-load simple code updates, and sophisticated distributed system facilities, and the ability to support a very large number of concurrent connections as opposed to straight-line per-process performance, Erlang is your game.