You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This gist details the decisions I made, the tests I conducted, the observations recorded, and the directions I took while building out a back end API service for the new Atelier retail app in the SDC (System Design Capstone) project for Hack Reactor.
Table of Contents
The following is the template and some additional prompts that will be used for the journal moving forward.
Goals & Motivations
Every action you take should be driven by some challenge you are facing. Many times this will be simply the requirement of the project. However, you should be able to frame each challenge in your own terms. Own the motivations in the context of your own project.
Actions Taken
This portion should be the most detailed. Make sure to be technically precise and remove any ambiguity related to the action you are taking.
To start off our project, we had to decide which database to make as the primary vs. secondary database and how to actually use both.
I chose to have MongoDB as the primary database (or typ. first encountered) as for simple read-write actions it is faster. However, I am setting it up to probably serve as a caching layer, with all of the data ultimately stored & reconciled in Postgres as the source of truth.
Postgres is generally better for faster insertions/updates and more complex queries made possible by table joins, so such queries will be forwarded directly here.
Actions where Postgres is faster will be directed straight to Postgres, and vice versa with Mongo. However, Mongo will not have all of the data at all times. If it is missing requested data, it will fetch it from Postgres. And data written to Mongo will be pushed to Postgres on a separate thread while the original call returns.
I don't know if I can/will have any metrics monitoring, but if one system is slowing down from high request rates, I can then switch to the other system to balance loads temporarily.
ORM Considerations
In addition to this database selection, I have decided to use ORM (object relational mapping) frameworks for each one (Mongoose for MongoDB, and Sequelize for Postgres). At this stage I am not sure if using ORMs will provide a performance penalty, but it seemed like a good way to design a server to be agnostic as to what databases it is using. To abstract out the database implementation details. Not only can other databases be swapped for Primary/Secondary in the future, but if they use the same ORM (e.g. switchin from Postgres to MySQL), for the most part, only connection details need to be changed in the higher level database controller code in the server.
Since the end goal of our project is to stress test the databases and server routes in a variety of ways, and at a variety of levels, we need a database populated with a dataset scaled to simulate real-world conditions. For the ratings and reviews component, I was given 4 CSV files comprising data dumped from a third party source, with the following metrics of file size and entries below:
TODO: Screenshot of spreadsheet breakdown
In this state of the project, I had to work through an ETL (Extract, Transform, Load) process in order to populate my database, in the following ways:
Extract the CSV data into a workable format, such as JSON/Javascript
Transform the data from the arrangement from the original data source to fit my databse, which has a different structure. This includes re-deriving metadata that the APIs will call that is not supplied in the CSV files.
Load The transformed data into my database.
Although this process was to be done for each database chosen, to me it made sense to only do it fully for the Postgres database as it serves as the ultimate 'source of truth'. Conversion methods later written to push data and back and forth betweent the two databases could be used later to generate whatever is needed in the MongoDB, either on the fly, or for some pre-caching data.
Extract
Initially when I performed ETL on a stream reading from a CSV file to JSON to the database, although it worked fine for small sets of data, the rate was too much for Sequelize when processing large files, and the process quickly stalled out.
I broke up the process to better calibrate the loads on Sequelize. The stream was mostly used to build up a cache system of data & requests that could then be held in memory and emptied at a slower rate. This system was comprised of asynchronous caches (With up to 100 running at a time) that each executed a synchronous set of queues (up to 10,000 for each).
In the even that this isn't enough, I can always add some system monitoring or other algorithm in either cache holder that could adjust the rate of invocations of queued callbacks. I made a placeholder method in use already, so stubbing in such behavior would be easy with the existing system.
console.log('Environment:', process.env.NODE_ENV);
console.log(process.env.NODE_ENV === 'development') // false
console.log(Expect ${'development'.length}:, process.env.NODE_ENV.length) // 11 (including a space at the end)
Now that the database is up and running, I need to get some basic metrics of how fast my read queries are.
Checking my test database will be helpful for establishing a basic metric for database speed in it's most efficient case of a simple operation, a single request, on a small database of a consistent and predictable form.
I also need to get some basic metrics of how fast my read queries are in a situation more similar to real life, so I will run the same tests on my production database. This will be helpful for establishing a basic metric for database speed in it's most efficient case of a simple operation, a single request, but now on a large database with a less predictable form.
Databases Tested
Test database (small # of entries, known state, faster runs)
Production database (large # of entries, unknown state, slower runs)
Query times are checked utilizing the server APIs, and will be run after each improvement is made so as to see the affects of various improvements.
Categories Tested
For each database, I will look up a few records via API paths that tap into each of the following categories:
Review
Product Metadata
Product Reviews
For the test database, I might as well get the most efficient response to establish an upper bound, so I will look up the earliest few records for each of the categories.
For the production-scale database, I might as well get the least efficient response to establish a lower bound, so I will look up how many records there are in a table in order to then search for the last few records for each of the categories.
Review: 5,765,293 records
Product: 1,000,011 records
Metadata: records
Note that I am only including tables that are likely to be used in the API. Basic tables such as products, users, characteristics, etc. are likely to only be obtained IN COMBINATION with other tables.
DBMS Performance Testing
Goals & Motivations
Now that the databases are fully formed and connected to the server with API routes, it is time to test the speed of various database queries in isolation.
Actions Taken
For each API route, a request was made roughly 10 times using Postman, with the total time from request to response taken. These times were recorded and averaged.
The following images show sample Postman output for the API paths tested on the test database:
Review Routes
Product Reviews Routes
Product Reviews Metadata Routes
Service Stress Testing
Goals & Motivations
Another testing goal is local stress testing of the databases to use as a quantitative basis for improvements. These quantitative metrics were RPS (requests per second), latency, and error rate.
Actions Taken
K6 was chosen as my testing framework for stress testing the database. I first set it up and learned the ins and outs of the system by running and calibrating it on my test database. This seemed like a good idea as I had fewer variables to consider and could focus purely on the inputs and outputs of all of my database route paths. I could test reads, writes, and updates without worrying about contaminating my larger database, and the iterations while experimenting with the system were fast. Additionally, since my test database is created fresh and populated with seed data automatically as a start point before any test, it can be easy to automate this and test predictable database states.
Tests were done with a constant request load sustained over time, as well as a ramping request load rate. The targeted test metrics are as follows:
Constant Request Rates (held for 1 min)
TBD (image of test config)
Ramping Request Rates (each held for 1 min)
TBD (image of test config)
For each type of loading, the service was tested for the following cases, if possible:
In order to maintain good practices, I am still avoiding touching the 'production' database created from ETL, and instead am cloning the database through dump and restore commands.
Actions Taken
The development database generated from ETL was copied to a new database by running the following dump command:
pg_dump -U postgres -d sourcedb -f sourcedb.sql
and then the following restore command to generate the database to stress test on a fresh instance:
psql -U postgres -d targetdb -f sourcedb.sql
It is worth noting that while ETL took a VERY long time to run, these commands, which tap into native methods and lower-level implementations were much faster. For example, to copy the entire 5.1 GB of CSV data:
Dump Time: 31 seconds
*.SQL File Size: 3.7 GB
Load Time: 6 min
Once this was completed, I could then take my existing testing setups and merely point them to the new database location to run the desired tests.
Achievements
A new database was deployed and tested, allowing me to not only get current metrics for a database, but have a nice A/B comparison between a smaller and larger database based on the same schema.
I will run these tests after each improvement to the database is made in order to more easily see where and how much each improvement made a difference. The improvements I will make for Postgres are as follows:
Indexing - Many queries are using the public IDs in order to do database lookups. These are currently not indexed.
Table joins - Current queries are chaining promises to find data in tables based on prior query lookup results
Max Connections - 400
Shared Buffer - 250 MB
Pool max - 400
Goal from Improvements
The target read speed for queries is for all reads to be under 50ms, but ideally as short as 10ms.
Target requests per second of the service are expected to be able to reach 1000 rps with under a 1% error rate and 2000ms latency.
Overall Results Observed - Test Database
IDs chosen are meant to include a range of results that may affect times, such as one-to-many and many-to-may relations, such as reviews with characteristics or photos. I chose ids with a range of states.
The following are the queries and resulting execution times on the test-scale database initially and after each improvement:
The following are the initial stress test results on the test-scale service for constant load:
The following are the final stress test results on the test-scale service for constant load:
The following are the initial stress test results on the test-scale service for ramped load:
The following are the final stress test results on the test-scale service for ramped load:
IDs chosen are meant to include a range of results that may affect times, such as one-to-many and many-to-may relations, such as reviews with characteristics or photos. In this 'worst case' set of tests, I aimed to have all ids (from low to high within the total range in the DB) include table joins with BOTH at least one characteristic and at least one photo.
Some interesting notes in general are that the first query always takes substantially longer than the rest (by about a factor of 5), as if the database is needing to warm up. Additionally, if the same query is repeated within a short enough period of time, even with other queries in between, the time seems to drop in a converging pattern, as if some sort of caching system is coming in to play. This is apparent when viewing each of the timed DB queries. These two factors (# of queries since the first one, # of queries in a row), has a MUCH larger effect than how high of an id number is queried within the database.
The following are the queries and resulting execution times on the production-scale initially and after each improvement:
The following are the stress test results on the production-scale service for constant load comparing the initial to final states, with key improvement steps noted:
The following are the stress test results on the production-scale service for ramping load comparing the initial to final states, with key improvement steps noted:
Results Impacts
Reflections
Indexing resulted in the greatest improvement, by reducing query times by 55.94% to 99.87%. Table joins also helped, but to a lesser extent of reducing query times by up to 65.1%.
Increasing max connections improved the speed and reliability in queries doing less work and returning less data, but there were also more errors in the connections being overwhelmed, which caused a higher failure rate and inconsistent results as some queries took a long time or never finished when this occurred. Increasing the max buffer helped somewhat, but the results were still rather mixed. This shows the importance of careful tuning of the database configuration as increasing capacities may not lead to desired improvements.
The following are the queries and resulting execution times on the test-scale database initially:
The following are the initial stress test results on the test-scale service for constant load:
The following are the initial stress test results on the test-scale service for ramped load:
Results Observed - Production-Scale Database
The following are the queries and resulting execution times on the production-scale initially:
The following are the initial stress test results on the production-scale service for constant load:
The following are the initial stress test results on the production-scale service for ramped load:
(Missed this before I moved on to indexing foreign keys. Oops!)
Reflections
Some interesting notes in general are that the first query always takes substantially longer than the rest (by about a factor of 5), as if the database is needing to warm up. Additionally, if the same query is repeated within a short enough period of time, even with other queries in between, the time seems to drop in a converging pattern, as if some sort of caching system is coming in to play. This is apparent when viewing each of the timed DB queries. These two factors (# of queries since the first one, # of queries in a row), has a MUCH larger effect than how high of an id number is queried within the database.
Also, when the number of entries in the database increase from a few dozen to millions, the slower queries end up flipping (as expected) to the ones that require more lookups and joins.
Additionally, what looks like reasonable times on the test database end up clearly over the target limit of 10-50ms limits for queries on the full-scale database. This shows the importance of testing database design with realistically scaled data included, in addition to just the mechanical workings of the schema & queries.
The first bit of low hanging fruit to improve the database is to index strategically. In the tables, there is a public set of IDs which are NOT used as the primary keys, since they might change. Since the API relies on public keys, this means that these non-indexed keys are ALWAYS used as a starting point for any request, rather than the indexed primary keys.
Additionally, it turns out that although I took care to use primary keys as foreign key references within the database, Postgres DOES NOT index these foreign keys where they are used, just defined. This means that any 1:m and n:m tables are in need of their foreign keys being indexed. This article explains the issue well.
Actions Taken
I indexed the following keys in the Postgres database, either in the table definitions (per the test database case), or by modifying the existing database (per the production database case where the existing data must be preserved):
Results Observed
The following are the queries and resulting execution times on the test-scale database:
The following are the queries and resulting execution times on the production-scale database:
The following are the stress test results on the test-scale service for constant load:
The following are the stress test results on the test-scale service for ramped load:
The following are the stress test results on the production-scale service for constant load:
The following are the stress test results on the production-scale service for ramped load:
Postgres Read Queries - Performance and Stress Testing - Max Databse Connections Results
The Postgres database on my local computer has a default max connection of 100. I chose to increase it initially to 400. It sounded like some calibration would be needed to find the optimal number of connections, so I decided to do the following:
Run a constant load in K6 to note completion % and latency
Do this for a constant load for 1 min of 1rps, 10 rps, 100 rps & 1000 rps in order to see improvements and see a rounding of the 'elbow' in efficiency. This was first done on a single review for a faster/simpler case, and then re-ran on a product reviews path to have maximum complexity of a read, since this involves many table joins across 1:m and n:m tables on many reviews.
Once confidence was gained that 400 is a reasonable starting point, max connections would be increased by doubling and only the largest reached rps load on the product reviews is tested. A binary search strategy will be used where a iteration or few of these large increases are checked until improvement decreases or flatlines. Once this occurs, a max connection between this state and the last best will be tested. This will be incremented a couple of times until a satisfactory max step is reached.
**Results of first K6 Run on Review Reads after Max Connections Increased to 400**
It was clear that, as expected, at low request volumes, this change had no noticeable effect on latency, but once volume reached prior levels that crashed tests, there was a noticeable improvement, and a higher rps load was able to be reached.
It is also worth noting that there were more timeouts, which occurred earlier on in the runtime, which makes sense if more requests are being made and another bottleneck further down the pipeline is being triggered.
Once working out that the ideal max_connections seemed to be about 400, I then re-did the tests with pool max, which is a Sequelize setting. It was a bit unclear to me just how it was different, apart from maybe it throttling the requests a bit by limiting the max number of requests that can be made per app instance.
The default max pool connection is 5, so for a large jump I first tried increasing it to the current max_connections.
For choice of 250 MB - AWS server has 1GB RAM
https://www.postgresql.org/docs/9.1/runtime-config-resource.html
'If you have a dedicated database server with 1GB or more of RAM, a reasonable starting value for shared_buffers is 25% of the memory in your system.'
Goals & Motivations
Actions Taken
Results Observed
The following are the queries and resulting execution times on the test-scale database:
The following are the queries and resulting execution times on the production-scale database:
The following are the stress test results on the test-scale service for constant load:
The following are the stress test results on the test-scale service for ramped load:
The following are the stress test results on the production-scale service for constant load:
The following are the stress test results on the production-scale service for ramped load:
Postgres Read Queries - Performance and Stress Testing - Max Pool Connections Results
Same as max connection?
https://betterprogramming.pub/how-to-avoid-bottlenecks-in-node-js-applications-8085d86b6b2e
For choice of 400 max pool
'The downside of this approach is that when the connection pool limit is reached, any new request will have to wait for a connection to be released by the previous ones.
This is why the pool size should be chosen carefully. If your data is not very large or you do not expect many simultaneous users, you might not encounter any problems. Otherwise, this may easily become the bottleneck of your application.
...
The most important aspects to take into account in order to find the right value are the expected application load and concurrency level.
...
By default, this pool is limited to 5
...
When max_connections is reached, the server rejects any other connection — as opposed to what happens at the application layer, where requests are queued.
Finding a good value for max_connections is complex'
Effective Cache size
'Setting effective_cache_size to 1/2 of total memory would be a normal conservative setting,'
Goals & Motivations
Actions Taken
Results Observed
The following are the queries and resulting execution times on the test-scale database:
The following are the queries and resulting execution times on the production-scale database:
The following are the stress test results on the test-scale service for constant load:
The following are the stress test results on the test-scale service for ramped load:
The following are the stress test results on the production-scale service for constant load:
The following are the stress test results on the production-scale service for ramped load:
I had to decide if/how to ensure authentication, and the methodology for doing so. One option was whether to use the default global admin profile or generate a new profile with the minimum necessary privileges. I chose to create a new profile as then customizations to it were less likely to cause system issues that are hard to reverse. Additionally, it provided another layer of security as that user only had specific privileges to CRUD within the database. In the event that that it was compromised, a hacker couldn't drop the database, create new tables, etc.
===
I needed to securely open access to the database that I had running on my EC2 instance in order to run databases on separate servers from my central Express API app but without allowing anyone with the IP address access to database CRUD operations.
This was achieved through first defining a role/user that requires a password. Next the user was granted only specific privileges to a specific database. This limited role would be the profile through which my database is controlled by my Express app. Finally, once this secure access was created, I edited the config file for Postgres to allow calls from any IP address, but also to require all calls to supply a password with md5 authentication.
This allowed me to securely access my database stored on a separate EC2 instance than my express app, as a user with limited privileges and an associated encrypted password must be supplied in order to interact with the database. This allows me to move the database away from my Express app. This is the first step in scaling my database horizontally through having multiple database instances running, rather than one on the server where the Express app is running.
Server Deployment - Postgres - Performance and Stress Testing - Initial
Loader IO was used.
All getProductReview tests completed predictably, even though this API route usually takes longer. Meanwhile, the getReview Route had issues at 100rps & 1000rps. Some tests behaved strangely where halfway through the run, such as for 100rp, the RPS increased to double what was specified, and caused the system to stall out. Making a new identical test would exhibit different behavior, such as low response rates initially, a spike, then a crash. For 1000 rps in getReview, it crashed on ramping up.
The following key takeaway is a sample question that I could have answered better. I ultimately answered it, but indirectly and with some nudging needed. I need to catch on to the key point and not get distracted with the details.
We are a software company looking for someone to come in and help us rework our back end to better scale with our app. If we hired you for this, how wouod you go about helping us expand our system?
I would first make sure I understand the existing system and run various metrics-based tests on the current behavior of the system. I should learn where it performs well, where the bottlenecks are, and what the system limitations are and how they stack up against the company's demands for the system. Only after this is done should I begin thining about how to improve their system.
These two terms have very different connotations to interviewers, and I should be careful which one I use where. Microservice indicates that I am working full stack and will have more limits to my database knowledge. Database indicates that I am looking to specialize on the back end and should be expected to know the ins & outs of databases to a much higher level.
For example, I had limits to what I could do/understand with optimizing databases with system configurations such as max connections and buffer allotment. This was due to it being a more difficult and indeterminate problem that must be calibrated to each OS & hardware setup. It requires deep knowledge of those that I do not have nor am seeking out.
Putting Myself Down
I mentionedmy failures on the project (failing to finish Mongo, not finishing cloud testing yet), but with the intent of it saying why I ended up in a different place than expected, or why some knowledge is partial. I figured it was better to be up front about it, give context to answers, but it still came out more negative than it could. Leslie had some ideas for how I could achieve those ends better.
Mongo/Postgres
For Mongo/Postgres, my intent was to interact a lot more with Mongo on main traffic, but never did because 2 databases were too much, so I had designed a system for 2, implemented one, and implemented the one that, while more fundamental & important to get going first, was of secondary interest.
I could instead say that:
Mongo was designed, but Postgres was implemented first for the reasons I did so. No need to mention Mongo intentions.
The Postgres implementation & optimization alone met the requirements, so there was no need to build out Mongo, but it is an option if needed in the future.
Cloud Testing
Just mention what I did, not what I didn't do. Again, bring up intentions as plans that could be fleshed out in the future, otherwise, say less and say constructively that 'I don't know' if asked (see below). While context is handy, one problem with bringing up extra material is that interviewers tend to drill down (e.g. ask 'why' 3 times to get to the real answer), so bringing it up will likely induce further questioning.
When I Don't Know an Answer
Talk. More. Slowly. To. Buy. Time... ;-P
Also, apparently when I am reasoning out loud, it sounds too confident as if I know the answer (or think I do) and if it is wrong, then it comes across as bullshitting. A different approach would be to say things like 'that is an interesting problem, let me think about it...', 'how about' and reason through more basic setup steps as well, similar to if I were whiteboarding.
Some Technical Strategies/Answers to Remember
Caching Layer Update Strategies
Whether for Redis, or for using Mongo as a caching database, these methods would be handled in a similar way.
Reads are rather straightforward: Store the data in the caching layer with an associated key (e.g. API request), and for every read, check for the existence of this key first and use the cached data if it exists.
CUD is trickier.
I already had worked out that if you do an update, do it to Postgrs first and it if is successful, update caches on the return. If changing a single piece of data, this is straightforward enough, but what if I change a user name, which is associated with many reviews? In this case, there is much more to update (or invalidate for a fresh cache) on the return.
I had already reasoned having an active async process updating cached DBs to Postgres (or a primary DB that is cloned for load balanced servers). For our app, data accuracy was OK to lag a bit in the areas that change often, I reasoned. This same strategy could be used for updating caches more directly from CUD operations. The user already has the (theoretically) updated state and doesn't need to receive anything back. The return response is on a critical path.
The response can be handled by removing it from the critical path. The flags for updating caches merely trigger an asycn update operation and continue on back to the client. The update operation can do the cache updates on a different system on the server (or different server) to avoid bogging down system resources alloted.
Round Robin vs. Weighted Load Balancer Strategies
If you have a load balancer handling 4 identical servers, go Round Robin as there is no penalty for switching between servers. However, if one server handled by the load balancer is itself a load balancer to other servers, a weighted strategy that weights the child balancer should be used. This is because this more equally taps into ALL servers that can handle API requests since this server may then use Round Robin on it's child servers.