This gist details the decisions I made, the tests I conducted, the observations recorded, and the directions I took while building out an API for Project Greenfield.
-
-
Save alecchampaign/bd1f70583ae4ac658c77469733a75b76 to your computer and use it in GitHub Desktop.
Today was the first day of the project. I spent the morning picking an Front End Capstone (FEC) project for the client and performing the project's initial setup.
Given the challenge of having to rebuild the Reviews portion of the overall api for the Greenfield Project (codenamed 'Kartify'), my first step was to choose which Database Management System (DBMS) and Relational DBMS would be best for storing the data provided from the old api. The RDBMS will be primarily used throughout the project. However, the new api may be refactored to utilize the no-SQL DMBS as a learning experience, given enough time.
To start, I researched the most popular SQL and no-SQL databases (PostgreSQL, Neo4j, MongoDB, etc). Upon understanding the general concepts of each, I designed a schema to visualize which sort of implementation would best suit my data: https://imgur.com/ENEJK3f
I found that designing a schema in a SQL-esque relational manner was easiest, so I decided to use PostgreSQL which I had discovered was a very popular SQL RDBMS through my research. As for a non-SQL DBMS, I decided to go with Cassandra, mainly for its high scalability and flexible data storage.
Given the requirement of having to choose an alternative, no-SQL DBMS for the api, I had to choose a secondary DBMS that would fit the project's needs. I ended up choosing Cassandra due to its comparative advantage in scalability thanks to clustures, as well as its use of bloom filters to reduce time required to make an expensive I/O operation. Now that I've chosen a no-SQL DBMS, I must research its architecture in-depth and design a schema.
I researched Cassandra architecture in-depth via documentation and various third party articles. After acquiring enough knowledge of how it works and what it looks like, I began diagramming a schema on a whiteboard, then switched to Google Drawings.
I completed my cassandra schema in Google Drawings. I created a column family for every table that I had in my SQL schema. Due to the lack of many-to-many relationships in my SQL schema, the data mapped over easily: https://imgur.com/8v3HEZG
For the second day of the project, I focused on attempting to seed my database with data from the old api, but ended up pivoting to containerizing my application and database with docker.
After choosing Postgres as my RDBMS, my next step was to actually initialize it with the data provided by the old api. The data is available within three separate CSV files from which it must be extracted and loaded into the database.
I installed Postgres and wrote a schema derived from my schema design. I realized that I couldn't load schemas in the same manner as I was used to with MySQL, so I resorted to copy/pasting commands from my schema file into the postgres cli to create the database with the necessary tables.
I successfully created a schema which I used to initialize my postgres database. However, I decided to defer loading data into it until I've containerized my application thus far (database and express server) with docker.
Since I knew that I would need to containerize my application before deployment, I decided to begin setting up docker before getting too deep in development. Specifically, I wanted to containerize my postgres database before I loaded any data into it, so that I wouldn't have to waste time loading the same data once again when I re-intialize it inside of a docker container.
I began by going through the 'Get Started' section of the docker docs, where I read the 'Docker Overview' and then completed the 'Quickstart' tutorial. Then I finished setting up my server before starting the process of containerizing my application. Next, I wrote two files:
- Dockerfile - builds an image for my express server
- docker-compose.yml - instructs docker to start a container for Postgres (official image via Docker Hub) and a container for my express server (image that I built)
To make my image accessible to my docker-compose.yml file, I had to create a Docker Hub repo for the image that I built (alecchampaign/server) then push the built image to the repo.
When I decided to test my docker setup to make sure that my containers were working as expected, I realized that I didn't know how to connect to my DB service. I found out that I had to edit my docker-compose.yml to map the port for postgres and run the following command on my localhost to connect to the container:
psql -h localhost -U admin -d reviews_api
I successfully containerized both my server and my postgres database!
Now that my application has been containerized, my next step was to seed my database with data from the old api.
Since my database is running in a container, I needed to map a volume on my local file system to a volume in my container. That way, I could easily reference the CSVs containing the old api data from within the database container during the ETL process.
To preface, I read a great article Orhan Yarar on docker volumes where he explained how to map a volume to a container via docker-compose. I went down a couple of rabbit holes when trying to figure out why docker-compose was erroring out whenever I ran docker-compose up
. As it turns out, my nesting of the volumes
code block was incorrect. After fixing the indentation, docker-compose up
ran without any errors.
I was able to map the volume containing the seed data on my local machine to the docker container containing my postgres database!
With the volume containing the initial data mapped to my database container, my next step was to begin the Extract, Transform, Load (ETL) process to seed my database with data from the old api.
I began by installing a useful webapp called pgAdmin which provides me with a GUI tool for interacting with postgres databases. I started pgAdmin and connected it to my database running within a docker container. Next, I ran psql -h localhost -U admin -d reviews_api -f schema.sql
in my terminal to load a schema into my reviews_api
database. I then verified that my schema loaded successfully by inspecting it in pgAdmin.
The next step was to seed my data by utilizing pgAdmin's import feature, which allowed me to import a file into the database by specifying its type (CSV) and the delimiter (a comma in this case). I encountered a problem where the order of the columns in my schema were different than the headers in the CSV file, resulting in the data being inserted in the incorrect order. To fix this, I had to refactor my schema's columns to match the order of the headers in the CSV files.
Another issue I encountered was messy data, wherein different data types were used within different entries for the same columns. I fixed this manually be changing the data types of bad entries (i.e. 'false' to '0').
Perhaps the largest challenge I encountered was cripplingly slow data load times when seeding my Ratings table with data from the api. My hopethesis was that the table was taking especially long to seed since it had multiple foreign keys referencing tables with already had large amounts of data. To remedy this, I indexed each foreign key in my schema. This improvement made a huge performance impact, as I was able to seed the table in under 400 seconds versus the 3,000+ seconds it took to seed earlier.
I successfully refactored my schema to accept data from the api and seeded my database with said data. My database was optimized for performance in the process.
Today's goal is to connect my express server to my database and fix an issue where my database is not persisting data despite having a volume mapped.
After having verified that both my database and server function as expected in isolation (aside from the fact that my database is not persisting data after the docker container is restarted), I needed to connect the two so that the server will be able to respond to requests with data and write data to the database.
I began by installing a node module called 'pg-promise' which would allow me to make a connection to my postgres database and consume said connection in my server via a promise. However, I ended up switching to a separate, more well-known module called 'pg' after discovering that pg-promise wasn't working as I expected and was an overall pain to use. I then setup a database connection in a separate db.js
file with the correct postgres configuration and exported said connection to my server, where I set a timeout which waited for the postgres docker container to be up and running before attempting to connect. Had I not set a timeout, I would have encountered a well-known issue where the server would continue to throw an error stating that it failed to connect on subsequent requests, even if it had connected successfully.
My database and server were successfully connected, and as such my server is now ready to serve requests.
My database still lacked the ability to persist data after container restarts, which introduced a major inconvenience. Every time I restarted my container I had to re-seed my database, and since I'm dealing with a very large dataset, this process took a considerable amount of time. As such, my next priority was to persist this data across container reloads so that I wouldn't need to waste anymore time re-seeding my database.
Before my refactor, I was manually re-seeding my database by loading in data from a dump.sql
file which contained a manually backup of my database generated via the pg_dump
utility. I decided to abandon this approach entirely by mapping a volume from the default location postgres saves a databases data (/var/lib/postgresql/data
) to a location on my file system. I decided to scrap my previous volume binding as I realized that it wasn't actually serving any purpose.
My database now persists data across container restarts!
With the database fully-functional, the next step was to begin setting up API routes.
I needed to setup a testing suite so that I could be sure that my API was responding to requests according to whatever params/queries were attached. I decided to write (or at least stub-out) integration tests ahead of actually coding the functionality on the server as to follow the TDD best practices.
I started by installing mocha/chai and an API testing framework called 'chakram' which is built on top of the former. I immediately faced a challenge where I was unable to get tests to wait for promises to resolve before evaluating assertion, which I mitigated by returning the promises, which mocha interprets by waiting for said promise to resolve. Once I saw that mocha had the ability to stall evaluation I realized that there was no need to use an API testing framework (chakram).
Another issue I faced was getting mocha to wait long enough for a response, since it took my server a while to query my database. Adding a .timeout(SOME_AMOUNT_OF_TIME)
method invocation on each it()
call or by simply running mocha with a --timeout SOME_AMOUNT_OF_TIME
flag.
I was able to mostly write out the entire test suite for my /reviews/:product_id/list
endpoint. Some tests regarding the sort
query are only stubs since I hadn't figured out exactly what my results should look like.
With my tests stubbed-out I was able to start programming my server to my /reviews/:product_id/list
endpoint, which would allow users to receive a list of reviews for a given product.
I started by making a database query with a template literal that inserted values from queries on the request. Once the db query resolved, I returned the data to the user. The largest challenge I faced was getting the photos for each review, as this required me to make a query to another table, photos
, for each review in the .then()
of the first db query to reviews
. Due to growing asynchronous complexity, my server was sending a response before waiting for the photos queries to resolve. I solved this problem by fetching the photos for each review inside of a promise, which resolved only once a query to photos
resolved. Once this promise resolved the response was sent.
I was able to serve many of the types of request to the /reviews/:product_id/list
endpoint which I needed to handle, but I was not able to finish writing out the remaining logic for sorting reviews based on the sort
request query.
With my first endpoint's testsuite stubbed and my server-side logic started, I needed to finish writing the rest of the server logic.
Although I wrote out much of my server logic for handling requests to the /reviews/:product_id/list
endpoint, I still needed to handle requests containing a sort
query so that users could sort reviews by newest
, relevance
, and helpfulness
.
I realized that I had made a mistake in my previous fix for solving the problem where my server would send responses to requests before waiting for db queries to photos
to resolve, resulting in many reviews not having photos. Although I was querying photos
for each review, apparently my promise resolved once the query for my first review completed.
After much trial and error, I solved this problem by adding each query to photos
in an array and resolving the promise once the array of queries resolved in a call to Promise.all()
. Only once the promise was resolved would the response be set, forcing the server to wait until all queries to photos
resolved before sending a response.
Another issue I faced was that I was sending data in the response that was recieved directly from the database, which contained some information which I didn't necessarily want to send back to the user (such as a reviewer's email address). I overcame this by instead sending back a new object, the properties on which were scrapped from data received from the db query. This way, I was in full control over exactly what data was sent back to the user. I also found that I needed to clean up some of the data, as responsed expected values such as an empty string (''
) for empty responses, but the database sent back 'null'
or null
instead.
Lastly, I completed logic for the /reviews/:product_id/list
endpoint by sorting reviews according to a sort
request query, which could contain values newest
, relevance
, or helpfulness
.
I was able to finish writing my server-sided logic to handle any kind of GET
request to the /reviews/:product_id/list
endpoint.
The next endpoint I needed to handle GET
requests for was /reviews/:product_id/meta
, so that the client could fetch metadata regarding all the reviews for a given product (i.e. total ratings, how many reviews recommended the product versus how many didn't, etc). In order to follow TDD best practices, I decided to write out my integration tests before the actual server-side logic.
Once again, I used mocha/chakram (although I should just remove chakram now that I realize I don't need it). Stubbing out tests were pretty straight-forward, although I did run into another minor issue regarding asynchronocity; unlike my last test suite, I only wanted to make one request and test different aspects of the same response. I solved this by using mocha's before()
and returned a promise which made the API request before any tests ran.
My test suite for my second endpoint was finished!
The overarching theme for this day was completing the /reviews/:product_id/meta
endpoint.
The next endpoint I needed to handle GET
requests for was /reviews/:product_id/meta
, so that the client could fetch metadata regarding all the reviews for a given product (i.e. total ratings, how many reviews recommended the product versus how many didn't, etc).
I began by formulating a SQL query which would pull all the information for the database which I would need to fulfill the GET
request. This was difficult because getting characteristic data required the querying of three separate tables. I came up with the following:
select recommended, review_id, value, name, rating, value, characteristic_id
from (select *
from (select id, recommended, rating
from reviews
where product_id = 5) as reviews inner join ratings on ratings.review_id = reviews.id) as ratings inner join characteristics as char on char.id = ratings.characteristic_id;
After designing a query came the tedious task of morphing the data into a proper response object. This involved the 'caching' of different data as I iterated over each entry to keep track of what I had already encountered while scraping the query results for data to add to the response object.
I developed the ability for my app to handle GET
requests to the /reviews/:product_id/meta
endpoint!
Although I planned to develop my /reviews/:product_id
endpoint to allow users to POST new reviews, I ended up going down a rabbit hole of attempting to fix a serialization problem with the tables in my database.
Knowing that I needed to handle POST requests to the /reviews/:product_id
endpoint, I began by writing integration tests before setting up the actual endpoint.
Setting up the test suite for this endpoint was a similar process as setting up the test suites for the other endpoints, except this time I had to make a POST request to the endpoint which I was testing, then make a GET request to the /reviews/:product_id/list
to check that my data had been successfully POSTed. I also chose to deprecate chakram and relied directly on mocha/chai for testing and axios for AJAX.
I designed a full test suite for ensuring that my /reviews/:product_id
endpoint works as expected when I start developing it.
I needed to set up an /reviews:product_id
endpoint to handle POST requests, so that users could submit new reviews for products.
First, I realized that dates were stored in a different format in the database than the format they are posted in. To mitigate this, I simply converted the date to a yyyy/mm/dd format.
Next, I decided that I needed to make a varying amount of queries due to having to insert into three tables. To minimize network latency, I used a SQL TRANSACTION BLOCK
to batch all of my queries into a single database query.
I also realized that I would need to insert a varying amount of values to the characteristics
and photos
tables depending on how many photos/ratings were included in the request. To solve this issue I decided to iterate over the request's photos/ratings and appended a set of values to the corresponding inserts for each one. Using a template literal I was able to add these inserts into the TRANSACTION BLOCK
.
Lastly, I faced on more major issue: I realized that, since my data was seeded from a CSV file, the SERIAL
pointer for each pointer was pointing at the first id by default regardless of how many entries were already in the table. This meant that when I tried to insert new data, it would try to insert into a row which already contained data, resulting in a duplicate id conflict. I decided that I needed to solve this issue with my database before I could continue developing the endpoint.
I ran into an issue with my database where it would try to insert values into rows which already contained data, causing a duplicate id conflict. Although I developed an INSERT INTO
query which I'm fairly confident will work as intended, I decided that I needed to solve the issue with my database first before continuing to develop the endpoint.
I relied on SERIAL
(AUTO_INCREMENT
in MySQL) functionality for inserting new data with the correct id in my database. However, since I seeded from CSV files, the pointer for keeping track of what id to insert next was attempting to insert data at ids which had already been inserted. To save new reviews in the database, I needed to solve this issue.
The strategy I chose was to intially load all data from the CSV files to a 'temporary' table equivalent, from which I would copy every column except the id from the temporary tables into the actual tables, which had their own SERIAL
ids. Then, I would add the foreign key constraints to the actual tables and drop the temporary tables.
Since I wanted to do this all within the schema.sql
file, I had to bind a volume which contained the CSV files to my postgres docker container so that it had access to the all the data it needed to copy.
Through a semi-complex refactor of my schema I was able to seed my data in such a way that SERIAL
pointers pointed at the correct index.