Skip to content

Instantly share code, notes, and snippets.

@alecchampaign
Forked from trentgoing/001 - Journal Homepage.md
Last active January 11, 2020 00:52
Show Gist options
  • Save alecchampaign/bd1f70583ae4ac658c77469733a75b76 to your computer and use it in GitHub Desktop.
Save alecchampaign/bd1f70583ae4ac658c77469733a75b76 to your computer and use it in GitHub Desktop.
SDC Engineering Journal

SDC Engineering Journal

This gist details the decisions I made, the tests I conducted, the observations recorded, and the directions I took while building out an API for Project Greenfield.

Goals

Achievements

Reflections

Daily Summary

Database Design and Creation

2019-12-30

December 30, 2019

Overview

Today was the first day of the project. I spent the morning picking an Front End Capstone (FEC) project for the client and performing the project's initial setup.

Log

11:00am - Choosing a RDBMS and DBMS for the Reviews api

Challenge/Motivation

Given the challenge of having to rebuild the Reviews portion of the overall api for the Greenfield Project (codenamed 'Kartify'), my first step was to choose which Database Management System (DBMS) and Relational DBMS would be best for storing the data provided from the old api. The RDBMS will be primarily used throughout the project. However, the new api may be refactored to utilize the no-SQL DMBS as a learning experience, given enough time.

Actions Taken

To start, I researched the most popular SQL and no-SQL databases (PostgreSQL, Neo4j, MongoDB, etc). Upon understanding the general concepts of each, I designed a schema to visualize which sort of implementation would best suit my data: https://imgur.com/ENEJK3f

Results Observed

I found that designing a schema in a SQL-esque relational manner was easiest, so I decided to use PostgreSQL which I had discovered was a very popular SQL RDBMS through my research. As for a non-SQL DBMS, I decided to go with Cassandra, mainly for its high scalability and flexible data storage.

5:30pm - Researching and designing schema for Cassandra database

Challenge/Motivation

Given the requirement of having to choose an alternative, no-SQL DBMS for the api, I had to choose a secondary DBMS that would fit the project's needs. I ended up choosing Cassandra due to its comparative advantage in scalability thanks to clustures, as well as its use of bloom filters to reduce time required to make an expensive I/O operation. Now that I've chosen a no-SQL DBMS, I must research its architecture in-depth and design a schema.

Actions Taken

I researched Cassandra architecture in-depth via documentation and various third party articles. After acquiring enough knowledge of how it works and what it looks like, I began diagramming a schema on a whiteboard, then switched to Google Drawings.

Results Observed

I completed my cassandra schema in Google Drawings. I created a column family for every table that I had in my SQL schema. Due to the lack of many-to-many relationships in my SQL schema, the data mapped over easily: https://imgur.com/8v3HEZG

December 31, 2019

Overview

For the second day of the project, I focused on attempting to seed my database with data from the old api, but ended up pivoting to containerizing my application and database with docker.

Log

10:00am - Initializing Postgres database with old api data

Challenge/Motivation

After choosing Postgres as my RDBMS, my next step was to actually initialize it with the data provided by the old api. The data is available within three separate CSV files from which it must be extracted and loaded into the database.

Actions Taken

I installed Postgres and wrote a schema derived from my schema design. I realized that I couldn't load schemas in the same manner as I was used to with MySQL, so I resorted to copy/pasting commands from my schema file into the postgres cli to create the database with the necessary tables.

Results Observed

I successfully created a schema which I used to initialize my postgres database. However, I decided to defer loading data into it until I've containerized my application thus far (database and express server) with docker.

3:00pm - Setup docker

Challenge/Motivation

Since I knew that I would need to containerize my application before deployment, I decided to begin setting up docker before getting too deep in development. Specifically, I wanted to containerize my postgres database before I loaded any data into it, so that I wouldn't have to waste time loading the same data once again when I re-intialize it inside of a docker container.

Actions Taken

I began by going through the 'Get Started' section of the docker docs, where I read the 'Docker Overview' and then completed the 'Quickstart' tutorial. Then I finished setting up my server before starting the process of containerizing my application. Next, I wrote two files:

  1. Dockerfile - builds an image for my express server
  2. docker-compose.yml - instructs docker to start a container for Postgres (official image via Docker Hub) and a container for my express server (image that I built)

To make my image accessible to my docker-compose.yml file, I had to create a Docker Hub repo for the image that I built (alecchampaign/server) then push the built image to the repo.

When I decided to test my docker setup to make sure that my containers were working as expected, I realized that I didn't know how to connect to my DB service. I found out that I had to edit my docker-compose.yml to map the port for postgres and run the following command on my localhost to connect to the container:

psql -h localhost -U admin -d reviews_api

Results Observed

I successfully containerized both my server and my postgres database!

January 2, 2020

Overview

Now that my application has been containerized, my next step was to seed my database with data from the old api.

Log

10:00am - Mounting a volume to my postgres database

Challenge/Motivation

Since my database is running in a container, I needed to map a volume on my local file system to a volume in my container. That way, I could easily reference the CSVs containing the old api data from within the database container during the ETL process.

Actions Taken

To preface, I read a great article Orhan Yarar on docker volumes where he explained how to map a volume to a container via docker-compose. I went down a couple of rabbit holes when trying to figure out why docker-compose was erroring out whenever I ran docker-compose up. As it turns out, my nesting of the volumes code block was incorrect. After fixing the indentation, docker-compose up ran without any errors.

Results Observed

I was able to map the volume containing the seed data on my local machine to the docker container containing my postgres database!

11:30am - Seeding database with initial data

Challenge/Motivation

With the volume containing the initial data mapped to my database container, my next step was to begin the Extract, Transform, Load (ETL) process to seed my database with data from the old api.

Actions Taken

I began by installing a useful webapp called pgAdmin which provides me with a GUI tool for interacting with postgres databases. I started pgAdmin and connected it to my database running within a docker container. Next, I ran psql -h localhost -U admin -d reviews_api -f schema.sql in my terminal to load a schema into my reviews_api database. I then verified that my schema loaded successfully by inspecting it in pgAdmin.

The next step was to seed my data by utilizing pgAdmin's import feature, which allowed me to import a file into the database by specifying its type (CSV) and the delimiter (a comma in this case). I encountered a problem where the order of the columns in my schema were different than the headers in the CSV file, resulting in the data being inserted in the incorrect order. To fix this, I had to refactor my schema's columns to match the order of the headers in the CSV files.

Another issue I encountered was messy data, wherein different data types were used within different entries for the same columns. I fixed this manually be changing the data types of bad entries (i.e. 'false' to '0').

Perhaps the largest challenge I encountered was cripplingly slow data load times when seeding my Ratings table with data from the api. My hopethesis was that the table was taking especially long to seed since it had multiple foreign keys referencing tables with already had large amounts of data. To remedy this, I indexed each foreign key in my schema. This improvement made a huge performance impact, as I was able to seed the table in under 400 seconds versus the 3,000+ seconds it took to seed earlier.

Results Observed

I successfully refactored my schema to accept data from the api and seeded my database with said data. My database was optimized for performance in the process.

January 4, 2020

Overview

Today's goal is to connect my express server to my database and fix an issue where my database is not persisting data despite having a volume mapped.

Log

10:00am - Connecting my express server to the postgres database

Challenge/Motivation

After having verified that both my database and server function as expected in isolation (aside from the fact that my database is not persisting data after the docker container is restarted), I needed to connect the two so that the server will be able to respond to requests with data and write data to the database.

Actions Taken

I began by installing a node module called 'pg-promise' which would allow me to make a connection to my postgres database and consume said connection in my server via a promise. However, I ended up switching to a separate, more well-known module called 'pg' after discovering that pg-promise wasn't working as I expected and was an overall pain to use. I then setup a database connection in a separate db.js file with the correct postgres configuration and exported said connection to my server, where I set a timeout which waited for the postgres docker container to be up and running before attempting to connect. Had I not set a timeout, I would have encountered a well-known issue where the server would continue to throw an error stating that it failed to connect on subsequent requests, even if it had connected successfully.

Results Observed

My database and server were successfully connected, and as such my server is now ready to serve requests.

3:00am - Making my postgres database persistent

Challenge/Motivation

My database still lacked the ability to persist data after container restarts, which introduced a major inconvenience. Every time I restarted my container I had to re-seed my database, and since I'm dealing with a very large dataset, this process took a considerable amount of time. As such, my next priority was to persist this data across container reloads so that I wouldn't need to waste anymore time re-seeding my database.

Actions Taken

Before my refactor, I was manually re-seeding my database by loading in data from a dump.sql file which contained a manually backup of my database generated via the pg_dump utility. I decided to abandon this approach entirely by mapping a volume from the default location postgres saves a databases data (/var/lib/postgresql/data) to a location on my file system. I decided to scrap my previous volume binding as I realized that it wasn't actually serving any purpose.

Results Observed

My database now persists data across container restarts!

January 6, 2020

Overview

With the database fully-functional, the next step was to begin setting up API routes.

10:30am - Designing test suite

Challenge/Motivation

I needed to setup a testing suite so that I could be sure that my API was responding to requests according to whatever params/queries were attached. I decided to write (or at least stub-out) integration tests ahead of actually coding the functionality on the server as to follow the TDD best practices.

Actions Taken

I started by installing mocha/chai and an API testing framework called 'chakram' which is built on top of the former. I immediately faced a challenge where I was unable to get tests to wait for promises to resolve before evaluating assertion, which I mitigated by returning the promises, which mocha interprets by waiting for said promise to resolve. Once I saw that mocha had the ability to stall evaluation I realized that there was no need to use an API testing framework (chakram).

Another issue I faced was getting mocha to wait long enough for a response, since it took my server a while to query my database. Adding a .timeout(SOME_AMOUNT_OF_TIME) method invocation on each it() call or by simply running mocha with a --timeout SOME_AMOUNT_OF_TIME flag.

Results Observed

I was able to mostly write out the entire test suite for my /reviews/:product_id/list endpoint. Some tests regarding the sort query are only stubs since I hadn't figured out exactly what my results should look like.

2:00pm - Setting up /reviews/:product_id/list endpoint

Challenge/Motivation

With my tests stubbed-out I was able to start programming my server to my /reviews/:product_id/list endpoint, which would allow users to receive a list of reviews for a given product.

Actions Taken

I started by making a database query with a template literal that inserted values from queries on the request. Once the db query resolved, I returned the data to the user. The largest challenge I faced was getting the photos for each review, as this required me to make a query to another table, photos, for each review in the .then() of the first db query to reviews. Due to growing asynchronous complexity, my server was sending a response before waiting for the photos queries to resolve. I solved this problem by fetching the photos for each review inside of a promise, which resolved only once a query to photos resolved. Once this promise resolved the response was sent.

Results Observed

I was able to serve many of the types of request to the /reviews/:product_id/list endpoint which I needed to handle, but I was not able to finish writing out the remaining logic for sorting reviews based on the sort request query.

January 7, 2020

Overview

With my first endpoint's testsuite stubbed and my server-side logic started, I needed to finish writing the rest of the server logic.

10:30am - Continuing setting up /reviews/:product_id/list endpoint

Challenge/Motivation

Although I wrote out much of my server logic for handling requests to the /reviews/:product_id/list endpoint, I still needed to handle requests containing a sort query so that users could sort reviews by newest, relevance, and helpfulness.

Actions Taken

I realized that I had made a mistake in my previous fix for solving the problem where my server would send responses to requests before waiting for db queries to photos to resolve, resulting in many reviews not having photos. Although I was querying photos for each review, apparently my promise resolved once the query for my first review completed.

After much trial and error, I solved this problem by adding each query to photos in an array and resolving the promise once the array of queries resolved in a call to Promise.all(). Only once the promise was resolved would the response be set, forcing the server to wait until all queries to photos resolved before sending a response.

Another issue I faced was that I was sending data in the response that was recieved directly from the database, which contained some information which I didn't necessarily want to send back to the user (such as a reviewer's email address). I overcame this by instead sending back a new object, the properties on which were scrapped from data received from the db query. This way, I was in full control over exactly what data was sent back to the user. I also found that I needed to clean up some of the data, as responsed expected values such as an empty string ('') for empty responses, but the database sent back 'null' or null instead.

Lastly, I completed logic for the /reviews/:product_id/list endpoint by sorting reviews according to a sort request query, which could contain values newest, relevance, or helpfulness.

Results Observed

I was able to finish writing my server-sided logic to handle any kind of GET request to the /reviews/:product_id/list endpoint.

7:00pm - Setting up test suite for /reviews/:product_id/meta endpoint

Challenge/Motivation

The next endpoint I needed to handle GET requests for was /reviews/:product_id/meta, so that the client could fetch metadata regarding all the reviews for a given product (i.e. total ratings, how many reviews recommended the product versus how many didn't, etc). In order to follow TDD best practices, I decided to write out my integration tests before the actual server-side logic.

Actions Taken

Once again, I used mocha/chakram (although I should just remove chakram now that I realize I don't need it). Stubbing out tests were pretty straight-forward, although I did run into another minor issue regarding asynchronocity; unlike my last test suite, I only wanted to make one request and test different aspects of the same response. I solved this by using mocha's before() and returned a promise which made the API request before any tests ran.

Results Observed

My test suite for my second endpoint was finished!

January 8, 2020

Overview

The overarching theme for this day was completing the /reviews/:product_id/meta endpoint.

2:00pm - Setup /reviews/:product_id/meta endpoint

Challenge/Motivation

The next endpoint I needed to handle GET requests for was /reviews/:product_id/meta, so that the client could fetch metadata regarding all the reviews for a given product (i.e. total ratings, how many reviews recommended the product versus how many didn't, etc).

Actions Taken

I began by formulating a SQL query which would pull all the information for the database which I would need to fulfill the GET request. This was difficult because getting characteristic data required the querying of three separate tables. I came up with the following:

 select recommended, review_id, value, name, rating, value, characteristic_id
 from (select *
   from (select id, recommended, rating
    from reviews
    where product_id = 5) as reviews inner join ratings on ratings.review_id = reviews.id) as ratings inner join characteristics as char on char.id = ratings.characteristic_id;

After designing a query came the tedious task of morphing the data into a proper response object. This involved the 'caching' of different data as I iterated over each entry to keep track of what I had already encountered while scraping the query results for data to add to the response object.

Results Observed

I developed the ability for my app to handle GET requests to the /reviews/:product_id/meta endpoint!

January 9, 2020

Overview

Although I planned to develop my /reviews/:product_id endpoint to allow users to POST new reviews, I ended up going down a rabbit hole of attempting to fix a serialization problem with the tables in my database.

10:30am - Write tests for posting reviews

Challenge/Motivation

Knowing that I needed to handle POST requests to the /reviews/:product_id endpoint, I began by writing integration tests before setting up the actual endpoint.

Actions Taken

Setting up the test suite for this endpoint was a similar process as setting up the test suites for the other endpoints, except this time I had to make a POST request to the endpoint which I was testing, then make a GET request to the /reviews/:product_id/list to check that my data had been successfully POSTed. I also chose to deprecate chakram and relied directly on mocha/chai for testing and axios for AJAX.

Results Observed

I designed a full test suite for ensuring that my /reviews/:product_id endpoint works as expected when I start developing it.

12:00am - Develop /reviews:product_id endpoint

Challenge/Motivatoin

I needed to set up an /reviews:product_id endpoint to handle POST requests, so that users could submit new reviews for products.

Actions Taken

First, I realized that dates were stored in a different format in the database than the format they are posted in. To mitigate this, I simply converted the date to a yyyy/mm/dd format.

Next, I decided that I needed to make a varying amount of queries due to having to insert into three tables. To minimize network latency, I used a SQL TRANSACTION BLOCK to batch all of my queries into a single database query.

I also realized that I would need to insert a varying amount of values to the characteristics and photos tables depending on how many photos/ratings were included in the request. To solve this issue I decided to iterate over the request's photos/ratings and appended a set of values to the corresponding inserts for each one. Using a template literal I was able to add these inserts into the TRANSACTION BLOCK.

Lastly, I faced on more major issue: I realized that, since my data was seeded from a CSV file, the SERIAL pointer for each pointer was pointing at the first id by default regardless of how many entries were already in the table. This meant that when I tried to insert new data, it would try to insert into a row which already contained data, resulting in a duplicate id conflict. I decided that I needed to solve this issue with my database before I could continue developing the endpoint.

Results Observed

I ran into an issue with my database where it would try to insert values into rows which already contained data, causing a duplicate id conflict. Although I developed an INSERT INTO query which I'm fairly confident will work as intended, I decided that I needed to solve the issue with my database first before continuing to develop the endpoint.

7:00pm - Solve database id conflict issue

Challenge/Motivation

I relied on SERIAL (AUTO_INCREMENT in MySQL) functionality for inserting new data with the correct id in my database. However, since I seeded from CSV files, the pointer for keeping track of what id to insert next was attempting to insert data at ids which had already been inserted. To save new reviews in the database, I needed to solve this issue.

Actions Taken

The strategy I chose was to intially load all data from the CSV files to a 'temporary' table equivalent, from which I would copy every column except the id from the temporary tables into the actual tables, which had their own SERIAL ids. Then, I would add the foreign key constraints to the actual tables and drop the temporary tables.

Since I wanted to do this all within the schema.sql file, I had to bind a volume which contained the CSV files to my postgres docker container so that it had access to the all the data it needed to copy.

Results Observed

Through a semi-complex refactor of my schema I was able to seed my data in such a way that SERIAL pointers pointed at the correct index.

January 10, 2020

Overview

11:00am - Refactor database id conflict solution

Challenge/Motivation

12:30pm - Finish /reviews/:product_id endpoint

Challenge/Motivation

3:00pm - Write tests for and setup /reviews/helpful/:product_id endpoint

Challenge/Motivation

3:46pm - Write tests for and setup /

4:31pm - Optimize DBMS

7:00pm - Deploy app on AWS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment