Skip to content

Instantly share code, notes, and snippets.

@seahyc
Last active April 1, 2020 07:13
Show Gist options
  • Save seahyc/0419f140778d9b14bf722ce033606bc1 to your computer and use it in GitHub Desktop.
Save seahyc/0419f140778d9b14bf722ce033606bc1 to your computer and use it in GitHub Desktop.
Added in deployment instructions

Finding the Next Unicorn

You work at a venture capital firm, HardBank, which is hell-bent on finding the next unicorn startup to invest in. They appointed you to build a tool to search through all the startups for patterns of success and failure. If it works, they’ll also open it up for public access.

Your general partner, Mahayoshi Sonny, dumped you a json file of company profiles (https://ufile.io/gzpnra8msk1s902) that his data team has assiduously compiled over many years.

He’d like you to build an API server, with documentation and a backing relational database that will allow any client to navigate through that sea of data easily, and intuitively. The front-end team will later use that documentation to build the front-end clients.

The choice of the language and server framework is up to your discretion, you just need to explain your decision later. As for database, you can pick any so long as it is a relational database. In the dataset, each company has many attributes, but for this project, the following are the most important:

  1. name
  2. crunchbase_url
  3. homepage_url
  4. category_code
  5. number_of_employees
  6. founded_year
  7. founded_month
  8. founded_day
  9. deadpooled_year (deadpooled means that it’s closed down)
  10. deadpooled_month
  11. deadpooled_day
  12. tag_list
  13. email_address
  14. overview
  15. total_money_raised
  16. acquisition (if it's been acquired)

The rest of it are under a field called “relationships”. The only relational data relevant to this task are:

  1. relationships (investors/ founders)
  2. acquisitions (other companies it's acquired)
  3. funding_rounds
  4. competitions (other competing companies)

You can safely ignore the rest. Now you'd need to design the querying interface, design your schema (it doesn’t have to follow the list above), extract, transform and load this data into your database, and implement an API server on top of it. No front-end work is needed.

The operations the front-end team would need you to support are:

  • Get a list of all the companies filtered by:
    • Number of funding rounds (support more than, less than, within range, exact match)
    • Total amount of funding raised (support more than, less than, within range, exact match)
    • Founding date (support after, before, within range, exact match)
    • Deadpool date (support after, before, within range, exact match)
    • The id of a person (can be assigned by you) and relationship to the company (eg. all companies invested in by Peter Thiel, or all companies ever founded by Elon Musk)
    • Other company id (can be assigned by you) and its relationship to the company (eg. all companies acquired by Google)
  • Update the company information and its relationships
  • Delete a company

Meanwhile, HardBank is cocksure that this application will explode in popularity and thus become a subject of abuse from scrapers and bots. Do design in an auditing and rate limiting system to prevent abuse.

In your repository, you would need to document the API interface, the commands to run the ETL (extract, transform and load) script that takes in the JSON file as an input, and outputs to your database, and the command to set up your server and database. You may use docker to ensure a uniform setup across environments.

Deployment

It'd be great if you can deploy this on the free tier of any cloud hosting platform (eg. free dyno on Heroku), so that we can easily access the application via an url.

Bonus Section

Fortunately for you, you work with a team of extremely smart front-end developers who know how to delight users. On top of your functioning API server, these front-end developers have built an interface that attracted a quickly growing user base.

Like Pied Piper from Silicon Valley, you now have funding, and millions of users eager to use your product.

Before pouring marketing dollars into advertising your site, you want to know how your API can scale to handle the incoming load. Therefore, you need to:

  1. Systematically measure, by means of a load generator, the scalability limits of one of your API endpoints at its current implementation, and record its performance at various load levels. You are free to come up with your own performance metrics.
  2. Identify and state the bottleneck in your system (e.g. database and/or API)
  3. Propose measures to improve the performance. You need not implement these measures.
  4. Estimate the effectiveness of each measure. How much would performance improve after the measure has been applied?

Prepare a short report containing the above and add the report to your repository. Along with the report, include any code used to measure the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment