Youtube URL: https://www.youtube.com/watch?v=JQDHz72OA3c
How to design a url shortener
Ask questions to clarify uncertainties
- What is the length of the URL which is shortener
- What is the volume of the traffic
- Should this be a single system or scalable
- Volume: Twitter has 300M users/month. I we have 10% of users, that's 30M/month
- Length of identifer: Assume 7 characters for the unique identifier after the domain uri
- long_url: 2KB (2048 characters)
- short_url: 18 bytes (18 chars) -> www.us.com/1234567
- created_at: 7 bytes (7 char) -> 1519129853500 (13 in this example)
- exp_at: 7 bytes (7 char)
- TOTAL: 2.032 KB / entry
For 30M users, that's 60.7GB/month, 0.7TB/yr, 3.6TB/5 yr
- B62 (base 62) - Takes 0-9a-zA-Z. Allows for 62^7 numbers or 3.5 trillion combinations
- B10 (base 10) - Only allows for 10^7 numbers or 10 million combinations
- If using 1000 combos/sec. Will take us 110 years to exhaust 3.5T/(1000606024365 combos/yr)
- MD5 Hash - will give you more than 7 characters. Need to take first 7 chars. Leads to collisions and data corruption
Approach: Use Base 62. Has most combinations.
- RDBMS (Relational Database Management System) has ACID properties
- Cons: Hard to scale even with horizontal scaling as lots of reads and writes and difficulty spreading out related data across nodes
- ACID
- Atomicity - All changes are made or none at all. If debit is made, corresponding credit is made to other account
- Consistency - Data is consistent before and after operation. Total funds are the same before and after op
- Isolation - Intermediate state is invisible to other transactions. Funds show up in one account or the other, not both or neither
- Durability - Changes persist and are not undone if system fails. Funds will not be reversed
- NoSQL
- Pro:
- Easily scalable - collections are self-contained and not coupled relationally
- Highly available
- Con:
- Eventual consistency - Needs time to replicate to another node
- Pro:
Approach: Use NoSQL since easier to scale
- Technique1 - Using random number to B62. Could generating the same random number. Need to check before inserting
- Con: Okay for single app. If multiple, and have same random number with diff longUrl, both would insert. Can use insert if not present logic. For high scalability, we can consider a different solution
- Technique2 - Use MD5 hash. Generates the same hash for same longUrl. Could collide if only using first 7 letters if longUrl is the same or if first 7 chars are the same. Fine if longUrl is the same. Bad if longUrl is different.
- Technique3
- Use a counter for a new value for every request
- Problem: If many app servers, they all call 1 counter service (single point of failure). Or if they're in different regions, then need two counter services. Need someone to organize the counters
- Apache Zookeeper - Distributed service manager.
- Coordinates between services
- Elects who is main and secondary, who is down, registers and gives names to servers, who is absent, who is present, coordinating between servers, keeps config data for each server, keeps coordinated file system for each server to connect with
- Use a counter for a new value for every request
- Loadbalancer to shortener apps to noSQL Zookeeper who gives counter services a range of say 100,000 numbers to count up from.
- Whenever each servers are online, they talk to zookeeper. Zookeeper assigns them an ID and 100K range.
- If counter server goes down, zookeeper will mark it down and unused
- New counter servers talk to zookeeper and gets a new range
- Once the counters are all used up, ZK marks it as used and give it a new range
- Guarantees: No collission is guaranteed, high performance, no need to check DB before entering
- Can use NoSQL or RDBMS but NoSQL better for scalability
- Create short url (give long url)
- Get long url (give short url)