Skip to content

Instantly share code, notes, and snippets.

@dt
Created January 19, 2017 23:31
Show Gist options
  • Save dt/b4b421e6003a9617c115dd79fb877a20 to your computer and use it in GitHub Desktop.
Save dt/b4b421e6003a9617c115dd79fb877a20 to your computer and use it in GitHub Desktop.
rekik86 @rekik86 Jun 22 2016 16:51
Hi David
David Taylor @dt Jun 22 2016 16:51
Hey
rekik86 @rekik86 Jun 22 2016 16:52
Thanks for offering your help
I'm new to RDBM
This is just a homework actually
need a bit of advice from an expert :grinning:
is that ok ?
David Taylor @dt Jun 22 2016 16:55
Heh, not sure to what extent I count as an “expert", but I’ll help if I can.
rekik86 @rekik86 Jun 22 2016 16:55
haha ok
lets give a try
Heres the thing
Say I built simple web application where users
can enter the URL of text files and the word(s) they would like to count in those texts. Since any new counting might take a long time, I'm using a relational database to cache calculated data for the next calls. How should I arrange the database?
Then there are additional questions
how should I refine the system if:
The DB volume is limited
There might be many different URLs that point to identical texts (i.e., copies of the same text)
Many users refer to the same texts, but everyone is looking for different words
My servers can handle hard-drives volume of up to 1024GB, but the expected volume of my DB is estimated by
5000GB (I can use as many servers as I need).
starting with the first question, I created 3 tables: one for all words, one for all URLs, one for all occurrences of a word in a URL
does that sound right ?
David Taylor @dt Jun 22 2016 17:00
so, working backwards: if total size exceeds what any single server can hold, you’ll need sharding — tradtitional dbs don’t really support this, so you’d have to store data in different DBs and figure out which one to query using application logic (hashing keys or something) — cockroachdb is somewhat different in that it shards across servers automatically so your application can treat many servers as one, monolithic db
as far as schema: yeah, could work
rekik86 @rekik86 Jun 22 2016 17:02
re total size: is that what is called partitionning ?
David Taylor @dt Jun 22 2016 17:02
though it sounds like you want all occurrence of a word in a text, and a text may have many URLs
rekik86 @rekik86 Jun 22 2016 17:03
indeed
David Taylor @dt Jun 22 2016 17:03
so you might want a 4th table for URL -> text id
yeah, partitioning or sharding, same thing
rekik86 @rekik86 Jun 22 2016 17:04
good point
in the answers I started writing down, I though about using a hashing algo (MD5 for instance) to check the text files instead of creating a new table
do you think that would work ?
David Taylor @dt Jun 22 2016 17:06
sure
rekik86 @rekik86 Jun 22 2016 17:06
great, so I'm not too far off :grinning:
oh right, Im actually stuck on the next one
Many users refer to the same texts, but everyone is looking for different words
is it a case of creating a new table for each popular text ?
David Taylor @dt Jun 22 2016 17:10
I might go with something like:
CREATE TABLE texts (hash VARCHAR PRIMARY KEY, body TEXT);
CREATE TABLE urls (url VARCHAR PRIMARY KEY, text_hash VARCHAR NOT NULL REFERENCES texts);
CREATE TABLE counts (text_hash VARCHAR NOT NULL REFERENCES texts, word VARCHAR NOT NULL, count int, PRIMARY KEY (text_hash, word));
you could keep track which user submitted a URL on the urls table i guess
if you needed manual sharding (i.e. not cockroach) you could partition counts pretty easily by the text_hash too
rekik86 @rekik86 Jun 22 2016 17:15
ok one sec I'm trying to understand your tables
the first one is storing the body of the text ?
David Taylor @dt Jun 22 2016 17:16
i was… guess you could ditch that if you don’t need it
all you really need is the unique hash to refer to from the other tables with foreign keys
rekik86 @rekik86 Jun 22 2016 17:17
ok I get it
sounds good
David Taylor @dt Jun 22 2016 17:19
if the composite primary key on counts is too fancy, you could just change it to a UNIQUE(text_hash, word) and get the same effect (ensuring there’s only one count for each word-text pair)
rekik86 @rekik86 Jun 22 2016 17:19
what does UNIQUE do ?
David Taylor @dt Jun 22 2016 17:20
UNIQUE on a column says that a value in that column is unique
like, if you had a users table and said the email column was unique, you couldn’t end up with two users for the same email
(which would make, e.g. a password recovery impossible)
rekik86 @rekik86 Jun 22 2016 17:21
indeed
really interesting
David Taylor @dt Jun 22 2016 17:21
at the table level, you can say UNIQUE (col_a, col_b) and you get that the combination is unique
so you can have at-most-one row for a given combination of values in those columns
above, that’s used to ensure that for a given word in a given text, you only have one count of occurances
rekik86 @rekik86 Jun 22 2016 17:22
get it
David Taylor @dt Jun 22 2016 17:23
PRIMARY KEY implies UNIQUE and NOT NULL
rekik86 @rekik86 Jun 22 2016 17:25
is there not an automated way to get unicity with a ID (number), which would be incremented for any new text_hash,word combination ?
David Taylor @dt Jun 22 2016 17:26
sure, you could have an id INT PRIMARY KEY SERIAL
rekik86 @rekik86 Jun 22 2016 17:26
any pros/cons ?
rekik86 @rekik86 Jun 22 2016 17:33
I have what I need I think anyway
Thanks so much David
_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment