Created
January 19, 2017 23:31
-
-
Save dt/b4b421e6003a9617c115dd79fb877a20 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rekik86 @rekik86 Jun 22 2016 16:51 | |
Hi David | |
David Taylor @dt Jun 22 2016 16:51 | |
Hey | |
rekik86 @rekik86 Jun 22 2016 16:52 | |
Thanks for offering your help | |
I'm new to RDBM | |
This is just a homework actually | |
need a bit of advice from an expert :grinning: | |
is that ok ? | |
David Taylor @dt Jun 22 2016 16:55 | |
Heh, not sure to what extent I count as an “expert", but I’ll help if I can. | |
rekik86 @rekik86 Jun 22 2016 16:55 | |
haha ok | |
lets give a try | |
Heres the thing | |
Say I built simple web application where users | |
can enter the URL of text files and the word(s) they would like to count in those texts. Since any new counting might take a long time, I'm using a relational database to cache calculated data for the next calls. How should I arrange the database? | |
Then there are additional questions | |
how should I refine the system if: | |
The DB volume is limited | |
There might be many different URLs that point to identical texts (i.e., copies of the same text) | |
Many users refer to the same texts, but everyone is looking for different words | |
My servers can handle hard-drives volume of up to 1024GB, but the expected volume of my DB is estimated by | |
5000GB (I can use as many servers as I need). | |
starting with the first question, I created 3 tables: one for all words, one for all URLs, one for all occurrences of a word in a URL | |
does that sound right ? | |
David Taylor @dt Jun 22 2016 17:00 | |
so, working backwards: if total size exceeds what any single server can hold, you’ll need sharding — tradtitional dbs don’t really support this, so you’d have to store data in different DBs and figure out which one to query using application logic (hashing keys or something) — cockroachdb is somewhat different in that it shards across servers automatically so your application can treat many servers as one, monolithic db | |
as far as schema: yeah, could work | |
rekik86 @rekik86 Jun 22 2016 17:02 | |
re total size: is that what is called partitionning ? | |
David Taylor @dt Jun 22 2016 17:02 | |
though it sounds like you want all occurrence of a word in a text, and a text may have many URLs | |
rekik86 @rekik86 Jun 22 2016 17:03 | |
indeed | |
David Taylor @dt Jun 22 2016 17:03 | |
so you might want a 4th table for URL -> text id | |
yeah, partitioning or sharding, same thing | |
rekik86 @rekik86 Jun 22 2016 17:04 | |
good point | |
in the answers I started writing down, I though about using a hashing algo (MD5 for instance) to check the text files instead of creating a new table | |
do you think that would work ? | |
David Taylor @dt Jun 22 2016 17:06 | |
sure | |
rekik86 @rekik86 Jun 22 2016 17:06 | |
great, so I'm not too far off :grinning: | |
oh right, Im actually stuck on the next one | |
Many users refer to the same texts, but everyone is looking for different words | |
is it a case of creating a new table for each popular text ? | |
David Taylor @dt Jun 22 2016 17:10 | |
I might go with something like: | |
CREATE TABLE texts (hash VARCHAR PRIMARY KEY, body TEXT); | |
CREATE TABLE urls (url VARCHAR PRIMARY KEY, text_hash VARCHAR NOT NULL REFERENCES texts); | |
CREATE TABLE counts (text_hash VARCHAR NOT NULL REFERENCES texts, word VARCHAR NOT NULL, count int, PRIMARY KEY (text_hash, word)); | |
you could keep track which user submitted a URL on the urls table i guess | |
if you needed manual sharding (i.e. not cockroach) you could partition counts pretty easily by the text_hash too | |
rekik86 @rekik86 Jun 22 2016 17:15 | |
ok one sec I'm trying to understand your tables | |
the first one is storing the body of the text ? | |
David Taylor @dt Jun 22 2016 17:16 | |
i was… guess you could ditch that if you don’t need it | |
all you really need is the unique hash to refer to from the other tables with foreign keys | |
rekik86 @rekik86 Jun 22 2016 17:17 | |
ok I get it | |
sounds good | |
David Taylor @dt Jun 22 2016 17:19 | |
if the composite primary key on counts is too fancy, you could just change it to a UNIQUE(text_hash, word) and get the same effect (ensuring there’s only one count for each word-text pair) | |
rekik86 @rekik86 Jun 22 2016 17:19 | |
what does UNIQUE do ? | |
David Taylor @dt Jun 22 2016 17:20 | |
UNIQUE on a column says that a value in that column is unique | |
like, if you had a users table and said the email column was unique, you couldn’t end up with two users for the same email | |
(which would make, e.g. a password recovery impossible) | |
rekik86 @rekik86 Jun 22 2016 17:21 | |
indeed | |
really interesting | |
David Taylor @dt Jun 22 2016 17:21 | |
at the table level, you can say UNIQUE (col_a, col_b) and you get that the combination is unique | |
so you can have at-most-one row for a given combination of values in those columns | |
above, that’s used to ensure that for a given word in a given text, you only have one count of occurances | |
rekik86 @rekik86 Jun 22 2016 17:22 | |
get it | |
David Taylor @dt Jun 22 2016 17:23 | |
PRIMARY KEY implies UNIQUE and NOT NULL | |
rekik86 @rekik86 Jun 22 2016 17:25 | |
is there not an automated way to get unicity with a ID (number), which would be incremented for any new text_hash,word combination ? | |
David Taylor @dt Jun 22 2016 17:26 | |
sure, you could have an id INT PRIMARY KEY SERIAL | |
rekik86 @rekik86 Jun 22 2016 17:26 | |
any pros/cons ? | |
rekik86 @rekik86 Jun 22 2016 17:33 | |
I have what I need I think anyway | |
Thanks so much David | |
_ | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment