dt · January 19, 2017 23:31
diff --git a/gistfile1.txt b/gistfile1.txt
 rekik86 @rekik86 Jun 22 2016 16:51
 Hi David

 David Taylor @dt Jun 22 2016 16:51
 Hey

 rekik86 @rekik86 Jun 22 2016 16:52
 Thanks for offering your help
 I'm new to RDBM
 This is just a homework actually
 need a bit of advice from an expert :grinning:
 is that ok ?

 David Taylor @dt Jun 22 2016 16:55
 Heh, not sure to what extent I count as an “expert", but I’ll help if I can.

 rekik86 @rekik86 Jun 22 2016 16:55
 haha ok
 lets give a try
 Heres the thing
 Say I built simple web application where users
 can enter the URL of text files and the word(s) they would like to count in those texts. Since any new counting might take a long time, I'm using a relational database to cache calculated data for the next calls. How should I arrange the database?
 Then there are additional questions
 how should I refine the system if:
 The DB volume is limited
 There might be many different URLs that point to identical texts (i.e., copies of the same text)
 Many users refer to the same texts, but everyone is looking for different words
 My servers can handle hard-drives volume of up to 1024GB, but the expected volume of my DB is estimated by
 5000GB (I can use as many servers as I need).
 starting with the first question, I created 3 tables: one for all words, one for all URLs, one for all occurrences of a word in a URL
 does that sound right ?

 David Taylor @dt Jun 22 2016 17:00
 so, working backwards: if total size exceeds what any single server can hold, you’ll need sharding — tradtitional dbs don’t really support this, so you’d have to store data in different DBs and figure out which one to query using application logic (hashing keys or something) — cockroachdb is somewhat different in that it shards across servers automatically so your application can treat many servers as one, monolithic db
 as far as schema: yeah, could work

 rekik86 @rekik86 Jun 22 2016 17:02
 re total size: is that what is called partitionning ?

 David Taylor @dt Jun 22 2016 17:02
 though it sounds like you want all occurrence of a word in a text, and a text may have many URLs

 rekik86 @rekik86 Jun 22 2016 17:03
 indeed

 David Taylor @dt Jun 22 2016 17:03
 so you might want a 4th table for URL -> text id
 yeah, partitioning or sharding, same thing

 rekik86 @rekik86 Jun 22 2016 17:04
 good point
 in the answers I started writing down, I though about using a hashing algo (MD5 for instance) to check the text files instead of creating a new table
 do you think that would work ?

 David Taylor @dt Jun 22 2016 17:06
 sure

 rekik86 @rekik86 Jun 22 2016 17:06
 great, so I'm not too far off :grinning:
 oh right, Im actually stuck on the next one
 Many users refer to the same texts, but everyone is looking for different words
 is it a case of creating a new table for each popular text ?

 David Taylor @dt Jun 22 2016 17:10
 I might go with something like:
 CREATE TABLE texts (hash VARCHAR PRIMARY KEY, body TEXT);

 CREATE TABLE urls (url VARCHAR PRIMARY KEY, text_hash VARCHAR  NOT NULL REFERENCES texts);

 CREATE TABLE counts (text_hash VARCHAR NOT NULL REFERENCES texts, word VARCHAR NOT NULL, count int, PRIMARY KEY (text_hash, word));
 you could keep track which user submitted a URL on the urls table i guess
 if you needed manual sharding (i.e. not cockroach) you could partition counts pretty easily by the text_hash too

 rekik86 @rekik86 Jun 22 2016 17:15
 ok one sec I'm trying to understand your tables
 the first one is storing the body of the text ?

 David Taylor @dt Jun 22 2016 17:16
 i was… guess you could ditch that if you don’t need it
 all you really need is the unique hash to refer to from the other tables with foreign keys

 rekik86 @rekik86 Jun 22 2016 17:17
 ok I get it
 sounds good

 David Taylor @dt Jun 22 2016 17:19
 if the composite primary key on counts is too fancy, you could just change it to a UNIQUE(text_hash, word) and get the same effect (ensuring there’s only one count for each word-text pair)

 rekik86 @rekik86 Jun 22 2016 17:19
 what does UNIQUE do ?

 David Taylor @dt Jun 22 2016 17:20
 UNIQUE on a column says that a value in that column is unique
 like, if you had a users table and said the email column was unique, you couldn’t end up with two users for the same email
 (which would make, e.g. a password recovery impossible)

 rekik86 @rekik86 Jun 22 2016 17:21
 indeed
 really interesting

 David Taylor @dt Jun 22 2016 17:21
 at the table level, you can say UNIQUE (col_a, col_b) and you get that the combination is unique
 so you can have at-most-one row for a given combination of values in those columns
 above, that’s used to ensure that for a given word in a given text, you only have one count of occurances

 rekik86 @rekik86 Jun 22 2016 17:22
 get it

 David Taylor @dt Jun 22 2016 17:23
 PRIMARY KEY implies UNIQUE and NOT NULL

 rekik86 @rekik86 Jun 22 2016 17:25
 is there not an automated way to get unicity with a ID (number), which would be incremented for any new text_hash,word combination ?

 David Taylor @dt Jun 22 2016 17:26
 sure, you could have an id INT PRIMARY KEY SERIAL

 rekik86 @rekik86 Jun 22 2016 17:26
 any pros/cons ?

 rekik86 @rekik86 Jun 22 2016 17:33
 I have what I need I think anyway
 Thanks so much David
 _
	rekik86 @rekik86 Jun 22 2016 16:51
	Hi David

	David Taylor @dt Jun 22 2016 16:51
	Hey

	rekik86 @rekik86 Jun 22 2016 16:52
	Thanks for offering your help
	I'm new to RDBM
	This is just a homework actually
	need a bit of advice from an expert :grinning:
	is that ok ?

	David Taylor @dt Jun 22 2016 16:55
	Heh, not sure to what extent I count as an “expert", but I’ll help if I can.

	rekik86 @rekik86 Jun 22 2016 16:55
	haha ok
	lets give a try
	Heres the thing
	Say I built simple web application where users
	can enter the URL of text files and the word(s) they would like to count in those texts. Since any new counting might take a long time, I'm using a relational database to cache calculated data for the next calls. How should I arrange the database?
	Then there are additional questions
	how should I refine the system if:
	The DB volume is limited
	There might be many different URLs that point to identical texts (i.e., copies of the same text)
	Many users refer to the same texts, but everyone is looking for different words
	My servers can handle hard-drives volume of up to 1024GB, but the expected volume of my DB is estimated by
	5000GB (I can use as many servers as I need).
	starting with the first question, I created 3 tables: one for all words, one for all URLs, one for all occurrences of a word in a URL
	does that sound right ?

	David Taylor @dt Jun 22 2016 17:00
	so, working backwards: if total size exceeds what any single server can hold, you’ll need sharding — tradtitional dbs don’t really support this, so you’d have to store data in different DBs and figure out which one to query using application logic (hashing keys or something) — cockroachdb is somewhat different in that it shards across servers automatically so your application can treat many servers as one, monolithic db
	as far as schema: yeah, could work

	rekik86 @rekik86 Jun 22 2016 17:02
	re total size: is that what is called partitionning ?

	David Taylor @dt Jun 22 2016 17:02
	though it sounds like you want all occurrence of a word in a text, and a text may have many URLs

	rekik86 @rekik86 Jun 22 2016 17:03
	indeed

	David Taylor @dt Jun 22 2016 17:03
	so you might want a 4th table for URL -> text id
	yeah, partitioning or sharding, same thing

	rekik86 @rekik86 Jun 22 2016 17:04
	good point
	in the answers I started writing down, I though about using a hashing algo (MD5 for instance) to check the text files instead of creating a new table
	do you think that would work ?

	David Taylor @dt Jun 22 2016 17:06
	sure

	rekik86 @rekik86 Jun 22 2016 17:06
	great, so I'm not too far off :grinning:
	oh right, Im actually stuck on the next one
	Many users refer to the same texts, but everyone is looking for different words
	is it a case of creating a new table for each popular text ?

	David Taylor @dt Jun 22 2016 17:10
	I might go with something like:
	CREATE TABLE texts (hash VARCHAR PRIMARY KEY, body TEXT);

	CREATE TABLE urls (url VARCHAR PRIMARY KEY, text_hash VARCHAR NOT NULL REFERENCES texts);

	CREATE TABLE counts (text_hash VARCHAR NOT NULL REFERENCES texts, word VARCHAR NOT NULL, count int, PRIMARY KEY (text_hash, word));
	you could keep track which user submitted a URL on the urls table i guess
	if you needed manual sharding (i.e. not cockroach) you could partition counts pretty easily by the text_hash too

	rekik86 @rekik86 Jun 22 2016 17:15
	ok one sec I'm trying to understand your tables
	the first one is storing the body of the text ?

	David Taylor @dt Jun 22 2016 17:16
	i was… guess you could ditch that if you don’t need it
	all you really need is the unique hash to refer to from the other tables with foreign keys

	rekik86 @rekik86 Jun 22 2016 17:17
	ok I get it
	sounds good

	David Taylor @dt Jun 22 2016 17:19
	if the composite primary key on counts is too fancy, you could just change it to a UNIQUE(text_hash, word) and get the same effect (ensuring there’s only one count for each word-text pair)

	rekik86 @rekik86 Jun 22 2016 17:19
	what does UNIQUE do ?

	David Taylor @dt Jun 22 2016 17:20
	UNIQUE on a column says that a value in that column is unique
	like, if you had a users table and said the email column was unique, you couldn’t end up with two users for the same email
	(which would make, e.g. a password recovery impossible)

	rekik86 @rekik86 Jun 22 2016 17:21
	indeed
	really interesting

	David Taylor @dt Jun 22 2016 17:21
	at the table level, you can say UNIQUE (col_a, col_b) and you get that the combination is unique
	so you can have at-most-one row for a given combination of values in those columns
	above, that’s used to ensure that for a given word in a given text, you only have one count of occurances

	rekik86 @rekik86 Jun 22 2016 17:22
	get it

	David Taylor @dt Jun 22 2016 17:23
	PRIMARY KEY implies UNIQUE and NOT NULL

	rekik86 @rekik86 Jun 22 2016 17:25
	is there not an automated way to get unicity with a ID (number), which would be incremented for any new text_hash,word combination ?

	David Taylor @dt Jun 22 2016 17:26
	sure, you could have an id INT PRIMARY KEY SERIAL

	rekik86 @rekik86 Jun 22 2016 17:26
	any pros/cons ?

	rekik86 @rekik86 Jun 22 2016 17:33
	I have what I need I think anyway
	Thanks so much David
	_