store attribute that has multiple values
CREATE TABLE Products (
-- ...
account_id VARCHAR(100),
-- ...
)
- We should not store account_id separated by comma
- hard to query, update, maintain
- Use intersection table to inverse the dependencies
- Store tree structure
- Depends on Parent
- child holds parent id
- ex.) get all childs for the parent: need to issue JOIN for every traverse
- hard to maintain (specifically deletion the node)
- Path Enumeration
- store path (ex.
1/2/3/
1/4/6/7/
) information
- store path (ex.
- Nested Set
- Closure Table
- Create additional table to store tree structure
CREATE TABLE TreePaths ( ancestor BIGINT, descendant BIGINT, PRIMARY KEY (ancestor, descendant), FOREIGN KEY (ancestor) REFERENCES Comments(comment_id), FOREIGN KEY (descendant) REFERENCES Comments(comment_id), );
- Create additional table to store tree structure
Use GraphQL might be much more faster?
- How to define primary key
- Add
id
column to all tables- it causes creation of redundunt key
- it might allows undesired duplicate rows
- hard to understand the meaning of key
- Use the column name that's easy to understand
- id => comment_id, user_id, product_id
- compound key if needed
- Simplify database architecture
- Not using foreign key
- to Update/Delete the row that refers other table column, need to delete the data in the dependent table beforehand
- Use foreign key
- Able to do Cascade Update
CREATE TABLE Bugs ( reported_by BIGINT, status VARCHAR(20) NOT NULL, FOREIGN KEY(reported_by) REFERENCES Accounts(account_id) ON UPDATE CASCADE ON DELETE RESTRICT, FOREIGN KEY (status) REFERENCES BugStatus(status) ON UPDATE CASCADE ON DELETE SET DEFAULT );
- Able to do Cascade Update
Support mutable attribute
- Create flexible attribute tabled
- what's wrong with above architecture:
- unable to set
required
attribute - all the attribute data type become same
- for instance, reported_date, version should be defined as date, float etc, but unable to define like that cause all attributes should have the same datatype (most likely VARCHAR etc)
- we need to handle the datatype is valid on application side
- unable to set
- If the mutable data structure is needed, let's use NoSQL etc.
- Single Table inheritance
- create Table that has all the attributes as a column
- cons: some rows might have NULL values for some attributes.
- create Table that has all the attributes as a column
- Concrete Table inheritance
- follow STi, but create table for each subclasses
- ex.) BugsAttributes, FeatureRequestAttributes etc
- follow STi, but create table for each subclasses
- Refer multiple parent tables
- don't use foreign key
CREATE TABLE Comments (
comment_id SERIAL PRIMARY KEY,
issue_type VARCHAR(20),
-- holds Bugs or FeatureRequests
issue_id BIGINT
);
- if we define the Comment table like above, we cannot use foreign key cause it refers two tables (Bugs or FeatureRequests).
- and also, if we query the comments from issue_id, need to JOIN two parent tables
- zombie comments (comments that refers non-exist rows) might exist cause it cannot use foreign key.
- Define intersection table
- Alternatively, we can use Common Parent table like this:
- define attributes that has multiple values
- ex.) issue labels
- it may has release version, severity, assignee, ETA etc
- ex.) issue labels
- define multiple columns
CREATE TABLE Bugs (
bug_id SERIAL PRIMARY KEY,
details VARCHAR(200),
-- holds Bugs or FeatureRequests
labels1 VARCHAR(20),
labels2 VARCHAR(20),
labels3 VARCHAR(20),
);
- in above case, we can holds up to 3 labels
- if the issue has only 1 or 2 labels, means some values become NULL
- what's the problem?
- to search the row that has "performance" label
- need to query like
WHERE label1 = 'performance' OR ...
- need to query like
- same as above, update and delete will be bothersome too
- some labels could have the same value
- it'd be bothersome too when we increase the number of available label (ex. label4, label5)
- to search the row that has "performance" label
- Create Dependent Table that has label values
CREATE TABLE Labels (
label_id SERIAL PRIMARY KEY,
label VARCHAR(20),
-- holds Bugs or FeatureRequests
PRIMARY KEY (label_id, label)
FOREIGN KEY (label_id) REFERENCES Labels(label_id)
);
- Increase Scalability in case of the Table becomes huge
- Split Table
- Tables like: Bugs_2008, Bugs_2009 etc
- what's the problems?
- need to have some tricky logics for the boundary values
- ex.) register row to Bugs_2010 cause the date of data is Jan 3.
- later realized the date of the data was 2009 Dec
- in that case, we need to delete dat in Bugs_2010 and copy to Bugs_2009
- hard to query among multiple tables
- need to have some tricky logics for the boundary values
- Partitioning table
- Horizonal Partitioning:
CREATE TABLE Bugs (bug_id SERIAL PRIMARY KEY, -- ... ) PARTITION BY HASH (YEAR (date_reported)) PARTITIONS 4;
- Virtical Partitioning:
- split table by column
- it's beneficial when we use huge datatype like BLOB, TEXT etc
- Horizonal Partitioning:
- Introduce Dependent Table
Use float values
- Using FLOAT data type
- what's wrong?
- float is not an actual value, is a value approximated by the power of 2
- Using NUMERIC or DECIMAL
- limit column values to specific values
- ex.) let
status
column can holds only OPEN, RESOLVED, CLOSED etc.
- ex.) let
- Define limited value by column definition
CREATE TABLE Bugs (
bug_id SERIAL PRIMARY KEY,
status VARCHAR(20) CHECK (status IN ('NEW', 'IN PROGRESS', "RESOLVED"))
);
- what's the problems?
- Unable to know the available values on the application side
- to delete the allowed value, need to do:
- remove constraint
- update target value
- apply new constraint
- Creata Mater table that holds the alloed value
- Store media files (ex. pics, videos)
We can use two ways based on the application situations:
-
Using BLOB
-
store filepath of the actual files
- hard to maintain transaction, rollback etc.
-
alternatively, some database supports the logic that can handle the external files transparently
- Oracle: BFILE
- SQL SERVER 2008: FILESTREAM
- Increase the performance
- Applying the index without analysis
- if we apply the index too much, DB needs to maintain the index on each insertion
- if we apply a few index, query costs would be huge
- Analysis
- check slow query by analysis tools (ex. pt-query-digest)
- check the index is used or not by EXPLAIN the actual queries
- Allow the row that has lacked columns
- ex.) for storing the information of employee, some has middle name but others not
- Use NULL as a general value
- what's the problem
- WHERE NOT cannot fetch NULL values
- Unable to use prepared statement for NULL value
- Avoid Using NULL and use alternative values that means NULL
- -1, "" etc
- if we use alternative values, means the application need to have above domain knowledges
- not use NULL value as much as possible
- if impossible, use NULL in a scalar formulas
- TRUE and NULL = NULL
- FALSE and NULL = FALSE
- TRUE or NULL = TRUE
- FALSE or NULL = NULL
- Use COALESCE to return non-NULL if the value is NULL
- get the row that has the maximum value in the group
- refer non-grouped value
SELECT product_id, MAX(date_reported) AS latest, bug_id
FROM Bugs GROUP BY product_id;
- what's the problem:
- product_id , date_reported are correct, for these columns, the value would be the unique
- buf for bug_id, it's out of scope of GROUP, means it may return ambiguous values
- the behaviour differs for each database, for MySQL and SQLite, it'll returns the ambiguous values, others return error in general.
- when using GROUP BY, we need to follow
Single Value Rule
.
- Use subquery
- Use derived table
- fetch random row
ORDER BY RAND() LIMIT n
- performance down, cause DB cannot utilize index fo this randomized sort
- Specify random value between the smallest and largest
CEIL(RAND() * (SELECT MAX(id) From Bugs)) AS rand_id
- Get all ids firstly, then specify the random id from the result
- Full text search
- Using pattern match statement
LIKE %crash%
- what's the problem
- cannot use index for that search
- Using thirdparty searchengine like Sphinx, Lucene, Elasticsearch
- use MyISAM FULLTEXT INDEX
- available for CHAR, VARCHAR, TEXT only
- SQL Server also supports full text index
- PostgreSQL has TSVECTOR type that supports full text search
- SQLite supports FTS
- Decrease the number of query
- Forcibly resolve the problem with one query
- SELECT, JOIN, JOIN, JOIN...
- it may leads to undesired Cartesian Products
- use UNION statement to combine the each qeury result
SELECT p.product_id, 'FIXED' as status, COUNT(p.id) AS bug_count FROM Bugs
UNION ALL
SELECT p.product_id, 'OPEN' as status,COUNT(p.id) AS bug_count FROM Bugs
- If possible, better to divide one huge query to multiple queries and do divide and conquer for more maintainability
- Reduce the size of query
- Using asterisk
- what's wrong
- if you use JOIN and both tables has the same name column, it may leads to undesired error
- JOIN Book table and Author table, get
$row["title"]
=> if both Book and Author table has "title" column, you'll get untrusted results
- JOIN Book table and Author table, get
- to reduce the consumption of NW bandwidth, better not to fetch the non-necessary data
- if you use JOIN and both tables has the same name column, it may leads to undesired error
- specify the required column name explicitly, instead of using asterisk
- Storing password into the database
- storing password as a plain text
- HUGE SECURITY RISK
- Storing Hash with Solt
- Prepared statement
- passing the variables when preparing the query
- filtering the query on the application side
- don't fill the lacked value,
- ex.) the lacked row of auto-increment column
- reusing the pseudo key may leads to undesired error
- we can use UUID alternatively
// omit
- let's create documentation and required information to make DB much more durable, and for preparing unexpected situations
- ER diagrams
- DDL
- infrastructure (conf, logging)
- NW
- seed data
- PIT backup
- unittest for DB(table, column exist or not, check constraint, check ORM, check seed data)
- don't make DAO =~ Model in MVC context
- let models hold domain knowledge and decouping with DAO