Article Introduction: Navigating Database Technologies for Hierarchical and Social Network Data Management
In today's digital landscape, efficient data management is pivotal, especially when dealing with complex structures like trees and graphs. These structures, foundational in computer science, dictate how data is stored, accessed, and manipulated across various platforms, from social networks to organizational systems. This article provides a comprehensive exploration of tree and graph structures and their implementation in different database systems, assessing their effectiveness in contexts ranging from hierarchical data management in traditional databases to sophisticated relationship mapping in graph databases.
- Characteristics: Trees epitomize a hierarchical model, with each node (except the root) having one clear parent, forming a unique path from the root.
- Use Cases: Ideal for organizational charts, file systems, and taxonomies.
- Limitations: The single parent restriction and scalability issues can hinder handling complex datasets.
- Characteristics: Graphs represent a network model without hierarchical constraints, allowing multiple and varied node relationships.
- Use Cases: Best suited for social networks, geographic information systems, and network analysis.
- Limitations: The complexity and resource intensity of graph operations can be challenging.
Neo4j. Renowned for its robustness in cloud environments, Neo4j leads in graph database technology but limits horizontal scaling in its community edition.
Microsoft Azure Cosmos DB. Mirroring Neo4j's capabilities, it shines in cloud-based deployment, especially for mature projects.
OrientDB. Now community-driven, OrientDB supports cluster nodes efficiently with a moderate resource requirement. 4GB is enough to run cluster node
TerminusDB, TypeDB, and others. These databases bring unique features to the table, with considerations like community support and licensing playing a crucial role in selection. But they are not so popular.
To manage comments in a social network context, we can use a table structure in PostgreSQL like this:
CREATE TABLE comments (
id UUID PRIMARY KEY,
post_id UUID NOT NULL REFERENCES posts(id),
user_id UUID NOT NULL REFERENCES users(id),
parent_comment_id UUID REFERENCES comments(id),
content TEXT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP
);
We can retrieve a comment along with all its parent comments using recursive queries. Here's an example that limits the recursion depth to prevent infinite loops:
WITH RECURSIVE recursive_cte AS (
SELECT
id,
parent_comment_id,
1 AS level
FROM
comments
WHERE
id = _id
UNION
SELECT
c.parent_comment_id AS id,
c.parent_comment_id,
rec.level + 1 AS level
FROM
recursive_cte rec
INNER JOIN comments c ON rec.parent_comment_id = c.id
WHERE
level < _max_recursion
)
SELECT r.id, r.parent_comment_id, r.level
FROM recursive_cte r;
To load comments page by page:
SELECT * FROM comments
WHERE post_id = $post_id AND created_at < $last_created_at
ORDER BY created_at DESC
This approach uses the created_at
field to paginate through comments, effectively implementing keyset pagination.
- Range Partitioning: Comments can be partitioned based on the creation date (e.g., by month or year).
- List Partitioning: Partitioning can be done based on
post_id
, where each partition contains comments for a subset of posts.
For identifying popular comments, a materialized view can be used:
CREATE MATERIALIZED VIEW popular_comments AS
SELECT c.*, COUNT(l.id) AS likes_count
FROM comments c
LEFT JOIN likes l ON l.comment_id = c.id
GROUP BY c.id
ORDER BY likes_count DESC;
Vertical Scaling: Enhancing server hardware capabilities.
Horizontal Scaling: Implementing read replicas, data sharding, and using partitioning.
Optimization Techniques: Involves query optimization, efficient schema design, and appropriate indexing.
To assess PostgreSQL's capabilities in handling and querying hierarchical data, a comprehensive test was conducted. This test involved creating a large dataset using the ltree
extension in PostgreSQL, followed by executing a series of queries to evaluate performance. Four key scripts were used in this experiment, each serving a distinct purpose.
The first script focused on generating a substantial volume of hierarchical data. The script created a table big_tree_test
with an ltree
column to effectively represent hierarchical paths. Key points of this script include:
- Table and Index Creation: Setting up a table with
ltree[]
and creating a B-tree index on this column. - Data Insertion: Inserting a large volume of rows to simulate a complex hierarchical structure.
Performance Metrics:
- Rows Generated: 21,510,000
- Time Taken: Approximately 45 minutes
- Memory Usage: About 5.859 GiB
- CPU Utilization: Utilized only one core, indicating an area for potential optimization.
Shortened version of script:
create or replace function pg_temp.random_int(_min int, _max int) returns int language sql as
'select floor(random()* (_max - _min + 1) + _min)::int;';
insert into big_tree_test
select
t1.id as id,
null as parent_id,
'#' || t1.id || ' root' || t1.r
from (
select id, pg_temp.random_int(1, _root_rows) r
from generate_series(1, _root_rows) id
) t1;
-- after that we can create comments with `parent_id` from generated ones
insert into big_tree_test
select
t1.id as id,
p as parent_id,
' l' || i || ' ' || t1.id || ' sub' || t1.r
from (
select _root_rows + _rows_per_group * 1 + id as id,
pg_temp.random_int(1, _root_rows + _rows_per_group * 1) p, -- random parent_id
pg_temp.random_int(1, _rows_per_group) r -- just random number
from generate_series(1, _rows_per_group) id
) t1;
Full sql with generating ltree path:
do
language plpgsql
$$
declare
_root_rows int = 1500000;
_rows_per_group int = 200;
_max_group int = 100000;
_very_deep int = 10000;
begin
drop table if exists big_tree_test;
CREATE TABLE big_tree_test (
id serial PRIMARY KEY,
parent_id int NULL,
content text NOT NULL,
permissions ltree[]
);
create or replace function pg_temp.random_int(_min int, _max int) returns int language sql as
'select floor(random()* (_max - _min + 1) + _min)::int;';
insert into big_tree_test
select
t1.id as id,
null as parent_id,
'#' || t1.id || ' root' || t1.r,
cast((concat('{"n', t1.id,'"}')) as ltree[]) as permissions
from (
select id, pg_temp.random_int(1, _root_rows) r
from generate_series(1, _root_rows) id
) t1;
-- the most readable is to use already created *page* as parents
-- after that use 2 pages, 3, ... else we would have null as parent_id
FOR i IN 0..(_max_group - 1) LOOP
insert into big_tree_test
select
t1.id as id,
p as parent_id,
' l' || i || ' ' || t1.id || ' sub' || t1.r,
cast((
SELECT
array_append('{}'::ltree[], concat(ss.permissions[1], '.s_', i, '_', t1.id)::ltree)
FROM big_tree_test ss
WHERE id = t1.p
) as ltree[]) AS permissions
from (
select _root_rows + _rows_per_group * i + id as id,
pg_temp.random_int(1, _root_rows + _rows_per_group * i) p,
pg_temp.random_int(1, _rows_per_group) r
from generate_series(1, _rows_per_group) id
) t1;
END LOOP;
-- to create very deep comments we have to use parent_id of the last created
FOR i IN 1..(_very_deep) LOOP
insert into big_tree_test
select
t1.id as id,
p as parent_id,
' dd' || i || ' ' || t1.id || ' sub' || t1.r,
cast((
SELECT
array_append('{}'::ltree[], concat(ss.permissions[1], '.s_', i, '_', t1.id)::ltree)
FROM big_tree_test ss
WHERE id = t1.p
) as ltree[]) AS permissions
from (
select _root_rows + _rows_per_group * _max_group + i as id,
pg_temp.random_int(_root_rows + _rows_per_group * _max_group + i * 9 / 10, _root_rows + _rows_per_group * _max_group + i - 1) p,
pg_temp.random_int(1, _rows_per_group) r
) t1;
END LOOP;
end;
$$;
-- bad one. it may be used for full eq
CREATE INDEX idx_big_tree_test_permissions ON big_tree_test USING btree(permissions);
-- good one
CREATE INDEX idx_big_tree_gist_permissions ON big_tree_test USING gist(permissions);
Size of data with indexes: 9,9G
Without btree: 6,4G
And without gist: 4,9G
Create gist
:
- time for : 6m 37s
- CPU: 1
- RAM: up to 2.2G
This script performed a full table scan using the nlevel
function on the permissions
column.
Performance Metrics:
- Execution Time: 1 minute and 11 seconds
- This time reflects the duration for a full scan without the benefit of an optimized index for the operation.
SELECT
*,
nlevel(t1.permissions[1]) as level
FROM big_tree_test t1
ORDER BY level DESC
here and everywhere in the article: psql (16.1 (Debian 16.1-1.pgdg120+1))
Gather Merge (cost=3422605.51..5514001.53 rows=17925000 width=136)
Workers Planned: 2
-> Sort (cost=3421605.49..3444011.74 rows=8962500 width=136)
Sort Key: (nlevel(permissions[1])) DESC
-> Parallel Seq Scan on big_tree_test t1 (cost=0.00..548625.25 rows=8962500 width=136)
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true
The third script executed a query using the ~
operator for pattern matching against the permissions
column.
Performance Metrics:
- Rows Fetched: 10,000
- First Run Time: 19 seconds
- Subsequent Run Time: 5 seconds
- The improved performance on the second run demonstrates PostgreSQL's efficiency in caching and query optimization.
SELECT
COUNT(*)
FROM big_tree_test t1
WHERE '*.s_16340_4768122.*' ~ t1.permissions
Aggregate (cost=8265.99..8266.00 rows=1 width=8)
-> Bitmap Heap Scan on big_tree_test t1 (cost=101.09..8260.61 rows=2151 width=0)
Recheck Cond: ('*.s_16340_4768122.*'::lquery ~ permissions)
-> Bitmap Index Scan on idx_big_tree_gist_permissions (cost=0.00..100.55 rows=2151 width=0)
Index Cond: (permissions ~ '*.s_16340_4768122.*'::lquery)
https://explain.dalibo.com/plan/c6bg7g1f2cbacgd3
-
Recheck Cond ('.s_16340_4768122.'::lquery ~ t1.permissions): This indicates that during the bitmap heap scan, the database needed to recheck the rows fetched by the index scan against the condition
*.s_16340_4768122.*'::lquery ~ t1.permissions
. This recheck is necessary because the bitmap index scan is lossy---it might fetch rows that seem to match based on the index but actually do not meet the condition upon rechecking. -
Rows Removed by Index Recheck (9,602,264): This is the number of rows that were initially identified by the index scan as potential matches but were later discarded upon rechecking the condition. It shows that the initial index scan fetched a lot of rows that did not actually meet the query condition.
https://wiki.postgresql.org/wiki/Bitmap_Indexes#General
The final script used the @>
operator to fetch specific hierarchical data.
Performance Metrics:
- Rows Fetched: 10,000
- Execution Time: Approximately 200 milliseconds
- This query's speed showcases the effectiveness of GiST indexes in handling
ltree
operations.
SELECT
*
FROM big_tree_test t1
WHERE 'n1258534.s_13709_4241826.s_16340_4768122.s_99999_21500000' @> t1.permissions
Bitmap Heap Scan on big_tree_test t1 (cost=101.09..8260.61 rows=2151 width=132)
Recheck Cond: ('n1258534.s_13709_4241826.s_16340_4768122.s_99999_21500000'::ltree @> permissions)
-> Bitmap Index Scan on idx_big_tree_gist_permissions (cost=0.00..100.55 rows=2151 width=0)
Index Cond: (permissions <@ 'n1258534.s_13709_4241826.s_16340_4768122.s_99999_21500000'::ltree)
https://explain.dalibo.com/plan/c6bg7g1f2cbacgd3

https://explain.dalibo.com/plan/5g3a7g5eb172gb04
