Start the clickhouse-client
and connect it to your ClickHouse Server.
$ clickhouse-client --host=<host> --port=<port> --user=<user> --password=<password>
You should be greated by the ClickHouse prompt
ClickHouse client version 21.12.3.32 (official build).
Connecting to localhost:9000 as user admin.
Connected to ClickHouse server version 21.12.3 revision 54452.
localhost :)
On this prompt we can execute our queries.
Here we are going to utilize the s3 table function
SELECT *
FROM s3(
'http://s3.amazonaws.com/[bucket_name]/[path]/filename.parquet',
'<access-key>',
'<secret-key>',
'Parquet',
'`column_name_a` Int64, `column_name_b` String'
)
LIMIT 10;
Notice that we surrounded the column names with ` characters. This allows us to access column names containing dots (".") and other reserved characters.
For example we can select from all the files between 2022-01-11 10 AM and 2022-01-11 12 AM from a partitioned dataset using the following query:
SELECT *
FROM s3(
'http://s3.amazonaws.com/[bucket_name]/year=2022/month=1/day=11/hour={10..11}/*.parquet',
'<access-key>',
'<secret-key>',
'Parquet',
'`column_name_a` Int64, `column_name_b` String'
)
LIMIT 10;
In addition to the access patterns above we can also utilize the _path
or _file
virtual columns during filtering.
To illustrate say that we have a file structure in S3 that has files like this.
year=2022/month=1/day=3/hour=21/5c454b8b35e8466aafc74960847970a0.parquet
This is quite common on partitioned datasets such as prestoDB or Apache Hive and similar.
What if we wanted to only access files in a given date period but did not have the date inside the parquet files? Then
we could utilize the virtual _path
field which contains the string above and parse the date like this.
How to parse a string to DateTime.
SELECT
parseDateTimeBestEffortOrNull(
replaceRegexpOne(
'year=2022/month=1/day=3/hour=21/5c454b8b35e8466aafc74960847970a0.parquet',
'year=(\\d+)/month=(\\d+)/day=(\\d+)/hour=(\\d+)/.*',
'\\1-\\2-\\3 \\4:00:00'
)
) as path_date;
So then we can efficiently construct queries like this on certain DateTime periods.
SELECT column_name_b,
parseDateTimeBestEffortOrNull(
replaceRegexpOne(
_path,
'year=(\\d+)/month=(\\d+)/day=(\\d+)/hour=(\\d+)/.*',
'\\1-\\2-\\3 \\4:00:00'
)
) as path_date;
FROM s3(
'http://s3.amazonaws.com/[bucket_name]/year=*/month=*/day=*/hour=*/*.parquet',
'<access-key>',
'<secret-key>',
'Parquet',
'`column_name_a` Int64, `column_name_b` String'
)
WHERE toYYYYMMDD(path_date) BETWEEN 20220113 AND 20220114
LIMIT 10;
Resulting in results like
┌─column_name_b──────────────┬───────────path_date─┐
│ 540ea21f-a8b8-9fa703509eeb │ 2022-01-13 22:00:00 │
│ 3356eb45-35ef-09a726b18726 │ 2022-01-13 22:00:00 │
│ f8034960-6361-bf20826518c9 │ 2022-01-13 22:00:00 │
│ cc0bc4fb-1247-ac6d3e0d58c4 │ 2022-01-13 22:00:00 │
│ ec6dcafc-43d9-66afb8cdd48c │ 2022-01-13 22:00:00 │
│ f082fbe0-eae7-78da244c9649 │ 2022-01-13 22:00:00 │
│ 5b7c8ecc-6298-3dcbfce7b3ce │ 2022-01-13 22:00:00 │
│ 823e0606-905c-e0307813ed8a │ 2022-01-13 22:00:00 │
│ 99c9bcd3-437d-09ab6493324d │ 2022-01-13 22:00:00 │
│ eec25407-9766-aeebccb33adc │ 2022-01-13 22:00:00 │
└────────────────────────────┴─────────────────────┘