Skip to content

Instantly share code, notes, and snippets.

@ties
Last active September 6, 2025 12:44
Show Gist options
  • Select an option

  • Save ties/245c4596a175182b23d36650549776c6 to your computer and use it in GitHub Desktop.

Select an option

Save ties/245c4596a175182b23d36650549776c6 to your computer and use it in GitHub Desktop.
"MRT data in clickhouse"
services:
clickhouse:
image: clickhouse/clickhouse-server:latest
ports:
- "8123:8123" # HTTP interface
- "9000:9000" # Native/TCP interface
environment:
CLICKHOUSE_DB: "mrt"
CLICKHOUSE_USER: "default"
CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT: "1"
CLICKHOUSE_PASSWORD: "clickhouse"
ulimits:
nofile:
soft: 262144
hard: 262144

Introduction

Route collector projects (RIPE RIS, Routeviews, ...) collect BGP data in MRT files. This is a binary file format you can only process with specific programs. In this article we describe one alternative way to work

When working with this data, often analists parse the raw MRT data for each lookup they want to do. parse -> grep -> process. One alternative is to insert the data in a database and query that.

In this article, we show the basics of how you can index MRT data in clickhouse and analyse it. This still needs infrastructure (clickhouse), but we want to show this because it is flexible and may fit in your analysis stack. If you are more used to a datascience approach, stay tuned: We plan to have a file format for that type of analysis ready at the RIPE meeting.

Indexing it in a database makes querying more ergonomic. Some people do this already (e.g. https://anuragbhatia.com/post/2025/02/analysing-transit-free-networks/) but there have been few writeups of how to do this. Below, you can see the CLI commands for working with a dump file. After that, we will sketch the outline of how you can get the data into clickhouse and how you query that.

"Parse and Grep": When you know what RIB/bview file you need to search in, or you are investigating the updates for a short time period, it is very fast to parse the files and filter the lines that you need. The example below uses bgpreader (from bgpstream). But you can also use monocle (from bgpkit).

$ bgpreader -m -d singlefile -o rib-file=bview.20250902.1600.gz | grep '193.0.14.0/24'
WARN: No time window specified, defaulting to all available data
TABLE_DUMP2|1756828800|B|80.77.16.114|34549|193.0.14.0/24|34549 25152|IGP|80.77.16.114|0|0|34549:200 34549:10000|NAG||
TABLE_DUMP2|1756828800|B|165.16.221.66|37721|193.0.14.0/24|37721 25152|IGP|165.16.221.66|0|0|37721:2000 37721:6002 37721:6003 37721:6004 37721:11000 37721:11100 37721:11101|NAG||
TABLE_DUMP2|1756828800|B|185.210.224.254|49432|193.0.14.0/24|49432 25152|IGP|185.210.224.254|0|1000||NAG||
TABLE_DUMP2|1756828800|B|193.33.94.251|58057|193.0.14.0/24|58057 1836 25152|IGP|193.33.94.251|0|0|1836:20000 1836:110 1836:3000 1836:3020 58057:65010 25152:1016|NAG||
...

How to load BGP data in clickhouse

  • Run a clickhouse database and have the CLI working (out of scope for here).
    • See compose.yml for a compose file
  • Download MRT RIB dumps (or updates)
  • Create the table
  • Import
  • Query ✨

Downloading MRT RIB dumps

You can download the MRT archives from https://data.ris.ripe.net/. Each collector has a directory (e.g. https://data.ris.ripe.net/rrc00/) in which there is a directory per month. For example one link for the rrc00 multi-hop collector is https://data.ris.ripe.net/rrc00/2025.09/bview.20250905.0000.gz.

When indexing in a database, I recommend to start with a low number of dumps.

There are also tools, e.g.

$ uvx mrt-downloader ~/tmp/mrt/bviews \
  2025-09-02T16:00 \
  2025-09-02T16:00:01 \
  --rib-only \
  --project ris --project routeviews \
  --create-target
Downloading updates from 2025-09-02 16:00:00 to 2025-09-02 16:00:01 to /home/user/tmp/bviews
Partitioning directories by collector and month
Gathering index for 77 collectors: RRC00, RRC01, RRC02, RRC03, RRC04, RRC05, RRC06, RRC07, RRC08, RRC09, RRC10, RRC11, RRC12, RRC13, RRC14, RRC15, RRC16, RRC18, RRC19, RRC20, ...

Which downloads all files in the time range (last second is exclusive). If you want only a specific collector, use --collector. For example --collector rrc13 for MSK-IX.

Creating the table

The tool that we will use here (monocle) has a pipe-separated output format. The table format may change depending on what you want to do. For this example I use

CREATE TABLE rib
(
    `operation` String,
    `timestamp` DateTime,
    `peer_ip` String,
    `peer_as` String,
    `prefix` String,
    `as_path` String,
    `source` String,
    `next_hop` String,
    `_` String,
    `__` String,
    `communities` String,
    `___` String,
    `____` String,
    `_____` String,
    `______` String,
)
ENGINE = MergeTree
ORDER BY (prefix, peer_ip, peer_as)

This is the simplest schema possible. You can use a more complicated schema. This may reduce storage, make querying easier, etc. At the end, I will show an improved schema. But for now, let's keep it simple :).

Importing data

Importing data is "ugly" in this case:

  • Use monocle (installation) to parse the file
  • Connect to clickhouse, set up the CSV separator, load data from STDIN.

This command assumed you are in the directory where you downloaded the RIB files. It then parses each of them, configures clickhouse to expect a | in CSV, and pipes the data into clickhouse:

for bview in $(find . -name \*.gz); do
    monocle parse $bview | clickhouse-client --host=localhost --database=mrt --user=default --password=clickhouse -q "SET format_csv_delimiter = '|'; INSERT INTO rib FORMAT CSV"
done

Querying the DB

Prefixes announced by an ASN:

SELECT *
FROM rib
WHERE prefix = '193.0.14.0/24'
LIMIT 2

Query id: 7ea7a828-987b-496e-8fe4-d886177ca741

   ┌─operation─┬───────────timestamp─┬─peer_ip────────┬─origin_as─┬─prefix────────┬─as_path─────────────────────────────┬─source─┬─next_hop───────┬─_─┬─__─┬─communities──────────────────────────────────────────────────────────────────────────────────────────────┬─___───┬─____─┬─_____─┬─______─┐
1. │ A         │ 2025-09-01 08:00:00 │ 86.104.125.108 │ 33823     │ 193.0.14.0/24 │ 33823 25152                         │ IGP    │ 86.104.125.150 │ 0 │ 0  │ 25152:1074 39107:25152 39107:650 39107:200 39107:500 25152:1074:39107 39107:39107:25152                  │ false │      │       │        │
2. │ A         │ 2025-09-01 08:00:00 │ 80.81.196.156  │ 34927     │ 193.0.14.0/24 │ 34927 59900 12297 201986 8932 25152 │ IGP    │ 80.81.196.156  │ 0 │ 0  │ 34927:130 34927:163 25152:1070:57463 57463:64700:25152 65000:0:100 65000:0:200 65000:0:202 65000:0:12297 │ false │      │       │        │
   └───────────┴─────────────────────┴────────────────┴───────────┴───────────────┴─────────────────────────────────────┴────────┴────────────────┴───┴────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────┴──────┴───────┴────────┘

2 rows in set. Elapsed: 0.015 sec. Processed 270.27 thousand rows, 42.65 MB (18.16 million rows/s., 2.87 GB/s.)
Peak memory usage: 72.72 MiB.

Prefixes transiting an ASN:

SELECT
    countDistinct(prefix),
    multiIf(position(prefix, ':') > 0, 'ipv6', 'ipv4') AS afi
FROM rib
WHERE position(as_path, ' 1299 ') > 0
GROUP BY ALL

Query id: 5661cc1d-189d-4038-b536-d691ff0854e5

   ┌─countDistinct(prefix)─┬─afi──┐
1. │                211018 │ ipv6 │
2. │                965376 │ ipv4 │
   └───────────────────────┴──────┘

2 rows in set. Elapsed: 1.022 sec. Processed 443.89 million rows, 24.11 GB (434.54 million rows/s., 23.60 GB/s.)
Peak memory usage: 2.62 GiB.

Et cetera

Conclusion

In this article I showed how you can import BGP dump files into clickhouse. It is relatively simple and it works. Queries are very fast when they match the order in which data is stored in. There also is space for improvement: clickhouse means infrastructure, while the same queries can be possible on pre-processed files. And, because of the way the data is imported, it has little structure. AS paths are a string, which is both easy (you can do a regular expression).

In the future we will show more methods you can use to look at this data, including a workflow where we use joins to easily calculate statistics for multiple networks.

Improvements:

Earlier on I mentioned a number of possible improvements to the schema. Some are in the list below. If they are striked through, they are in the updated schema and commands below. Be careful with every change that you do --- they only help if the database or your queries use them.

  • enum column for operation: operation Enum8('A' = 1, 'W' = 2)
  • LowCardinality column for dictionary encoding for columns with a low number of unique values (e.g. peer_ip)
  • Add secondary incides INDEX prefixes prefix TYPE set(100) GRANULARITY 2
  • order by truncated time to optimise queryies ORDER BY (toStartOfFifteenMinutes(timestamp), prefix, peer_ip)
  • Partition the table PARTITION BY toStartOfDay(timestamp)
CREATE TABLE rib
(
    `operation` Enum8('A' = 1, 'W' = 2),
    `timestamp` DateTime,
    `peer_ip` LowCardinality(String),
    `peer_as` LowCardinality(String),
    `prefix` String,
    `as_path` Array(String), --- recall: Can contain AS-sets
    `source` LowCardinality(String),
    `next_hop` String,
    `_` String,
    `__` String,
    `communities` Array(String),
    `___` String,
    `____` String,
    `_____` String,
    `______` String
)
ENGINE = MergeTree
ORDER BY (toStartOfFifteenMinutes(timestamp), as_path[-1], prefix)
PARTITION BY toStartOfDay(timestamp)
for bview in $(find . -name \*.gz); do
 monocle parse $bview | clickhouse-client --host=localhost --database=mrt --user=default --password=clickhouse -q "
 SET format_csv_delimiter = '|'; INSERT INTO rib 
 SELECT 
     operation, timestamp, peer_ip, peer_as, prefix,
     splitByChar(' ', as_path_raw) as as_path,
     source, next_hop, col9, col10,
     splitByChar(' ', raw_communities) as communities,
     col12, col13, col14, col15
 FROM input('
     operation String, timestamp DateTime, peer_ip String,
     peer_as String, prefix String,
     as_path_raw String,
     source String, next_hop String, col9 String, col10 String,
     raw_communities String, col12 String, col13 String, col14 String, col15 String
 ') FORMAT CSV"
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment