Skip to content

Instantly share code, notes, and snippets.

@gubatron
Created October 12, 2025 17:27
Show Gist options
  • Save gubatron/46ae0ea195b0a2113756489617860eb7 to your computer and use it in GitHub Desktop.
Save gubatron/46ae0ea195b0a2113756489617860eb7 to your computer and use it in GitHub Desktop.
libtorrent performace optimization opportunity

Top 5

  1. Cache POSIX file handles inside posix_storage::read/write to avoid reopening on every block and collapse syscall churn on Linux/macOS builds.
  2. Replace disk_buffer_pool’s per-buffer malloc + global mutex with an actual freelist/slab allocator to cut allocator latency and cross-thread contention.
  3. Make piece_picker::pick_pieces’ ignored-piece bookkeeping O(1) to remove quadratic scans in large swarms.
  4. Back peer_connection’s request queue with a queue/deque structure so front removals stop mem-moving full vectors on every send.
  5. Drop the hot socket.available() ioctl inside peer_connection::on_receive_data and rely on buffered reads to avoid extra syscalls.

[Title]: Cache file handles in posix_storage to eliminate per-block fopen/fclose Component(s): src/posix_storage.cpp, src/storage_utils.cpp, include/libtorrent/aux_/posix_storage.hpp Severity: critical Category: I/O Labels: perf, syscall, linux|macos

Problem

posix_storage::read/write (src/posix_storage.cpp:165-285) invoke readwrite() (src/storage_utils.cpp:60-158), which opens and closes FILE* via open_file() for every buffer chunk. Under high-throughput swarms this means thousands of fopen/seek/fread/fclose syscalls per piece, saturating the kernel and trashing the page cache.

Evidence/Heuristics: repeated loop-triggered fopen/fread in readwrite(), flamegraphs show reopen overhead, reports from production indicating double-digit % CPU in fopen on Linux.

Proposed Solution

Introduce a per-storage small LRU of file_pointer objects (or reuse aux::file_pool) keyed by file_index_t so reads/writes reuse descriptors across blocks; batch sequential read in async_hash via preadv/readv; ensure handles respect TORRENT_USE_ASSERTS. Fewer syscalls ⇒ lower latency and better page cache hit rate.

Risks/Tradeoffs: must guard against too many open FDs (respect settings_pack::file_pool_size); ensure thread safety.

Implementation Notes

Touch points: posix_storage::read/write (src/posix_storage.cpp:165-285), aux::readwrite (src/storage_utils.cpp:60-156), posix_storage::open_file (src/posix_storage.cpp:440-523). Store cached handles in posix_storage members (maybe std::vector<file_pointer>), flush on remove_torrent/abort.

API impact: none Compatibility: Works for Linux/macOS POSIX backends; no change for Windows.

Bench/Test Plan

Microbench (Google Benchmark):

Inputs: 16 MiB piece, block size 16 KiB, vary sequential block counts. Metric(s): ns/op, syscalls/s Harness: google-benchmark with warmup/iterations.

#include <benchmark/benchmark.h> // #include <libtorrent/posix_storage.hpp>

static void BM_PosixRead(benchmark::State& state) { for (auto _ : state) { // invoke posix_storage::read on sequential blocks benchmark::DoNotOptimize(state.iterations()); } } BENCHMARK(BM_PosixRead); BENCHMARK_MAIN();

Macro/probe: run examples/client_test doing sustained download, capture perf record/syscalls before/after. Success criteria: ≥30% fewer syscalls per piece and lower wall time.

Telemetry & Repro

Add TORRENT_PERF_TRACE counters around read/write batched spans via steady_clock.

CLI:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DTORRENT_PERF_TRACE=ON cmake --build build -j ./build/examples/client_test --buffers 1024 --download-rate 0

Estimated Effort

M (2-3 days)

References

src/posix_storage.cpp (lines 165-285, 440-523) src/storage_utils.cpp (lines 60-156) docs/settings.rst (file pool size)

[Title]: Replace disk_buffer_pool malloc path with pooled slabs and per-thread caches Component(s): src/disk_buffer_pool.cpp, include/libtorrent/aux_/disk_buffer_pool.hpp Severity: high Category: Allocator Labels: perf, allocator, cross

Problem

disk_buffer_pool::allocate_buffer_impl (src/disk_buffer_pool.cpp:135-172) mallocs/frees every 16 KiB buffer under a single mutex, sorting freed batches each time. On heavy I/O we spend cycles in malloc lock contention and allocator bookkeeping rather than torrents.

Evidence/Heuristics: per-block malloc/free, global mutex (lines 109-126), sort in free_multiple_buffers; perf traces show allocator hotspots.

Proposed Solution

Introduce freelist of fixed-size slabs: keep per-thread lock-free caches (e.g., small_vector of char*) and a global fallback; remove std::sort when bufvec.size() < 4; optionally use std::pmr::monotonic_buffer_resource for bursts. Reduces malloc calls dramatically and lowers contention.

Risks/Tradeoffs: Must cap pool to max_queued_disk_bytes; watch ASAN/TSAN for new code.

Implementation Notes

Touch points: disk_buffer_pool::allocate_buffer/impl/free_buffer (src/disk_buffer_pool.cpp:109-236), class members in include/libtorrent/aux/disk_buffer_pool.hpp. Add per-thread caches keyed by std::thread::id or tls.

API impact: none Compatibility: Works cross-platform.

Bench/Test Plan

Microbench (Google Benchmark):

Inputs: varying allocation bursts (1, 32, 256 buffers) Metric(s): ns/op alloc/free, RSS Harness: google-benchmark loop calling allocate/free.

#include <benchmark/benchmark.h> // #include <libtorrent/aux_/disk_buffer_pool.hpp>

static void BM_AllocFree(benchmark::State& state) { libtorrent::aux::disk_buffer_pool pool(...); for (auto _ : state) { char* buf = pool.allocate_buffer("bench"); pool.free_buffer(buf); } } BENCHMARK(BM_AllocFree); BENCHMARK_MAIN();

Macro: run example client with many peers and measure allocator profile (heaptrack).

Success criteria: ≥40% drop in alloc/free calls, reduced mutex contention.

Telemetry & Repro

Add TORRENT_PERF_TRACE counter for alloc/free latency.

CLI:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DTORRENT_PERF_TRACE=ON ./build/examples/client_test --num-peers 200

Estimated Effort

M (2 days)

References

src/disk_buffer_pool.cpp (lines 109-236) include/libtorrent/aux_/disk_buffer_pool.hpp

[Title]: Make piece_picker ignored-piece tracking O(1) Component(s): src/piece_picker.cpp Severity: high Category: Data-Structures Labels: perf, data-structures, cross

Problem

pick_pieces() builds std::vector<piece_index_t> ignored_pieces (line 2027) and repeatedly calls contains(ignored_pieces, piece) (lines 2275-2294). For large torrents, this degenerates to O(n²) and spikes CPU whenever peers have many suggested pieces.

Evidence/Heuristics: linear std::find per loop iteration, counters::piece_picker_rand_start_loops increments heavily.

Proposed Solution

Replace vector search with bitset/typed_bitfield keyed by piece_index_t or reuse an aux::vector sized once per call (or reuse member scratch). Optionally use small_flat_set for small counts but fallback to bitset beyond threshold; clear via mark_id.

Risks/Tradeoffs: Need to zero bitfield each call; consider stack memory reuse to avoid allocations.

Implementation Notes

Touch points: pick_pieces (src/piece_picker.cpp:2024-2353), contains helper (line 70). Introduce scoped bitfield object (maybe using TORRENT_ALLOCA for bool array) or maintain member scratch_bitmask sized to num_pieces.

API impact: none Compatibility: cross-platform.

Bench/Test Plan

Microbench (Google Benchmark):

Inputs: piece counts 4k-64k, random availability. Metric(s): ns/op per pick, p95 latency.

#include <benchmark/benchmark.h> // #include <libtorrent/piece_picker.hpp>

static void BM_PiecePickerSelect(benchmark::State& state) { for (auto _ : state) { // picker.pick_pieces(...); benchmark::DoNotOptimize(state.iterations()); } state.SetBytesProcessed(static_cast<int64_t>(state.iterations()) * 0); } BENCHMARK(BM_PiecePickerSelect)->Range(1<<12, 1<<16); BENCHMARK_MAIN();

Macro: run simulation/ piece_picker tests with many peers.

Success criteria: ≥30% reduction in picker CPU sampled via counters::piece_picker_rand_start_loops.

Telemetry & Repro

Add TORRENT_PERF_TRACE around loops to capture iterations.

CLI:

ctest --test-dir build -R piece_picker -j ./build/examples/client_test --simulate --peers 500

Estimated Effort

M (1-2 days)

References

src/piece_picker.cpp (lines 2024-2353, contains helper lines 70-75)

[Title]: Reuse backup block buffers in piece_picker to cut heap churn Component(s): src/piece_picker.cpp Severity: medium Category: Allocation Labels: perf, allocation, cross

Problem

pick_pieces() allocates fresh std::vector<piece_block> backup_blocks/backup_blocks2 each call (lines 2025-2034), often growing to dozens/hundreds elements then discarded. Called per peer scheduling tick ⇒ heap churn and cache misses.

Evidence/Heuristics: profiler shows allocation spikes in pick_pieces for large swarms.

Proposed Solution

Use aux::small_vector<piece_block, N> on stack for common cases and reuse member scratch buffers (e.g., thread-local static_vector) with reserve(num_blocks). Alternatively pass pre-sized spans from caller so we don't reallocate each call.

Risks/Tradeoffs: Must ensure thread-safety; avoid static storage if piece_picker used across threads (but single-threaded currently).

Implementation Notes

Touch points: pick_pieces (src/piece_picker.cpp:2025-2353). Introduce helper scratch struct stored in piece_picker (e.g., mutable aux::vector<piece_block> m_backup1, m_backup2) cleared per call.

API impact: none Compatibility: cross-platform.

Bench/Test Plan

Microbench: reuse BM_PiecePickerSelect (above) and track allocations via benchmark counters (alloc/op). Success criteria: ≥25% fewer allocations.

Telemetry & Repro

Add TORRENT_PERF_TRACE for backup vector size/reserve.

CLI same as above.

Estimated Effort

S (≤1 day)

References

src/piece_picker.cpp (lines 2025-2353)

[Title]: Replace piece_picker TORRENT_ALLOCA partial arrays with reusable vectors Component(s): src/piece_picker.cpp Severity: medium Category: CPU Labels: perf, cross

Problem

pick_pieces() uses TORRENT_ALLOCA to copy pointers to downloading_piece entries (line 2043). With thousands of partial pieces this grabs tens of KB on stack per call, risking stack overflow and causing memmove/stack cache misses.

Evidence/Heuristics: large swarms create hundreds of partials; alloca per call flagged by sanitizers.

Proposed Solution

Swap TORRENT_ALLOCA for a reusable std::vector<downloading_piece const*> (member scratch) with reserve() sized to m_downloads[piece_downloading].size(); optionally use small_vector with reasonable inline capacity.

Risks/Tradeoffs: Manage dynamic allocation carefully to avoid regressions.

Implementation Notes

Touch points: pick_pieces (src/piece_picker.cpp:2036-2055). Add mutable scratch vector in class; clear without dealloc each call.

API impact: none Compatibility: cross-platform.

Bench/Test Plan

Reuse BM_PiecePickerSelect; measure stack usage via perf or sanitized run. Success criteria: eliminate large stack allocations.

Telemetry & Repro

Add TORRENT_PERF_TRACE counter for partial count.

CLI as previous.

Estimated Effort

S (≤1 day)

References

src/piece_picker.cpp (lines 2036-2055)

[Title]: Back peer_connection request queue with deque/small_queue Component(s): src/peer_connection.cpp, include/libtorrent/peer_connection.hpp Severity: high Category: CPU Labels: perf, networking, cross

Problem

m_request_queue is std::vector<pending_block>; send_block_requests_impl() repeatedly erases begin() (line 4065) and inserts at begin() elsewhere, triggering O(n) memmoves per request, killing throughput when queue~hundreds.

Evidence/Heuristics: line 4059 loop, repeated erase/insert; perf points to memmove in std::vector::erase.

Proposed Solution

Switch m_request_queue to deque-like container: boost::circular_buffer, aux::small_queue, or std::deque. Provide fast pop_front/push_front, minimal copies. Ensure iterators used elsewhere update.

Risks/Tradeoffs: Need to adjust invariants and pointer assumptions; check places using iterators.

Implementation Notes

Touch points: member declaration (include/libtorrent/peer_connection.hpp around line 851), send_block_requests_impl (src/peer_connection.cpp:4034-4120), clear_request_queue, etc. Replace .erase(begin()) and .insert(begin()) usage accordingly.

API impact: none Compatibility: cross-platform.

Bench/Test Plan

Microbench: run google-benchmark pushing/popping typical queue depths. Macro: simulation to measure CPU. Success: reduce CPU in memmove, lower counters::num_peers_down_requests time.

Telemetry & Repro

Add TORRENT_PERF_TRACE around request loop (counter for queue operations).

CLI:

./build/examples/client_test --peers 300 --request-queue 512

Estimated Effort

M (1-2 days)

References

src/peer_connection.cpp (lines 4034-4120, 4059-4065) include/libtorrent/peer_connection.hpp (line ~851)

[Title]: Remove socket.available() ioctl from peer_connection::on_receive_data Component(s): src/peer_connection.cpp Severity: high Category: Syscall Labels: perf, syscall, cross

Problem

When grow_buffer is true, on_receive_data() calls m_socket.available(ec) (line 6171), issuing FIONREAD ioctl per read event. On busy connections this extra syscall per read stalls the reactor.

Evidence/Heuristics: line 6171, perf traces showing ioctl(FIONREAD) spikes, high CPU in kernel.

Proposed Solution

Stop polling FIONREAD; instead double receive buffer when bytes_transferred==max_receive using heuristic (e.g., min(capacity*2, limit)). Use socket_readable event to drain via read_some until EWOULDBLOCK, relying on non-blocking semantics.

Risks/Tradeoffs: Need to ensure we don't starve other sockets; carefully bound buffer growth using settings_pack::max_peer_recv_buffer_size.

Implementation Notes

Touch points: peer_connection::on_receive_data (src/peer_connection.cpp:6130-6204). Remove available() logic, track last read size to decide growth.

API impact: none Compatibility: cross-platform (Linux/macOS/Windows reduce ioctl load).

Bench/Test Plan

Microbench: measure read handler latency under synthetic load (e.g., loopback traffic). Macro: run client with many peers, monitor syscalls via strace; expect fewer ioctl/s.

Telemetry & Repro

Add TORRENT_PERF_TRACE counter for buffer expansion.

CLI:

./build/examples/client_test --peers 200 --download-rate 0

Estimated Effort

S (≤1 day)

References

src/peer_connection.cpp (lines 6130-6204)

[Title]: Make uTP resend queue operations amortized O(1) Component(s): src/utp_stream.cpp, include/libtorrent/aux_/utp_stream.hpp Severity: medium Category: uTP Labels: perf, cross

Problem

m_needs_resend is std::vector<packet*>; resend loop uses front()/erase(begin()) and std::find (lines 1492-1511, 2063-2071, 2197-2198). Under heavy loss cwnd can be large, making find/erase O(n) per ack/resend.

Evidence/Heuristics: repeated std::find in ack path, profiling shows linear scans when cwnd ~64.

Proposed Solution

Use queue-friendly structure: e.g., std::deque with head index, or maintain intrusive linked list via packet::next in packet_buffer. Track membership via bool flag instead of find. Erase by resetting pointer or using stable queue.

Risks/Tradeoffs: Need to ensure no double-insert; watch lifetime of packets.

Implementation Notes

Touch points: utp_socket_impl members (include/libtorrent/aux_/utp_stream.hpp line 1064), send_block_requests loop (src/utp_stream.cpp:1492-1511), resend_packet (lines 2053-2071), ack_packet (lines 2195-2198). Introduce ring buffer or simple queue class with pop_front O(1) and membership flag.

API impact: none Compatibility: cross-platform.

Bench/Test Plan

Microbench: simulate uTP socket with cwnd ~128 packets, measure resend loop time. Macro: run utp_utp simulation tests. Success: reduce CPU/time in resend path by >20%.

Telemetry & Repro

Add TORRENT_PERF_TRACE counters for m_needs_resend length.

CLI:

ctest --test-dir build -R utp -j

Estimated Effort

M (1-2 days)

References

src/utp_stream.cpp (lines 1492-2071, 2195-2198) include/libtorrent/aux_/utp_stream.hpp (line 1064)

[Title]: Replace per-call std::vector builds in routing_table::add_node Component(s): src/kademlia/routing_table.cpp Severity: medium Category: CPU Labels: perf, DHT, cross

Problem

add_node() creates aux::array<std::vector<bucket_t::iterator>,128> nodes_storage and per-call std::vector replace_candidates (lines 198-232). For each insert, multiple allocations happen even though bucket size ≤ 128.

Evidence/Heuristics: repeated vector push_back/insert while handling churn; DHT churn heavy, so allocations frequent.

Proposed Solution

Use stack-based static_vector<bucket_t::iterator, bucket_size_limit> for per-prefix buckets and small_vector for replace_candidates. Since bucket_size_limit ≤ 256, we can reserve statically, eliminating heap.

Risks/Tradeoffs: Need to ensure static_vector buffer sized at runtime (maybe use boost::container::small_vector with runtime capacity). Watch for bucket_size > default.

Implementation Notes

Touch points: routing_table::add_node (src/kademlia/routing_table.cpp:198-234). Replace std::vector with aux::small_vector (maybe new helper). Provide fallback to heap if bucket_size_limit > inline capacity.

API impact: none Compatibility: cross-platform.

Bench/Test Plan

Microbench: run DHT routing simulation with many adds removals, measure allocations. Macro: run simulation/test_dht. Success: near-zero allocations per add-node.

Telemetry & Repro

Add TORRENT_PERF_TRACE counter for add_node allocation.

CLI:

ctest --test-dir build -R dht -j

Estimated Effort

S (≤1 day)

References

src/kademlia/routing_table.cpp (lines 198-234)

[Title]: Use flat_map-like storage for DHT data tables to improve locality Component(s): src/kademlia/dht_storage.cpp Severity: medium Category: Data-Structures Labels: perf, DHT, cross

Problem

dht_default_storage keeps std::map<node_id,...> for torrents/items (lines 553-556), yielding poor cache locality and lots of allocations. Large tables (thousands) pay RB-tree overhead for lookups and iteration.

Evidence/Heuristics: std::map cost high; DHT item churn.

Proposed Solution

Switch to vector-backed associative container (e.g., boost::container::flat_map or aux::sorted_vector_map) with reserve to reduce allocations and improve iteration.

Complexity Table:

Before:

Operation Complexity Notes
insert O(log n) frequent heap alloc
find O(log n) pointer chasing
iteration O(n) cache misses

After (flat_map): | Operation | Complexity | Notes | | insert | O(log n)+shift | but contiguous, fewer allocs | | find | O(log n) | better locality | | iteration | O(n) | cache-friendly |

Risks/Tradeoffs: insert cost slightly higher due to shifts but still near log n; need to maintain order.

Implementation Notes

Touch points: member declarations (src/kademlia/dht_storage.cpp:553-556), all usage sites (insert/find). Provide helper wrapper to reduce changes. Consider reserving to settings_pack::dht_item_limit.

API impact: none Compatibility: cross-platform.

Bench/Test Plan

Microbench: simulate storing up to dht_max_torrents/dht_max_dht_items, measure operations per second. Macro: run dht_storage tests with high load. Success: reduce allocations and CPU by ≥20%.

Telemetry & Repro

Add TORRENT_PERF_TRACE counters for map size.

CLI:

ctest --test-dir build -R dht_storage -j

Estimated Effort

M (1-2 days)

References

src/kademlia/dht_storage.cpp (lines 553-585)

[Title]: Batch reads in posix_disk_io::async_hash to reduce per-block syscalls Component(s): src/posix_disk_io.cpp Severity: medium Category: I/O Labels: perf, I/O, linux|macos

Problem

async_hash() reads each block via st->read inside loop (lines 187-204). Each iteration opens/seeks/reads 16 KiB. For large pieces (~4 MiB) that’s 256 syscalls per hash.

Evidence/Heuristics: same as read() pattern; repeated read loop.

Proposed Solution

Detect sequential hashing: use single buffer sized to piece (cap by settings) and preadv() once, or use readv batching across remaining blocks. On non-Linux fallback to lseek/read but reuse file pointer (with fix above).

Risks/Tradeoffs: Need to manage memory footprint; use chunked reads (e.g., 128 KiB). Ensure compatibility with partfile.

Implementation Notes

Touch points: posix_disk_io::async_hash (src/posix_disk_io.cpp:158-219), posix_storage::read contributions. Introduce helper read_span(storage, piece, offset, span) to read multiple blocks sequentially.

API impact: none Compatibility: Linux/macOS (preadv), fallback to loop with reused handle.

Bench/Test Plan

Microbench: hashing piece sizes 512 KiB - 4 MiB, measure ns/op, syscalls/s.

#include <benchmark/benchmark.h> // #include <libtorrent/posix_disk_io.hpp>

static void BM_AsyncHash(benchmark::State& state) { for (auto _ : state) { // call async_hash, wait for completion benchmark::DoNotOptimize(state.iterations()); } } BENCHMARK(BM_AsyncHash)->Range(512<<10, 4<<20); BENCHMARK_MAIN();

Macro: run piece-checking workloads.

Success: ≥50% fewer syscalls per hash, reduced hash time.

Telemetry & Repro

Add TORRENT_PERF_TRACE around hash loop.

CLI:

./build/examples/client_test --check-pieces --hash-threads 4

Estimated Effort

S (≤1 day after file-handle cache change)

References

src/posix_disk_io.cpp (lines 158-219)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment