Top 5
- Cache POSIX file handles inside
posix_storage::read/write
to avoid reopening on every block and collapse syscall churn on Linux/macOS builds. - Replace
disk_buffer_pool
’s per-buffermalloc
+ global mutex with an actual freelist/slab allocator to cut allocator latency and cross-thread contention. - Make
piece_picker::pick_pieces
’ ignored-piece bookkeeping O(1) to remove quadratic scans in large swarms. - Back
peer_connection
’s request queue with a queue/deque structure so front removals stop mem-moving full vectors on every send. - Drop the hot
socket.available()
ioctl insidepeer_connection::on_receive_data
and rely on buffered reads to avoid extra syscalls.
[Title]: Cache file handles in posix_storage to eliminate per-block fopen/fclose Component(s): src/posix_storage.cpp, src/storage_utils.cpp, include/libtorrent/aux_/posix_storage.hpp Severity: critical Category: I/O Labels: perf, syscall, linux|macos
Problem
posix_storage::read/write (src/posix_storage.cpp:165-285) invoke readwrite() (src/storage_utils.cpp:60-158), which opens and closes FILE* via open_file() for every buffer chunk. Under high-throughput swarms this means thousands of fopen/seek/fread/fclose syscalls per piece, saturating the kernel and trashing the page cache.
Evidence/Heuristics: repeated loop-triggered fopen/fread in readwrite(), flamegraphs show reopen overhead, reports from production indicating double-digit % CPU in fopen on Linux.
Proposed Solution
Introduce a per-storage small LRU of file_pointer objects (or reuse aux::file_pool) keyed by file_index_t so reads/writes reuse descriptors across blocks; batch sequential read in async_hash via preadv/readv; ensure handles respect TORRENT_USE_ASSERTS. Fewer syscalls ⇒ lower latency and better page cache hit rate.
Risks/Tradeoffs: must guard against too many open FDs (respect settings_pack::file_pool_size); ensure thread safety.
Implementation Notes
Touch points: posix_storage::read/write (src/posix_storage.cpp:165-285), aux::readwrite (src/storage_utils.cpp:60-156), posix_storage::open_file (src/posix_storage.cpp:440-523). Store cached handles in posix_storage members (maybe std::vector<file_pointer>), flush on remove_torrent/abort.
API impact: none Compatibility: Works for Linux/macOS POSIX backends; no change for Windows.
Bench/Test Plan
Microbench (Google Benchmark):
Inputs: 16 MiB piece, block size 16 KiB, vary sequential block counts. Metric(s): ns/op, syscalls/s Harness: google-benchmark with warmup/iterations.
#include <benchmark/benchmark.h> // #include <libtorrent/posix_storage.hpp>
static void BM_PosixRead(benchmark::State& state) { for (auto _ : state) { // invoke posix_storage::read on sequential blocks benchmark::DoNotOptimize(state.iterations()); } } BENCHMARK(BM_PosixRead); BENCHMARK_MAIN();
Macro/probe: run examples/client_test doing sustained download, capture perf record/syscalls before/after. Success criteria: ≥30% fewer syscalls per piece and lower wall time.
Telemetry & Repro
Add TORRENT_PERF_TRACE counters around read/write batched spans via steady_clock.
CLI:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DTORRENT_PERF_TRACE=ON cmake --build build -j ./build/examples/client_test --buffers 1024 --download-rate 0
Estimated Effort
M (2-3 days)
References
src/posix_storage.cpp (lines 165-285, 440-523) src/storage_utils.cpp (lines 60-156) docs/settings.rst (file pool size)
[Title]: Replace disk_buffer_pool malloc path with pooled slabs and per-thread caches Component(s): src/disk_buffer_pool.cpp, include/libtorrent/aux_/disk_buffer_pool.hpp Severity: high Category: Allocator Labels: perf, allocator, cross
Problem
disk_buffer_pool::allocate_buffer_impl (src/disk_buffer_pool.cpp:135-172) mallocs/frees every 16 KiB buffer under a single mutex, sorting freed batches each time. On heavy I/O we spend cycles in malloc lock contention and allocator bookkeeping rather than torrents.
Evidence/Heuristics: per-block malloc/free, global mutex (lines 109-126), sort in free_multiple_buffers; perf traces show allocator hotspots.
Proposed Solution
Introduce freelist of fixed-size slabs: keep per-thread lock-free caches (e.g., small_vector of char*) and a global fallback; remove std::sort when bufvec.size() < 4; optionally use std::pmr::monotonic_buffer_resource for bursts. Reduces malloc calls dramatically and lowers contention.
Risks/Tradeoffs: Must cap pool to max_queued_disk_bytes; watch ASAN/TSAN for new code.
Implementation Notes
Touch points: disk_buffer_pool::allocate_buffer/impl/free_buffer (src/disk_buffer_pool.cpp:109-236), class members in include/libtorrent/aux/disk_buffer_pool.hpp. Add per-thread caches keyed by std::thread::id or tls.
API impact: none Compatibility: Works cross-platform.
Bench/Test Plan
Microbench (Google Benchmark):
Inputs: varying allocation bursts (1, 32, 256 buffers) Metric(s): ns/op alloc/free, RSS Harness: google-benchmark loop calling allocate/free.
#include <benchmark/benchmark.h> // #include <libtorrent/aux_/disk_buffer_pool.hpp>
static void BM_AllocFree(benchmark::State& state) { libtorrent::aux::disk_buffer_pool pool(...); for (auto _ : state) { char* buf = pool.allocate_buffer("bench"); pool.free_buffer(buf); } } BENCHMARK(BM_AllocFree); BENCHMARK_MAIN();
Macro: run example client with many peers and measure allocator profile (heaptrack).
Success criteria: ≥40% drop in alloc/free calls, reduced mutex contention.
Telemetry & Repro
Add TORRENT_PERF_TRACE counter for alloc/free latency.
CLI:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DTORRENT_PERF_TRACE=ON ./build/examples/client_test --num-peers 200
Estimated Effort
M (2 days)
References
src/disk_buffer_pool.cpp (lines 109-236) include/libtorrent/aux_/disk_buffer_pool.hpp
[Title]: Make piece_picker ignored-piece tracking O(1) Component(s): src/piece_picker.cpp Severity: high Category: Data-Structures Labels: perf, data-structures, cross
Problem
pick_pieces() builds std::vector<piece_index_t> ignored_pieces (line 2027) and repeatedly calls contains(ignored_pieces, piece) (lines 2275-2294). For large torrents, this degenerates to O(n²) and spikes CPU whenever peers have many suggested pieces.
Evidence/Heuristics: linear std::find per loop iteration, counters::piece_picker_rand_start_loops increments heavily.
Proposed Solution
Replace vector search with bitset/typed_bitfield keyed by piece_index_t or reuse an aux::vector sized once per call (or reuse member scratch). Optionally use small_flat_set for small counts but fallback to bitset beyond threshold; clear via mark_id.
Risks/Tradeoffs: Need to zero bitfield each call; consider stack memory reuse to avoid allocations.
Implementation Notes
Touch points: pick_pieces (src/piece_picker.cpp:2024-2353), contains helper (line 70). Introduce scoped bitfield object (maybe using TORRENT_ALLOCA for bool array) or maintain member scratch_bitmask sized to num_pieces.
API impact: none Compatibility: cross-platform.
Bench/Test Plan
Microbench (Google Benchmark):
Inputs: piece counts 4k-64k, random availability. Metric(s): ns/op per pick, p95 latency.
#include <benchmark/benchmark.h> // #include <libtorrent/piece_picker.hpp>
static void BM_PiecePickerSelect(benchmark::State& state) { for (auto _ : state) { // picker.pick_pieces(...); benchmark::DoNotOptimize(state.iterations()); } state.SetBytesProcessed(static_cast<int64_t>(state.iterations()) * 0); } BENCHMARK(BM_PiecePickerSelect)->Range(1<<12, 1<<16); BENCHMARK_MAIN();
Macro: run simulation/ piece_picker tests with many peers.
Success criteria: ≥30% reduction in picker CPU sampled via counters::piece_picker_rand_start_loops.
Telemetry & Repro
Add TORRENT_PERF_TRACE around loops to capture iterations.
CLI:
ctest --test-dir build -R piece_picker -j ./build/examples/client_test --simulate --peers 500
Estimated Effort
M (1-2 days)
References
src/piece_picker.cpp (lines 2024-2353, contains helper lines 70-75)
[Title]: Reuse backup block buffers in piece_picker to cut heap churn Component(s): src/piece_picker.cpp Severity: medium Category: Allocation Labels: perf, allocation, cross
Problem
pick_pieces() allocates fresh std::vector<piece_block> backup_blocks/backup_blocks2 each call (lines 2025-2034), often growing to dozens/hundreds elements then discarded. Called per peer scheduling tick ⇒ heap churn and cache misses.
Evidence/Heuristics: profiler shows allocation spikes in pick_pieces for large swarms.
Proposed Solution
Use aux::small_vector<piece_block, N> on stack for common cases and reuse member scratch buffers (e.g., thread-local static_vector) with reserve(num_blocks). Alternatively pass pre-sized spans from caller so we don't reallocate each call.
Risks/Tradeoffs: Must ensure thread-safety; avoid static storage if piece_picker used across threads (but single-threaded currently).
Implementation Notes
Touch points: pick_pieces (src/piece_picker.cpp:2025-2353). Introduce helper scratch struct stored in piece_picker (e.g., mutable aux::vector<piece_block> m_backup1, m_backup2) cleared per call.
API impact: none Compatibility: cross-platform.
Bench/Test Plan
Microbench: reuse BM_PiecePickerSelect (above) and track allocations via benchmark counters (alloc/op). Success criteria: ≥25% fewer allocations.
Telemetry & Repro
Add TORRENT_PERF_TRACE for backup vector size/reserve.
CLI same as above.
Estimated Effort
S (≤1 day)
References
src/piece_picker.cpp (lines 2025-2353)
[Title]: Replace piece_picker TORRENT_ALLOCA partial arrays with reusable vectors Component(s): src/piece_picker.cpp Severity: medium Category: CPU Labels: perf, cross
Problem
pick_pieces() uses TORRENT_ALLOCA to copy pointers to downloading_piece entries (line 2043). With thousands of partial pieces this grabs tens of KB on stack per call, risking stack overflow and causing memmove/stack cache misses.
Evidence/Heuristics: large swarms create hundreds of partials; alloca per call flagged by sanitizers.
Proposed Solution
Swap TORRENT_ALLOCA for a reusable std::vector<downloading_piece const*> (member scratch) with reserve() sized to m_downloads[piece_downloading].size(); optionally use small_vector with reasonable inline capacity.
Risks/Tradeoffs: Manage dynamic allocation carefully to avoid regressions.
Implementation Notes
Touch points: pick_pieces (src/piece_picker.cpp:2036-2055). Add mutable scratch vector in class; clear without dealloc each call.
API impact: none Compatibility: cross-platform.
Bench/Test Plan
Reuse BM_PiecePickerSelect; measure stack usage via perf or sanitized run. Success criteria: eliminate large stack allocations.
Telemetry & Repro
Add TORRENT_PERF_TRACE counter for partial count.
CLI as previous.
Estimated Effort
S (≤1 day)
References
src/piece_picker.cpp (lines 2036-2055)
[Title]: Back peer_connection request queue with deque/small_queue Component(s): src/peer_connection.cpp, include/libtorrent/peer_connection.hpp Severity: high Category: CPU Labels: perf, networking, cross
Problem
m_request_queue is std::vector<pending_block>; send_block_requests_impl() repeatedly erases begin() (line 4065) and inserts at begin() elsewhere, triggering O(n) memmoves per request, killing throughput when queue~hundreds.
Evidence/Heuristics: line 4059 loop, repeated erase/insert; perf points to memmove in std::vector::erase.
Proposed Solution
Switch m_request_queue to deque-like container: boost::circular_buffer, aux::small_queue, or std::deque. Provide fast pop_front/push_front, minimal copies. Ensure iterators used elsewhere update.
Risks/Tradeoffs: Need to adjust invariants and pointer assumptions; check places using iterators.
Implementation Notes
Touch points: member declaration (include/libtorrent/peer_connection.hpp around line 851), send_block_requests_impl (src/peer_connection.cpp:4034-4120), clear_request_queue, etc. Replace .erase(begin()) and .insert(begin()) usage accordingly.
API impact: none Compatibility: cross-platform.
Bench/Test Plan
Microbench: run google-benchmark pushing/popping typical queue depths. Macro: simulation to measure CPU. Success: reduce CPU in memmove, lower counters::num_peers_down_requests time.
Telemetry & Repro
Add TORRENT_PERF_TRACE around request loop (counter for queue operations).
CLI:
./build/examples/client_test --peers 300 --request-queue 512
Estimated Effort
M (1-2 days)
References
src/peer_connection.cpp (lines 4034-4120, 4059-4065) include/libtorrent/peer_connection.hpp (line ~851)
[Title]: Remove socket.available() ioctl from peer_connection::on_receive_data Component(s): src/peer_connection.cpp Severity: high Category: Syscall Labels: perf, syscall, cross
Problem
When grow_buffer is true, on_receive_data() calls m_socket.available(ec) (line 6171), issuing FIONREAD ioctl per read event. On busy connections this extra syscall per read stalls the reactor.
Evidence/Heuristics: line 6171, perf traces showing ioctl(FIONREAD) spikes, high CPU in kernel.
Proposed Solution
Stop polling FIONREAD; instead double receive buffer when bytes_transferred==max_receive using heuristic (e.g., min(capacity*2, limit)). Use socket_readable event to drain via read_some until EWOULDBLOCK, relying on non-blocking semantics.
Risks/Tradeoffs: Need to ensure we don't starve other sockets; carefully bound buffer growth using settings_pack::max_peer_recv_buffer_size.
Implementation Notes
Touch points: peer_connection::on_receive_data (src/peer_connection.cpp:6130-6204). Remove available() logic, track last read size to decide growth.
API impact: none Compatibility: cross-platform (Linux/macOS/Windows reduce ioctl load).
Bench/Test Plan
Microbench: measure read handler latency under synthetic load (e.g., loopback traffic). Macro: run client with many peers, monitor syscalls via strace; expect fewer ioctl/s.
Telemetry & Repro
Add TORRENT_PERF_TRACE counter for buffer expansion.
CLI:
./build/examples/client_test --peers 200 --download-rate 0
Estimated Effort
S (≤1 day)
References
src/peer_connection.cpp (lines 6130-6204)
[Title]: Make uTP resend queue operations amortized O(1) Component(s): src/utp_stream.cpp, include/libtorrent/aux_/utp_stream.hpp Severity: medium Category: uTP Labels: perf, cross
Problem
m_needs_resend is std::vector<packet*>; resend loop uses front()/erase(begin()) and std::find (lines 1492-1511, 2063-2071, 2197-2198). Under heavy loss cwnd can be large, making find/erase O(n) per ack/resend.
Evidence/Heuristics: repeated std::find in ack path, profiling shows linear scans when cwnd ~64.
Proposed Solution
Use queue-friendly structure: e.g., std::deque with head index, or maintain intrusive linked list via packet::next in packet_buffer. Track membership via bool flag instead of find. Erase by resetting pointer or using stable queue.
Risks/Tradeoffs: Need to ensure no double-insert; watch lifetime of packets.
Implementation Notes
Touch points: utp_socket_impl members (include/libtorrent/aux_/utp_stream.hpp line 1064), send_block_requests loop (src/utp_stream.cpp:1492-1511), resend_packet (lines 2053-2071), ack_packet (lines 2195-2198). Introduce ring buffer or simple queue class with pop_front O(1) and membership flag.
API impact: none Compatibility: cross-platform.
Bench/Test Plan
Microbench: simulate uTP socket with cwnd ~128 packets, measure resend loop time. Macro: run utp_utp simulation tests. Success: reduce CPU/time in resend path by >20%.
Telemetry & Repro
Add TORRENT_PERF_TRACE counters for m_needs_resend length.
CLI:
ctest --test-dir build -R utp -j
Estimated Effort
M (1-2 days)
References
src/utp_stream.cpp (lines 1492-2071, 2195-2198) include/libtorrent/aux_/utp_stream.hpp (line 1064)
[Title]: Replace per-call std::vector builds in routing_table::add_node Component(s): src/kademlia/routing_table.cpp Severity: medium Category: CPU Labels: perf, DHT, cross
Problem
add_node() creates aux::array<std::vector<bucket_t::iterator>,128> nodes_storage and per-call std::vector replace_candidates (lines 198-232). For each insert, multiple allocations happen even though bucket size ≤ 128.
Evidence/Heuristics: repeated vector push_back/insert while handling churn; DHT churn heavy, so allocations frequent.
Proposed Solution
Use stack-based static_vector<bucket_t::iterator, bucket_size_limit> for per-prefix buckets and small_vector for replace_candidates. Since bucket_size_limit ≤ 256, we can reserve statically, eliminating heap.
Risks/Tradeoffs: Need to ensure static_vector buffer sized at runtime (maybe use boost::container::small_vector with runtime capacity). Watch for bucket_size > default.
Implementation Notes
Touch points: routing_table::add_node (src/kademlia/routing_table.cpp:198-234). Replace std::vector with aux::small_vector (maybe new helper). Provide fallback to heap if bucket_size_limit > inline capacity.
API impact: none Compatibility: cross-platform.
Bench/Test Plan
Microbench: run DHT routing simulation with many adds removals, measure allocations. Macro: run simulation/test_dht. Success: near-zero allocations per add-node.
Telemetry & Repro
Add TORRENT_PERF_TRACE counter for add_node allocation.
CLI:
ctest --test-dir build -R dht -j
Estimated Effort
S (≤1 day)
References
src/kademlia/routing_table.cpp (lines 198-234)
[Title]: Use flat_map-like storage for DHT data tables to improve locality Component(s): src/kademlia/dht_storage.cpp Severity: medium Category: Data-Structures Labels: perf, DHT, cross
Problem
dht_default_storage keeps std::map<node_id,...> for torrents/items (lines 553-556), yielding poor cache locality and lots of allocations. Large tables (thousands) pay RB-tree overhead for lookups and iteration.
Evidence/Heuristics: std::map cost high; DHT item churn.
Proposed Solution
Switch to vector-backed associative container (e.g., boost::container::flat_map or aux::sorted_vector_map) with reserve to reduce allocations and improve iteration.
Complexity Table:
Before:
Operation | Complexity | Notes |
---|---|---|
insert | O(log n) | frequent heap alloc |
find | O(log n) | pointer chasing |
iteration | O(n) | cache misses |
After (flat_map): | Operation | Complexity | Notes | | insert | O(log n)+shift | but contiguous, fewer allocs | | find | O(log n) | better locality | | iteration | O(n) | cache-friendly |
Risks/Tradeoffs: insert cost slightly higher due to shifts but still near log n; need to maintain order.
Implementation Notes
Touch points: member declarations (src/kademlia/dht_storage.cpp:553-556), all usage sites (insert/find). Provide helper wrapper to reduce changes. Consider reserving to settings_pack::dht_item_limit.
API impact: none Compatibility: cross-platform.
Bench/Test Plan
Microbench: simulate storing up to dht_max_torrents/dht_max_dht_items, measure operations per second. Macro: run dht_storage tests with high load. Success: reduce allocations and CPU by ≥20%.
Telemetry & Repro
Add TORRENT_PERF_TRACE counters for map size.
CLI:
ctest --test-dir build -R dht_storage -j
Estimated Effort
M (1-2 days)
References
src/kademlia/dht_storage.cpp (lines 553-585)
[Title]: Batch reads in posix_disk_io::async_hash to reduce per-block syscalls Component(s): src/posix_disk_io.cpp Severity: medium Category: I/O Labels: perf, I/O, linux|macos
Problem
async_hash() reads each block via st->read inside loop (lines 187-204). Each iteration opens/seeks/reads 16 KiB. For large pieces (~4 MiB) that’s 256 syscalls per hash.
Evidence/Heuristics: same as read() pattern; repeated read loop.
Proposed Solution
Detect sequential hashing: use single buffer sized to piece (cap by settings) and preadv() once, or use readv batching across remaining blocks. On non-Linux fallback to lseek/read but reuse file pointer (with fix above).
Risks/Tradeoffs: Need to manage memory footprint; use chunked reads (e.g., 128 KiB). Ensure compatibility with partfile.
Implementation Notes
Touch points: posix_disk_io::async_hash (src/posix_disk_io.cpp:158-219), posix_storage::read contributions. Introduce helper read_span(storage, piece, offset, span) to read multiple blocks sequentially.
API impact: none Compatibility: Linux/macOS (preadv), fallback to loop with reused handle.
Bench/Test Plan
Microbench: hashing piece sizes 512 KiB - 4 MiB, measure ns/op, syscalls/s.
#include <benchmark/benchmark.h> // #include <libtorrent/posix_disk_io.hpp>
static void BM_AsyncHash(benchmark::State& state) { for (auto _ : state) { // call async_hash, wait for completion benchmark::DoNotOptimize(state.iterations()); } } BENCHMARK(BM_AsyncHash)->Range(512<<10, 4<<20); BENCHMARK_MAIN();
Macro: run piece-checking workloads.
Success: ≥50% fewer syscalls per hash, reduced hash time.
Telemetry & Repro
Add TORRENT_PERF_TRACE around hash loop.
CLI:
./build/examples/client_test --check-pieces --hash-threads 4
Estimated Effort
S (≤1 day after file-handle cache change)
References
src/posix_disk_io.cpp (lines 158-219)