Disclaimer: Don't know much about libarchive... yet!
- When reading a streamed archive using archive_read_open() [1] and archive_read_extract() [2] then a callback is called one or more times to read chunks of the archive.
- This creates an issue if (a) your program needs to wait for the next chunk to arrive, and/or (b) you want to process multiple archive streams in the same thread.
- Effectively archive_read_open() [1] and archive_read_extract() [2] block until all the necessary number of archive stream chunks have been read via the callback.
[1] archive_read_open [2] archive_read_extract
$ git clone https://github.com/libarchive/libarchive.git
$ cd libarchive/
$ /bin/sh build/autogen.sh
$ ./configure
$ make
- Use the source code from this github issue [1] and copy it to example.cpp.
- Note: The code output archive is hard-coded as test.tar.gz.
- Try building it:
$ g++ -std=c++11 -O2 -Ilibarchive -L.libs -o example example.cpp -larchive
$ ./example
./example{-r | -w} file[s]
$ ./example -w example.cpp
Compressing -w...
Compressing example.cpp...
$ ls -al test.tar.gz
-rw-rw-r-- 1 simon simon 1406 Oct 16 11:38 test.tar.gz
$ ./example -r example.cpp
Attempting to open example.cpp
calling archive_read_open_memory..
Step 5a: Create a version of the code which attempts to read and extract from two archive streams concurrently:
- Note: The callback business logic is hacked so that the first three read callbacks for an archive only ever return 7 bytes. This helps draw attention to the blocking nature of archive_read_open().
$ cp example.cpp example2.cpp
$ # hack the code
$ cp .libs/libarchive.a .libs/libarchive_test_1a.a
$ cp .libs/libarchive.a .libs/libarchive_test_2b.a
$ cp .libs/libarchive_fe.a .libs/libarchive_test_1b.a
$ cp .libs/libarchive_fe.a .libs/libarchive_test_2a.a
$ ls -al .libs/libarchive*.a
-rw-rw-r-- 1 simon simon 6130184 Oct 17 15:50 .libs/libarchive.a
-rw-rw-r-- 1 simon simon 51168 Oct 17 15:50 .libs/libarchive_fe.a
-rw-rw-r-- 1 simon simon 6130184 Oct 17 15:50 .libs/libarchive_test_1a.a
-rw-rw-r-- 1 simon simon 51168 Oct 17 15:51 .libs/libarchive_test_1b.a
-rw-rw-r-- 1 simon simon 51168 Oct 17 15:51 .libs/libarchive_test_2a.a
-rw-rw-r-- 1 simon simon 6130184 Oct 17 15:51 .libs/libarchive_test_2b.a
$ g++ -DFROM_STREAM -std=c++11 -O2 -Ilibarchive -L.libs -o example2 example2.cpp -larchive && ./example2 -w .libs/libarchive_test_1*.a ; cp test.tar.gz test_1.tar.gz
- compressing: .libs/libarchive_test_1a.a
- archive_entry_new()
- read() = 2097152
- read() = 2097152
- read() = 1935880
- read() = 0
- compressing: .libs/libarchive_test_1b.a
- archive_entry_clear()
- read() = 51168
- read() = 0
$ g++ -DFROM_STREAM -std=c++11 -O2 -Ilibarchive -L.libs -o example2 example2.cpp -larchive && ./example2 -w .libs/libarchive_test_2*.a ; cp test.tar.gz test_2.tar.gz
- compressing: .libs/libarchive_test_2a.a
- archive_entry_new()
- read() = 51168
- read() = 0
- compressing: .libs/libarchive_test_2b.a
- archive_entry_clear()
- read() = 2097152
- read() = 2097152
- read() = 1935880
- read() = 0
$ ls -al test_*.tar.gz
-rw-rw-r-- 1 simon simon 1689852 Oct 17 15:54 test_1.tar.gz
-rw-rw-r-- 1 simon simon 1689589 Oct 17 15:55 test_2.tar.gz
$ tar -ztvf test_1.tar.gz
---------- 0/0 6130184 1969-12-31 16:00 .libs/libarchive_test_1a.a
---------- 0/0 51168 1969-12-31 16:00 .libs/libarchive_test_1b.a
$ tar -ztvf test_2.tar.gz
---------- 0/0 51168 1969-12-31 16:00 .libs/libarchive_test_2a.a
---------- 0/0 6130184 1969-12-31 16:00 .libs/libarchive_test_2b.a
- /* for each archive file: read off disk, and archive_read_open() */
0=id - attempting to open: test_1.tar.gz
0=id - read 1,689,852 bytes into id_buff_archive[0]
0=id - archive_read_new() {}
0=id - archive_read_open() {
0=id - libarchiveRead() {} = 7=bytes_available // callback
0=id - libarchiveRead() {} = 7=bytes_available // callback
0=id - libarchiveRead() {} = 7=bytes_available // callback
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id } = ARCHIVE_OK
- /* for each archive file: archive_read_next_header() and archive_read_extract() */
0=id - archive_read_next_header() {} = ARCHIVE_OK
0=id - archive_entry_pathname(entry) {} = .libs/libarchive_test_1a.a // file to extract
0=id - archive_read_extract() {
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id - libarchiveRead() {} = 116,967=bytes_available // callback
0=id } = ARCHIVE_OK
- /* for each archive file: archive_read_next_header() and archive_read_extract() */
0=id - archive_read_next_header() {} = ARCHIVE_OK
0=id - archive_entry_pathname(entry) {} = .libs/libarchive_test_1b.a // file to extract
0=id - archive_read_extract() {
0=id - libarchiveRead() {} = 0=bytes_available // callback
0=id } = ARCHIVE_OK
- /* for each archive file: archive_read_next_header() and archive_read_extract() */
0=id - archive_read_next_header() {} = ARCHIVE_EOF
- /* for each archive file: archive_read_close() and archive_read_free() */
0=id - archive_read_close() {}
0=id - archive_read_free() {}
- /* for each archive file: read off disk, and archive_read_open() */
0=id - attempting to open: test_2.tar.gz
0=id - read 1,689,589 bytes into id_buff_archive[0]
0=id - archive_read_new() {}
0=id - archive_read_open() {
0=id - libarchiveRead() {} = 7=bytes_available // callback
0=id - libarchiveRead() {} = 7=bytes_available // callback
0=id - libarchiveRead() {} = 7=bytes_available // callback
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id } = ARCHIVE_OK
- /* for each archive file: archive_read_next_header() and archive_read_extract() */
0=id - archive_read_next_header() {} = ARCHIVE_OK
0=id - archive_entry_pathname(entry) {} = .libs/libarchive_test_2a.a // file to extract
0=id - archive_read_extract() {
0=id } = ARCHIVE_OK
- /* for each archive file: archive_read_next_header() and archive_read_extract() */
0=id - archive_read_next_header() {} = ARCHIVE_OK
0=id - archive_entry_pathname(entry) {} = .libs/libarchive_test_2b.a // file to extract
0=id - archive_read_extract() {
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id - libarchiveRead() {} = 116,704=bytes_available // callback
0=id - libarchiveRead() {} = 0=bytes_available // callback
0=id } = ARCHIVE_OK
- /* for each archive file: archive_read_next_header() and archive_read_extract() */
0=id - archive_read_next_header() {} = ARCHIVE_EOF
- /* for each archive file: archive_read_close() and archive_read_free() */
0=id - archive_read_close() {}
0=id - archive_read_free() {}
- Note: Lines starting 0=id are dealing with test_1.tar.gz, and lines starting 1=id are dealing with test_2.tar.gz.
- Note: We can see that once we start with the callbacks for a particular archive, there's no way to pause execution for the archive (and continue with the other archive).
- Note: Once started, the callbacks for an archive are called repeatly one after the other until enough chunks of the archive have been read to extract the next file within the archive.
- Note: In the example below, which contains a longer and shorter a archived file, we can see how the longer file causes many callbacks in succession.
- Note: If each callback presented only another e.g. 1,500 bytes (instead of 256 KB) to libarchive, then there would be very many more callbacks, and delay if waiting for the packets from the network.
$ g++ -DFROM_STREAM -std=c++11 -O2 -Ilibarchive -L.libs -o example2 example2.cpp -larchive && ./example2 -r test_1.tar.gz test_2.tar.gz
- /* for each archive file: read off disk, and archive_read_open() */
0=id - attempting to open: test_1.tar.gz
0=id - read 1,689,852 bytes into id_buff_archive[0]
0=id - archive_read_new() {}
0=id - archive_read_open() {
0=id - libarchiveRead() {} = 7=bytes_available // callback
0=id - libarchiveRead() {} = 7=bytes_available // callback
0=id - libarchiveRead() {} = 7=bytes_available // callback
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id } = ARCHIVE_OK
1=id - attempting to open: test_2.tar.gz
1=id - read 1,689,589 bytes into id_buff_archive[1]
1=id - archive_read_new() {}
1=id - archive_read_open() {
1=id - libarchiveRead() {} = 7=bytes_available // callback
1=id - libarchiveRead() {} = 7=bytes_available // callback
1=id - libarchiveRead() {} = 7=bytes_available // callback
1=id - libarchiveRead() {} = 262,144=bytes_available // callback
1=id } = ARCHIVE_OK
- /* for each archive file: archive_read_next_header() and archive_read_extract() */
0=id - archive_read_next_header() {} = ARCHIVE_OK
0=id - archive_entry_pathname(entry) {} = .libs/libarchive_test_1a.a // file to extract
0=id - archive_read_extract() {
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id - libarchiveRead() {} = 262,144=bytes_available // callback
0=id - libarchiveRead() {} = 116,967=bytes_available // callback
0=id } = ARCHIVE_OK
1=id - archive_read_next_header() {} = ARCHIVE_OK
1=id - archive_entry_pathname(entry) {} = .libs/libarchive_test_2a.a // file to extract
1=id - archive_read_extract() {
1=id } = ARCHIVE_OK
- /* for each archive file: archive_read_next_header() and archive_read_extract() */
0=id - archive_read_next_header() {} = ARCHIVE_OK
0=id - archive_entry_pathname(entry) {} = .libs/libarchive_test_1b.a // file to extract
0=id - archive_read_extract() {
0=id - libarchiveRead() {} = 0=bytes_available // callback
0=id } = ARCHIVE_OK
1=id - archive_read_next_header() {} = ARCHIVE_OK
1=id - archive_entry_pathname(entry) {} = .libs/libarchive_test_2b.a // file to extract
1=id - archive_read_extract() {
1=id - libarchiveRead() {} = 262,144=bytes_available // callback
1=id - libarchiveRead() {} = 262,144=bytes_available // callback
1=id - libarchiveRead() {} = 262,144=bytes_available // callback
1=id - libarchiveRead() {} = 262,144=bytes_available // callback
1=id - libarchiveRead() {} = 262,144=bytes_available // callback
1=id - libarchiveRead() {} = 116,704=bytes_available // callback
1=id - libarchiveRead() {} = 0=bytes_available // callback
1=id } = ARCHIVE_OK
- /* for each archive file: archive_read_next_header() and archive_read_extract() */
0=id - archive_read_next_header() {} = ARCHIVE_EOF
1=id - archive_read_next_header() {} = ARCHIVE_EOF
- /* for each archive file: archive_read_close() and archive_read_free() */
0=id - archive_read_close() {}
0=id - archive_read_free() {}
1=id - archive_read_close() {}
1=id - archive_read_free() {}
- Note: Ideally an API change would maintain backwards compatibility.
- Note: Ideally an API change would not expand the already enormous and complicated API.
- Note: This idea is born just from examining the current API, without examining libarchive internals.
- Note: In theory this change would be the lowest touch to the libarchive documentation?
- Note: This would potentially double the number of
archive_read_callback()
calls? Why? The first would always offer raw bytes, while the second would always returnARCHIVE_WOULDBLOCK
, and so on. The more complicated idea #2 attempts to mitigate performance issues, but could end up being over complicated? - Introduce
ARCHIVE_WOULDBLOCK
as an extra return value along withARCHIVE_OK
[1] et al. archive_read_open()
[2] can be called repeatedly ifARCHIVE_WOULDBLOCK
is returned.archive_read_extract()
[3] can be called repeatedly ifARCHIVE_WOULDBLOCK
is returned.archive_read_callback()
can returnARCHIVE_WOULDBLOCK
if there is no raw bytes currently to pass to libarchive.archive_read_next_header()
[4] will returnARCHIVE_WOULDBLOCK
ifarchive_read_open()
orarchive_read_extract()
have not been fed enough raw bytes.
[1] ARCHIVE_OK
[2] archive_read_open()
[3] archive_read_extract()
[4] archive_read_next_header()
Idea #2: Like idea #1 with new archive_read_offer_bytes()
function instead of archive_read_callback()
- Note: This idea is born just from examining the current API, without examining libarchive internals.
- Note: This idea makes the API even more complicated but saves a callback at run-time; likely relatively small performance savings.
- Note:
archive_read_open()
actually has 4 possible callbacks, so spinning one out intoarchive_read_offer_bytes()
might end up really over complicating the API? archive_read_open()
would be given NULL as the address for thecallback archive_read_callback()
, and will therefore always returnARCHIVE_WOULDBLOCK
.- If
archive_read_open()
orarchive_read_extract()
orarchive_read_next_header()
returnARCHIVE_WOULDBLOCK
, then the newarchive_read_offer_bytes()
function should be called (which acts as a substitute forarchive_read_callback()
). - The new
archive_read_offer_bytes()
function returns whatarchive_read_open()
orarchive_read_extract()
would have returned, includingARCHIVE_WOULDBLOCK
if `archive_read_offer_bytes() needs to be called again in the future.