Skip to content

Instantly share code, notes, and snippets.

@mostlygeek
Created February 1, 2018 05:48
Show Gist options
  • Save mostlygeek/d54f10c322e3efdffb819d3128d6b966 to your computer and use it in GitHub Desktop.
Save mostlygeek/d54f10c322e3efdffb819d3128d6b966 to your computer and use it in GitHub Desktop.
list files from a tar archive
package main
// tweaked from
// original source: https://gist.github.com/indraniel/1a91458984179ab4cf80
import (
"archive/tar"
"compress/gzip"
"flag"
"fmt"
"io"
"os"
)
func main() {
flag.Parse()
sourceFile := flag.Arg(0)
if sourceFile == "" {
fmt.Println("Dude, you didn't pass in a tar file!")
os.Exit(1)
}
processFile(sourceFile)
}
func processFile(srcFile string) {
f, err := os.Open(srcFile)
if err != nil {
fmt.Println(err)
os.Exit(1)
}
defer f.Close()
var source io.Reader
// make a gzip reader
gzf, err := gzip.NewReader(f)
if err != nil {
fmt.Println("Doesn't look like it is compressed...pretending it is a raw tar")
source = f
} else {
source = gzf
}
// make a tar reader
tarReader := tar.NewReader(source)
for {
header, err := tarReader.Next()
if err == io.EOF {
break
}
if err != nil {
fmt.Println(err)
os.Exit(1)
}
switch header.Typeflag {
case tar.TypeDir:
continue
case tar.TypeReg:
fmt.Println("File:", header.Name)
default:
fmt.Printf("%s : %c %s %s\n",
"Yikes! Unable to figure out type",
header.Typeflag,
"in file",
header.Name,
)
}
}
}
@mostlygeek
Copy link
Author

made this to test an idea... see the original blog entry: https://www.peterbe.com/plog/fastest-way-to-unzip-a-zip-file-in-python

Converted the 34MB zip file into a gzipped tarball. It came out to about 34,768,561 so about the same size. Biggest difference is the block compression and the tar format. No need have the whole archive before processing it. We can process it as a stream of bytes... presumably close to wire speed.

Some timing information:

$ ls -l
total 518624
drwxr-xr-x  104 bwong  staff       3536 31 Jan 21:32 output
-rwxr-xr-x    1 bwong  staff    2290736 31 Jan 21:45 process
-rw-r--r--    1 bwong  staff       1167 31 Jan 21:47 process.go
-rw-r--r--    1 bwong  staff  193751040 31 Jan 21:46 symbols-2017-11-27T14_15_30.tar
-rw-r--r--    1 bwong  staff   34768561 31 Jan 21:44 symbols-2017-11-27T14_15_30.tar.gz
-rw-r--r--    1 bwong  staff   34709742 31 Jan 21:30 symbols-2017-11-27T14_15_30.zip

$ time ./process symbols-2017-11-27T14_15_30.tar.gz > /dev/null
real    0m0.947s
user    0m0.917s
sys     0m0.025s

$ time ./process symbols-2017-11-27T14_15_30.tar > /dev/null

real    0m0.016s
user    0m0.004s
sys     0m0.006s

This was done on a 2017 13" MBP. These have a really fast SSD disk and a decently fast processor. The data shows that decompression takes up almost all the time. Even so, we can decompress and process at 35MB/second, or approximately 322Mbps. That's not too bad.

It shouldn't be too much more work to spin up a goroutine that checks that a file exists and if it does streams (uploads) the data to S3.

@peterbe
Copy link

peterbe commented Feb 1, 2018

The equivalent of...

with open(fn, 'rb') as f:
    zf = zipfile.ZipFile(f)
    zf.extractall(dest)

with tarfile is...

with tarfile.open(fn) as tar:
    tar.extractall(dest)

Which works out to be the same speed.

In respective cases, the fn is either 33MB symbols-2017-11-27T14_15_30.zip or 33MB symbols-2017-11-27T14_15_30.tar.gz.

@mostlygeek
Copy link
Author

The point was to process the tar.gz while it was uploading. You don't have to wait for the whole file, write it to disk and then decompress it. The file structure of tar.gz files allow you to process them as a stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment