Created
February 1, 2018 05:48
-
-
Save mostlygeek/d54f10c322e3efdffb819d3128d6b966 to your computer and use it in GitHub Desktop.
list files from a tar archive
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package main | |
// tweaked from | |
// original source: https://gist.github.com/indraniel/1a91458984179ab4cf80 | |
import ( | |
"archive/tar" | |
"compress/gzip" | |
"flag" | |
"fmt" | |
"io" | |
"os" | |
) | |
func main() { | |
flag.Parse() | |
sourceFile := flag.Arg(0) | |
if sourceFile == "" { | |
fmt.Println("Dude, you didn't pass in a tar file!") | |
os.Exit(1) | |
} | |
processFile(sourceFile) | |
} | |
func processFile(srcFile string) { | |
f, err := os.Open(srcFile) | |
if err != nil { | |
fmt.Println(err) | |
os.Exit(1) | |
} | |
defer f.Close() | |
var source io.Reader | |
// make a gzip reader | |
gzf, err := gzip.NewReader(f) | |
if err != nil { | |
fmt.Println("Doesn't look like it is compressed...pretending it is a raw tar") | |
source = f | |
} else { | |
source = gzf | |
} | |
// make a tar reader | |
tarReader := tar.NewReader(source) | |
for { | |
header, err := tarReader.Next() | |
if err == io.EOF { | |
break | |
} | |
if err != nil { | |
fmt.Println(err) | |
os.Exit(1) | |
} | |
switch header.Typeflag { | |
case tar.TypeDir: | |
continue | |
case tar.TypeReg: | |
fmt.Println("File:", header.Name) | |
default: | |
fmt.Printf("%s : %c %s %s\n", | |
"Yikes! Unable to figure out type", | |
header.Typeflag, | |
"in file", | |
header.Name, | |
) | |
} | |
} | |
} |
The equivalent of...
with open(fn, 'rb') as f:
zf = zipfile.ZipFile(f)
zf.extractall(dest)
with tarfile
is...
with tarfile.open(fn) as tar:
tar.extractall(dest)
Which works out to be the same speed.
In respective cases, the fn
is either 33MB symbols-2017-11-27T14_15_30.zip
or 33MB symbols-2017-11-27T14_15_30.tar.gz
.
The point was to process the tar.gz while it was uploading. You don't have to wait for the whole file, write it to disk and then decompress it. The file structure of tar.gz files allow you to process them as a stream.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
made this to test an idea... see the original blog entry: https://www.peterbe.com/plog/fastest-way-to-unzip-a-zip-file-in-python
Converted the 34MB zip file into a gzipped tarball. It came out to about 34,768,561 so about the same size. Biggest difference is the block compression and the tar format. No need have the whole archive before processing it. We can process it as a stream of bytes... presumably close to wire speed.
Some timing information:
This was done on a 2017 13" MBP. These have a really fast SSD disk and a decently fast processor. The data shows that decompression takes up almost all the time. Even so, we can decompress and process at 35MB/second, or approximately 322Mbps. That's not too bad.
It shouldn't be too much more work to spin up a goroutine that checks that a file exists and if it does streams (uploads) the data to S3.