mrtns/README.md

Last active August 12, 2019 06:31

Star (4) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/mrtns/49af31fcdf63d59b6d5f.js"></script>
Save mrtns/49af31fcdf63d59b6d5f to your computer and use it in GitHub Desktop.

Download ZIP

Reading and Writing Event Streams to S3

Raw

Request Rate and Performance Considerations
AWS S3 Developer Guide (API Version 2006-03-01)
How do I ingest a large number of small files from S3? My job looks like it's stalling.
Databricks Cloud support forum thread
What is the best way to ingest and analyze a large S3 dataset?
Databricks Cloud support forum thread
How can we get S3DistCp running on DBC?
Databricks Cloud support forum thread
How do I improve throughput of S3 writes in a Spark Streaming scenario?
Databricks Cloud support forum thread
Stall on loading many Parquet files on S3
Databricks Cloud support forum thread
Strategies for reading large numbers of files
Apache Spark Users Mailing List
Dealing with Hadoop's small files problem
Snowplow Blog Post
s3-streamlogger
npm package
Maximizing Amazon S3 Performance
Slide deck from AWS re:Invent 2013 (STG304)
The Bleeding Edge: Spark, Parquet and S3
AppsFlyer tech blog post by Arnon Rotem-Gal-Oz
Using S3DistCP to Merge Many Small S3 Files
Hadoop and S3: 6 Tips for Top Performance
Mortar Data blog post
s4cmd
Super S3 command line tool (python)
The Open Guide to Amazon Web Services: S3
AWS EMR S3DistCp
AWS documentation
fetch_and_combine.py
Sample python script to aggregate Cloudfront logs on S3
S3mper: Consistency in the Cloud
Netflix tech blog
AWS EMRFS
AWS documentation
Are We Consistent Yet?

luck02 commented Nov 17, 2016

Somewhat related, but more specific to Spark:

luck02 commented May 17, 2017 •

edited

Loading

https://github.com/apache/spark/blob/master/docs/cloud-integration.md

Important: Cloud Object Stores are Not Real Filesystems

While the stores appear to be filesystems, underneath they are still object stores, and the difference is significant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment