-
Request Rate and Performance Considerations
AWS S3 Developer Guide (API Version 2006-03-01) -
How do I ingest a large number of small files from S3? My job looks like it's stalling.
Databricks Cloud support forum thread -
What is the best way to ingest and analyze a large S3 dataset?
Databricks Cloud support forum thread -
How can we get S3DistCp running on DBC?
Databricks Cloud support forum thread -
How do I improve throughput of S3 writes in a Spark Streaming scenario?
Databricks Cloud support forum thread -
Stall on loading many Parquet files on S3
Databricks Cloud support forum thread -
Strategies for reading large numbers of files
Apache Spark Users Mailing List -
Dealing with Hadoop's small files problem
Snowplow Blog Post -
s3-streamlogger
npm package -
Maximizing Amazon S3 Performance
Slide deck from AWS re:Invent 2013 (STG304) -
The Bleeding Edge: Spark, Parquet and S3
AppsFlyer tech blog post by Arnon Rotem-Gal-Oz -
Hadoop and S3: 6 Tips for Top Performance
Mortar Data blog post -
s4cmd
Super S3 command line tool (python) -
AWS EMR S3DistCp
AWS documentation -
fetch_and_combine.py
Sample python script to aggregate Cloudfront logs on S3 -
S3mper: Consistency in the Cloud
Netflix tech blog -
AWS EMRFS
AWS documentation
Last active
August 12, 2019 06:31
-
-
Save mrtns/49af31fcdf63d59b6d5f to your computer and use it in GitHub Desktop.
Reading and Writing Event Streams to S3
https://github.com/apache/spark/blob/master/docs/cloud-integration.md
Important: Cloud Object Stores are Not Real Filesystems
While the stores appear to be filesystems, underneath they are still object stores, and the difference is significant
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Somewhat related, but more specific to Spark: