- FAIR data principles and distributed computing
- data is stored in the cloud and can be directly read or queried with http read requests
- Exploiting large geospatial datasets in the cloud in an optimized way by transmitting as few bytes as possible.
- efficient with cloud storage
- Ability to scale up/out geospatial analyses to cloud scale more easily
- Big Data, tiled processing, STAC, portable/scalable workflows, COG
- Technologies that are designed to work well in the cloud.
- Less configuration
- The data (and analytics - which is not yet achieved) moves from Desktop computers to clouds (plural), where they can be accessed using cloud services by expert but also non-expert users.
- To work on cloud without lift and shift (I.e. spinning up a VM on cloud)
The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. Downloading them is free from any instance on Amazon EC2, both via S3 and HTTP.
As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.
- [ARC] Archived Crawl #1 - s3://commoncrawl/crawl-001/ - crawl data from 2008/2010
- [ARC] Archived Crawl #2 - s3://commoncrawl/crawl-002/ - crawl data from 2009/2010
- [ARC] Archived Crawl #3 - s3://commoncrawl/parse-output/ - crawl data from 2012
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2013-20/
{ | |
"AWSTemplateFormatVersion": "2010-09-09", | |
"Description": "This template creates the AWS infrastructure to publish a public data set on S3. It creates an S3 bucket for the dataset, an S3 bucket for access logs, and a policy to allows the Amazon Public Data Set program to read the logs and the public to read the dataset.", | |
"Outputs": {}, | |
"Parameters": { | |
"DataSetName": { | |
"AllowedPattern": "[a-z0-9\\.\\-_]*", | |
"ConstraintDescription": "may only contain lowercase letters, numbers, and ., -, or _ characters", | |
"Description": "The name of the dataset's S3 bucket. This will be used to create the dataset and log S3 bucket.", | |
"MaxLength": "250", |
{ | |
"Version": "2012-10-17", | |
"Id": "BUCKET_NAME-pds-policy", | |
"Statement": [ | |
{ | |
"Effect": "Allow", | |
"Principal": "*", | |
"Action": [ | |
"s3:List*", | |
"s3:Get*" |
We are experimenting with providing Global Forecast System (GFS) Model and High-Resolution Rapid Refresh (HRRR) Model data publicly available on Amazon S3. This Gist describes where to find the data and how it's organized. To work with the data, use any of AWS's various SDKs or Command Line Interface.
A rolling four-week archive of 0.25 degree GFS data is available in s3://noaa-gfs-pds
.
Browse the data in your browser at http://awsopendata.s3-website-us-west-2.amazonaws.com/noaa-gfs/
How to create a publicly-accessible SNS topic that sends messages when objects are added to a public Amazon S3 bucket.
In this case, that's an S3 bucket that is continually updated by the addition of new sensor data. For the purposes of this tutorial, we’ll use s3://noaa-nexrad-level2 – one of our NEXRAD on AWS buckets – as an example.
The SNS topic should be in the same region as the bucket. It will need to have a policy that allows our S3 bucket to publish to it, and anyone to subscribe to it using Lambda or SQS.
while read p; do | |
mkdir -p $p | |
done <urls.txt |
[{"tag_id":"agriculture","tag_text":"agriculture"},{"tag_id":"airtravel","tag_text":"air travel"},{"tag_id":"arts","tag_text":"arts"},{"tag_id":"banking","tag_text":"banking"},{"tag_id":"benefits","tag_text":"benefits"},{"tag_id":"betterbusinessbureaus","tag_text":"better business bureaus"},{"tag_id":"biology","tag_text":"biology"},{"tag_id":"business","tag_text":"business"},{"tag_id":"businessdevelopment","tag_text":"business development"},{"tag_id":"career","tag_text":"career"},{"tag_id":"cars","tag_text":"cars"},{"tag_id":"challenges","tag_text":"challenges"},{"tag_id":"charities","tag_text":"charities"},{"tag_id":"childcare","tag_text":"child care"},{"tag_id":"children","tag_text":"children"},{"tag_id":"citizenship","tag_text":"citizenship"},{"tag_id":"college","tag_text":"college"},{"tag_id":"commerce","tag_text":"commerce"},{"tag_id":"community","tag_text":"community"},{"tag_id":"communitydevelopment","tag_text":"community development"},{"tag_id":"complaints","tag_text":"complaints"},{"tag_id":"conserva |