We'll be using Python-lambda
-
Create new directory and call it whatever you want.
-
Enter new directory and run
virtualvenv venv
from Terminal. If you don't have virtualenv, you can install it withpip install virtualenv
. -
Activate virtualenv with
source venv/bin/activate
-
Run
(venv) $ pip install python-lambda
-
Run
lambda init
-
In
config.yaml
, updatefunction_name
anddescription
. Do not add AWS stuff here because it will get pushed up to github. -
Create a
.gitignore
file if you don't have one and add.env
to it. -
Add your credentials to an
.env
file like:
export AWS_ACCESS_KEY_ID=''
export AWS_SECRET_ACCESS_KEY=''
-
Run
(venv) $ source .env
to activate. -
Create/note which AWS bucket you want to save into. Enter it into
bucket_data_path
inservice.py
. -
In
service.py
replace'your-file-name.json'
with the name of your file.
The handler
function is invoked in response to an event, so this is what you'll populate.
The populated example will show you that you can grab variables from the events.json
file if you so choose. I found I didn't use that in my code. This is just for local testing.
Example service.py
setup:
import os
import json
import boto3
# Since scrapers can run long, save them as modules and import them
from exampleFile1 import exampleFunction1
def handler(*args):
# your code here
# trash this example
data = exampleFunction1(1)
# Should print '2'
print(data)
# upload some data
upload_data_s3(data)
def upload_data_s3(data):
s3 = boto3.resource('s3')
bucket = s3.Bucket('interactives.dallasnews.com')
bucket_data_path = '2017/some-path'
bucket.put_object(
Key=os.path.join(
bucket_data_path,
'your-file-name.json'
),
Body=json.dumps(data),
ACL='public-read',
ContentType='application/json'
)
# this function is just for our testing purposes,
# just calling the main handler function
if __name__ == '__main__':
handler()
You'll be able to access this file in a JS script with something like:
d3.json("https://interactives.dallasnews.com/2017/some-path/your-file-name.json", function(error, data){
// your code here
})
If you are using a populated event.json
file, you can call:
(venv) $ lambda invoke -v
and it will run the handler(event, context)
function.
If you aren't using the event.json
(like I'm not in the example above), simply call:
(venv) $ python service.py
When you're ready to deploy, run:
(venv) $ lambda deploy
Navigate to lambda and then:
-
Configure any necessary environment variables. (API keys, etc)
-
Triggers > add trigger. I've selected CloudWatch Event because I'm going to ping a page myself and check for updates.
-
Create a new rule. If you want it to fire at a certain time or at certain intervals, selection "Schedule expression"
-
Use a fancy chron expression and submit your trigger.
-
Under Configuration tab in Advanced Settings, you can set a timeout if you so choose.
-
Hitting "Test" button will run the scraper and populate the S3 bucket.