We'll be using Python-lambda
-
Create new directory, virtualenv. If you're attaching this scraper to an interactive, you can just add a new folder to the main directory.
-
(venv) $ pip install python-lambda
-
lambda init
-
In
config.yaml
, updatefunction_name
anddescription
. Do not add AWS stuff here because it will get pushed up to github. -
Instead, add your credentials to an
.env
file like:
AWS_ACCESS_KEY_ID=''
AWS_SECRET_ACCESS_KEY=''
and run (venv) $ source .env
to activate.
The handler
function is invoked in response to an event, so this is what you'll populate.
The populated example will show you that you can grab variables from the events.json
file if you so choose. I found I didn't use that in my code. This is just for local testing.
Example service.py
setup:
import os
import json
import boto3
def handler(*args):
# your code here
# upload some data
upload_data_s3(data)
def upload_data_s3(data):
s3 = boto3.resource('s3')
bucket = s3.Bucket('interactives.dallasnews.com')
bucket_data_path = '2017/some-path'
bucket.put_object(
Key=os.path.join(
bucket_data_path,
'your-file-name.json'
),
Body=json.dumps(data),
ACL='public-read',
ContentType='application/json'
)
# this function is just for our testing purposes,
# just calling the main handler function
if __name__ == '__main__':
handler()
You'll be able to access this file in a JS script with something like:
d3.json("https://interactives.dallasnews.com/2017/some-path/your-file-name.json", function(error, data){
// your code here
})
If you are using a populated event.json
file, you can call:
(venv) $ lambda invoke -v
and it will run the handler(event, context)
function.
If you aren't using the event.json
(like I'm not in the example above), simply call:
(venv) $ python service.py
When you're ready to deploy, run:
(venv) $ lambda deploy
Navigate to lambda and then:
-
Configure any necessary environment variables. (API keys, etc)
-
Triggers > add trigger. I've selected CloudWatch Event because I'm going to ping a page myself and check for updates.
-
Create a new rule. If you want it to fire at a certain time or at certain intervals, selection "Schedule expression"
-
Use a fancy chron expression and submit your trigger.
-
Under Configuration tab in Advanced Settings, you can set a timeout if you so choose.
-
Hitting "Test" button will run the scraper and populate the S3 bucket.