We'll be using Python-lambda
-
Create new directory, virtualenv. If you're attaching this scraper to an interactive, you can just add a new folder to the main directory.
-
(venv) $ pip install python-lambda -
lambda init -
In
config.yaml, updatefunction_nameanddescription. Do not add AWS stuff here because it will get pushed up to github. -
Instead, add your credentials to an
.envfile like:
AWS_ACCESS_KEY_ID=''
AWS_SECRET_ACCESS_KEY=''
and run (venv) $ source .env to activate.
The handler function is invoked in response to an event, so this is what you'll populate.
The populated example will show you that you can grab variables from the events.json file if you so choose. I found I didn't use that in my code. This is just for local testing.
Example service.py setup:
import os
import json
import boto3
def handler(*args):
# your code here
# upload some data
upload_data_s3(data)
def upload_data_s3(data):
s3 = boto3.resource('s3')
bucket = s3.Bucket('interactives.dallasnews.com')
bucket_data_path = '2017/some-path'
bucket.put_object(
Key=os.path.join(
bucket_data_path,
'your-file-name.json'
),
Body=json.dumps(data),
ACL='public-read',
ContentType='application/json'
)
# this function is just for our testing purposes,
# just calling the main handler function
if __name__ == '__main__':
handler()You'll be able to access this file in a JS script with something like:
d3.json("https://interactives.dallasnews.com/2017/some-path/your-file-name.json", function(error, data){
// your code here
})If you are using a populated event.json file, you can call:
(venv) $ lambda invoke -v
and it will run the handler(event, context) function.
If you aren't using the event.json (like I'm not in the example above), simply call:
(venv) $ python service.py
When you're ready to deploy, run:
(venv) $ lambda deploy
Navigate to lambda and then:
-
Configure any necessary environment variables. (API keys, etc)
-
Triggers > add trigger. I've selected CloudWatch Event because I'm going to ping a page myself and check for updates.
-
Create a new rule. If you want it to fire at a certain time or at certain intervals, selection "Schedule expression"
-
Use a fancy chron expression and submit your trigger.
-
Under Configuration tab in Advanced Settings, you can set a timeout if you so choose.
-
Hitting "Test" button will run the scraper and populate the S3 bucket.