We'll be using Python-lambda
-
Create new directory and call it whatever you want.
-
Enter new directory and run
virtualvenv venvfrom Terminal. If you don't have virtualenv, you can install it withpip install virtualenv. -
Activate virtualenv with
source venv/bin/activate -
Run
(venv) $ pip install python-lambda -
Run
lambda init -
In
config.yaml, updatefunction_nameanddescription. Do not add AWS stuff here because it will get pushed up to github. -
Create a
.gitignorefile if you don't have one and add.envto it. -
Add your credentials to an
.envfile like:
export AWS_ACCESS_KEY_ID=''
export AWS_SECRET_ACCESS_KEY=''
-
Run
(venv) $ source .envto activate. -
Create/note which AWS bucket you want to save into. Enter it into
bucket_data_pathinservice.py. -
In
service.pyreplace'your-file-name.json'with the name of your file.
The handler function is invoked in response to an event, so this is what you'll populate.
The populated example will show you that you can grab variables from the events.json file if you so choose. I found I didn't use that in my code. This is just for local testing.
Example service.py setup:
import os
import json
import boto3
# Since scrapers can run long, save them as modules and import them
from exampleFile1 import exampleFunction1
def handler(*args):
# your code here
# trash this example
data = exampleFunction1(1)
# Should print '2'
print(data)
# upload some data
upload_data_s3(data)
def upload_data_s3(data):
s3 = boto3.resource('s3')
bucket = s3.Bucket('interactives.dallasnews.com')
bucket_data_path = '2017/some-path'
bucket.put_object(
Key=os.path.join(
bucket_data_path,
'your-file-name.json'
),
Body=json.dumps(data),
ACL='public-read',
ContentType='application/json'
)
# this function is just for our testing purposes,
# just calling the main handler function
if __name__ == '__main__':
handler()You'll be able to access this file in a JS script with something like:
d3.json("https://interactives.dallasnews.com/2017/some-path/your-file-name.json", function(error, data){
// your code here
})If you are using a populated event.json file, you can call:
(venv) $ lambda invoke -v
and it will run the handler(event, context) function.
If you aren't using the event.json (like I'm not in the example above), simply call:
(venv) $ python service.py
When you're ready to deploy, run:
(venv) $ lambda deploy
Navigate to lambda and then:
-
Configure any necessary environment variables. (API keys, etc)
-
Triggers > add trigger. I've selected CloudWatch Event because I'm going to ping a page myself and check for updates.
-
Create a new rule. If you want it to fire at a certain time or at certain intervals, selection "Schedule expression"
-
Use a fancy chron expression and submit your trigger.
-
Under Configuration tab in Advanced Settings, you can set a timeout if you so choose.
-
Hitting "Test" button will run the scraper and populate the S3 bucket.