Disclaimer: Everything described in this document is my personal opinion that doesn't have to be true for everyone.
For the first time it's very hard to keep in mind all variables defined in Key Information section.
It would be nice to give a brief description for producer-consumer pattern.
- What is it? Pattern of writing multiple spiders to crawl a website.
- Why do we tend to use it? It simplifies spider logic, helps to achieve concurrency, makes crawling process more robust, etc.
- Is it mandatory to implement both of them? Yes/no, why.
It may be a good practice to provide a real example (here is website, let DATASET_FULL
be website.org
, here we're crawling for apartments, so let collection
be apartments
...
Quick intro to frontier (==HCF
) is highly welcome. Links to its API, links to wiki about what is. Describe what is it and why/how we use it. What is SLOT_PREFIX
?
Every time i met statement like 'do this' question why? arises in my head. Some quick info and link (possible) to detailed explanatioon is highly appreciated.
- Should bitbucket repo be private or not?
- Create dataset section is a bit outdated:
price_modifier
field is not mentioned and it's mandatory. Name the project ds-DATASET
. Why? For convenience, for consistency, all dataset projects are prefixed withds-
for some reasons.Use cookiecutter to instantiate the spider template source code
. Why? We have a nice template/wizard with many things already done, that will simplify your dev process, that will help you avoid many errors.- Spider manager. What is it? Why we need them? Link to Managers and Development Protocol is welcome here as well as very brief description.
- Kafka first mention. What is it? Why we use it? How do we use it? What particular component use those
kafka.topic
settings. - Monitoring settings. Why? What if i don't set them?
- Dumpers. Same as Managers.
- Local testing. Describe each command, maybe it's good to mention how it's possible to achieve the same with Scrapinghub API. Example:
Confirm the producer created requests
: Either use this convenient script fromds-dash-scripts
package or test it withHCF
api withcurl ...
. - Cloud testing. It would be really nice to explain what does that entry in
dsclients/clients.json
means, how it is used, by what tools, etc. - Deployment - outdated a bit with kumo transition
As i have written previously, infering schema after spiders run doesn't make any sense for me. What if i make mistake with a spider or forget about some field, make a typo? Schema would be wrong. My suggestion here is to write schema alongside with design document by hand.
Also, it would be nice to explain what ./bin/update_dataset_schemas.py
does. And if it possible to do the same with web UI.
I've been told on #dataservices-development that ./bin/check_spiders.py
doesn't have to be run locally and it's intended for cloud usage. However, it's still on a guide. But (!) I found it really useful to test locally before doing any cloud test.
Written with StackEdit.