Created
September 17, 2014 15:07
-
-
Save YaroSpace/b80ae825360d08b998f1 to your computer and use it in GitHub Desktop.
ComPort Architecture draft
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Architecture description | |
- | |
Aggregator | |
Orchestrates the overall process from fetching to updating the db, | |
scheduling and managing aggregator jobs and their stages for different modules | |
does: | |
fetch :all | latest - accepts a block with strategy to determine latest | |
jobs :all | :current - AggreagationJob - status, stop, pause, resume | |
update_job_status - hook to be called by other modules | |
knows: | |
resource_type :forum | :blog | |
resource_url | |
resource_download_schema - how to download sections/categories/topics/pages with posts | |
resource_parsing_schema - how to extract posts and their attributes from downloaded data | |
resource_nolmalize_schema - how to map parsed data to internal db schema | |
resource_validation_schema - how to validate parsed data | |
strategy - action on errors, callbacks | |
Downloader | |
Executes download jobs - iterates over provided download_schema, | |
collecting pages with posts | |
Updates download_job status | |
does: | |
get :job_id, :all|:options (options restrict the download) | |
jobs :all | :current - (DownloadJob - status, stop, pause, resume) | |
knows: | |
download_scema - supplied by Aggregator | |
Parser | |
Extracts posts and their attributes (category, topic, title, author, datetime) | |
from the download batch. | |
Updates parser_job status | |
does: | |
parse :job_id - extracts posts and attributes from job_id batch. Handles | |
links, images, emoji, etc. | |
jobs :all | :current - (ParseJob - status, stop, pause, resume) | |
knows: | |
parsing_schema - suppled by Aggreagator | |
Normalizer | |
Maps parsed data to DB schema. | |
Updates normalizer_job status | |
does: | |
normalize :job_id - parses the data in job_id batch | |
jobs :all | :current - (NormalizerJob - status, stop, pause, resume) | |
knows: | |
normalize_schema - suppled by Aggreagator | |
Validator | |
Validates parsed data and marks OK/Check | |
Updates validator_job status | |
does: | |
validate :job_id - validates the data in job_id batch | |
jobs :all | :current - (NormalizerJob - status, stop, pause, resume) | |
knows: | |
valiation_schema - suppled by Aggreagator, :manual|:strategy | |
Execution flow | |
- | |
1. | |
a) Aggregator.new settings = { | |
:type => :forum, | |
:url => 'www.forum.com', | |
:schemas => {}, | |
:strategy => :pause_on_error, | |
} | |
b) Aggregator.fetch :all | |
c) returns aggregator job_id | |
d) calls Downloader | |
2. | |
a) Downloader.get :job_id, settings[:download_schema] = { | |
:section_url => '/threads?s=', | |
:sections_range => '2..section_end', | |
:section_end => 'find_css('.section').last' | |
:category_page => 'c=', | |
:categories_range => '1..categories_end', | |
:categories_end => 'find_css('.categories-page').last' | |
:topics_page_url => 'page=', | |
:topics_pages_range => '1..topics_pages_end', | |
:topics_pages_end => 'find_css('.topics-page').last' | |
:topic_url => 'topic=', | |
:topic_no => 'topics_css('.topic').last' | |
:posts_page_url => 'page=', | |
:posts_pages_range => '1..posts_pages_end', | |
:posts_pages_end => 'find_css('.posts-page').last' | |
} | |
b) On success - call Aggregator.update_job_status(download: 'ok') | |
on failure - call Aggregaror.update_job_status(download: error) | |
3. | |
a) Parser.parse :job_id, settings[:parser_schema] - on job status change event for :downloader |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
https://github.com/tschellenbach/feedly