Skip to content

Instantly share code, notes, and snippets.

@jaimeiniesta
Created February 4, 2012 10:47
Show Gist options
  • Select an option

  • Save jaimeiniesta/1737044 to your computer and use it in GitHub Desktop.

Select an option

Save jaimeiniesta/1737044 to your computer and use it in GitHub Desktop.
W3Clove RESTful API draft
This is a draft of the upcoming W3Clove RESTful API.
By now it will allow to submit a sitemap or webpage URL for validation, see the results and ask for re-checking later.
It doesn't yet allow user authentication, so you can't manage your list of sitemaps as it can be done on the web site.
URI params will be passed URLencoded; here they appear whithout encoding for legibility purposes.
Single entry point
==================
Shows the entry points for sitemap and webpage submissions:
POST /api/sitemaps
POST /api/web_pages
Sitemap submission
==================
# POST /api/sitemaps
# params:
* url=http://example.com/sitemap.xml
Creates the sitemap and returns the URL where you can get the resource
Sitemap data
============
# GET /api/sitemaps?url=http://example.com/sitemap.xml
Returns the sitemap data
Sitemap rescraping
==================
# POST /api/sitemaps
# params:
* url=http://example.com/sitemap.xml
* reprocess=true
Asks for re-scraping of the sitemap, resetting its state so it will be re-scraped
Webpage submission
==================
# POST /api/web_pages
# params:
* url=http://example.com
Creates the webpage and returns the URL where you can get the resource
Webpage data
============
# GET /api/web_pages?url=http://example.com
Returns the web_pages data
Webpage revalidation
====================
# POST /api/web_pages
# params:
* url=http://example.com
* reprocess=true
Asks for a re-validation of the webpage, resetting its state so it will be re-scraped
A sitemap can show this information:
* url, text, like: "http://example.com/sitemap.xml"
* status, string, can be one of:
- scraping # sitemap has been created on database and is on the scraping queue
- scraping_failed # scraping could not be completed
- validating # some webpages of this sitemap are pending validation
- validated_partially # sitemap validation has finished but some of its webpages could not be validated
- validated # sitemap validation has finished and all its webpages could be validated
* web_pages_count, integer
* web_pages, array of its scraped web_pages with basic info and links to them
* web_pages_pending_validation_count, integer
* validation_errors_count, integer, sum of all validation errors of its web_pages
* validation_warnings_count, integer, sum of all validation warnings of its web_pages
* validation_errors, array of errors found for all its web_pages. Each entry in the array will contain:
- message_id, string, identifies the error type
- text, string, explains the error
- times, integer, how many times this error is found on the scraped web_pages of this sitemap
* validation_warnings, array similar to validation_errors but referring to warnings reported in the validation
* created_at, datetime
* updated_at, datetime
* scraped_at, datetime
There should also be a way to get the web_pages that have each particular error and warning.
A web_page can show this information:
* url, text, like: "http://example.com"
* status, string, can be one of:
- validating
- validation_failed
- validated
* validation_errors_count, integer
* validation_warnings_count, integer
* validation_errors, array of errors reported in the validation. Each entry in the array will contain:
- message_id, string, identifies the error type
- text, string, explains the error
- times, integer, how many times this error is found on the web_page
- lines, array of integers, line numbers where this error is found on the web_page
* validation_warnings, array similar to validation_errors but referring to warnings reported in the validation
* created_at, datetime
* updated_at, datetime
* validated_at, datetime
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment