Skip to content

Instantly share code, notes, and snippets.

@philschmid
Last active February 3, 2020 07:41
Show Gist options
  • Save philschmid/66516706ee4922edbee032384643ffaf to your computer and use it in GitHub Desktop.
Save philschmid/66516706ee4922edbee032384643ffaf to your computer and use it in GitHub Desktop.
Patent Crawler

Patent Crawler for Talos-jobs

  1. Talos-Job gets created
  2. Patent Crawler gets triggered once (noramlly it runs everyday once)
  3. Patent Crawler queries "company-patent-table" (not defined yet) should be a table where we match companies to patent assignee for example siemens healthineers doesn hold any patents but siemens healthcare does (not required @ beginning).
  4. Patent crawler request url with correct company (rather name not excatly defined and filter in csv), for example : Siemens
    a. url see below
    b. result is a csv
  5. patent crawler parses results and compares the result of csv with patent-link-table. (only add last 5-10 years)
  6. "new patents" which are not in the "patent-link-table" will be added and the patent crawler crawls them
  7. Second Step-Function for "patent crawler" -> patent extractor -> extracts abstract, claims & descriptions and saves results on s3.
  8. s3 trigger -> patent ml workflow.

Patent Crawler Architecture

  • Step-functions
    1. crawl csv and save links in table
    2. crawl webpage and extract abstract,claims and upload them to s3
  • dynamo-table "patent-link-table" (maybe we can include it in the news-link-table)
  • dynamo-table "company-patent-table" mapping for companies -> not required @ beginning.
  • s3-bucket "single-document" saves results as json under /patent/YY/MM/DD

Url for Construction const companyName = 'Siemens Healtineers'.replace(" ","%2B") https://patents.google.com/xhr/query?url=assignee%3D${companyName}%26oq%3D${companyName}&exp=&download=true Url for Siemens:

@philschmid
Copy link
Author

Beim crawl der csv kann es zu einem Fehler führen, wenn zu schnell und viel angefragt wird.

Bei Fehler 10 Sekunden warten und erneut versuchen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment