Patent Crawler for Talos-jobs

Talos-Job gets created
Patent Crawler gets triggered once (noramlly it runs everyday once)
Patent Crawler queries "company-patent-table" (not defined yet) should be a table where we match companies to patent assignee for example siemens healthineers doesn hold any patents but siemens healthcare does (not required @ beginning).
Patent crawler request url with correct company (rather name not excatly defined and filter in csv), for example : Siemens
a. url see below
b. result is a csv
patent crawler parses results and compares the result of csv with patent-link-table. (only add last 5-10 years)
"new patents" which are not in the "patent-link-table" will be added and the patent crawler crawls them
Second Step-Function for "patent crawler" -> patent extractor -> extracts abstract, claims & descriptions and saves results on s3.
s3 trigger -> patent ml workflow.

Patent Crawler Architecture

Step-functions
1. crawl csv and save links in table
2. crawl webpage and extract abstract,claims and upload them to s3
dynamo-table "patent-link-table" (maybe we can include it in the news-link-table)
dynamo-table "company-patent-table" mapping for companies -> not required @ beginning.
s3-bucket "single-document" saves results as json under /patent/YY/MM/DD

Url for Construction const companyName = 'Siemens Healtineers'.replace(" ","%2B") https://patents.google.com/xhr/query?url=assignee%3D${companyName}%26oq%3D${companyName}&exp=&download=true Url for Siemens:

philschmid/ideation.md

Patent Crawler for Talos-jobs

Patent Crawler Architecture

philschmid commented Feb 3, 2020

Uh oh!