- Talos-Job gets created
- Patent Crawler gets triggered once (noramlly it runs everyday once)
- Patent Crawler queries "company-patent-table" (not defined yet) should be a table where we match companies to patent assignee for example siemens healthineers doesn hold any patents but siemens healthcare does (not required @ beginning).
- Patent crawler request url with correct company (rather name not excatly defined and filter in csv), for example : Siemens
a. url see below
b. result is a csv - patent crawler parses results and compares the result of csv with patent-link-table. (only add last 5-10 years)
- "new patents" which are not in the "patent-link-table" will be added and the patent crawler crawls them
- Second Step-Function for "patent crawler" -> patent extractor -> extracts abstract, claims & descriptions and saves results on s3.
- s3 trigger -> patent ml workflow.
- Step-functions
- crawl csv and save links in table
- crawl webpage and extract abstract,claims and upload them to s3
- dynamo-table "patent-link-table" (maybe we can include it in the news-link-table)
- dynamo-table "company-patent-table" mapping for companies -> not required @ beginning.
- s3-bucket "single-document" saves results as json under /patent/YY/MM/DD
Url for Construction
const companyName = 'Siemens Healtineers'.replace(" ","%2B")
https://patents.google.com/xhr/query?url=assignee%3D${companyName}%26oq%3D${companyName}&exp=&download=true
Url for Siemens:
Beim crawl der csv kann es zu einem Fehler führen, wenn zu schnell und viel angefragt wird.
Bei Fehler 10 Sekunden warten und erneut versuchen.