-
-
Save anshoomehra/ead8925ea291e233a5aa2dcaa2dc61b2 to your computer and use it in GitHub Desktop.
appreciate your work!
In the 3rd cell you mention document, which is not there in above two. how did you get it
I am able to get the code to work when I use the download from the SEC website; however, SEC is no longer allowing mass loops to circle and collect data (at least that is what I was told). I found a website which has every 10-K downloaded and I saved those onto my personal computer. When I use the code I changed the requests to an open and read file, but now I'm getting an error "KeyError: 'item1a'". I've tried different versions such as "Item 1A." etc. with no luck. Is there another way to get this code to work using SEC downloads. Downloads are from https://drive.google.com/drive/folders/1tZP9A0hrAj8ptNP3VE9weYZ3WDn9jHic. Thank you!
Hi Bill, thank you for sharing this. Have you got the chance to verify that this 10-Ks a complete list? Have you had the chance to validate the data? May I kindly ask you which website was it? Happy to take it offline if you prefer. Thanks
May I know how to remove the footer information, "Apple Inc. | 2018 Form 10-K |" as well as page number from the generated text?
Thanks for this! I've followed the steps to get historic numeric data and made a free API in case anyone else wants the data for training AI etc.
https://rapidapi.com/alexventisei2/api/sec-api2
i think the line below assumes same number of entries for all items, which is not necessarily the case for example nyt. in that case there are more item 1A items then 1B and the approach does not work. I would also add re.IGNORECASE to the re.compile
pos_dat = test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'], keep='last')
This was very helpful, thank you for taking the time to post this
Amazing! Thanks for sharing.
i have Html url i dont know how to get txt url of 10k file after that I am able to use above notebook code
any one can help me please
Jesus, you saved my life!
@Onapmek - I always thought that as much as RegEx is accurate, it will also be sensitive causing reliability issues - can the RegEx approach made more reliable?
Secondly, some of these "Items" consist of tables. I wonder how the text is extracted excluding the tables?
@anshoomehra - could you elaborate this?