anshoomehra/parsing10k.ipynb

Last active November 25, 2025 00:38

Star (146) You must be signed in to star a gist
Fork (30) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/anshoomehra/ead8925ea291e233a5aa2dcaa2dc61b2.js"></script>
Save anshoomehra/ead8925ea291e233a5aa2dcaa2dc61b2 to your computer and use it in GitHub Desktop.

Download ZIP

How to Parse 10-K Report from EDGAR (SEC)

Raw

parsing10k.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

Taram1980 commented Feb 4, 2021

Это очень помогло. Я пробовал использовать несколько примеров утверждений, и у одного возникла проблема с получением правильной записи в
Пункте 7: https://www.sec.gov/Archives/edgar/data/1530721/000153072120000062/0001530721-20-000062.txt
Кажется как будто он определяет только одну запись для элемента 7 (в оглавлении, а не в теле). Какие-либо предложения?
@ pucek80 :

Чтобы решить эту проблему, измените регулярное выражение в ячейке 8 на:
regex = re.compile(r'(>(Item|ITEM)(\s| | )(1A|1B|7A|7|8)\.{0,1})')

What about this:
https://www.sec.gov/Archives/edgar/data/40545/000004054520000009/0000040545-20-000009.txt

lkcao commented May 19, 2021

thanks sooooooo much. This is driving me crazy and you save my ass.

janlukasschroeder commented Jun 18, 2021

you could use the query API from SEC API to batch retrieve 10Ks, then use the render API to download the filings and add your script to extract the data. awesome workflow!

dtelljo commented Aug 18, 2021

I am able to get the code to work when I use the download from the SEC website; however, SEC is no longer allowing mass loops to circle and collect data (at least that is what I was told). I found a website which has every 10-K downloaded and I saved those onto my personal computer. When I use the code I changed the requests to an open and read file, but now I'm getting an error "KeyError: 'item1a'". I've tried different versions such as "Item 1A." etc. with no luck. Is there another way to get this code to work using SEC downloads. Downloads are from https://drive.google.com/drive/folders/1tZP9A0hrAj8ptNP3VE9weYZ3WDn9jHic. Thank you!

janlukasschroeder commented Sep 28, 2021

Can also be done with the item extraction API now.

from sec_api import ExtractorApi

extractorApi = ExtractorApi("YOUR_API_KEY")

# Tesla 10-K filing
filing_url = "https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/tsla-10k_20201231.htm"

# get the standardized and cleaned text of section 1A "Risk Factors"
section_text = extractorApi.get_section(filing_url, "1A", "text")

# get the original HTML of section 7 "Management’s Discussion and Analysis of Financial Condition and Results of Operations"
section_html = extractorApi.get_section(filing_url, "7", "html")

print(section_text)
print(section_html)

Docs: https://sec-api.io/docs/sec-filings-item-extraction-api

RudeFerret commented Mar 14, 2022

Hey, thanks for the code. It's wonderful.

One question, I can get information for most sections. However, for Item 1 (business section), I can't seem to get the information.

item_1_raw = document['10-K'][pos_dat['start'].loc['item1']:pos_dat['start'].loc['item1a']]

I receive a NoneType back. Any ideas?

marcelinochamon commented Mar 22, 2022

m

I am having the same problem. Would be great to have any idea.

sash236 commented Apr 14, 2022

@Onapmek and @marcelinochamon
To fix NoneType, see below - include headers
response = requests.get(url, headers={'User-Agent': 'Mozilla'})

or better a longer one:
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}

RudeFerret commented Apr 15, 2022 •

edited

Loading

@sash236 This unfortunately doesn't fix the issue for me. I already used a header to successfully retrieve the 10k text file.
I can get sections explained in the example given by OP, but I can't seem to retrieve section 1 itself.

I don't know enough of REGEX to fix the issue, but what I noticed is start the start-position for Item 1 is extremely off:

# Set item as the dataframe index
pos_dat.set_index('item', inplace=True)

# display the dataframe
pos_dat

Outcome: 

item		
item2	198174	198182
item1	2360825	2360832

RudeFerret commented Apr 15, 2022 •

edited

Loading

@Onapmek Ok, I found out what was going on. The model notes Item 11, Item 12 etc. as well when it's looking for Item 1. And thus it looked for the latest found item 10+, which goes after Item 1a or Item 2 of course and thus returns a None. I have an ugly fix, for the position of item 1 I selected the position of the latest Item 1 found before the position of the latest item 1a found.

sash236 commented Apr 15, 2022

@Onapmek - I always thought that as much as RegEx is accurate, it will also be sensitive causing reliability issues - can the RegEx approach made more reliable?
Secondly, some of these "Items" consist of tables. I wonder how the text is extracted excluding the tables?

@anshoomehra - could you elaborate this?

xesws commented Jul 1, 2022

appreciate your work!

pratikWokelo commented Dec 5, 2022

In the 3rd cell you mention document, which is not there in above two. how did you get it

mevalerio commented May 16, 2023

I am able to get the code to work when I use the download from the SEC website; however, SEC is no longer allowing mass loops to circle and collect data (at least that is what I was told). I found a website which has every 10-K downloaded and I saved those onto my personal computer. When I use the code I changed the requests to an open and read file, but now I'm getting an error "KeyError: 'item1a'". I've tried different versions such as "Item 1A." etc. with no luck. Is there another way to get this code to work using SEC downloads. Downloads are from https://drive.google.com/drive/folders/1tZP9A0hrAj8ptNP3VE9weYZ3WDn9jHic. Thank you!

Hi Bill, thank you for sharing this. Have you got the chance to verify that this 10-Ks a complete list? Have you had the chance to validate the data? May I kindly ask you which website was it? Happy to take it offline if you prefer. Thanks

monashjg commented Jun 6, 2023

May I know how to remove the footer information, "Apple Inc. | 2018 Form 10-K |" as well as page number from the generated text?

AlessandroVentisei commented Jun 28, 2023

Thanks for this! I've followed the steps to get historic numeric data and made a free API in case anyone else wants the data for training AI etc.
https://rapidapi.com/alexventisei2/api/sec-api2

thegallier commented Sep 24, 2023

i think the line below assumes same number of entries for all items, which is not necessarily the case for example nyt. in that case there are more item 1A items then 1B and the approach does not work. I would also add re.IGNORECASE to the re.compile

pos_dat = test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'], keep='last')

VadarVillage commented Sep 26, 2023

This was very helpful, thank you for taking the time to post this

niravsatani24 commented Sep 30, 2023

Amazing! Thanks for sharing.

rabsher commented Dec 4, 2023

i have Html url i dont know how to get txt url of 10k file after that I am able to use above notebook code

any one can help me please

versatile712 commented Mar 19, 2024

Jesus, you saved my life!

Tarun3679 commented Mar 24, 2025

I just tried this, and it does not seem to return anything for the example above?

rabsher commented Mar 24, 2025

I just tried this, and it does not seem to return anything for the example above?

import requests
url = "https://www.sec.gov/Archives/edgar/data/1571996/000157199624000036/dell-20240202.htm" must be .htm
  headers = {
       "User-Agent": 'get it from sec website',  # by SEC website
       'Accept-Encoding': 'gzip, deflate',
       'Host': 'www.sec.gov'
   }
   response = requests.get(file_url, headers=headers)
   html_content = response.text.replace('\xa0', ' ')

you can use this code to parse a 10kfile Once you have HTML you can create your regex function to parse specific content from HTML, or you can get a complete 10k filing as text

Tarun3679 commented Mar 28, 2025

Does anyone know any such similar script to retrieve 10-Q?

john-friedman commented Apr 16, 2025

@Tarun3679
https://github.com/john-friedman/datamule-python

from datamule import Portfolio

portfolio = Portfolio('10q')
portfolio.download_submissions(submission_type='10-Q',ticker='MSFT')

for document in portfolio.document_type('10-Q'):
  document.parse()
  print(document.data)

anshoomehra/parsing10k.ipynb

Taram1980 commented Feb 4, 2021

Uh oh!

lkcao commented May 19, 2021

Uh oh!

janlukasschroeder commented Jun 18, 2021

Uh oh!

dtelljo commented Aug 18, 2021

Uh oh!

janlukasschroeder commented Sep 28, 2021

Uh oh!

RudeFerret commented Mar 14, 2022

Uh oh!

marcelinochamon commented Mar 22, 2022

Uh oh!

sash236 commented Apr 14, 2022

Uh oh!

RudeFerret commented Apr 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RudeFerret commented Apr 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sash236 commented Apr 15, 2022

Uh oh!

xesws commented Jul 1, 2022

Uh oh!

pratikWokelo commented Dec 5, 2022

Uh oh!

mevalerio commented May 16, 2023

Uh oh!

monashjg commented Jun 6, 2023

Uh oh!

AlessandroVentisei commented Jun 28, 2023

Uh oh!

thegallier commented Sep 24, 2023

Uh oh!

VadarVillage commented Sep 26, 2023

Uh oh!

niravsatani24 commented Sep 30, 2023

Uh oh!

rabsher commented Dec 4, 2023

Uh oh!

versatile712 commented Mar 19, 2024

Uh oh!

Tarun3679 commented Mar 24, 2025

Uh oh!

rabsher commented Mar 24, 2025

Uh oh!

Tarun3679 commented Mar 28, 2025

Uh oh!

john-friedman commented Apr 16, 2025

Uh oh!

RudeFerret commented Apr 15, 2022 •

edited

Loading

RudeFerret commented Apr 15, 2022 •

edited

Loading