Skip to content

Instantly share code, notes, and snippets.

@anshoomehra
Last active November 14, 2024 09:21
Show Gist options
  • Save anshoomehra/ead8925ea291e233a5aa2dcaa2dc61b2 to your computer and use it in GitHub Desktop.
Save anshoomehra/ead8925ea291e233a5aa2dcaa2dc61b2 to your computer and use it in GitHub Desktop.
How to Parse 10-K Report from EDGAR (SEC)
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# <span style=\"color:navy\"> Intro\n",
"\n",
"\n",
"In this notebook we will apply REGEX & BeautifulSoup to find useful financial information in 10-Ks. In particular, we will extract text from Items 1A, 7, and 7A of 10-K."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# <span style=\"color:navy\"> STEP 1 : Import Libraries\n",
"\n",
"Note that, we will need parser for BeautifulSoup, there are many parsers, we will be using 'lxml' which can be pre-installed as follows & it help BeatifulSoup read HTML, XML documents:\n",
"\n",
"!pip install lxml"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Import requests to retrive Web Urls example HTML. TXT \n",
"import requests\n",
"\n",
"# Import BeautifulSoup\n",
"from bs4 import BeautifulSoup\n",
"\n",
"# import re module for REGEXes\n",
"import re\n",
"\n",
"# import pandas\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# <span style=\"color:navy\"> STEP 2 : Get Apple's [AAPL] 2018 10-K \n",
"\n",
"Though we are using AAPL as example 10-K here, the pipeline being built is generic & can be used for other companies 10-K\n",
" \n",
"[SEC Website URL for 10-K (TEXT version)](https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/0000320193-18-000145.txt)\n",
"\n",
"[SEC Website URL for 10-K (HTML version)](https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/a10-k20189292018.htm)\n",
"It will be good to view/study along in html format to see how the below code would apply.\n",
"\n",
"All the documents can be easily ssearched via CIK or company details via [SEC's search tool](https://www.sec.gov/cgi-bin/browse-edgar?CIK=0000320193&owner=exclude&action=getcompany&Find=Search)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"# Get the HTML data from the 2018 10-K from Apple\n",
"r = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/0000320193-18-000145.txt')\n",
"raw_10k = r.text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we print the `raw_10k` string we will see that it has many sections. In the code below, we print part of the `raw_10k` string:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<SEC-DOCUMENT>0000320193-18-000145.txt : 20181105\n",
"<SEC-HEADER>0000320193-18-000145.hdr.sgml : 20181105\n",
"<ACCEPTANCE-DATETIME>20181105080140\n",
"ACCESSION NUMBER:\t\t0000320193-18-000145\n",
"CONFORMED SUBMISSION TYPE:\t10-K\n",
"PUBLIC DOCUMENT COUNT:\t\t88\n",
"CONFORMED PERIOD OF REPORT:\t20180929\n",
"FILED AS OF DATE:\t\t20181105\n",
"DATE AS OF CHANGE:\t\t20181105\n",
"\n",
"FILER:\n",
"\n",
"\tCOMPANY DATA:\t\n",
"\t\tCOMPANY CONFORMED NAME:\t\t\tAPPLE INC\n",
"\t\tCENTRAL INDEX KEY:\t\t\t0000320193\n",
"\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tELECTRONIC COMPUTERS [3571]\n",
"\t\tIRS NUMBER:\t\t\t\t942404110\n",
"\t\tSTATE OF INCORPORATION:\t\t\tCA\n",
"\t\tFISCAL YEAR END:\t\t\t0930\n",
"\n",
"\tFILING VALUES:\n",
"\t\tFORM TYPE:\t\t10-K\n",
"\t\tSEC ACT:\t\t1934 Act\n",
"\t\tSEC FILE NUMBER:\t001-36743\n",
"\t\tFILM NUMBER:\t\t181158788\n",
"\n",
"\tBUSINESS ADDRESS:\t\n",
"\t\tSTREET 1:\t\tONE APPLE PARK WAY\n",
"\t\tCITY:\t\t\tCUPERTINO\n",
"\t\tSTATE:\t\t\tCA\n",
"\t\tZIP:\t\t\t95014\n",
"\t\tBUSINESS PHONE:\t\t(408) 996-1010\n",
"\n",
"\tMAIL ADDRESS:\t\n",
"\t\tSTREET 1:\t\tONE APPLE PARK WAY\n",
"\t\tCITY:\t\t\tCUPERTINO\n",
"\t\tSTATE:\t\t\tCA\n",
"\t\tZIP:\t\t\t95014\n",
"\n",
"\tFORMER COMPANY:\t\n",
"\t\tFORMER CONFORMED NAME:\tAPPLE COMPUTER INC\n",
"\t\tDATE OF NAME CHANGE:\t19970808\n",
"</SEC-HEADER>\n",
"<DOCUMENT>\n",
"<TYPE>10-K\n",
"<SEQUENCE>1\n",
"<FILENAME>a10-k20189292018.htm\n",
"<DESCRIPTION>10-K\n",
"<TEXT>\n",
"<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">\n",
"<html>\n",
"\t<head>\n",
"\t\t<!-- Document created using Wdesk 1 -->\n",
"\t\t<!-- Copyright\n"
]
}
],
"source": [
"print(raw_10k[0:1300])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# <span style=\"color:navy\"> STEP 3 : Apply REGEXes to find 10-K Section from the document\n",
"\n",
"For our purposes, we are only interested in the sections that contain the 10-K information. All the sections, including the 10-K are contained within the `<DOCUMENT>` and `</DOCUMENT>` tags. Each section within the document tags is clearly marked by a `<TYPE>` tag followed by the name of the section."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Regex to find <DOCUMENT> tags\n",
"doc_start_pattern = re.compile(r'<DOCUMENT>')\n",
"doc_end_pattern = re.compile(r'</DOCUMENT>')\n",
"# Regex to find <TYPE> tag prceeding any characters, terminating at new line\n",
"type_pattern = re.compile(r'<TYPE>[^\\n]+')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define Span Indices using REGEXes\n",
"\n",
"Now, that we have the regexes defined, we will use the `.finditer()` method to match the regexes in the `raw_10k`. In the code below, we will create 3 lists:\n",
"\n",
"1. A list that holds the `.end()` index of each match of `doc_start_pattern`\n",
"\n",
"2. A list that holds the `.start()` index of each match of `doc_end_pattern`\n",
"\n",
"3. A list that holds the name of section from each match of `type_pattern`"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Create 3 lists with the span idices for each regex\n",
"\n",
"### There are many <Document> Tags in this text file, each as specific exhibit like 10-K, EX-10.17 etc\n",
"### First filter will give us document tag start <end> and document tag end's <start> \n",
"### We will use this to later grab content in between these tags\n",
"doc_start_is = [x.end() for x in doc_start_pattern.finditer(raw_10k)]\n",
"doc_end_is = [x.start() for x in doc_end_pattern.finditer(raw_10k)]\n",
"\n",
"### Type filter is interesting, it looks for <TYPE> with Not flag as new line, ie terminare there, with + sign\n",
"### to look for any char afterwards until new line \\n. This will give us <TYPE> followed Section Name like '10-K'\n",
"### Once we have have this, it returns String Array, below line will with find content after <TYPE> ie, '10-K' \n",
"### as section names\n",
"doc_types = [x[len('<TYPE>'):] for x in type_pattern.findall(raw_10k)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a Dictionary for the 10-K\n",
"\n",
"In the code below, we will create a dictionary which has the key `10-K` and as value the contents of the `10-K` section found above. To do this, we will create a loop, to go through all the sections found above, and if the section type is `10-K` then save it to the dictionary. Use the indices in `doc_start_is` and `doc_end_is`to slice the `raw_10k` file."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"document = {}\n",
"\n",
"# Create a loop to go through each section type and save only the 10-K section in the dictionary\n",
"for doc_type, doc_start, doc_end in zip(doc_types, doc_start_is, doc_end_is):\n",
" if doc_type == '10-K':\n",
" document[doc_type] = raw_10k[doc_start:doc_end]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'\\n<TYPE>10-K\\n<SEQUENCE>1\\n<FILENAME>a10-k20189292018.htm\\n<DESCRIPTION>10-K\\n<TEXT>\\n<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">\\n<html>\\n\\t<head>\\n\\t\\t<!-- Document created using Wdesk 1 -->\\n\\t\\t<!-- Copyright 2018 Workiva -->\\n\\t\\t<title>Document</title>\\n\\t</head>\\n\\t<body style=\"font-family:Times New Roman;font-size:10pt;\">\\n<div><a name=\"s3540C27286EF5B0DA103CC59028B96BE\"></a></div><div style=\"line-height:120%;text-align:center;font-size:10pt;\"><div sty'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# display excerpt the document\n",
"document['10-K'][0:500]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# <span style=\"color:navy\"> STEP 3 : Apply REGEXes to find Item 1A, 7, and 7A under 10-K Section \n",
"\n",
"The items in this `document` can be found in four different patterns. For example Item 1A can be found in either of the following patterns:\n",
"\n",
"1. `>Item 1A`\n",
"\n",
"2. `>Item&#160;1A` \n",
"\n",
"3. `>Item&nbsp;1A`\n",
"\n",
"4. `ITEM 1A` \n",
"\n",
"In the code below we will write a single regular expression that can match all four patterns for Items 1A, 7, and 7A. Then use the `.finditer()` method to match the regex to `document['10-K']`.\n",
"\n",
"Note that Item 1B & Item 8 are added to find out end of section Item 1A & Item 7A subsequently."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<_sre.SRE_Match object; span=(38318, 38327), match='>Item 1A.'>\n",
"<_sre.SRE_Match object; span=(39347, 39356), match='>Item 1B.'>\n",
"<_sre.SRE_Match object; span=(46148, 46156), match='>Item 7.'>\n",
"<_sre.SRE_Match object; span=(47281, 47290), match='>Item 7A.'>\n",
"<_sre.SRE_Match object; span=(48357, 48365), match='>Item 8.'>\n",
"<_sre.SRE_Match object; span=(119131, 119140), match='>Item 1A.'>\n",
"<_sre.SRE_Match object; span=(197023, 197032), match='>Item 1B.'>\n",
"<_sre.SRE_Match object; span=(333318, 333326), match='>Item 7.'>\n",
"<_sre.SRE_Match object; span=(729984, 729993), match='>Item 7A.'>\n",
"<_sre.SRE_Match object; span=(741774, 741782), match='>Item 8.'>\n"
]
}
],
"source": [
"# Write the regex\n",
"regex = re.compile(r'(>Item(\\s|&#160;|&nbsp;)(1A|1B|7A|7|8)\\.{0,1})|(ITEM\\s(1A|1B|7A|7|8))')\n",
"\n",
"# Use finditer to math the regex\n",
"matches = regex.finditer(document['10-K'])\n",
"\n",
"# Write a for loop to print the matches\n",
"for match in matches:\n",
" print(match)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that each item is matched twice. This is because each item appears first in the index and then in the corresponding section. We will now have to remove the matches that correspond to the index. We will do this using Pandas in the next section."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the code below we will create a pandas dataframe with the following column names: `'item','start','end'`. In the `item` column save the `match.group()` in lower case letters, in the ` start` column save the `match.start()`, and in the `end` column save the ``match.end()`. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>item</th>\n",
" <th>start</th>\n",
" <th>end</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>&gt;item 1a.</td>\n",
" <td>38318</td>\n",
" <td>38327</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>&gt;item 1b.</td>\n",
" <td>39347</td>\n",
" <td>39356</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>&gt;item 7.</td>\n",
" <td>46148</td>\n",
" <td>46156</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>&gt;item 7a.</td>\n",
" <td>47281</td>\n",
" <td>47290</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>&gt;item 8.</td>\n",
" <td>48357</td>\n",
" <td>48365</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" item start end\n",
"0 >item 1a. 38318 38327\n",
"1 >item 1b. 39347 39356\n",
"2 >item 7. 46148 46156\n",
"3 >item 7a. 47281 47290\n",
"4 >item 8. 48357 48365"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Matches\n",
"matches = regex.finditer(document['10-K'])\n",
"\n",
"# Create the dataframe\n",
"test_df = pd.DataFrame([(x.group(), x.start(), x.end()) for x in matches])\n",
"\n",
"test_df.columns = ['item', 'start', 'end']\n",
"test_df['item'] = test_df.item.str.lower()\n",
"\n",
"# Display the dataframe\n",
"test_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Eliminate Unnecessary Characters\n",
"\n",
"As we can see, our dataframe, in particular the `item` column, contains some unnecessary characters such as `>` and periods `.`. In some cases, we will also get unicode characters such as `&#160;` and `&nbsp;`. In the code below, we will use the Pandas dataframe method `.replace()` with the keyword `regex=True` to replace all whitespaces, the above mentioned unicode characters, the `>` character, and the periods from our dataframe. We want to do this because we want to use the `item` column as our dataframe index later on."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>item</th>\n",
" <th>start</th>\n",
" <th>end</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>item1a</td>\n",
" <td>38318</td>\n",
" <td>38327</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>item1b</td>\n",
" <td>39347</td>\n",
" <td>39356</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>item7</td>\n",
" <td>46148</td>\n",
" <td>46156</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>item7a</td>\n",
" <td>47281</td>\n",
" <td>47290</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>item8</td>\n",
" <td>48357</td>\n",
" <td>48365</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" item start end\n",
"0 item1a 38318 38327\n",
"1 item1b 39347 39356\n",
"2 item7 46148 46156\n",
"3 item7a 47281 47290\n",
"4 item8 48357 48365"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get rid of unnesesary charcters from the dataframe\n",
"test_df.replace('&#160;',' ',regex=True,inplace=True)\n",
"test_df.replace('&nbsp;',' ',regex=True,inplace=True)\n",
"test_df.replace(' ','',regex=True,inplace=True)\n",
"test_df.replace('\\.','',regex=True,inplace=True)\n",
"test_df.replace('>','',regex=True,inplace=True)\n",
"\n",
"# display the dataframe\n",
"test_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remove Duplicates\n",
"\n",
"Now that we have removed all unnecessary characters form our dataframe, we can go ahead and remove the Item matches that correspond to the index. In the code below we will use the Pandas dataframe `.drop_duplicates()` method to only keep the last Item matches in the dataframe and drop the rest. Just as precaution ensure that the `start` column is sorted in ascending order before dropping the duplicates."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>item</th>\n",
" <th>start</th>\n",
" <th>end</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>item1a</td>\n",
" <td>119131</td>\n",
" <td>119140</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>item1b</td>\n",
" <td>197023</td>\n",
" <td>197032</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>item7</td>\n",
" <td>333318</td>\n",
" <td>333326</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>item7a</td>\n",
" <td>729984</td>\n",
" <td>729993</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>item8</td>\n",
" <td>741774</td>\n",
" <td>741782</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" item start end\n",
"5 item1a 119131 119140\n",
"6 item1b 197023 197032\n",
"7 item7 333318 333326\n",
"8 item7a 729984 729993\n",
"9 item8 741774 741782"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Drop duplicates\n",
"pos_dat = test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'], keep='last')\n",
"\n",
"# Display the dataframe\n",
"pos_dat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Set Item to Index\n",
"\n",
"In the code below use the Pandas dataframe `.set_index()` method to set the `item` column as the index of our dataframe."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>start</th>\n",
" <th>end</th>\n",
" </tr>\n",
" <tr>\n",
" <th>item</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>item1a</th>\n",
" <td>119131</td>\n",
" <td>119140</td>\n",
" </tr>\n",
" <tr>\n",
" <th>item1b</th>\n",
" <td>197023</td>\n",
" <td>197032</td>\n",
" </tr>\n",
" <tr>\n",
" <th>item7</th>\n",
" <td>333318</td>\n",
" <td>333326</td>\n",
" </tr>\n",
" <tr>\n",
" <th>item7a</th>\n",
" <td>729984</td>\n",
" <td>729993</td>\n",
" </tr>\n",
" <tr>\n",
" <th>item8</th>\n",
" <td>741774</td>\n",
" <td>741782</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" start end\n",
"item \n",
"item1a 119131 119140\n",
"item1b 197023 197032\n",
"item7 333318 333326\n",
"item7a 729984 729993\n",
"item8 741774 741782"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Set item as the dataframe index\n",
"pos_dat.set_index('item', inplace=True)\n",
"\n",
"# display the dataframe\n",
"pos_dat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b> Get The Financial Information From Each Item </b>\n",
"\n",
"The above dataframe contains the starting and end index of each match for Items 1A, 7, and 7A. In the code below, we will save all the text from the starting index of `item1a` till the starting index of `item1b` into a variable called `item_1a_raw`. Similarly, save all the text from the starting index of `item7` till the starting index of `item7a` into a variable called `item_7_raw`. Finally, save all the text from the starting index of `item7a` till the starting of `item8` into a variable called `item_7a_raw`. We can accomplish all of this by making the correct slices of `document['10-K']`."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# Get Item 1a\n",
"item_1a_raw = document['10-K'][pos_dat['start'].loc['item1a']:pos_dat['start'].loc['item1b']]\n",
"\n",
"# Get Item 7\n",
"item_7_raw = document['10-K'][pos_dat['start'].loc['item7']:pos_dat['start'].loc['item7a']]\n",
"\n",
"# Get Item 7a\n",
"item_7a_raw = document['10-K'][pos_dat['start'].loc['item7a']:pos_dat['start'].loc['item8']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have each item saved into a separate variable we can view them separately. For illustration purposes we will display Item 1a, but the other items will look similar."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'>Item 1A.</font></div></td><td style=\"vertical-align:top;\"><div style=\"line-height:120%;text-align:justify;font-size:9pt;\"><font style=\"font-family:Helvetica,sans-serif;font-size:9pt;font-weight:bold;\">Risk Factors</font></div></td></tr></table><div style=\"line-height:120%;padding-top:8px;text-align:justify;font-size:9pt;\"><font style=\"font-family:Helvetica,sans-serif;font-size:9pt;\">The following discussion of risk factors contains forward-looking statements. These risk factors may be important to understanding other statements in this Form 10-K. The following information should be read in conjunction with Part II, Item&#160;7, &#8220;Management&#8217;s Discussion and Analysis of Financial Condition and Results of Operations&#8221; and the consolidated financial statements and related notes in Part II, Item&#160;8, &#8220;Financial Statements and Supplementary Data&#8221; of this Form 10-K.</font></div><div style=\"line-height:120%;padding-top:16px;text-align:justify;font-size:9pt;\"><f'"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"item_1a_raw[0:1000]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the items looks pretty messy, they contain HTML tags, Unicode characters, etc... Before we can do a proper Natural Language Processing in these items we need to clean them up. This means we need to remove all HTML Tags, unicode characters, etc... In principle we could do this using regex substitutions as we learned previously, but his can be rather difficult. Luckily, packages already exist that can do all the cleaning for us, such as **Beautifulsoup**, let's make use of this to refine the extracted text."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# <span style=\"color:navy\"> STEP 4 : Apply BeautifulSoup to refine the content"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"### First convert the raw text we have to exrtacted to BeautifulSoup object \n",
"item_1a_content = BeautifulSoup(item_1a_raw, 'lxml')"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<html>\n",
" <body>\n",
" <p>\n",
" &gt;Item 1A.\n",
" </p>\n",
" <td style=\"vertical-align:top;\">\n",
" <div style=\"line-height:120%;text-align:justify;font-size:9pt;\">\n",
" <font style=\"font-family:Helvetica,sans-serif;font-size:9pt;font-weight:bold;\">\n",
" Risk Factors\n",
" </font>\n",
" </div>\n",
" </td>\n",
" <div style=\"line-height:120%;padding-top:8px;text-align:justify;font-size:9pt;\">\n",
" <font style=\"font-family:Helvetica,sans-serif;font-size:9pt;\">\n",
" The following discussion of risk factors contains forward-looking statements. These risk factors may be important to understanding other statements in this Form 10-K. The following information should be read in conjunction with Part II, Item 7, “Management’s Discussion and Analysis of Financial Condition and Results of Operations” and the consolidated financial statements and related notes in Part II, Item 8, “Financial Statements and Supplementary Data” of this Form 10-K.\n",
" </font>\n",
" </div>\n",
" <div style=\"line-height:120%;padding-top:16px;text-align:justify;fon\n"
]
}
],
"source": [
"### By just applying .pretiffy() we see that raw text start to look oragnized, as BeautifulSoup\n",
"### apply indentation according to the HTML Tag tree structure\n",
"print(item_1a_content.prettify()[0:1000])"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
">Item 1A.\n",
"\n",
"Risk Factors\n",
"\n",
"The following discussion of risk factors contains forward-looking statements. These risk factors may be important to understanding other statements in this Form 10-K. The following information should be read in conjunction with Part II, Item 7, “Management’s Discussion and Analysis of Financial Condition and Results of Operations” and the consolidated financial statements and related notes in Part II, Item 8, “Financial Statements and Supplementary Data” of this Form 10-K.\n",
"\n",
"The business, financial condition and operating results of the Company can be affected by a number of factors, whether currently known or unknown, including but not limited to those described below, any one or more of which could, directly or indirectly, cause the Company’s actual financial condition and operating results to vary materially from past, or from anticipated future, financial condition and operating results. Any of these factors, in whole or in part, could materially and adversely affect the Company’s business, financial condition, operating results and stock price.\n",
"\n",
"Because of the following factors, as well as other factors affecting the Company’s financial condition and operating results, past financial performance should not be considered to be a reliable indicator of future performance, and investors should not use historical trends to anticipate results or trends in future periods.\n",
"\n",
"Global and regional economic conditions could materially adversely affect the Comp\n"
]
}
],
"source": [
"### Our goal is though to remove html tags and see the content\n",
"### Method get_text() is what we need, \\n\\n is optional, I just added this to read text \n",
"### more cleanly, it's basically new line character between sections. \n",
"print(item_1a_content.get_text(\"\\n\\n\")[0:1500])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# <span style=\"color:navy\"> Summary..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we have seen that simply applying REGEX & BeautifulSoup combination we can form a very powerful combination extracting/scarpping content from any web-content very easily. Having said this, not all 10-Ks are well crafted HTML, TEXT formats, example older 10-Ks, hence there may be adjustments needed to adopt to the circumstances."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@evmcheb
Copy link

evmcheb commented May 23, 2020

<3 you legend

@pucek80
Copy link

pucek80 commented Jul 14, 2020

It has been of great help. I've tried with several sample statements and one has been having issue getting the right Item 7 entry:
https://www.sec.gov/Archives/edgar/data/1530721/000153072120000062/0001530721-20-000062.txt
It seems like it's identifying only one entry for Item 7 (the one in the TOC, not the one in the body). Any suggestions?

@prgtrdr
Copy link

prgtrdr commented Oct 23, 2020

It has been of great help. I've tried with several sample statements and one has been having issue getting the right Item 7 entry:
https://www.sec.gov/Archives/edgar/data/1530721/000153072120000062/0001530721-20-000062.txt
It seems like it's identifying only one entry for Item 7 (the one in the TOC, not the one in the body). Any suggestions?
@pucek80:

To fix this problem change the regex in cell 8 to:
regex = re.compile(r'(>(Item|ITEM)(\s|&#160;|&nbsp;)(1A|1B|7A|7|8)\.{0,1})')

@kvtoraman
Copy link

Thanks @anshoomehra core logic works. 10-Q files need some work though, since same item names are used in Part 1 and 2.

@Taram1980
Copy link

Это очень помогло. Я пробовал использовать несколько примеров утверждений, и у одного возникла проблема с получением правильной записи в
Пункте 7: https://www.sec.gov/Archives/edgar/data/1530721/000153072120000062/0001530721-20-000062.txt
Кажется как будто он определяет только одну запись для элемента 7 (в оглавлении, а не в теле). Какие-либо предложения?
@ pucek80 :

Чтобы решить эту проблему, измените регулярное выражение в ячейке 8 на:
regex = re.compile(r'(>(Item|ITEM)(\s|&#160;|&nbsp;)(1A|1B|7A|7|8)\.{0,1})')

What about this:
https://www.sec.gov/Archives/edgar/data/40545/000004054520000009/0000040545-20-000009.txt

@lkcao
Copy link

lkcao commented May 19, 2021

thanks sooooooo much. This is driving me crazy and you save my ass.

@janlukasschroeder
Copy link

you could use the query API from SEC API to batch retrieve 10Ks, then use the render API to download the filings and add your script to extract the data. awesome workflow!

@dtelljo
Copy link

dtelljo commented Aug 18, 2021

I am able to get the code to work when I use the download from the SEC website; however, SEC is no longer allowing mass loops to circle and collect data (at least that is what I was told). I found a website which has every 10-K downloaded and I saved those onto my personal computer. When I use the code I changed the requests to an open and read file, but now I'm getting an error "KeyError: 'item1a'". I've tried different versions such as "Item 1A." etc. with no luck. Is there another way to get this code to work using SEC downloads. Downloads are from https://drive.google.com/drive/folders/1tZP9A0hrAj8ptNP3VE9weYZ3WDn9jHic. Thank you!

@janlukasschroeder
Copy link

Can also be done with the item extraction API now.

from sec_api import ExtractorApi

extractorApi = ExtractorApi("YOUR_API_KEY")

# Tesla 10-K filing
filing_url = "https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/tsla-10k_20201231.htm"

# get the standardized and cleaned text of section 1A "Risk Factors"
section_text = extractorApi.get_section(filing_url, "1A", "text")

# get the original HTML of section 7 "Management’s Discussion and Analysis of Financial Condition and Results of Operations"
section_html = extractorApi.get_section(filing_url, "7", "html")

print(section_text)
print(section_html)

Docs: https://sec-api.io/docs/sec-filings-item-extraction-api

@RudeFerret
Copy link

Hey, thanks for the code. It's wonderful.

One question, I can get information for most sections. However, for Item 1 (business section), I can't seem to get the information.

item_1_raw = document['10-K'][pos_dat['start'].loc['item1']:pos_dat['start'].loc['item1a']]

I receive a NoneType back. Any ideas?

@marcelinochamon
Copy link

m

I am having the same problem. Would be great to have any idea.

@sash236
Copy link

sash236 commented Apr 14, 2022

@Onapmek and @marcelinochamon
To fix NoneType, see below - include headers
response = requests.get(url, headers={'User-Agent': 'Mozilla'})

or better a longer one:
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}

@RudeFerret
Copy link

RudeFerret commented Apr 15, 2022

@sash236 This unfortunately doesn't fix the issue for me. I already used a header to successfully retrieve the 10k text file.
I can get sections explained in the example given by OP, but I can't seem to retrieve section 1 itself.

I don't know enough of REGEX to fix the issue, but what I noticed is start the start-position for Item 1 is extremely off:

# Set item as the dataframe index
pos_dat.set_index('item', inplace=True)

# display the dataframe
pos_dat

Outcome: 

item		
item2	198174	198182
item1	2360825	2360832

@RudeFerret
Copy link

RudeFerret commented Apr 15, 2022

@Onapmek Ok, I found out what was going on. The model notes Item 11, Item 12 etc. as well when it's looking for Item 1. And thus it looked for the latest found item 10+, which goes after Item 1a or Item 2 of course and thus returns a None. I have an ugly fix, for the position of item 1 I selected the position of the latest Item 1 found before the position of the latest item 1a found.

@sash236
Copy link

sash236 commented Apr 15, 2022

@Onapmek - I always thought that as much as RegEx is accurate, it will also be sensitive causing reliability issues - can the RegEx approach made more reliable?
Secondly, some of these "Items" consist of tables. I wonder how the text is extracted excluding the tables?

@anshoomehra - could you elaborate this?

@xesws
Copy link

xesws commented Jul 1, 2022

appreciate your work!

@pratikWokelo
Copy link

In the 3rd cell you mention document, which is not there in above two. how did you get it

@mevalerio
Copy link

I am able to get the code to work when I use the download from the SEC website; however, SEC is no longer allowing mass loops to circle and collect data (at least that is what I was told). I found a website which has every 10-K downloaded and I saved those onto my personal computer. When I use the code I changed the requests to an open and read file, but now I'm getting an error "KeyError: 'item1a'". I've tried different versions such as "Item 1A." etc. with no luck. Is there another way to get this code to work using SEC downloads. Downloads are from https://drive.google.com/drive/folders/1tZP9A0hrAj8ptNP3VE9weYZ3WDn9jHic. Thank you!

Hi Bill, thank you for sharing this. Have you got the chance to verify that this 10-Ks a complete list? Have you had the chance to validate the data? May I kindly ask you which website was it? Happy to take it offline if you prefer. Thanks

@monashjg
Copy link

monashjg commented Jun 6, 2023

May I know how to remove the footer information, "Apple Inc. | 2018 Form 10-K |" as well as page number from the generated text?

@AlessandroVentisei
Copy link

Thanks for this! I've followed the steps to get historic numeric data and made a free API in case anyone else wants the data for training AI etc.
https://rapidapi.com/alexventisei2/api/sec-api2

@thegallier
Copy link

i think the line below assumes same number of entries for all items, which is not necessarily the case for example nyt. in that case there are more item 1A items then 1B and the approach does not work. I would also add re.IGNORECASE to the re.compile

pos_dat = test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'], keep='last')

@VadarVillage
Copy link

This was very helpful, thank you for taking the time to post this

@niravsatani24
Copy link

Amazing! Thanks for sharing.

@rabsher
Copy link

rabsher commented Dec 4, 2023

i have Html url i dont know how to get txt url of 10k file after that I am able to use above notebook code

any one can help me please

@versatile712
Copy link

Jesus, you saved my life!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment