Using Regular Expressions to Search SEC 10K Filings

Regular expressions, or “regex”, are text matching patterns that are used for searching text. In the case of SEC 10k filings, regex can greatly assist the search process.


Contents

SEC 10K filings contain inconsistencies from year to year and from company to company. This makes it difficult to identify and extract particular sections of text. Regex offers versatile and powerful text search capabilities that can be used for matching almost any pattern of text.

I recently encountered this issue when trying to extract a section of text from a 10K filing. In this article, I show how I found a workable solution using regex in Python. By using carefully selected pattern-matching techniques, regex helped me extract the particular section of text that I was after.

My solution won’t work for all 10K filings however, as the inconsistencies between filings can vary quite a bit and are difficult to know in advance. But it’ll work in many cases.

More importantly, my solution can help you if you’re encountering a similar issue. It can provide a starting point which you can modify as needed for the particular text that you wish to search.

If you’re working with 10K filings and are looking for ideas on how to identify and extract text, or if you simply want to learn more about using regex, then this article is for you!

SEC 10K filings and where to find them

SEC 10K filings are produced annually by all publicly traded companies in the US. They contain plenty of useful information including details of each company’s history, structure, personnel, financial circumstances and operations. Their content is usually a key topic of discussion during earnings calls between company management and analysts, investors and the media.

SEC filings are available from the EDGAR database on the SEC website. EDGAR, or the ‘Electronic Data Gathering, Analysis and Retrieval’ system, offers easy access to all public company filings.

EDGAR is huge, with around 3,000 filings processed each day and with over 40,000 new filers each year. Best of all, EDGAR is free to access and offers a comprehensive and reliable source of information on US public companies.

Individual filings can be obtained directly through EDGAR’s online portal. Searches for company information can be done manually, through third-party APIs or by accessing documents through their URL links — this is the approach that we’ll use.

The structure (and challenges) of 10K filings

10K filings tend to be large and unwieldy given the scope of information that they contain. They’re also challenging to work with due to the inconsistencies inherent between individual filings and companies. This makes it hard to search through and identify text within them.

10K filings have a number of standardized sections such as ‘Business’, ‘Risk Factors’ and ‘Selected Financial Data’. One section of particular interest is ‘Management’s Discussion and Analysis of Financial Condition’, or the ‘MD&A’. This contains information about a company’s progress, its financial circumstances and its future plans.

We’ll be extracting the MD&A in this article.

Each of the sections in a 10K filing have numbers associated with them called ‘Item numbers’. The MD&A is Item 7.

Appearing just after the MD&A is another section called ‘Quantitative and Qualitative Disclosures About Market Risk’. This is Item 7A. It’s often quite short but also contains useful information, so we’ll extract this as well.

Unfortunately, the length and location of Items 7 and 7A can vary from filing to filing and between companies. When searching through the text of a 10K filing, there’s no straightforward way of knowing exactly where the contents of Item 7 or 7A will be.

Consider, for example, looking for Item 7 in a 10K filing. The first time a reference to “Item 7” appears is typically in the table of contents. The second time may be the section content that we’re after, but not always. There are often references to Item 7 throughout the filing, in commentary or disclosures or even footnotes, and we won’t know in advance exactly how many times “Item 7” appears and which reference relates to the section content that we wish to extract.

We’ll therefore need to be creative in order to find Items 7 and 7A, and we’ll use regex to help us.

What is regex?

Regex, or ‘regular expressions’, refers to sequences of characters that define a search pattern. These patterns are used to match against text in documents that are being searched. In our case, the documents being searched are the 10K filings.

Regex isn’t new, having first been developed in the 1950’s. It has since been used extensively for pattern matching in applications involving text editing and lexical analysis.

Today, regex is available in a number of coding languages and text editors. It has a rich feature set which allows for powerful pattern matching and is a versatile tool for text pre-processing in natural language processing (NLP) applications.

Strictly speaking, regex is intended for use in ‘regular languages’, of which HTML is not one. Nevertheless, in our case we use regex to search through HTML texts of the 10K filings due to the inconsistencies that they contain. These inconsistencies make any sort of text search challenging, and the powerful features of regex will help us find what we need.

I assume a basic familiarity with regex in this article. If you’d like to learn more about regex, here is an excellent resource.

The search pattern

How do we find exactly where Items 7 and 7A appear in a 10K document?

If we’re manually perusing an XML version of the document, we’d simply go to the table of contents and click on the hyperlink. But if we’re automating the process, it isn’t so straightforward.

The key is to look for patterns (text sequences) that identify the particular section of the filing that we’re searching for. In our case, we want to know where the Item 7/7A section starts and ends.

As mentioned, the second occurrence of “Item 7” in a 10K filing may not refer to the section content. So, what else can we look for to give us a better chance of finding Item 7 (the MD&A)?

A pattern that I’ve found to work in many cases is as follows:

“Item 7”, followed immediately by a full stop or space, followed soon after by the title of the MD&A section

This sequence of text is quite specific to the MD&A section content, rather than mere reference to it elsewhere in the 10K filing. We’ll refer to it as the Item 7 Search Pattern.

To find where the MD&A section ends, we simply look for where the following section starts. Since we’re including Item 7A in our extraction, this means we need to find where Item 8 starts (which immediately follows Item 7A in 10K filings). We use a similar approach to that for finding Item 7 and look for a unique sequence of text that identifies the start of Item 8.

For Item 8, the sequence of text that we’ll be looking for is:

“Item 8”, followed immediately by “.” or “ ”, followed soon after by the title of the section, ie. “Financial Statements and Supplementary Data”

We’ll call this the Item 8 Search Pattern.

I’ve tested this approach on the most recent 5 years of 10K filings for Tesla, Apple, GM, Mastercard and Microsoft. It works in all of these cases. I’ve been able to identify exactly where the MD&A section starts and ends (including Item 7A) and extract it successfully.

As discussed, 10K filings can vary quite a bit so this approach may not work for other companies or filing years. If you find it isn’t working, simply adjust the sequence of text that you’re looking for so that it’s unique in the filing document that you’re searching through.

If you want to find a different section of the 10K filing, such as a different Item number, simply identify the unique sequence of text for that particular Item number.

Now that we know what we’re looking for, let’s dive into the code!

Implementation in Python

I’ve used Python (v3.7.7) to write the code, so you may need to adjust your code if you’re using a different version of Python.

Brush up on your coding skills with FREE online Python courses from top institutions

I found this YouTube video and this github resource to be helpful in writing my code, and I’ve adopted some elements from both of them.

In the following, I first set out the full code, then I step through and explain the key sections of code to implement the analysis in Python (v3.7.7).

You can jump straight to the step-through of the code here.

The full code:

#################################################################################
### Code for Searching (Extracting Sections from) SEC 10K Filings Using Regex ###
#################################################################################

# Import libraries
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Define URL for the specific 10K filing
URL_text = r'https://www.sec.gov/Archives/edgar/data/1318605/000156459020004475/0001564590-20-004475.txt' # Tesla 10K Dec 2019

# Grab the response
response = requests.get(URL_text)

# Parse the response (the XML flag works better than HTML for 10Ks)
soup = BeautifulSoup(response.content, 'lxml')

for filing_document in soup.find_all('document'): # The document tags contain the various components of the total 10K filing pack
    
    # The 'type' tag contains the document type
    document_type = filing_document.type.find(text=True, recursive=False).strip()
    
    if document_type == "10-K": # Once the 10K text body is found
        
        # Grab and store the 10K text body
        TenKtext = filing_document.find('text').extract().text
        
        # Set up the regex pattern
        matches = re.compile(r'(item\s(7[\.\s]|8[\.\s])|'
                             'discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition|'
                             '(consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata)', re.IGNORECASE)
                                             
        matches_array = pd.DataFrame([(match.group(), match.start()) for match in matches.finditer(TenKtext)])
        
        # Set columns in the dataframe
        matches_array.columns = ['SearchTerm', 'Start']
        
        # Get the number of rows in the dataframe
        Rows = matches_array['SearchTerm'].count()
           
        # Create a new column in 'matches_array' called 'Selection' and add adjacent 'SearchTerm' (i and i+1 rows) text concatenated
        count = 0 # Counter to help with row location and iteration
        while count < (Rows-1): # Can only iterate to the second last row
            matches_array.at[count,'Selection'] = (matches_array.iloc[count,0] + matches_array.iloc[count+1,0]).lower() # Convert to lower case
            count += 1
        
        # Set up 'Item 7/8 Search Pattern' regex patterns
        matches_item7 = re.compile(r'(item\s7\.discussion\s[a-z]*)')
        matches_item8 = re.compile(r'(item\s8\.(consolidated\sfinancial|financial)\s[a-z]*)')
            
        # Lists to store the locations of Item 7/8 Search Pattern matches
        Start_Loc = []
        End_Loc = []
            
        # Find and store the locations of Item 7/8 Search Pattern matches
        count = 0 # Set up counter
        
        while count < (Rows-1): # Can only iterate to the second last row
            
            # Match Item 7 Search Pattern
            if re.match(matches_item7, matches_array.at[count,'Selection']):
                # Column 1 = 'Start' columnn in 'matches_array'
                Start_Loc.append(matches_array.iloc[count,1]) # Store in list => Item 7 will be the starting location (column '1' = 'Start' column)
            
            # Match Item 8 Search Pattern
            if re.match(matches_item8, matches_array.at[count,'Selection']):
                End_Loc.append(matches_array.iloc[count,1])
            
            count += 1

        # Extract section of text and store in 'TenKItem7'
        TenKItem7 = TenKtext[Start_Loc[1]:End_Loc[1]]
        
        # Clean newly extracted text
        TenKItem7 = TenKItem7.strip() # Remove starting/ending white spaces
        TenKItem7 = TenKItem7.replace('\n', ' ') # Replace \n (new line) with space
        TenKItem7 = TenKItem7.replace('\r', '') # Replace \r (carriage returns-if you're on windows) with space
        TenKItem7 = TenKItem7.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space
        TenKItem7 = TenKItem7.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space
        while '  ' in TenKItem7:
            TenKItem7 = TenKItem7.replace('  ', ' ') # Remove extra spaces

        # Print first 500 characters of newly extracted text
        print(TenKItem7[:500])

Stepping through the code:

Import libraries

In addition to regex, we use ‘requests’ to grab the 10K filing, ‘Beautiful Soup’ to do basic parsing and ‘pandas’ to do some data manipulation.

import re
import requests
from bs4 import BeautifulSoup
import pandas as pd

Grab and parse the 10K filing

We’ll look at Tesla’s 2019 10K filing (released in early 2020) in this example. You can find the URL for this and other filings on the EDGAR database.

# Define URL for the specific 10K filing
URL_text = r'https://www.sec.gov/Archives/edgar/data/1318605/000156459020004475/0001564590-20-004475.txt' # Tesla 10K Dec 2019
# Grab the response
response = requests.get(URL_text)
# Parse the response (the XML flag works better than HTML for 10Ks)
soup = BeautifulSoup(response.content, 'lxml')

Loop through the 10K filing to find its text body

10K filings contain various document types, including charts and exhibits, but we only want the text body of the filing.

for filing_document in soup.find_all('document'):
  # The 'type' tag contains the document type
  document_type = filing_document.type.find(text=True, recursive=False).strip()

Working with the 10K text body

The 10K text body has a document type ‘10-K’, and this will appear only once in a given filing (based on post-2009 filing structures). Once this is found, grab it and store it — I use a variable called ‘TenKtext’ for this.

if document_type == "10-K":  # Once the 10K text body is found

  # Grab and store the 10K text body    
  TenKtext = filing_document.find('text').extract().text

Set up the regex pattern – Stage 1 of the search pattern

We use a 2-stage process to implement the search.

The first stage is the following regex pattern (4 components):

1(item\s(7[\.\s]|8[\.\s])
2discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition
3(consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata
4re.IGNORECASE
Regex pattern in 4 components

Does it look confusing? Unfortunately regex patterns can be hard to decipher at first glance. Let’s go through it by considering the segments of text that we’re trying to match from the Item 7 and Item 8 Search Patterns:

1“Item 7”, followed (immediately) by “.” (full stop) or “ ” (space), to identify the start of the section of text that we’re after, and similarly for “Item 8” to identify the end of the section. The portion of the regex pattern which matches these is: (item\s(7[\.\s]|8[\.\s])
2“Discussion and Analysis of (Consolidated) Financial Condition”, which is the title of Item 7 (the MD&A), noting that we include “Consolidated” as an optional word, since it appears in the MD&A title for some filings. This is matched by: discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition
3“(Consolidated) Financial Statements and Supplementary Data”, which is the title of Item 8, and again noting the optional inclusion of “Consolidated”. This is matched by: (consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata
4Finally, we wish to capture all upper and lower case versions of the above segments of text as the capitalization approach varies between filings. We do this by including the ‘re.IGNORECASE’ flag
Description of regex pattern components

This regex pattern will match all occurrences of the above segments of text in the 10K filing.

matches = re.compile(r'(item\s(7[\.\s]|8[\.\s])|discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition|(consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata)', re.IGNORECASE)

Once we find the regex matches, we store the results in a pandas dataframe — I call this ‘matches_array’.

I set up the dataframe with two columns, the first with the matched segment of text, the second with its starting position in the 10K text body. I label these columns ‘SearchTerm’ and ‘Start’ respectively.

I also calculate and store the number of rows in the dataframe — we’ll need this later.

matches_array = pd.DataFrame([(match.group(), match.start()) for match in matches.finditer(TenKtext)])# Set columns in the dataframe
matches_array.columns = ['SearchTerm', 'Start']# Get the number of rows in the dataframe
Rows = matches_array['SearchTerm'].count()

In our example of the 2019 Tesla 10K, the regex matches are as follows (which are stored in the ‘matches_array’ dataframe):

SearchTermStart
0Item 7.61,103
1Discussion and Analysis of Financial Condition61,128
2Item 8.61,293
3Financial Statements and Supplementary Data61,305
4Discussion and Analysis of Financial Condition220,720
5ITEM 7.223,520
6DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION223,542
7Item 7.223,944
8Discussion and Analysis of Financial Condition223,965
9Discussion and Analysis of Financial Condition298,965
10ITEM 8.314,729
11FINANCIAL STATEMENTS AND SUPPLEMENTARY DATA314,738
12Item 8572,118
Tesla 2019 10K regex matches

There are 13 matches of the regex pattern (numbered 0 to 12) in the 10K filing. The first match is “Item 7” which appears at position 61,103 in the 10K text body. The second match is “Discussion and Analysis of Financial Condition” at position 61,128, and so on.

Set up the pattern sequence and find the extraction locations – Stage 2 of the search process

The second stage of the search process begins by forming the Item 7 and Item 8 Search Patterns from the matches in ‘matches_array’. We do this by joining pairs of matches in the order that they appear in ‘matches_array’, ie. concatenating the text of adjacent (ie. rows i and i+1) matches in the ‘SearchTerm’ column of ‘matches_array’. We store the results in a new column called ‘Selection’.

Why do we concatenate adjacent rows/matches? In the Item 7/8 Search Patterns, we’re looking for the titles of the Item 7/8 sections appearing ‘soon after’ the item numbers. This means that the title matches will be the next match after the item number matches in ‘matches_array’. Hence, by concatenating adjacent rows/matches, we’re creating the full Search Pattern sequences (where they exist).

Next, we find the second occurrence of each of the Item 7 and Item 8 Search Patterns in the ‘Selection’ column of ‘matches_array’. This will indicate the start and end points respectively (ie. the numbers in the ‘Start’ column) for the section of text that we wish to extract.

Note that, although the term “Item 7” (or “Item 8”) may occur in various places in the 10K filing, the whole of the Item 7 Search Pattern and Item 8 Search Pattern occur in a more predictable manner.

For the 10K filings that I’ve successfully tested, the Item 7 and Item 8 Search Patterns occur at least twice in each filing. The first occurrence is in the table of contents and the second occurrence is the section content that we wish to extract.

We exploit this feature in our search process. This is why we want the second occurrences of Item 7 and Item 8 Search Patterns. They mark the position points (ie. the start and end points respectively) of the section of text we wish to extract.

# Create a new column in 'matches_array' called 'Selection' and add adjacent 'SerchTerm' (i and i+1 rows) text concatenated
count = 0 # Counter to help with row location and iteration
while count < (Rows-1): # Can only iterate to the second last row
  matches_array.at[count,'Selection'] = (matches_array.iloc[count,0] + matches_array.iloc[count+1,0]).lower() # Convert to lower case
  count += 1

In our 2019 Tesla 10K example, the output from ‘matches_array’ for the ‘Start’ column and the newly created ‘Selection’ column is:

StartSelection
61,103item 7.discussion and analysis of financial co…
61,128discussion and analysis of financial condition…
61,293item 8.financial statements and supplementary…
61,305financial statements and supplementary datadis…
220,720discussion and analysis of financial condition…
223,520item 7.discussion and analysis of financial co…
223,542discussion and analysis of financial condition…
223,944item 7.discussion and analysis of financial co…
223,965discussion and analysis of financial condition…
298,251discussion and analysis of financial condition…
314,729item 8.financial statements and supplementary…
314,738financial statements and supplementary dataite…
NaN
Output from ‘matches_array’ showing ‘Selection’ column

The output shows the concatenated text from adjacent matches (rows i and i+1) in ‘matches_array’.

Note that the last entry is null (‘NaN’), since there is no i+1 row available when the i row is the last row in ‘matches_array’ (we only count up to ‘Rows – 1’ in the ‘while’ loop in the above section of code).

We now need to identify which of the concatenated text entries in the ‘Selection’ column match the Item 7 and Item 8 Search Patterns. We do this by using the following regex patterns:

Item 7 Search Patternitem\s7\.discussion\s[a-z]*
Item 8 Search Patternitem\s8\.(consolidated\sfinancial|financial)\s[a-z]*
Regex for Item 7 & 8 Search Patterns

We set up list variables, which we call ‘Start_Loc’ and ‘End_Loc’, to store the Item 7 and Item 8 Search Pattern matches respectively. We then select the second item in each of these lists as the start and end positions of the section of text that we wish to extract.

# Set up 'Item 7/8 Search Pattern' regex patterns
matches_item7 = re.compile(r'(item\s7\.discussion\s[a-z]*)')
matches_item8 = re.compile(r'(item\s8\.(consolidated\sfinancial|financial)\s[a-z]*)')
            
# Lists to store the locations of Item 7/8 Search Pattern matches
Start_Loc = []
End_Loc = []
# Find and store the locations of Item 7/8 Search Pattern matches
count = 0 # Set up counter
while count < (Rows-1): # Can only iterate to the second last row 
  # Match Item 7 Search Pattern
  if re.match(matches_item7, matches_array.at[count,'Selection']):
    # Column 1 = 'Start' column in 'matches_array'
    Start_Loc.append(matches_array.iloc[count,1])
  
  # Match Item 8 Search Pattern
  if re.match(matches_item8, matches_array.at[count,'Selection']):  
    End_Loc.append(matches_array.iloc[count,1])
  
  count += 1

In our 2019 Tesla 10K example, the above code will find our Item 7 and Item 8 Search Patterns as follows:

found Item 7 Search Pattern at: 61,103
found Item 8 Search Pattern at: 61,293
found Item 7 Search Pattern at: 223,520
found Item 7 Search Pattern at: 223,944
found Item 8 Search Pattern at: 314,729
Items 7 and 8 Search Patterns found

The ‘Start_Loc’ and ‘End_Loc’ list variables will be as follows:

[61103, 3223520, 223944]
[61293, 314729]
Start_Loc and End_Loc list variables

The second numbers in each of the list variables will mark the positions of the text we wish to extract. So, in our 2019 Tesla 10K example, the start position will be 223,520 (the second entry in the ‘Start_Loc’ list) and the end position will be 314,729 (the second entry in the ‘End_Loc’ list).

Extract and clean the section of text

We can now extract the section of text that we’re after from our 10K filing, ie. Items 7 and 7A, as follows:

TenKItem7 = TenKtext[Start_Loc[1]:End_Loc[1]]

We store the extracted section of text in a new variable called ‘TenKItem7’.

It helps to clean up the newly extracted text, which we can do with the following code:

TenKItem7 = TenKItem7.strip() # Remove start/end white space
TenKItem7 = TenKItem7.replace('\n', ' ') # Replace \n with space
TenKItem7 = TenKItem7.replace('\r', '') # \r => space
TenKItem7 = TenKItem7.replace(' ', ' ') # " " => space
TenKItem7 = TenKItem7.replace(' ', ' ') # " " => space
  while '  ' in TenKItem7:
    TenKItem7 = TenKItem7.replace('  ', ' ') # Remove extra spaces

The output

That’s it! We’ve now identified, extracted and cleaned our section of text… so, what does it look like?

For our 2019 Tesla 10K example, the first ~500 characters of our freshly extracted text is as follows:

ITEM 7. MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS The following discussion and analysis should be read in conjunction with the consolidated financial statements and the related notes included elsewhere in this Annual Report on Form 10-K. For discussion related to changes in financial condition and the results of operations for fiscal year 2017-related items, refer to Part II, Item 7. Management’s Discussion and Analysis of Financial Condition and …
First 500 characters of extracted Item 7 (MD&A) from Tesla 2019 10K filing

We’ve done it! We’ve successfully extracted Item 7 (and 7A) from our 10K filing and can now use this for analysis or natural language processing applications.

Conclusion

SEC 10K filings are produced annually by all publicly traded companies in the US. They contain lots of useful information for investors or anyone interested in the affairs of those companies.

Unfortunately, 10K filings tend to be large and unwieldy and they contain inconsistencies between individual filings and companies. Searching through and extracting sections of text from them can be challenging.

I recently encountered this issue when working with SEC 10K filings. Fortunately, I found a solution using the powerful text matching features of regular expressions (regex). Using regex, I searched and extracted the section of text that I wanted from a 10K filing.

Regex is available as versatile and effective text pattern matching packages in various coding languages and text editors. It can be used in many applications of text analytics and NLP preprocessing.

I’ve implemented my regex search process in Python, and the resulting code can serve as a useful starting point, or an illustrative use case, for regex applications.

If you’re interested in using regex for searching through SEC 10K filings, then I hope that this article is helpful for you!

Similar Posts