Topic Modeling of Earnings Calls using Latent Dirichlet Allocation (LDA): Efficient Topic Extraction
Topic modeling can streamline text document analysis by extracting the key topics or themes within the documents. It’s an evolving area of natural language processing that helps to make sense of large volumes of text data. In this article, I show how to apply topic modeling to a set of earnings call transcripts using a popular approach called Latent Dirichlet Allocation (LDA).
Contents
- What is topic modeling?
- Latent Dirichlet Allocation (LDA)
- Model development, evaluation and deployment
- Conclusion
Most listed US companies host earnings calls every quarter. These are conference calls where management discusses financial performance and company updates with analysts, investors, and the media.
Earnings calls are important—they highlight valuable information for investors and provide an opportunity for interaction through Q&A sessions.
There are hundreds of earnings calls held each quarter, often with the release of detailed transcripts. But the sheer volume of those transcripts makes analyzing them a daunting task.
This is where topic modeling can help—it’s a way to streamline the analysis by identifying and extracting the key topics or themes within the data.
In this article, I show how to apply topic modeling to a set of earnings call transcripts using Latent Dirichlet Allocation and implement the model using Python. I also show how topic modeling can require some judgment, and how you can achieve better results by adjusting key parameters.
What is topic modeling?
Topic modeling is a form of unsupervised learning that can be applied to unstructured data.
In the case of text documents, it extracts topics by identifying words or phrases that have a similar meaning and grouping them (into topics) using statistical techniques.
Topic modeling is useful for organizing text documents based on the topics within them, and for identifying the words that make up each topic. It can be helpful in automating a process for classifying documents or for uncovering concealed meaning (hidden semantic structures) within text data.
When applied to natural language, topic modeling requires interpretation of the identified topics—this is where judgment plays a role. The goal is to ensure that the topics and their allocations make sense for the context and purpose of the modeling exercise.
Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) is a popular approach for topic modeling. It works by identifying the key topics within a set of text documents, and the key words that make up each topic.
Under LDA, each document is assumed to have a mix of underlying (latent) topics, each topic with a certain probability of occurring in the document. Individual text documents can therefore be represented by the topics that make them up.
In this way, LDA topic modeling can be used to categorize or classify documents based on their topic content.
Each LDA topic model requires:
- A set of documents for training the model—the training corpus
- A dictionary of words to form the vocabulary used in the model—this can be derived from the training corpus
Once a model has been trained, it can be applied to a new set of documents to identify the topics in those new documents.
You can learn more about topic modeling and LDA in this easy-to-follow introduction.
In this article, I show how to implement LDA using the gensim package in Python. This is a powerful yet accessible package for topic modeling.
Model development, evaluation and deployment
In the following, I explain the process of training, evaluating, refining, and applying an LDA topic model in Python (v3.7.7).
I first set out the full code and then step through its key sections. You can jump straight to the step-through of the code here.
The full code:
################################################
### TOPIC MODELING Earnings Call Transcripts ###
################################################
### IMPORT LIBRARIES ###
import requests # If directly requesting URLs
from bs4 import BeautifulSoup # If parsing requested earnings call transcripts
import gensim
import gensim.corpora as corpora
from gensim import models
import matplotlib.pyplot as plt
import spacy
from pprint import pprint
from wordcloud import WordCloud
from mpl_toolkits import mplot3d
import matplotlib.pyplot as plt
nlp = spacy.load("en_core_web_lg")
nlp.max_length = 1500000 # In case max_length is set to lower than this (ensure sufficient memory)
### GRAB THE TRANSCRIPTS BY PARSING URLs - OPTIONAL ###
### NB. This may sometimes not work due to the SA website blocking your web scrapting (thinking you're a bot or harmful in some way)
### If the parsing approach doesn't work, download the files manually - see alternative file grab approach below
#URL_text = r'https://seekingalpha.com/article/4371280-dell-technologies-inc-dell-management-on-q2-2021-results-earnings-call-transcript' # Dell Q2 2021
### Grab the response
#response = requests.get(URL_text)
### Parse the response
#soup = BeautifulSoup(response.content, 'lxml')
### Extract the text portion of the transcript (contained within the 'article' tab, extracting text only from within HTML elements)
#ECallTxt = soup.find('article').text
### ALTERNATIVELY, GRAB THE DOCUMENT FROM TEXT FILE ###
FilePath = r'< your local file path >'
### SETTING UP THE TRAINING CORPUS ###
# Transcripts to form the training corpus
DocList = ['ADSK-Q2-2021', 'ANF-Q2-2020', 'APPEF-Q2-2020', 'BBY-Q2-2020', 'CRM-Q2-2021', 'DELL-Q2-2021',
'DE-Q3-2020', 'DG-Q2-2020', 'DLTR-Q2-2020', 'EL-Q4-2020', 'EV-Q3-2020', 'FLWS-Q4-2020', 'GPS-Q2-2020',
'HPQ-Q3-2020', 'INTU-Q4-2020', 'JWN-Q2-2020', 'MRVL-Q2-2021', 'NVDA-Q2-2021', 'SPLK-Q2-2021', 'TD-Q3-2020',
'TOL-Q3-2020', 'TSLA-Q2-2020', 'VMW-Q2-2021', 'WDAY-Q2-2021', 'TGT-Q2-2020', 'BJ-Q2-2020', 'A-Q3-2020', 'HD-Q2-2020',
'KSS-Q2-2020', 'ADMP-Q2-2020', 'FL-Q2-2020', 'GASS-Q2-2020', 'ADI-Q3-2020', 'WMT-Q2-2021', 'CODX-Q2-2020', 'ECC-Q2-2020']
### TEXT PRE-PROCESSING ###
ECallDocuments = [] # List to store all documents in the training corpus as a 'list of lists'
ECallWordCloud = [] # Single list version of the training corpus documents for WordCloud
# Loop through all documents in the training corpus
for doc in DocList:
ECallTxt = open(FilePath + doc + '.txt', 'r').read() # Open text file, including the 'read' flag to convert the file object to a string
# Clean text
ECallTxt = ECallTxt.strip() # Remove white space at the beginning and end
ECallTxt = ECallTxt.replace('\n', ' ') # Replace the \n (new line) character with space
ECallTxt = ECallTxt.replace('\r', '') # Replace the \r (carriage returns -if you're on windows) with null
ECallTxt = ECallTxt.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space.
ECallTxt = ECallTxt.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space.
while ' ' in ECallTxt:
ECallTxt = ECallTxt.replace(' ', ' ') # Remove extra spaces
# Parse document with SpaCy
ECall = nlp(ECallTxt)
ECallDoc = [] # Temporary list to store individual document
# Further cleaning and selection of text characteristics
for token in ECall:
if token.is_stop == False and token.is_punct == False and (token.pos_ == "NOUN" or token.pos_ == "ADJ" or token.pos_ =="VERB"): # Retain words that are not a stop word nor punctuation, and only if a Noun, Adjective or Verb
ECallDoc.append(token.lemma_.lower()) # Convert to lower case and retain the lemmatized version of the word (this is a string object)
ECallWordCloud.append(token.lemma_.lower()) # Build the WordCloud list
ECallDocuments.append(ECallDoc) # Build the training corpus 'list of lists'
# Generate and plot WordCloud for full training corpus
wordcloud = WordCloud(background_color="white").generate(','.join(ECallWordCloud)) # NB. 'join' method used to convert the documents list to text format
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
### NUMERIC REPRESENTATION OF TRAINING CORPUS USING BAG OF WORDS AND TF-IDF ###
# Form dictionary by mapping word IDs to words
ID2word = corpora.Dictionary(ECallDocuments)
# Set up Bag of Words and TFIDF
corpus = [ID2word.doc2bow(doc) for doc in ECallDocuments] # Apply Bag of Words to all documents in training corpus
TFIDF = models.TfidfModel(corpus) # Fit TF-IDF model
trans_TFIDF = TFIDF[corpus] # Apply TF-IDF model
### SET UP & TRAIN LDA MODEL ###
SEED = 75 # Set random seed
NUM_topics = 3 # Set number of topics
ALPHA = 0.9 # Set alpha
ETA = 0.35 # Set eta
# Train LDA model on the training corpus
lda_model = gensim.models.LdaMulticore(corpus=trans_TFIDF, num_topics=NUM_topics, id2word=ID2word, random_state=SEED, alpha=ALPHA, eta=ETA, passes=100)
# Print topics generated from the training corpus
pprint(lda_model.print_topics(num_words=10))
### CALCULATE COHERENCE SCORE ###
# Set up coherence model
coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, texts=ECallDocuments, dictionary=ID2word, coherence='c_v')
# Calculate and print coherence
coherence_lda = coherence_model_lda.get_coherence()
print('-'*50)
print('\nCoherence Score:', coherence_lda)
print('-'*50)
### PRINT TOPIC WORD CLOUDS ###
topic = 0 # Initialize counter
while topic < NUM_topics:
# Get topics and frequencies and store in a dictionary structure
topic_words_freq = dict(lda_model.show_topic(topic, topn=50)) # NB. the 'dict()' constructor builds dictionaries from sequences (lists) of key-value pairs - this is needed as input for the 'generate_from_frequencies' word cloud function
topic += 1
# Generate Word Cloud for topic using frequencies
wordcloud = WordCloud(background_color="white").generate_from_frequencies(topic_words_freq)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
### GET TOPIC ALLOCATIONS FOR TRAINING CORPUS DOCUMENTS ###
doc_no = 0 # Set document counter
for doc in ECallDocuments:
TFIDF_doc = TFIDF[corpus[doc_no]] # Apply TFIDF model to individual documents
print(lda_model.get_document_topics(TFIDF_doc)) # Get and print document topic allocations
doc_no += 1
print('-'*50)
### APPLY TRAINED MODEL TO NEW TRANSCRIPTS - SET UP ###
# New documents on which to apply trained LDA model
NewDocList = ['EAST-Q2-2020', 'SQBG-Q2-2020', 'TTNP-Q2-2020',
'FSM-Q2-2020', 'SNDL-Q2-2020', 'NVGS-Q2-2020']
NewDocuments = [] # List for new documents as a 'list of lists'
# Loop through new documents
for doc in NewDocList:
ECallTxt = open(FilePath + doc + '.txt', 'r').read() # Opening text file, include the 'read' flag to convert the file to a string
# Clean text
ECallTxt = ECallTxt.strip() # Remove white space at the beginning and end
ECallTxt = ECallTxt.replace('\n', ' ') # Replace the \n (new line) character with space
ECallTxt = ECallTxt.replace('\r', '') # Replace the \r (carriage returns -if you're on windows) with null
ECallTxt = ECallTxt.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space.
ECallTxt = ECallTxt.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space.
while ' ' in ECallTxt:
ECallTxt = ECallTxt.replace(' ', ' ') # Remove extra spaces
# Parse document with SpaCy
ECall = nlp(ECallTxt)
ECallDoc = [] # Temporary list to store individual document
# Further cleaning and selection of text characteristics
for token in ECall:
if token.is_stop == False and token.is_punct == False and (token.pos_ == "NOUN" or token.pos_ == "ADJ" or token.pos_ =="VERB"): # Retain words that are not a stop word nor punctuation, and only if a Noun, Adjective or Verb
ECallDoc.append(token.lemma_.lower()) # Convert to lower case and retain the lemmatized version of the word (this is a string)
NewDocuments.append(ECallDoc) # Build the 'list of lists' for the new documents
### APPLY TRAINED MODEL TO NEW TRANSCRIPTS - GET TOPIC ALLOCATIONS ###
NewDocumentTopix = [] # For plotting the new document topics
doc_no = 0 # Set document counter
for doc in NewDocuments:
new_corpus = [ID2word.doc2bow(doc) for doc in NewDocuments] # Apply Bag of Words to new documents
new_TFIDF = models.TfidfModel(new_corpus) # Fit TF-IDF model
TFIDF_doc = TFIDF[new_corpus[doc_no]] # Apply TFIDF model
NewDocumentTopix.append(lda_model.get_document_topics(TFIDF_doc)) # Get the new document topic allocations and store for plotting
print(NewDocumentTopix[doc_no]) # Print new document topic allocations
doc_no += 1
print('-'*50)
### PLOTTING NEW TRANSCRIPTS BY TOPICS ###
# Plotting topic distributions of the new transcripts
# Initialize 3D plot
ax = plt.axes(projection='3d')
# Get data points
x_data = []
y_data = []
z_data = []
count = 0 # Counter for loop
for data_point in NewDocumentTopix:
#'data_point' is an element in NewDocumentTopix (level 1), each of which has 3 elements (level 2), each of which in turn has 2 elements (level 3)
x_data.append(data_point[0][1]) #'x axis' data is always the first of the 3 elements (level 2) and the topic allocation is the 2nd element (level 3)
y_data.append(data_point[1][1]) #'y axis' data is always the second of the 3 elements (level 2)
z_data.append(data_point[2][1]) #'y axis' data is always the third of the 3 elements (level 2)
count += 1
# Plot topics data 3D
ax.scatter3D(x_data, z_data, y_data) # Reversed 'y' and 'z' axes to place topics 0 & 1 in the 'base' and topic 2 as 'height', matching the 2D graphs
plt.xlabel("Topic 0")
plt.ylabel("Topic 2")
plt.legend(("Topic Distribution"), loc='best')
plt.show()
# Plot topics data 2D (x and y axes only)
plt.scatter(x_data, y_data, marker='o')
plt.xlabel("Topic 0")
plt.ylabel("Topic 1")
plt.legend(("Topic Distribution"), loc='best')
plt.show()
# Plot topics data 2D (z and y axes only)
plt.scatter(z_data, y_data, marker='o')
plt.xlabel("Topic 2")
plt.ylabel("Topic 1")
plt.legend(("Topic Distribution"), loc='best')
plt.show()
### INVESTIGATING COHERENCE BY VARYING KEY PARAMETERS ###
# Coherence values for varying alpha
def compute_coherence_values_ALPHA(corpus, dictionary, num_topics, seed, texts, start, limit, step):
coherence_values = []
model_list = []
for alpha in range(start, limit, step):
model = gensim.models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=seed, alpha=alpha/10, passes=100)
model_list.append(model)
coherencemodel = gensim.models.CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
model_list, coherence_values = compute_coherence_values_ALPHA(dictionary=ID2word, corpus=trans_TFIDF, num_topics=NUM_topics, seed=SEED, texts=ECallDocuments, start=1, limit=10, step=1)
# Plot graph of coherence values by varying alpha
limit=10; start=1; step=1;
x_axis = []
for x in range(start, limit, step):
x_axis.append(x/10)
plt.plot(x_axis, coherence_values)
plt.xlabel("Alpha")
plt.ylabel("Coherence score")
plt.legend(("coherence"), loc='best')
plt.show()
# Coherence values for varying eta
def compute_coherence_values_ETA(corpus, dictionary, num_topics, seed, alpha, texts, start, limit, step):
coherence_values = []
model_list = []
for eta in range(start, limit, step):
model = gensim.models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=seed, alpha=alpha, eta=eta/100, passes=100)
model_list.append(model)
coherencemodel = gensim.models.CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
model_list, coherence_values = compute_coherence_values_ETA(corpus=trans_TFIDF, dictionary=ID2word, num_topics=NUM_topics, seed=SEED, alpha=ALPHA, texts=ECallDocuments, start=25, limit=50, step=1)
# Plot graph of coherence values by varying eta
limit=50; start=25; step=1;
x_axis = []
for x in range(start, limit, step):
x_axis.append(x/100)
plt.plot(x_axis, coherence_values)
plt.xlabel("Eta")
plt.ylabel("Coherence score")
plt.legend(("coherence"), loc='best')
plt.show()
# Coherence values for varying number of topics
def compute_coherence_values_TOPICS(corpus, dictionary, alpha, seed, texts, start, limit, step):
coherence_values = []
model_list = []
for num_topics in range(start, limit, step):
model = gensim.models.LdaMulticore(corpus=corpus, id2word=dictionary, alpha=alpha, num_topics=num_topics, random_state=seed, passes=100)
model_list.append(model)
coherencemodel = gensim.models.CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
model_list, coherence_values = compute_coherence_values_TOPICS(corpus=trans_TFIDF, dictionary=ID2word, alpha=ALPHA, seed=SEED, texts=ECallDocuments, start=2, limit=10, step=1)
# Plot graph of coherence values by varying number of topics
limit=10; start=2; step=1;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Number of Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence"), loc='best')
plt.show()
# Coherence values for varying seed
def compute_coherence_values_SEED(corpus, dictionary, alpha, num_topics, texts, start, limit, step):
coherence_values = []
model_list = []
for seed in range(start, limit, step):
model = gensim.models.LdaMulticore(corpus=corpus, id2word=dictionary, alpha=alpha, num_topics=num_topics, random_state=seed, passes=100)
model_list.append(model)
coherencemodel = gensim.models.CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
model_list, coherence_values = compute_coherence_values_SEED(corpus=trans_TFIDF, dictionary=ID2word, alpha=ALPHA, num_topics=NUM_topics, texts=ECallDocuments, start=60, limit=125, step=5)
# Plot graph of coherence values by varying seed
limit=125; start=60; step=5;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Random Seed")
plt.ylabel("Coherence score")
plt.legend(("coherence"), loc='best')
plt.show()
Stepping through the code:
Importing libraries
We first import libraries for requesting and parsing earnings call transcripts (requests and BeautifulSoup), text pre-processing (SpaCy), displaying results (matplotlib, pprint and wordcloud) and LDA (gensim).
### IMPORT LIBRARIES ###
import requests # If directly requesting URLs
from bs4 import BeautifulSoup # If parsing requested earnings call transcripts
import gensim
import gensim.corpora as corpora
from gensim import models
import matplotlib.pyplot as plt
import spacy
from pprint import pprint
from wordcloud import WordCloud
from mpl_toolkits import mplot3d
import matplotlib.pyplot as plt
nlp = spacy.load("en_core_web_lg")
nlp.max_length = 1500000 # In case max_length is set to lower than this (ensure sufficient memory)
Sourcing earnings call transcripts
Earnings call transcripts are available from company websites or through third-party providers. One popular source is the Seeking Alpha website, from which recent transcripts are freely available.
Individual transcripts can be parsed directly through URL links. The following is an example for a Dell earnings call transcript. I store the resulting text in a variable called ECallTxt
.
### GRAB THE TRANSCRIPTS BY PARSING URLs ###
URL_text = r'https://seekingalpha.com/article/4371280-dell-technologies-inc-dell-management-on-q2-2021-results-earnings-call-transcript' # Dell Q2 2021
(example)
# Grab the response
response = requests.get(URL_text)
# Parse the response
soup = BeautifulSoup(response.content, 'lxml')
# Extract the text portion of the transcript (contained within the 'article' tab, extracting text only from within HTML elements)
ECallTxt = soup.find('article').text
Unfortunately, this approach doesn’t always work, based on my experience with the Seeking Alpha website. This could be due to its site protection mechanisms. To get around it, I separately downloaded transcripts and stored them in local text files.
FilePath = r'< your local file path >'
I downloaded a set of 36 earnings call transcripts for this exercise, covering a range of different companies and industries.
I list them in a variable called DocList
. These will form the training corpus of our model.
# Transcripts to form the training corpus
(36 in total)
DocList = ['ADSK-Q2-2021', 'ANF-Q2-2020', 'APPEF-Q2-2020', ... ,'ECC-Q2-2020']
Text pre-processing
Prepare transcripts for topic modeling by cleaning them (removing special characters and extra spaces), removing stop words and punctuation, lemmatizing, and selecting the parts-of-speech that we wish to retain.
We’ll keep nouns, adjectives, and verbs—this seems to work well.
You can learn more about text pre-processing as a part of the natural language processing workflow here.
We prepare each transcript in turn, by looping through DocList
and collecting the results in two lists:
- A list of lists, made up of the 36 transcripts forming the training corpus, with each transcript being a list of words. I call this
ECallDocuments
. - A single list of all the words in the 36 transcripts, which I use later for illustrating the training corpus as a word cloud. I call this
ECallWordCloud
.
### TEXT PRE-PROCESSING ###
ECallDocuments = [] # List to store all documents in the training corpus as a 'list of lists'
ECallWordCloud = [] # Single list version of the training corpus documents for WordCloud
# Loop through all documents in the training corpus
for doc in DocList:
ECallTxt = open(FilePath + doc + '.txt', 'r').read() # Open text file, including the 'read' flag to convert the file object to a string
# Clean text
ECallTxt = ECallTxt.strip() # Remove white space at the beginning and end
ECallTxt = ECallTxt.replace('\n', ' ') # Replace the \n (new line) character with space
ECallTxt = ECallTxt.replace('\r', '') # Replace the \r (carriage returns -if you're on windows) with null
ECallTxt = ECallTxt.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space.
ECallTxt = ECallTxt.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space.
while ' ' in ECallTxt:
ECallTxt = ECallTxt.replace(' ', ' ') # Remove extra spaces
# Parse document with SpaCy
ECall = nlp(ECallTxt)
ECallDoc = [] # Temporary list to store individual document
# Further cleaning and selection of text characteristics
for token in ECall:
if token.is_stop == False and token.is_punct == False and (token.pos_ == "NOUN" or token.pos_ == "ADJ" or token.pos_ =="VERB"): # Retain words that are not a stop word nor punctuation, and only if a Noun, Adjective or Verb
ECallDoc.append(token.lemma_.lower()) # Convert to lower case and retain the lemmatized version of the word (this is a string object)
ECallWordCloud.append(token.lemma_.lower()) # Build the WordCloud list
ECallDocuments.append(ECallDoc) # Build the training corpus 'list of lists'
Inspecting the training corpus
It’s helpful to have an idea of what the training corpus looks like — there are various ways to do this, but I like using word clouds.
# Generate and plot WordCloud for full training corpus
wordcloud = WordCloud(background_color="white").generate(','.join(ECallWordCloud)) # NB. 'join' method used to convert the documents list to text format
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
The word cloud of the training corpus shows a number of words that you might expect to see in company earnings calls — words related to customers, earnings periods (“quarter”), financials (“increase” and “growth”), and words to express opinions (“think” and “expect”).
Training the LDA model
To train our LDA model, first, form a dictionary by mapping the training corpus to word IDs. Then, convert the words in each transcript to numbers (text representation) using a bag-of-words approach.
# Map training corpus to IDs
ID2word = corpora.Dictionary(ECallDocuments)
# Convert all transcripts in training corpus using bag-of-words
train_corpus = [ID2word.doc2bow(doc) for doc in ECallDocuments]
We use the gensim package to generate our LDA model. This requires training the model using our training corpus and selecting the number of topics as an input. Since we don’t know how many topics are likely to emerge from the training corpus, let’s start with 5.
NUM_topics = 5 # Set number of topics
# Train LDA model on the training corpus
lda_model = gensim.models.LdaMulticore(corpus=trans_corpus, num_topics=NUM_topics, id2word=ID2word, passes=100)
The passes
flag refers to the number of iterations through the corpus during training — the higher, the better for defining topics, albeit at the cost of extra processing time. We set the flag to passes=100
to produce better results.
I use the LdaMulticore
version of the model which makes use of parallelization for faster processing.
If your machine cannot accommodate this, you can use the standard LDA model offered by gensim instead.
Observing the topics
You can observe the key words in each topic that results from the training.
# Print topics generated from the training corpus
pprint(lda_model.print_topics(num_words=4))
Use pprint
to print in an easier-to-read format—the above code prints the top 4 keywords in each of the 5 topics generated through training:
TOPIC 0 | 0.023*”customer” + 0.019*”year” + 0.014*”think” + 0.012*”quarter” |
TOPIC 1 | 0.017*”quarter” + 0.015*”year” + 0.012*”store” + 0.011*”customer” |
TOPIC 2 | 0.015*”test” + 0.010*”think” + 0.009*”go” + 0.009*”question” |
TOPIC 3 | 0.020*”quarter” + 0.014*”think” + 0.013*”year” + 0.010*”market” |
TOPIC 4 | 0.020*”cloud” + 0.016*”year” + 0.015*”™” + 0.010*”customer” |
Also shown is the frequency of each keyword (the decimal number next to each keyword).
How good are these topics?
To answer this question, let’s evaluate our model results.
Model evaluation
There are several ways to evaluate LDA models, and they’re not all based on numbers. Context has a role to play, as does the practical usefulness of the generated topics.
The considerations and challenges of topic model evaluation are discussed further in this article.
In terms of quantitative measures, a common way to evaluate LDA models is through the coherence score.
The coherence score of an LDA model measures the degree of semantic similarity between words in each topic.
All else equal, a higher coherence score is better, as it indicates a higher degree of likeness in the meaning of the words within each topic.
We can measure our model’s coherence using the CoherenceModel
within gensim.
### CALCULATE COHERENCE SCORE ###
# Set up coherence model
coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, texts=ECallDocuments, dictionary=ID2word, coherence='c_v')
# Calculate and print coherence
coherence_lda = coherence_model_lda.get_coherence()
print('-'*50)
print('\nCoherence Score:', coherence_lda)
print('-'*50)
We choose coherence='c_v'
as our coherence method, which has been shown to be effective and is a popular choice. Using this method, coherence scores between 0 and 1 are typical.
The coherence score of our model is:
Coherence Score: | 0.289907 |
This does not appear to be very high… is it possible to improve this?
Model improvement
LDA models can be improved by adjusting the text representation stage or by changing model parameters.
Text representation:
We used a bag-of-words approach to convert our words to numbers. Whilst straightforward, bag-of-words tends to produce representations that have low information content (sparse vectors). This can lead to poor results.
An alternative approach is TF-IDF. This adjusts for words that appear frequently but have low semantic value, relative to words that appear infrequently but with higher semantic value. This tends to produce better results.
We can apply TF-IDF to our model by recalculating our corpus.
# Set up Bag of Words and TFIDF
corpus = [ID2word.doc2bow(doc) for doc in ECallDocuments] # Apply Bag of Words to all documents in training corpus
TFIDF = models.TfidfModel(corpus) # Fit TF-IDF model
trans_TFIDF = TFIDF[corpus] # Apply TF-IDF model
Our training corpus now becomes trans_TFIDF
.
We can now re-train our model and observe the updated coherence score:
Coherence Score: | 0.437026 |
That’s an improvement!
Model parameters:
The parameters that we will change to try and improve our model’s coherence are:
- Number of topics — We had arbitrarily selected 5 as a starting point, but we can adjust this.
- Random seed — If this is not set to a specific number, the model results will vary with each run. We can set this to a number that leads to better coherence.
- Alpha — This determines the document-topic density, ie. the extent to which topics are distributed amongst documents. It will be chosen automatically unless we specify it. A low value of alpha results in fewer topics per document, and a high value of alpha results in more topics per document. This is an area where some judgement is helpful — having better topic distributions can make it easier to differentiate documents in cases where a single topic would otherwise dominate. This implies the selection of a higher alpha in such cases, even if it results in lower coherence scores.
- Eta — This determines the topic-word density, ie. the extent to which words are distributed amongst topics. It will also be chosen automatically unless we specify it. With a high Eta, topics are made up of more words from the corpus than with a low Eta.
You can explore the effect of changing the above parameters by calculating our model’s coherence for a range of different parameter values and plotting the results.
The following code shows how to calculate coherence for varying values of the alpha parameter.
# Coherence values for varying alpha
def compute_coherence_values_ALPHA(corpus, dictionary, num_topics, seed, texts, start, limit, step):
coherence_values = []
model_list = []
for alpha in range(start, limit, step):
model = gensim.models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=seed, alpha=alpha/10, passes=100)
model_list.append(model)
coherencemodel = gensim.models.CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
model_list, coherence_values = compute_coherence_values_ALPHA(dictionary=ID2word, corpus=trans_TFIDF, num_topics=NUM_topics, seed=SEED, texts=ECallDocuments, start=1, limit=10, step=1)
# Plot graph of coherence values by varying alpha
limit=10; start=1; step=1;
x_axis = []
for x in range(start, limit, step):
x_axis.append(x/10)
plt.plot(x_axis, coherence_values)
plt.xlabel("Alpha")
plt.ylabel("Coherence score")
plt.legend(("coherence"), loc='best')
plt.show()
You may need to use some trial-and-error as you explore the effects of changing parameters. This is because of interaction between parameters—changing any one parameter affects the coherence scores generated by the other parameters, and vice-versa.
In our model’s case, the coherence scores for different parameter values are shown below:
Balancing the effects of the different parameter choices, and using a bit of judgement (particularly with alpha—weighing coherence against document-topic density), I chose the following parameters:
- Number of topics = 3
- Random seed = 75
- Alpha = 0.9
- Eta = 0.35
With these parameter choices, what does our model’s coherence now look like?
Coherence Score: | 0.535576 |
That’s looking good!
Model results
Now that we’ve fine-tuned our model and have selected our parameters, let’s take a look at the topics that it generates (remembering there are now 3 topics). Once again, we can use word clouds.
### PRINT TOPIC WORD CLOUDS ###
topic = 0 # Initialize counter
while topic < NUM_topics:
# Get topics and frequencies and store in a dictionary structure
topic_words_freq = dict(lda_model.show_topic(topic, topn=50)) # NB. the 'dict()' constructor builds dictionaries from sequences (lists) of key-value pairs - this is needed as input for the 'generate_from_frequencies' word cloud function
topic += 1
# Generate Word Cloud for topic using frequencies
wordcloud = WordCloud(background_color="white").generate_from_frequencies(topic_words_freq)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
How do these topics look? At first glance, they appear to have some intuitive sense.
LDA modeling won’t label topics for you. In fact, it won’t tell you much about the topics other than their word distributions and certain metrics like coherence. What the topics imply in a practical sense depends on how you wish to interpret them.
For our 3 topics, let’s take a closer look. We’ll generate the document-topic densities across the 36 documents in our training corpus.
### GET TOPIC ALLOCATIONS FOR TRAINING CORPUS DOCUMENTS ###
doc_no = 0 # Set document counter
for doc in ECallDocuments:
TFIDF_doc = TFIDF[corpus[doc_no]] # Apply TFIDF model to individual documents
print(lda_model.get_document_topics(TFIDF_doc)) # Get and print document topic allocations
doc_no += 1
Below are the document-topic densities for 3 of the transcripts in the training corpus:
Highest Topic 0 | [(0, 0.8202308), (1, 0.0998479), (2, 0.07992126)] |
Highest Topic 1 | [(0, 0.07596862), (1, 0.8596242), (2, 0.06440719)] |
Highest Topic 2 | [(0, 0.21165603), (1, 0.31834567), (2, 0.46999827)] |
The above 3 transcripts have the highest allocations to Topics 0, 1 and 2 respectively.
The first transcript (highest Topic 0) is from NVIDIA Corporation. This is a technology company that manufactures computer hardware and peripherals.
NVIDIA’s GPUs are in fact prominent in hardware setups for modern deep learning applications.
Not surprisingly, NVIDIA’s earnings call features plenty of discussion around cloud computing, servers, and related areas of technology. This suggests that Topic 0 may be technology-related.
The second transcript (highest Topic 1) is from Kohl’s Corporation. This is a retail product and department store company.
Kohl’s earnings call discusses ‘comps’ (financial metric comparisons between earnings periods), retail stores, holiday periods (sales figures), and retail brands. This suggests that Topic 1 is retail-related.
The third transcript (highest Topic 2) is from StealthGas Incorporated. This company specializes in transporting petrochemical and gas products.
Stealth’s earnings call features plenty of discussion around vessels, charters, ships, voyages, and (cubic) capacity. This suggests that Topic 2 is related to logistics.
Based on these observations, we can label the 3 topics as follows:
- Topic 0 — IT & Technology
- Topic 1 —Retail & Customer Management
- Topic 2 — Logistics
We’ve now successfully trained our LDA topic model and have identified sensible topics. The model is now ready to deploy on new earnings call transcripts which do not form a part of the training corpus.
Deploying the model on new transcripts
I selected 6 new earnings call transcripts to assess our newly trained topic model.
We go through a similar process of preparing these new transcripts as we did for the training corpus, i.e., cleaning, removing stop words and punctuation, lemmatizing, and selecting the parts-of-speech to retain. (Refer to the section “APPLY TRAINED MODEL TO NEW TRANSCRIPTS – SET UP” in the full code listing above)
We can then apply our trained model to the new transcripts.
### APPLY TRAINED MODEL TO NEW TRANSCRIPTS - GET TOPIC ALLOCATIONS ###
NewDocumentTopix = [] # For plotting the new document topics
doc_no = 0 # Set document counter
for doc in NewDocuments:
new_corpus = [ID2word.doc2bow(doc) for doc in NewDocuments] # Apply Bag of Words to new documents
new_TFIDF = models.TfidfModel(new_corpus) # Fit TF-IDF model
TFIDF_doc = TFIDF[new_corpus[doc_no]] # Apply TFIDF model
NewDocumentTopix.append(lda_model.get_document_topics(TFIDF_doc)) # Get the new document topic allocations and store for plotting
print(NewDocumentTopix[doc_no]) # Print new document topic allocations
doc_no += 1
Here, I calculate new_corpus
and new_TFIDF
for the new transcripts. I also create a list called NewDocumentTopix
to store the new document-topic densities. This is for plotting and investigating the topic distributions for the new transcripts.
Observing the document-topic densities of the new transcripts
Let’s see how our new transcripts are allocated to our topics. An easy way to do this is to plot the topic distributions. (Refer to the section “PLOTTING NEW TRANSCRIPTS BY TOPICS” in the full code listing above)
The code plots a 3D graph (since there are 3 topics) and two 2D graphs (2 topics at a time) to assist our intuition.
The 3D graph shows how our 6 new transcripts are allocated amongst the 3 topics (each dot represents 1 transcript). If we think of the 3D graph as a cube, the vertical axis (not labeled) is for Topic 2, while the two horizontal axes forming the ‘base’ of the cube (labeled) are for Topic 0 and Topic 1.
Based on the 3D graph, we see that there’s a good distribution of topics amongst the transcripts. This implies decent grounds for differentiation between the transcripts based on the topic allocations.
I’ve marked two of the 6 transcripts in the above charts — one with a (red) circle and the other with a (green) square. Let’s look closer at these two transcripts to see if our topic model makes sense.
The red circle transcript is for Titan Pharmaceuticals. This has a topic distribution of 57% Topic 0, 31% Topic 1, and 12% Topic 2. This implies that the majority of the earnings call discussed technology-related areas and most of the remainder discussed retail or customer-related issues.
Although not quite as obvious as for NVIDIA, the Titan earnings call did spend some time discussing digital communications, testing, design, and commercial operations. There is also some discussion on branding and customers. So, the topic distribution seems reasonable.
The green square transcript is for Sundial Growers, a cannabis grower and distributor. This has a topic distribution of 20% Topic 0, 71% Topic 1, and 9% Topic 2. This implies a majority discussion around Topic 1, retail and customer management, in the earnings call.
There is indeed lots of discussion around retail stores, customers, earnings period comparisons, and branding on the Sundial Growers transcript. This topic distribution certainly seems credible.
Our new transcripts, therefore, appear to have been sensibly allocated to topics based on our trained model.
Better results can be achieved, of course, with a larger training corpus and more attention to parameter fine-tuning.
Nevertheless, this simple exercise demonstrates how effective topic modeling can be, with good results being achievable with relative ease by using tools such as gensim.
Conclusion
Topic modeling is an evolving area of natural language processing.
It helps with streamlining the analysis and classification of text documents by identifying their underlying semantic structure.
Using the popular LDA approach, implemented in Python, I show how to apply topic modeling to company earnings call transcripts.
We see that, with a relatively simple process, we can train a model with a small set of transcripts and use the model to successfully classify new (unseen) transcripts based on their topic distributions.
Topic modeling has a bright future in the evolution of natural language processing.
Along with other emerging technologies, it has the potential to improve countless processes that involve the organization and interpretation of text data.