Topic Modeling in NLP: Conceptualization and Implementation
Picture this, we need to make a classification system for an e-book platform, with sociological and scientific research!
The titles, content, and respective authors are known to us. How do we proceed from here? Should we really screen the whole article, only to draw similarities/differences?
Here’s an easy way out — Topic Modeling!
Much like the word suggests, our goal is to find underlying topics that organize this collection. Topic Modeling is an unsupervised class of Machine Learning techniques, which means we don’t require a set of target values for its implementation.
LDA (Latent Dirichlet Allocation) is the most popular topic modeling technique. The rest of our discussion will focus on the same.
Each topic may be visualized as a pool of words. It is from these pools, the words that make up a document is taken from.
To simplify, words form topics that bunch into documents.
As technology enthusiasts, we must all be familiar with the word “mining” (say, word type W) in the text processing context. But what if we come across a document (say, D) discussing the adversity of “mining” on the environment.
While the former belongs to a topic (say, topic Z) comprising of computing or automation, the latter deals with a completely different domain!
How do we resolve this misjudgment?
We need to consider two important aspects here:
I. How often does the word “mining” occur in topic Z?
If the occurrence is very frequent, there is a chance that this “mining” (in document D) also belongs to topic Z.
II. How common is topic Z in the rest of document D?
Combining the above-mentioned criteria, we calculate the probability that this word type (in D) came from topic Z.
i.e., We multiply the frequency of this word type W in Z by the number of other words in document D that already belong to Z.
Roadmap for LDA
1. Each word is randomly assigned to a topic — initialization can be controlled by Dirichlet priors.
2. Now, find out which term belongs to which topic by calculating the probabilities and picking the highest. This is the topic modeling part of LDA.
3. Step 2 is repeatedly performed over the document.
4. They converge over time to form a word-topic distribution.
Implementation of LDA with PySpark
Here, I’ve primarily used the gensim, regex, and nltk libraries for text processing.
- Select the appropriate data — drop null or empty values
2. Cleaning and Preprocessing — Tokenize, remove punctuations/stopwords and normalize the corpus.
3. Create the dictionary and corpus needed for Topic Modeling
4. Building the Topic Model
5. View the topics in the LDA model
6. Selecting the most appropriate number of topics — calculating perplexity and coherence score
Listing the key takeaways —
I. The number of topics is picked ahead of time, and finalized by combining several factors like coherence factor and perplexity.
II. Each document is represented as a distribution over topics.
III. Each topic is represented as a distribution over words.
If the coherence scores continue to increase, choose a value for number of topics just when the scores start flattening out.
Here’s a summary of few topics I arrived at, from job description data.
To dive deeper…
I. Topic visualization can be done through distance maps and histogram plots, using pyLDAvis.
II. The most representative document for each topic can be found out.
III. Topic distribution across documents can also be computed.
IV. A function to calculate varying coherence scores with different argument values to the LDAmodel function may also be implemented.
Word Clouds may be used to capture a bird’s-eye view of the topic.
Here’s a link that discusses the implementation of LDA in more detail.
Words are not directly grouped into topics, rather a probability of the word belonging to the topic is calculated.
The topic giving the highest probability will be associated with the word.
The choice for the number of topics is subject to a set of often uncorrelated factors like human judgment, perplexity, coherence scores, etc.
For further analysis, we may also extract the most dominant topic in a given document, reduce the dimensionality of the corpus, etc.
To explore the applications of topic modeling, click here
I would also like to share with you all a very interesting read on Topic Modeling by Ted Underwood, University of Illinois.
To conclude, LDA being a probabilistic technique, is the way to go for finding interesting patterns in a large collection of data! Furthermore, data visualization libraries offer promising ways to represent features and insights from data.
- medium | Topic Modelling with PySpark and Spark NLP by Maria Obedkova
- tedunderwood | Topic modeling made just simple enough by Ted Underwood
- analyticsvidhya | Beginners Guide to Topic Modeling in Python by Shivam Bansal
- towardsdatascience | Topic Modelling in Python with NLTK and Gensim by Susan Li
- freecodecamp | How we Changed Unsupervised LDA to Semi-Supervised GuidedLDA by Vikash Singh