If we were to comprehend programmatically everything a person speaks in a day, we would definitely end up with quite a bit of information, but how about the composition of these sentences — the words?
This is where the real challenge lies. Computers process numbers with ease, but any text fed into it as such, only leads to increased storage consumption without the computation of useful information.
Using NLP, we build algorithms that can explore and understand i.e., work seamlessly with such categories of data.
We begin by preprocessing — splitting the text into a list of words, removing punctuations/stop words/spaces, getting rid of infrequent words and so on.
How will the computer interpret these words?
We quantify them! We associate every word with a set of real numbers — a vector, just like there is a price associated with every product in a store, be it a comb or a mixer. As different as the two are, there is still room for comparison based on this aspect.
Similarly in our case of random words, which are way too vast to make any synonymous or concrete connection out of, what is the aspect that we quantify?
Represent words such that their meanings are captured, like humans do! Not the exact meaning, but a contextual one — by quantifying their semantics.
For instance, when I say the word speak, we know exactly what I’m referring to (the context), even though the meaning as in a dictionary maybe ambiguous.
The semantics of a word is embedded across a preset of vector components, these embeddings form the weights of a neural network which are adjusted to make the model converge. The process is termed as word vectorization.
Now, how do we arrive at these components? Either resort to reasoned predictions after further exploring the data, or simply wing it and finalize based on the outcome.
The latter often gives better results.
Take a look at the following example —
Every dimension represents a point in the vector space, which may or may not be associated with an inherent meaning.
- Computation of similar words
- Feature extraction for text classifications
- Document clustering/grouping
From a neural network standpoint, embeddings are low-dimensional, learned continuous vector representations of discrete data.
Let’s try to understand two instances where word embeddings offer an edge over older methods!
Say we have 10,000 distinct words occuring in varying contexts over a large corpus of about 10 million words. With one-hot encoding, each word would be associated with 10,000 bits, making the transformed vector quite unmanageable.
Another shortcoming associated with this strategy is that the mapping is ill-informed, ‘similar’ categories are not placed closer to each other from the rest, in the embedding space.
Word vectors mitigate these issues using different approaches, as discussed here — An Intuitive Understanding of Word Embeddings.
The rest of this article deals with the principles, implementation and application of a Word2Vec model.
Introduced by Mitolov, Word2Vec is among the most sought-after techniques for word vectorisation, rooting from the idea of generating distributed representations — by introducing a dependence of words on one another — contrasting with the older approach in NLP where words were treated as atomic units.
This method follows a prediction-based approach to word embeddings, where a probability of occurrence for a word was estimated depending on the context — surrounding words. Thereby finessing analogies and similarities.
Key principles in Word2Vec
Neural networks are used in Word2Vec to create embeddings in two different prediction-based approaches.
i. CBOW (Continuous Bag of Words) model — This method uses the context words as input to predict the probability distribution of the word corresponding to this context.
Consider the sentence — Take an umbrella when you go out as the sky looks cloudy, it might rain today. Here’s how a CBOW model would interpret this data.
Say the words input are umbrella, out, cloudy, today and rain. The model averages over the probability distribution of these C context words to return the target word’s probabilities.
The activation function between the layers could be ReLU, Sigmoid, tanh or Softmax.
ii. Skip-Gram model — This propagates in a direction opposite to CBOW i.e., it predicts the context/surrounding words given the middle word.
For each context position corresponding to the middle word, the model outputs C probability distributions.
In both cases, the network uses backpropagation to train.
Going further, we introduce a Skip Gram model with Negative Sampling using PyTorch. Negative sampling penalises words that occur outside the context window.
Before going into the details of building the model, we will take a look at data preparation and the required formatting for training.
- We begin by splitting the dataframe into a list of words — corpus.
['job', 'engineering', 'location', 'of', 'service', 'job', 'and', 'position']
2. The corpus is then filtered, taking into account the stop words, frequency of occurrence of the words, and punctuations.
['job', 'engineering', 'location', 'service', 'job', 'position']
3. A vocabulary is derived from this, consisting of the distinct words from the corpus.
['job', 'engineering', 'location', 'service', 'position']
4. A dictionary is initialised with its key-value pairs as the words from the corpus with its corresponding index in the vocabulary.
5. Now, the values corresponding to each word in the corpus is taken from the dictionary and appended into a NumPy array.
array([0, 1, 2, 0, 3, 4])
This is the array that will be used to generate batches for training.
Generation of true pairs
The indices are now extracted within the given context size, randomly from different positions and labelled as 1.
The context size used in the code snippet is 5.
Generation of negative samples
This is done by randomly picking a location across the array and associating it with the middle word (pivot). These are labelled as 0.
Generation of batches for training
This can optimise performance as well as memory usage for the model.
Building the model
This is a fairly straight forward approach, carried out by the multiplication of weight matrices to produce an embedding of size lesser than that of the corpus.
Training the model
BCE (Binary Cross Entropy) loss is computed by treating the true and false pairs as the two classes.
Optimizer updates the weights accordingly with every pass.
The function BCEwithLogitsLoss() performs both the computation of value with Sigmoid function and its consequent loss.
The steps or epochs can be decided from the batch size and corpus.
The approach using Gensim for CBOW and Skip-Gram is fairly simple as it makes use of an inbuilt function.
The embedding weights (stored in the variable out) is used for further analysis by computing similarities and differences.
Cosine distance is calculated as it gives a better contextual analogy.
A dot product taken between two words can give the similarity between them.
Here are some pointers on word embeddings:
i. Ensure that the size of the embedding is much lesser than the size of the vocabulary.
Longer embeddings don’t add enough information, and can lead to a higher model size.
Shorter embeddings may not represent the semantics well enough.
ii. Since the batches (true/false) are generated randomly, training should be carried out in sufficient number of iterations.
iii. A cosine distance will always yield a more appropriate result than an euclidean one, as it also takes into account the context.
iv. Clustering/analogies can be visualised through scatterplots.
v. The estimation of model size can come in handy when dealing with larger corpuses.
Word Embeddings are a powerful tool to bring out patterns in text data. They not only speed up processing, but also optimise memory usage!
- github | The Illustrated Word2Vec by Jay Alammar
- towardsdatascience | From Words to Vectors by Mete Ismayil
- analyticsvidhya | An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec by NSS
- pytorch | PyTorch Documentation