In this article, we’re going to learn how to clean text data.
So let’s get into it.
What is text cleaning?
Firstly, what exactly is text cleaning?
Ultimately, it’s just a process of transforming raw text into a format that’s suitable for textual analysis.
Cleaning text data is imperative for any sort of textual analysis; and naturally, the same applies for sentiment analysis or more broadly, text mining as well.
And this holds regardless of whether you’re conducting sentiment analysis (or other textual analysis) “manually”, or whether you’re using some sort of machine learning algorithms.
Data cleansing is imperative for any sort of analysis. And textual analysis is no exception.
Formally, text cleaning essentially involves vectorising text data.
Put differently, we’re going from, say, a text file or .txt file to some sort of a vector.
Think of a column or a row in an Excel spreadsheet, or in a pandas dataframe.
Text data by definition, and by construction is unstructured. But the idea is to move away from a blob of text to a format that’s a little more structured.
What does text cleaning look like?
To help you understand what this looks like, let’s actually take a look at what a blob of text looks like. And then see what the cleaned text looks like.
Pre-Cleaning / processing
So right here, you’ve got just a bunch of text…
This is actually an excerpt from a management discussion and analysis or MD&A filing.
And you can see that this looks like any ordinary blob of financial text.
So you’ve got some words. And then you’ve got your punctuation marks. You’ve got some numbers, dates, parentheses, percentage signs, etc.
There’s a variety of different special characters in here. There’s a variety of different words.
And a lot of it is not something we can actually use.
Related Course: Investment Analysis with Natural Language Processing (NLP)
This Article features concepts that are covered extensively in our course on Investment Analysis with Natural Language Processing (NLP).
If you’re interested in learning how to leverage the power of text data for investment analysis while working with real world data, you should definitely check out the course.
Post cleaning / processing
Ultimately, the only thing we’re really interested in is the actual words.
We don’t particularly care about the numbers. Or the symbols.
And nor do we care about punctuation marks.
We’re literally only really interested in the words.
And once you go through the cleaning process, here’s what the cleaned text would look like…
We’ve now moved away from a blob of text to something that’s relatively more structured in that it is a list of words.
This state – of transforming text into a list of words or a “bag of words” – is also called “tokenization”. And each element inside the list is a “token” (essentially a word).
You can see that all of the symbols have gone. There are no parentheses and hyphens. All the punctuation marks have gone.
And all the numbers have gone, too.
Now we literally only have the words, without any unwanted characters.
How do we achieve this?
Well, it’s actually a three-step process. And we do actually have a video that explains the full process, viewable here:
Of course, you can also continue to read about the whole process further below.
How to clean text data using the 3 Step Process
Step 1: Remove numbers, symbols, and other unwanted characters
The 3 step process on how to clean text data starts with removing all the numbers, symbols, and anything that’s not an alphabetic character from the text.
So we remove literally anything that is not a word.
Want to go beyond cleaning text data?
Get the Investment Analysis with NLP Study Pack (for FREE!).
Why is removing non-alphabetic characters important?
Why is this necessary?
Because if we don’t do this, then we can essentially end up underestimating sentiment, for example.
For instance, one of the ways to estimate sentiment is to use a “proportional counts approach”.
Sentiment using a proportional counts approach can be estimated as…
Where you take the frequency counts of the (cleaned) words that belong to a sentiment language (the numerator).
And you divide that by the total number of (cleaned) words in that document (the denominator).
Now, the total number of words to you and I intuitively would just include words, right?
But if you don’t eliminate things like the numbers, symbols, and other non-alphabetic characters…
Then it’s possible for the program to essentially include those symbols and numbers as words. So essentially, the number of words would be significantly higher than it is actually is.
Because it’s counting the symbols as individual words. You and I, as humans, know they’re not words.
But unless we explicitly code in the requirement to ignore numbers, symbols and punctuation marks, and the like…
Unless we do that, the program is likely going to end up counting those symbols and numbers as words.
So that’s the technical reason as to why we need to remove non-alphabetic characters.
The fundamental rationale for removing non-alphabetic characters
But there’s also an underlying reasoning or rationale.
And that’s simply because numbers aren’t words. Symbols aren’t words. And punctuation marks aren’t words.
If we’re trying to establish the words which relate to a specific sentiment language, then numbers, symbols and punctuation marks are completely irrelevant.
For us to achieve that objective, we only need the words.
We don’t need all of the other stuff that just happens to be inside a blob of text.
And this is why it’s important for us to remove all of the non-alphabetic characters as the first step in our text cleaning process.
Note that removing unwanted characters is fairly straightforward, using either regular expression or other built-in text cleaning functions, for example, Python’s .isalpha() method.
Step 2: Harmonise letter case
The next thing we do as part of how to clean text data using the 3 step process, is to harmonise the letter case.
In an ordinary blob of text, we tend to have a mix of upper case, lower case, and title case text.
And working with text that’s in different cases can be a little bit problematic.
Why is harmonising letter case important?
Harmonising letter case helps us ensure that the words inside a document that belong to a sentiment lexicon or sentiment language are actually picked up.
To give you an example, consider the word “growth” with a capital G – so “Growth”.
To you and I, as humans, that’s the same as growth with a little G (“growth”).
The word is the same.
They both mean the same thing.
Just because we write it with a capital G or we write the whole thing with an upper case as “GROWTH”, or indeed the whole thing with lower case as “growth”…
That doesn’t change the meaning of the word.
It still means growth.
You and I know that as humans.
But computers aren’t as clever as we’d like them to be.
And so for instance, if you were to ask Python whether the text string “growth” is the same as “Growth”, Python will return “false”.
Because as far as Python is concerned, these two words are not the same because they are spelled slightly differently.
Python doesn’t care that, as far as the English language goes, they are in fact the same thing.
And so if you now imagine the word/text string “growth” being in our positive dictionary or positive sentiment language.
Say we’ve listed that word as lowercase…
Then if there’s a title, case word of growth (“Growth”), it’s not going to pick it up. It’s not going to include that word as one that belongs to the sentiment language.
And the same goes if there was an upper case “GROWTH” – it won’t pick it up.
Because our dictionary includes the word in a lower case as “growth”.
This is why it’s really important for us to make sure that all of the cases in the text that we’re working with is consistent and identical.
Choice of letter case
You can choose to work with upper case, or title case, or lower case.
It doesn’t matter which specific case you end up working with. As long as you’re consistent with that case throughout the Corpus.
Generally speaking, most people who work with text data tend to harmonise the text data to lower case.
It just happens to be a typical convention or the general consensus / general practice.
Okay, after harmonising the letter case across all words, the last thing we need to do is remove all stopwords.
Step 3: Remove the most common words (stopwords)
The final step of the tax cleaning process involves removing the most common words, aka “stopwords”.
Stopwords are the most common words in a given language. And this language can be a general language (e.g., English), or it could be a subject-specific language; for instance, Finance.
The idea is to remove the words that are most commonly used in that language. “a” is a stopword. As is “the”. And “an”, for example.
Why is removing stopwords important?
And that’s ultimately because the most common words are so common, that they actually add little to no value to any analysis.
It’s also because if we don’t remove stopwords, then we can end up underestimating sentiment.
To see why we might end up underestimating sentiment, think about the proportional counts’ estimate of sentiment that we talked about earlier in this article.
You’ll recall that the numerator of that estimate is the frequency count of all of the words which belong to a given sentiment language.
And the denominator of the proportional counts’ estimate is the total number of words in that document.
Here’s the equation again, just in case you missed it:
And so if our document has words like “a”, “the”, “and”, etc, then, of course, that’s going to increase the total number of words in the document.
That will increase the denominator of the proportional counts estimate. Which will then naturally decrease the value of the proportional counts based estimate of sentiment.
Language specific & subject specific stopwords
Now, importantly, stopwords aren’t necessarily limited to just the most common words in the English language.
Broadly speaking, stopwords can comprise of general language specific words. But they can also comprise of subject specific words.
So in finance and accounting, for instance, you might think of words like “company”, “firm”, “management”, or “business” as examples of stopwords. Because these are likely words that are extremely common across all documents.
And so you can actually think of these common words as stopwords that are specific to finance and accounting.
Now, while the general language specific stopwords lists are available for free, subject specific stock words – at least at the time of writing – tend to proprietary.
Some people have created stopword lists that are specific to certain subjects, but they do not allow people to use those stopword lists for free.
Few of them allow free use for academic purposes, but not for commercial purposes. And others don’t allow people to use those lists for anything for free.
Subject specific stop words can be very important.
But they’re not imperative for to you use.
So it’s not like your sentiment analysis will completely break down if you don’t use the subject specific stopwords.
They certainly can be very useful and important. But they’re not by any means the be-all and end-all of sentiment analysis.
So hopefully you now understand the process of cleaning text data, and perhaps more importantly, you understand why the individual steps are necessary.
Step 3.5 (Bonus / optional): Stem and Lemmatize
When exploring how to clean text data, the preceding 3 steps are imperative.
In addition though, depending on the hypotheses that you’re working with, text cleaning can also include “stemming” or lemmatizing”. And it can also include the removal of the most common words within the Corpus.
Stemming and lemmatizing essentially reduce all of the words down to their core root word.
So for example, the word “managers” would be reduced down to “manager”.
In terms of removing the most common words within the Corpus, it’s a simple case of remove the words that are used most commonly across all documents inside the Corpus.
Wrapping Up How to Clean Text Data
In this article, you’ve learnt the core fundamentals of how to clean text data.
Specifically, we learned that cleaning text data involves transforming raw text into a format that’s suitable for textual analysis.
This itself is a 3 step process, including:
- removing numbers, symbols, and everything that’s not an alphabetic character,
- harmonising letter case so they all have the same case, be that upper case, title case, or lower case
- removing stopwords
Hopefully all of this makes sense.
If any part of this article is not quite clear, please read it again before moving on any further.
Next steps? Discover how all this hard work can be used to create profitable sentiment investing strategies.
Or build your own sentiment investing system by enrolling on and taking the course below.
Related Course: Investment Analysis with Natural Language Processing (NLP)
Do you want to build a rigorous investment analysis system that leverages the power of text data with Python?