• Skip to main content
  • Skip to footer

Fervent | Finance Courses, Investing Courses

Rigorous Courses, Backed by Research, Taught with Simplicity.

  • Home
  • Courses
  • Resource Hub
  • Articles
  • All Access Pass
How to Clean Text Data (Full Practical Walkthrough)

How to Clean Text Data (Full Practical Walkthrough)

January 6, 2021 By Support from Fervent Leave a Comment

In this article, we’re going to learn how to clean text data.

So let’s get into it.

What is text cleaning?

Firstly, what exactly is text cleaning?

Ultimately, it’s just a process of transforming raw text into a format that’s suitable for textual analysis.

Slide showcasing what text cleaning is, as part of the fundamentals of how to clean text data

Cleaning text data is imperative for any sort of textual analysis; and naturally, the same applies for sentiment analysis or more broadly, text mining as well.

And this holds regardless of whether you’re conducting sentiment analysis (or other textual analysis) “manually”, or whether you’re using some sort of machine learning algorithms.

Data cleansing is imperative for any sort of analysis. And textual analysis is no exception.

Formally, text cleaning essentially involves vectorising text data.

Put differently, we’re going from, say, a text file or .txt file to some sort of a vector.

Think of a column or a row in an Excel spreadsheet, or in a pandas dataframe.

Text data by definition, and by construction is unstructured. But the idea is to move away from a blob of text to a format that’s a little more structured.

What does text cleaning look like?

To help you understand what this looks like, let’s actually take a look at what a blob of text looks like. And then see what the cleaned text looks like.

Pre-Cleaning / processing

So right here, you’ve got just a bunch of text…

Slide showcasing raw text (before text cleaning), as part of the core fundamentals of how to clean text data

This is actually an excerpt from a management discussion and analysis or MD&A filing.

And you can see that this looks like any ordinary blob of financial text.

So you’ve got some words. And then you’ve got your punctuation marks. You’ve got some numbers, dates, parentheses, percentage signs, etc.

There’s a variety of different special characters in here. There’s a variety of different words.

And a lot of it is not something we can actually use.


Related Course: Investment Analysis with Natural Language Processing (NLP)

This Article features concepts that are covered extensively in our course on Investment Analysis with Natural Language Processing (NLP).

If you’re interested in learning how to leverage the power of text data for investment analysis while working with real world data, you should definitely check out the course.


Post cleaning / processing

Ultimately, the only thing we’re really interested in is the actual words.

We don’t particularly care about the numbers. Or the symbols.

And nor do we care about punctuation marks.

We’re literally only really interested in the words.

And once you go through the cleaning process, here’s what the cleaned text would look like…

Slide showcasing cleaned words (after text cleaning) as part of the core fundamentals on how to clean text data

We’ve now moved away from a blob of text to something that’s relatively more structured in that it is a list of words.

This state – of transforming text into a list of words or a “bag of words” – is also called “tokenization”. And each element inside the list is a “token” (essentially a word).

You can see that all of the symbols have gone. There are no parentheses and hyphens. All the punctuation marks have gone.

And all the numbers have gone, too.

Now we literally only have the words, without any unwanted characters.

How do we achieve this?

Well, it’s actually a three-step process. And we do actually have a video that explains the full process, viewable here:

Of course, you can also continue to read about the whole process further below.

How to clean text data using the 3 Step Process

Step 1: Remove numbers, symbols, and other unwanted characters

The 3 step process on how to clean text data starts with removing all the numbers, symbols, and anything that’s not an alphabetic character from the text.

So we remove literally anything that is not a word.

Want to go beyond cleaning text data?

Get the Investment Analysis with NLP Study Pack (for FREE!).

Investment Analysis with Natural Language Processing Study Pack Feature

Why is removing non-alphabetic characters important?

Why is this necessary?

Because if we don’t do this, then we can essentially end up underestimating sentiment, for example.

For instance, one of the ways to estimate sentiment is to use a “proportional counts approach”.

Sentiment using a proportional counts approach can be estimated as…

    \[\phi_{s} = \frac{\sum 1_{w^* \in \psi_s}}{\sum w^*}\]

Where you take the frequency counts of the (cleaned) words that belong to a sentiment language (the numerator).

And you divide that by the total number of (cleaned) words in that document (the denominator).

Now, the total number of words to you and I intuitively would just include words, right?

But if you don’t eliminate things like the numbers, symbols, and other non-alphabetic characters…

Then it’s possible for the program to essentially include those symbols and numbers as words. So essentially, the number of words would be significantly higher than it is actually is.

Because it’s counting the symbols as individual words. You and I, as humans, know they’re not words.

But unless we explicitly code in the requirement to ignore numbers, symbols and punctuation marks, and the like…

Unless we do that, the program is likely going to end up counting those symbols and numbers as words.

So that’s the technical reason as to why we need to remove non-alphabetic characters.

The fundamental rationale for removing non-alphabetic characters

But there’s also an underlying reasoning or rationale.

And that’s simply because numbers aren’t words. Symbols aren’t words. And punctuation marks aren’t words.

If we’re trying to establish the words which relate to a specific sentiment language, then numbers, symbols and punctuation marks are completely irrelevant.

For us to achieve that objective, we only need the words.

We don’t need all of the other stuff that just happens to be inside a blob of text.

And this is why it’s important for us to remove all of the non-alphabetic characters as the first step in our text cleaning process.

Note that removing unwanted characters is fairly straightforward, using either regular expression or other built-in text cleaning functions, for example, Python’s .isalpha() method.

Step 2: Harmonise letter case

The next thing we do as part of how to clean text data using the 3 step process, is to harmonise the letter case.

In an ordinary blob of text, we tend to have a mix of upper case, lower case, and title case text.

And working with text that’s in different cases can be a little bit problematic.

Why is harmonising letter case important?

Harmonising letter case helps us ensure that the words inside a document that belong to a sentiment lexicon or sentiment language are actually picked up.

To give you an example, consider the word “growth” with a capital G – so “Growth”.

To you and I, as humans, that’s the same as growth with a little G (“growth”).

The word is the same.

They both mean the same thing.

Just because we write it with a capital G or we write the whole thing with an upper case as “GROWTH”, or indeed the whole thing with lower case as “growth”…

That doesn’t change the meaning of the word.

It still means growth.

You and I know that as humans.

But computers aren’t as clever as we’d like them to be.

And so for instance, if you were to ask Python whether the text string “growth” is the same as “Growth”, Python will return “false”.

Because as far as Python is concerned, these two words are not the same because they are spelled slightly differently.

Python doesn’t care that, as far as the English language goes, they are in fact the same thing.

And so if you now imagine the word/text string “growth” being in our positive dictionary or positive sentiment language.

Say we’ve listed that word as lowercase…

Then if there’s a title, case word of growth (“Growth”), it’s not going to pick it up. It’s not going to include that word as one that belongs to the sentiment language.

And the same goes if there was an upper case “GROWTH” – it won’t pick it up.

Because our dictionary includes the word in a lower case as “growth”.

This is why it’s really important for us to make sure that all of the cases in the text that we’re working with is consistent and identical.

Choice of letter case

You can choose to work with upper case, or title case, or lower case.

It doesn’t matter which specific case you end up working with. As long as you’re consistent with that case throughout the Corpus.

Generally speaking, most people who work with text data tend to harmonise the text data to lower case.

It just happens to be a typical convention or the general consensus / general practice.

Okay, after harmonising the letter case across all words, the last thing we need to do is remove all stopwords.

Step 3: Remove the most common words (stopwords)

The final step of the tax cleaning process involves removing the most common words, aka “stopwords”.

Stopwords are the most common words in a given language. And this language can be a general language (e.g., English), or it could be a subject-specific language; for instance, Finance.

The idea is to remove the words that are most commonly used in that language. “a” is a stopword. As is “the”. And “an”, for example.

Why is removing stopwords important?

And that’s ultimately because the most common words are so common, that they actually add little to no value to any analysis.

It’s also because if we don’t remove stopwords, then we can end up underestimating sentiment.

To see why we might end up underestimating sentiment, think about the proportional counts’ estimate of sentiment that we talked about earlier in this article.

You’ll recall that the numerator of that estimate is the frequency count of all of the words which belong to a given sentiment language.

And the denominator of the proportional counts’ estimate is the total number of words in that document.

Here’s the equation again, just in case you missed it:

    \[\phi_{s} = \frac{\sum 1_{w^* \in \psi_s}}{\sum w^*}\]

And so if our document has words like “a”, “the”, “and”, etc, then, of course, that’s going to increase the total number of words in the document.

That will increase the denominator of the proportional counts estimate. Which will then naturally decrease the value of the proportional counts based estimate of sentiment.

Language specific & subject specific stopwords

Now, importantly, stopwords aren’t necessarily limited to just the most common words in the English language.

Broadly speaking, stopwords can comprise of general language specific words. But they can also comprise of subject specific words.

So in finance and accounting, for instance, you might think of words like “company”, “firm”, “management”, or “business” as examples of stopwords. Because these are likely words that are extremely common across all documents.

And so you can actually think of these common words as stopwords that are specific to finance and accounting.

Now, while the general language specific stopwords lists are available for free, subject specific stock words – at least at the time of writing – tend to proprietary.

Some people have created stopword lists that are specific to certain subjects, but they do not allow people to use those stopword lists for free.

Few of them allow free use for academic purposes, but not for commercial purposes. And others don’t allow people to use those lists for anything for free.

Subject specific stop words can be very important.

But they’re not imperative for to you use.

So it’s not like your sentiment analysis will completely break down if you don’t use the subject specific stopwords.

They certainly can be very useful and important. But they’re not by any means the be-all and end-all of sentiment analysis.

So hopefully you now understand the process of cleaning text data, and perhaps more importantly, you understand why the individual steps are necessary.

Step 3.5 (Bonus / optional): Stem and Lemmatize

When exploring how to clean text data, the preceding 3 steps are imperative.

In addition though, depending on the hypotheses that you’re working with, text cleaning can also include “stemming” or lemmatizing”. And it can also include the removal of the most common words within the Corpus.

Stemming and lemmatizing essentially reduce all of the words down to their core root word.

So for example, the word “managers” would be reduced down to “manager”.

In terms of removing the most common words within the Corpus, it’s a simple case of remove the words that are used most commonly across all documents inside the Corpus.

Wrapping Up How to Clean Text Data

In this article, you’ve learnt the core fundamentals of how to clean text data.

Specifically, we learned that cleaning text data involves transforming raw text into a format that’s suitable for textual analysis.

This itself is a 3 step process, including:

  • removing numbers, symbols, and everything that’s not an alphabetic character,
  • harmonising letter case so they all have the same case, be that upper case, title case, or lower case
  • removing stopwords

Hopefully all of this makes sense.

If any part of this article is not quite clear, please read it again before moving on any further.

Next steps? Discover how all this hard work can be used to create profitable sentiment investing strategies.

Or build your own sentiment investing system by enrolling on and taking the course below.


Related Course: Investment Analysis with Natural Language Processing (NLP)

Do you want to build a rigorous investment analysis system that leverages the power of text data with Python?

Explore the Course

Filed Under: Finance, NLP for Finance

Reader Interactions

Leave a Reply Cancel reply

You must be logged in to post a comment.

Footer CTA

Do You Want To Crack The Code of Successful Investing?

Yes! Tell Me More

  • About Us
  • Write For Us
  • Contact Us

Copyright © 2025, Fervent · Privacy Policy · Terms and Conditions


Logos of institutions used are owned by those respective institutions. Neither Fervent nor the institutions endorse each other's products / services.

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.
Cookie SettingsAccept All
Manage consent

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
CookieDurationDescription
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytics
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
Others
Other uncategorized cookies are those that are being analyzed and have not been classified into a category as yet.
SAVE & ACCEPT