• Skip to main content
  • Skip to footer

Fervent | Finance Courses, Investing Courses

Rigorous Courses, Backed by Research, Taught with Simplicity.

  • Home
  • Courses
  • Resource Hub
  • Articles
  • All Access Pass
How to Estimate Sentiment using a Document Term Matrix

How to Estimate Sentiment using a Document Term Matrix

December 3, 2020 By Vash Leave a Comment

In this article, we’re going to learn how to estimate sentiment using what’s called a document term matrix or “DTM”.

What is a Document Term Matrix (DTM)?

Firstly, what exactly is a document matrix?

Well, it’s a matrix. And specifically, it’s a matrix that represents the words that are inside a Corpus.

Slide showcasing what a document term matrix is

So we take our entire Corpus; we take all of the words that are inside our text corpus… and we transform it into a mathematical matrix, where each column represents a unique word in the entire text corpus.

Each column represents one unique word that exists in the entire Corpus across all text documents.

And each row represents a unique text document.

And it’s this particular structure; this particular format, that’s the reason that this thing is called a document term matrix.

Because if you think about a matrix, we typically define a mathematical matrix as  N \times M, where N refers to the number of rows and M refers to the number of columns.

When working with text data, you’ve got N number of documents and M number of terms or words.

So this is an N \times M or a Document \times Term matrix, or simply just a Document Term Matrix (DTM).

Document Term Matrix vs. Term Document Matrix

Some people call the DTM a Term Document Matrix. And that’s only because they put the words in the rows, and put the documents in the columns.

It’s just another way of storing that information.

But for the most part, people tend to call it a Document Term Matrix (DTM).

Why use a Document Term Matrix (DTM)?

While it’s possible to estimate sentiment without a Document Term Matrix, a DTM can be extremely useful when working with really large datasets.

Especially for instance, when working on text mining with “Big Data” within the context of NLP for Finance.

RELATED: Sentiment Analysis

Using a DTM creates a coherent structure for what is otherwise an “unstructured” format of data. This in turn allows for better text analysis / text analytics.

The matrix representation of the corpus object allows for estimating sentiment significantly more efficiently, compared to say, if one estimated sentiment iteratively.

What does a DTM look like?

Now let’s think about what this actually looks like.

Firstly, it’s useful to think of a Corpus C as a bunch of documents.

If we see the Corpus object in that way, then we can think of each document within the Corpus as a bag of words or a list of words.

Consider a small “toy” corpus with just 6 documents relating to 3 firms over 2 time periods.

Slide showcasing a representation of the corpus

Here J represents the total number of firms, in this case, three.

And we were looking at two specific time periods, t1 and t2.

So we’ve got a bunch of text documents, and each document has a list of words, or a bag of words. Because fundamentally, that’s exactly what a text document is. It’s just got a bunch of words.


Related: Investment Analysis with Natural Language Processing (NLP) Course

This Article features a concept that is covered extensively in our sentiment investing course on Investment Analysis with Natural Language Processing (NLP).

If you’re interested in leveraging the power of text data for investment analysis, you should definitely check out the course.


Now, since we can think of every single document in this way, what if we just took all of the unique words and “plonk” them into a matrix?

Well, if we did that, we’d have something like this…

Slide showcasing bare bones document term matrix

We’ve still got those same six text documents between three firms over two time periods.

But rather than looking at individual bags of words for each and every document, we can just get all of the unique words across all documents and place them as individual columns.

The columns represent unique words, which means, of course, each word only shows up one time.

And all of the words that show up (W words in total) come from the entire Corpus.

Thus, that is literally is the entire Corpus, but just in unique terms.

Essentially, transforming / creating the DTM is akin to tokenization, because we’ve literally got a bunch of tokens now across multiple columns.

So while previously we could only see our Corpus as either a bunch of files, or a bunch of lists of words…

We can now see the entire Corpus and look at all of the information that’s inside the Corpus.

And this whole thing here is then our document term matrix.

Slide showcasing a document term matrix

The values inside the Document Term Matrix represent the frequency counts of the individual words that show up in individual documents.

Put differently, each value represents a term count (or term frequency).

So if we focus our attention on the first row up there, then what it tells us is that for the document d_{1t1}:

  • w_1 shows up 4 times
  • w_2, w_3, and w_4 don’t show up at all
  • w_5 shows up once
  • w_6 shows up twice
  • W, the final word in the corpus, doesn’t show up

Similarly for document d_{2t1} (i.e., the document for firm 2 at time 1), there’s no occurrence of w_1; five occurrences of w_2; no occurrences of w_3; three occurrences of w_4, and so on and so forth. You get the idea.

We’re literally just counting the number of times a given word occurs, and calling that the term frequency, or term count, or word frequency – call it whatever you fancy.

And of course it’s the same principle and the same interpretation across all documents.

Importantly, notice that the vast majority of values inside the DTM are zeros. This is normal. And the bigger the DTM, the more zeros you’ll see.

That’s because by construction, the DTM is a sparse matrix, and the majority of the values will be 0.

Now this particular Document Term Matrix has all of the unique words in the entire Corpus.

When working with text data, we don’t tend to work with all the words in the corpus; not even all the unique words.

Instead, we tend to only work with “cleaned words”.

Want to go further in Financial Sentiment Analysis?

Get the Investment Analysis with NLP Study Pack (for FREE!).

Investment Analysis with Natural Language Processing Study Pack Feature

Document Term Matrix Subset (Cleaned Words)

The Document Term Matrix in its current form then, isn’t particularly useful for estimating sentiment.

But of course we can create a similar Document Term Matrix using only the cleaned words in the Corpus.

And that might look something like this:

Slide showcasing a document term matrix comprising only cleaned words

So we’ve just got the same format.

So we’ve got documents only this time, they’re cleaned documents, and we’ve got words except they’re cleaned words.

The * superscript here denotes “cleaned”, so w^* represents a “cleaned word”, and d^* represents a “cleaned document”.

And this is now a Document Term Matrix of all of the cleaned words in the entire Corpus, across all documents.

The interpretation of this particular Document Term Matrix is of course, identical to the interpretation of the previous DTM.

By the way, in case you’re wondering how to clean text data, we’ve got a separate article for that. Long story short though, text cleaning involves:

  • Removing unwanted characters
  • Harmonising letter case, and
  • Removing stopwords (i.e., the most common words in the Corpus)

Estimating Sentiment using a DTM

Now, how exactly do we use a Document Term Matrix to estimate sentiment?

Given that we can use the same approach to create a Document Term Matrix of cleaned words, as we did for just all of the words in the entire Corpus…

We can of course create a Document Term Matrix which only comprises of the words which belong to a specific sentiment language (aka sentiment dictionary, sentiment vocabulary).

And that will look something like this:

Slide showcasing a document term matrix comprising cleaned words which belong to a sentiment language

Now we’ve got cleaned words which belong to a specific sentiment language \psi_s.

Of course, the interpretation of this “subset” DTM is identical to the previous versions.

The only difference is, rather than having all of the unique words in the entire Corpus. Or indeed, having all of the unique cleaned words that are inside the entire Corpus…

We now only have the cleaned words which belong to a specific sentiment language.

Now, let’s say for simplicity that we are only looking at positive sentiment.

And our positive sentiment lexicon / vocabulary, again for simplicity, just has three positive words inside it.

Then our DTM might look like this:

Slide showcasing a "toy" DTM with a simple lexicon

So we’ve got “happy”, “good”, and “amazing”, which are the only words in our sentiment language.

And given the interpretation of the document term matrix, and the frequency counts available, we can get an idea of which specific sentiment language words appear in which documents.

Now, why bother with all of this?

Because if you look closely, the sum across every single row here is nothing but our estimate for sentiment.

The sum of each row represents \phi_s, or sentiment, estimated using a frequency counts approach.

Slide showcasing how to estimate sentiment using a DTM

Because remember, the values inside this “subset” document term matrix are literally just the frequency counts of all of the cleaned words in a given document, which belong to a sentiment language.

So if we take a look at our toy example, again, we have the count for positive sentiment!

We’re just calling that “pos_count”.

And each value in the “pos_count” column essentially represents \phi_{pos} (positive sentiment).

Strictly in this case, \phi_{pos} represents positive sentiment, estimated using a frequency counts approach.

Slide showcasing positive sentiment counts

Of course, estimating sentiment using a proportional counts approach is trivial after having obtained the frequency counts based estimates

Because all we’d need to do is divide every single row in the “pos_count” column by the total number of cleaned words for each document.

It literally is that simple!


Related Course: Investment Analysis with Natural Language Processing (NLP)

Do you want to build a rigorous investment analysis system that leverages the power of text data with Python?

Explore the Course

Filed Under: Finance, NLP for Finance

Reader Interactions

Leave a Reply Cancel reply

You must be logged in to post a comment.

Footer CTA

Do You Want To Crack The Code of Successful Investing?

Yes! Tell Me More

  • About Us
  • Write For Us
  • Contact Us

Copyright © 2025, Fervent · Privacy Policy · Terms and Conditions


Logos of institutions used are owned by those respective institutions. Neither Fervent nor the institutions endorse each other's products / services.

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.
Cookie SettingsAccept All
Manage consent

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
CookieDurationDescription
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytics
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
Others
Other uncategorized cookies are those that are being analyzed and have not been classified into a category as yet.
SAVE & ACCEPT