Sentiment Analysis on ANY Length of Text With Transformers (Python)

James Briggs

artifical intelligence artificial intelligence deep learning huggingface huggingface transformers language machine learning natural language processing nlp python transformers pytorch sentiment analysis sentiment analysis with transformers sentiment classification tensorflow transformers

description

The de-facto standard in many natural language processing (NLP) tasks nowadays is to use a transformer. Text generation? Transformer. Question-and-answering? Transformer. Language classification? Transformer!

However, one of the problems with many of these models (a problem that is not just restricted to transformer models) is that we cannot process long pieces of text.

Almost every article I write on Medium contains 1000+ words, which, when tokenized for a transformer model like BERT, will produce 1000+ tokens. BERT (and many other transformer models) will consume 512 tokens max - truncating anything beyond this length.

Although I think you may struggle to find value in processing my Medium articles, the same applies to many useful data sources - like news articles or Reddit posts.

We will take a look at how we can work around this limitation. In this article, we will find the sentiment for long posts from the /r/investing subreddit. This video will cover:

High-Level Approach Getting Started

Data
Initialization Tokenization Preparing The Chunks
Split
CLS and SEP
Padding
Reshaping For BERT Making Predictions

🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5

Here's a link to the Medium article: https://towardsdatascience.com/how-to-apply-transformers-to-any-length-of-text-a5601410af7f

And a free access link if you don't have Medium membership: https://towardsdatascience.com/how-to-apply-transformers-to-any-length-of-text-a5601410af7f?sk=d4e717eb2ff31fb27ea67019bbb63ad6 ... https://www.youtube.com/watch?v=yDGo9z_RlnE

created

2025-02-21

staked

0.0 LBC

license

Copyrighted (contact publisher)

File size

139436804 Bytes