NLP for beginners | Classifying text using TextBlob

Share on facebook
Share on twitter
Share on linkedin
Share on telegram
Share on whatsapp

Contents

Introduction

Natural language processing (PNL) is an area of ​​increasing attention due to the growing number of applications such as chatbots, automatic translation, etc. Somehow, The entire intelligent machine revolution is based on the ability to understand and interact with humans.

I've been exploring NLP for some time. My journey started with the NLTK library in Python, which was the recommended library to start at that time. NLTK is a perfect library for education and research, becomes very heavy and tedious to complete even the simplest tasks.

Later, I was introduced to TextBlob, which is based on NLTK and Pattern. A big advantage of this is that it is easy to learn and offers many features such as sentiment analysis., pos labeling, noun phrase extraction, etc. It has now become my reference library for performing NLP tasks.

On a side note, There is space, which is widely recognized as one of the powerful and advanced libraries used to implement NLP tasks. But having found both spacy and TextBlob, I would still suggest TextBlob to a beginner because of its simple interface.

If it is your first step in NLP, TextBlob is the perfect library for you to practice. The best way to read this article is to follow the code and do the tasks yourself. Then let's get started!

Note : This article does not describe NLP tasks in depth. If you want to review the basics and come back here, you can always read this article.

Table of Contents

  1. About TextBlob?
  2. Set up the system
  3. Try NLP tasks with TextBlob
    1. Tokenización
    2. Noun Phrase Extraction
    3. POS-labeling
    4. Inflection and stemming of words
    5. N-grams
    6. Sentiment analysis
  4. Other cool things to do with TextBlob
    1. Spell correction
    2. Create a short summary of a text
    3. Translation and language detection
  5. Classifying text using TextBlob
  6. Pros and cons
  7. Final notes

1. About TextBlob?

TextBlob is a Python library and offers a simple API to access its methods and perform basic NLP tasks.

The nice thing about TextBlob is that they are like python strings. Then, you can transform it and play with it the same way we did in python. Then, I have shown you below some basic tasks. Don't worry about the syntax, it's just to give you an idea how related the TextBlob is to Python strings.

screen-shot-2018-02-11-at-4-27-25-pm-4681987Then, to do these things on your own, let's quickly install and start coding.

2. System configuration

Installing TextBlob on your system in one simple task, all you need to do is open the anaconda indicator (the terminal itself uses Mac OS or Ubuntu) and enter the following commands:

pip install -U textblob

This will install TextBlob. For the uninitiated: practical work in natural language processing generally uses large amounts of linguistic data, O corpora. To download the necessary corpus, you can run the following command

python -m textblob.download_corpora

3. NLP Tasks with TextBlob

3.1 Tokenización

Tokenization refers to dividing text or a sentence into a sequence of tokens, which roughly correspond to “words”. This is one of the basic tasks of NLP. To do this using TextBlob, follow the two steps:

  1. Create a textblob object and pass a rope with it.
  2. Llama functions of textblob to perform a specific task.

Then, let's quickly create a textblob object to play.

from textblob import TextBlob

blob = TextBlob("DataPeaker is a great platform to learn data science. n It helps community through blogs, hackathons, discussions,etc.")

Now, this block of text can be turned into a sentence and then into words. Let's see the code shown below.

3.2 Noun Phrase Extraction

How we extracted the words in the previous section, instead, we can simply extract the noun phrases from the text block. The extraction of noun phrases is particularly important when you want to analyze the “who” in a sentence. Let's see an example below.

blob = TextBlob("DataPeaker is a great platform to learn data science.")
for np in blob.noun_phrases:
 print (e.g)
>> analytics vidhya
great platform
data science

As we see, the results are not quite correct, but we must be aware that we are working with machines.

3.3 Labeling part of the voice

Part-of-speech tagging or grammar tagging is a method of marking words present in a text based on their definition and context. In simple words, says if a word is a noun, an adjective, a verb, etc. This is just a full version of noun phrase extraction, where we want to find all parts of speech in a sentence.

Let's check the labels of our text block.

for words, tag in blob.tags:
 print (words, tag)
>> Analytics NNS
Vidhya NNP
is VBZ
a DT
great JJ
platform NN
to TO
learn VB
data NNS
science NN

Here, NN represents a noun, DT represents a determinant, etc. You can check the full list of labels at here to know but.

3.4 Inflection and stemming of words

Inflection is a process of word formation in which characters are added to the base shape of a word to express grammatical meanings. Word inflection in TextBlob is very simple, namely, the words that we have tokenized from a textblob can be easily changed to singular or plural.

blob = TextBlob("DataPeaker is a great platform to learn data science. n It helps community through blogs, hackathons, discussions,etc.")
print (blob.sentences[1].words[1])
print (blob.sentences[1].words[1].singularize())

>> helps
help

The TextBlob library also offers a built-in object known as Word. We just need to create a word object and then apply a function to it directly as shown below.

from textblob import Word
w = Word('Platform')
w.pluralize()
>>'Platforms'

We can also use the tags to inflect a particular type of words as shown below.

## using tags
for word,pos in blob.tags:
 if pos == 'NN':
 print (word.pluralize())
>> platforms
sciences

Words can be stemmed using the lematizar function.

## lemmatization
w = Word('running')
w.lemmatize("v") ## v here represents verb
>> 'run'

3,5 N-grams

A combination of several words together is called N-Grams. The N grams (N> 1) are generally more informative compared to words and can be used as features for language modeling. N-grams can be easily accessed in TextBlob using the ngramas function, which returns a tuple of n successive words.

for ngram in blob.ngrams(2):
print (ngram)
>> ['Analytics', 'Vidhya']
['Vidhya', 'is']
['is', 'a']
['a', 'great']
['great', 'platform']
['platform', 'to']
['to', 'learn']
['learn', 'data']
['data', 'science']

3.6 Sentiment analysis

Sentiment analysis is basically the process of determining the attitude or emotion of the writer, namely, yes it is positive, negative or neutral.

the feeling textblob function returns two properties, polarity, Y subjectivity.

The polarity is floating which is in the range of [-1,1] where 1 means positive statement and -1 means negative statement. Subjective sentences generally refer to opinions, personal emotions or judgments, while the objective ones refer to factual information. Subjectivity is also a float that is in the range of [0,1].

Let's review the feeling of our blob.

print (blob)
blob.sentiment
>> DataPeaker is a great platform to learn data science.
Sentiment(polarity=0.8, subjectivity=0.75)

We can see that the polarity is 0,8, which means that the statement is positive and 0,75 subjectivity refers to the fact that it is mostly a public opinion and not factual information.

4. Other cool things to do

4.1 Spell correction

Spell checking is an interesting feature that TextBlob offers, you can access us using the Right work as shown below.

blob = TextBlob('DataPeaker is a gret platfrm to learn data scence')
blob.correct()
>> TextBlob("DataPeaker is a great platform to learn data science")

We can also check the suggested word list and your confidence using the spelling checker function.

blob.words[4].spellcheck()
>> [('great', 0.5351351351351351),
 ('get', 0.3162162162162162),
 ('grew', 0.11216216216216217),
 ('grey', 0.026351351351351353),
 ('greet', 0.006081081081081081),
 ('fret', 0.002702702702702703),
 ('grit', 0.0006756756756756757),
 ('curly', 0.0006756756756756757)]

4.2 Create a short summary of a text

This is a simple trick that we will use the things we learned earlier. First, take a look at the code shown below and understand yourself.

import random

blob = TextBlob('DataPeaker is a thriving community for data driven industry. This platform allows 
people to know more about analytics from its articles, Q&A forum, and learning paths. Also, we help 
professionals & amateurs to sharpen their skillsets by providing a platform to participate in Hackathons.')
nouns = list()
for word, tag in blob.tags:
if tag == 'NN':
nouns.append(word.lemmatize())

print ("This text is about...")
for item in random.sample(nouns, 5):
word = Word(item)
print (word.pluralize())

>> This text is about...
communities
platforms
forums
platforms
industries

Simple, It is not like this? What we did previously is that we extracted a list of nouns from the text to give the reader a general idea about the things that the text relates to..

4.3 Translation and language detection

Can you guess what is written on the following line?

screen-shot-2018-02-11-at-4-58-28-pm-4096635

And and! Can you guess what language this is? Do not worry, we detect it using textblob…

blob.detect_language()
>> 'With'

Then, It is Arabic. Now, let's try to translate it to english so we can know what is written using TextBlob.

blob.translate(from_lang='ar', to = 'en')
>> TextBlob("that's cool")

Even if you don't explicitly define the source language, TextBlob will automatically detect the language and translate it to the desired language.

blob.translate(to = 'en') ## or you can directly do like this
>> TextBlob("that's cool")

This is really cool !!! 😀

5. Classifying text using TextBlob

Let's build a simple text classification model using TextBlob. For this, first, we need to prepare a training and test data.

training = [
('Tom Holland is a terrible spiderman.','pos'),
('a terrible Javert (Russell Crowe) ruined Les Miserables for me...','pos'),
('The Dark Knight Rises is the greatest superhero movie ever!','neg'),
('Fantastic Four should have never been made.','pos'),
('Wes Anderson is my favorite director!','neg'),
('Captain America 2 is pretty awesome.','neg'),
('Lets pretend "Batman and Robin" never happened..','pos'),
]
testing = [
('Superman was never an interesting character.','pos'),
('Fantastic Mr Fox is an awesome film!','neg'),
('Dragonball Evolution is simply terrible!!','pos')
]

Textblob provides a built-in classifiers module to create a custom classifier. Then, let's quickly import it and create a basic classifier.

from textblob import classifiers
classifier = classifiers.NaiveBayesClassifier(training)

As you can see above, we have passed the training data to the classifier.

Note that here we have used the Naive Bayes classifier, but TextBlob also offers decision tree classifier shown below.

## decision tree classifier
dt_classifier = classifiers.DecisionTreeClassifier(training)

Now, let's check the accuracy of this classifier on the test dataset and also TextBlob provides us to check the more informative features.

print (classifier.accuracy(testing))
classifier.show_informative_features(3)
>> 1.0
Most Informative Features
            contains(is) = True              neg : pos    =      2.9 : 1.0
      contains(terrible) = False             neg : pos    =      1.8 : 1.0
         contains(never) = False             neg : pos    =      1.8 : 1.0

What, we can see that if the text contains “it is”, then there is a high probability that the statement is negative.

To give a little more idea, let's check our classifier on a random text.

blob = TextBlob('the weather is terrible!', classifier=classifier)
print (blob.classify())
>> neg

Then, based on training in the data set above, our classifier gave us the correct result.

Note that here we could have done some data preprocessing and cleaning, but here my goal was to give you an idea how we can do text classification using TextBlob.

6. Pros and cons

Pros:

  1. Given the, is built on the shoulders of NLTK and Pattern, Thus, makes it simple for beginners by providing an intuitive interface for NLTK.
  2. Provides translation and language detection that works with Google Translate (not provided with Spacy).

Cons:

  1. It's a bit slower compared to space, but faster than NLTK. (Space> TextBlob> NLTK)
  2. Does not provide features like dependency analysis, word vectors, etc. that provides spacy.

7. Final notes

I hope you have fun learning about this library. TextBlob, in reality, provided a very easy interface for beginners to learn basic NLP tasks.

I would recommend all beginners to start with this library and then, to do advanced work, they can also learn to be spaced. We will continue to use TextBlob for initial prototyping in almost all NLP projects.

You can find the full code for this article in my github repository.

What's more, Did you find this article useful? Share your opinions / thoughts in the comment section below.

Learn, to compete, hack and get hired!

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.