Interesting tool for text analysis and NLP

Share on facebook
Share on twitter
Share on linkedin
Share on telegram
Share on whatsapp

Contents

This article was published as part of the Data Science Blogathon

Introduction

There are many ways to compare text in Python. But nevertheless, we often look for an easy way to compare text. Text comparison is required for various text analysis and natural language processing purposes.

One of the easiest ways to compare text in Python is to use the fuzzy-wuzzy library. Here, we get a score of 100, according to the similarity of the chains. Basically, we are given the similarity index. The library uses the Levenshtein distance to calculate the difference between two strings.

88249pexels-pixabay-261763-9063654
Image source: https://www.pexels.com/

Levenshtein distance

Levenshtein distance is a string metric for calculating the difference between two different strings. The Soviet mathematician Vladimir Levenshtein formulated this method and is named after him..

The Levenshtein distance between two strings a, b (of length {| a | Y | b | respectively) Is given by lev (a, b) where

47459fw-1631496

where he cola of some rope X is a string of all but the first character of X, Y X[n] is he Northth string character X starting with the character 0.

(Source: https://en.wikipedia.org/wiki/Levenshtein_distance)

FuzzyWuzzy

Fuzzy Wuzzy is an open source library developed and released by SeatGeek. You can read his original blog here. Simple implementation and unique scoring (about 100) metic make it interesting to use FuzzyWuzzy for text comparison and has numerous applications.

Installation:

pip install fuzzywuzzy
pip install python-Levenshtein

These are the requirements that must be installed.

Now let's start with the code by importing the necessary libraries.

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

Necessary imports are made.

#string comparison

#exactly same text
fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.ratio'}, '*')">ratio('London is a big city.', 'London is a big city.')

Departure: 100

Since the two strings are exactly the same here, we get the result 100, indicating identical strings.

#string comparison

#not same text
fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.ratio'}, '*')">ratio('London is a big city.', 'London is a very big city.')

Departure: 89

How the strings are different now, the score is 89. Then, We watch Fuzzy Wuzzy perform.

#now let us do conversion of cases

a1 = "Python Program"
a2 = "PYTHON PROGRAM"
Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.ratio'}, '*')">ratio(a1.lower(),a2.lower())
print(Ratio)

Departure: 100

Here, in this case, although the two different chains had different cases, both were converted to lowercase and the score was 100.

Substring Match

Now, often multiple text matching cases can arise where we need to compare two different strings where one could be a substring of the other. For instance, we are testing a text summary and we need to check how well it is performing. Then, the summarized text will be a substring of the original string. FuzzyWuzzy has powerful functions to deal with these cases.

#fuzzywuzzy functions to work with substring matching

b1 = "The Samsung Group is a South Korean multinational conglomerate headquartered in Samsung Town, Seoul."
b2 = "Samsung Group is a South Korean company based in Seoul"

Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.ratio'}, '*')">ratio(b1.lower(),b2.lower())
Partial_Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.partial_ratio'}, '*')">partial_ratio(b1.lower(),b2.lower())

print("Ratio:",Ratio)
print("Partial Ratio:",Partial_Ratio)

Production:

Ratio: 64
Partial Ratio: 74

Here, we can see that the score for the Partial Reason function is higher. This indicates that it is able to recognize the fact that the string b2 has words of b1.

Token ranking ratio

But the above substring comparison method is not foolproof. Often, the words are mixed and do not follow an order. Similarly, in the case of similar sentences, word order is different or mixed. In this case, we use a different function.

c1 = "Samsung Galaxy SmartPhone"
c2 =  "SmartPhone Samsung Galaxy"
Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.ratio'}, '*')">ratio(c1.lower(),c2.lower())
Partial_Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.partial_ratio'}, '*')">partial_ratio(c1.lower(),c2.lower())
Token_Sort_Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.token_sort_ratio'}, '*')">token_sort_ratio(c1.lower(),c2.lower())
print("Ratio:",Ratio)
print("Partial Ratio:",Partial_Ratio)
print("Token Sort Ratio:",Token_Sort_Ratio)

Production:

Ratio: 56
Partial Ratio: 60
Token Sort Ratio: 100

Then, here, in this case, we can see that the strings are just mixed versions of each other. And the two strings show the same sentiment and also mention the same entity. The standard fuzz function shows that the score between them is 56. And the Token Sort Ratio function shows that the similarity is 100.

Then, it is clear that in some situations or applications, the token ranking index will be more useful.

Token set ratio

But, now if the two strings have different lengths. Token classification relationship functions may not work well in this situation. For it, we have the Token Set Ratio function.

d1 = "Windows is built by Microsoft Corporation"
d2 = "Microsoft Windows"


Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.ratio'}, '*')">ratio(d1.lower(),d2.lower())
Partial_Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.partial_ratio'}, '*')">partial_ratio(d1.lower(),d2.lower())
Token_Sort_Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.token_sort_ratio'}, '*')">token_sort_ratio(d1.lower(),d2.lower())
Token_Set_Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.token_set_ratio'}, '*')">token_set_ratio(d1.lower(),d2.lower())
print("Ratio:",Ratio)
print("Partial Ratio:",Partial_Ratio)
print("Token Sort Ratio:",Token_Sort_Ratio)
print("Token Set Ratio:",Token_Set_Ratio)

Production:

Ratio: 41
Partial Ratio: 65
Token Sort Ratio: 59
Token Set Ratio: 100

¡Ah! The score of 100. Good, the reason is that the chain d2 the components are fully present in the chain d1.

Now, let's slightly modify the string d2.

d1 = "Windows is built by Microsoft Corporation"
d2 = "Microsoft Windows 10"


Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.ratio'}, '*')">ratio(d1.lower(),d2.lower())
Partial_Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.partial_ratio'}, '*')">partial_ratio(d1.lower(),d2.lower())
Token_Sort_Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.token_sort_ratio'}, '*')">token_sort_ratio(d1.lower(),d2.lower())
Token_Set_Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.token_set_ratio'}, '*')">token_set_ratio(d1.lower(),d2.lower())
print("Ratio:",Ratio)
print("Partial Ratio:",Partial_Ratio)
print("Token Sort Ratio:",Token_Sort_Ratio)
print("Token Set Ratio:",Token_Set_Ratio)

By, slightly modifying the text d2 we can see that the score is reduced to 92. This is because the text “10“Not present in the chain d1.

WRatio ()

This feature helps manage capitalization, lowercase and some other parameters.

#fuzz.WRatio()

print("Slightly change of cases:",fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.WRatio'}, '*')">WRatio('Ferrari LaFerrari', 'FerrarI LAFerrari'))

Production:

Slightly change of cases: 100

Let's try to remove a space.

#fuzz.WRatio()

print("Slightly change of cases and a space removed:",fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.WRatio'}, '*')">WRatio('Ferrari LaFerrari', 'FerrarILAFerrari'))

Production:

Slightly change of cases and a space removed: 97

Let's try some punctuation marks.

#handling some random punctuations
g1='Microsoft Windows is good, but takes up lof of ram!!!'
g2='Microsoft Windows is good but takes up lof of ram?'
print(fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.WRatio'}, '*')">WRatio(g1,g2 ))

Departure: 99

Therefore, we can see that FuzzyWuzzy has a lot of cool functions that can be used to perform cool text comparison tasks.

Some suitable applications:

FuzzyWuzzy may have some cool applications.

Can be used to evaluate longer text summaries and judge their similarity. This can be used to measure the performance of text summaries.

According to the similarity of the texts, can also be used to identify the authenticity of a text, Article, News, book, etc. Often, we find several texts / incorrect data. Often, not possible to verify each and every text data. Using text similarity, cross-check of multiple texts can be performed.

FuzzyWuzzy can also be useful to select the best similar text among several texts. Then, FuzzyWuzzy applications are numerous.

Text similarity is an important metric that can be used for various NLP and text analysis purposes.. The interesting thing about FuzzyWuzzy is that the similarities are given as a score of 100. This allows a relative score and also generates a new characteristic / data that can be used for analytical purposes / ML.

Summary similarity:

#uses of fuzzy wuzzy
#summary similarity

input_text="Text Analytics involves the use of unstructured text data, processing them into usable structured data. Text Analytics is an interesting application of Natural Language Processing. Text Analytics has various processes including cleaning of text, removing stopwords, word frequency calculation, and much more. Text Analytics has gained much importance these days. As millions of people engage in online platforms and communicate with each other, a large amount of text data is generated. Text data can be blogs, social media posts, tweets, product reviews, surveys, forum discussions, and much more. Such huge amounts of data create huge text data for organizations to use. Most of the text data available are unstructured and scattered. Text analytics is used to gather and process this vast amount of information to gain insights. Text Analytics serves as the foundation of many advanced NLP tasks like Classification, Categorization, Sentiment Analysis, and much more. Text Analytics is used to understand patterns and trends in text data. Keywords, topics, and important features of Text are found using Text Analytics. There are many more interesting aspects of Text Analytics, now let us proceed with our resume dataset. The dataset contains text from various resume types and can be used to understand what people mainly use in resumes. Resume Text Analytics is often used by recruiters to understand the profile of applicants and filter applications. Recruiting for jobs has become a difficult task these days, with a large number of applicants for jobs. Human Resources executives often use various Text Processing and File reading tools to understand the resumes sent. Here, we work with a sample resume dataset, which contains resume text and resume category. We shall read the data, clean it and try to gain some insights from the data."

The above is the original text.

output_text="Text Analytics involves the use of unstructured text data, processing them into usable structured data. Text Analytics is an interesting application of Natural Language Processing. Text Analytics has various processes including cleaning of text, removing stopwords, word frequency calculation, and much more. Text Analytics is used to understand patterns and trends in text data. Keywords, topics, and important features of Text are found using Text Analytics. There are many more interesting aspects of Text Analytics, now let us proceed with our resume dataset. The dataset contains text from various resume types and can be used to understand what people mainly use in resumes."
Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.ratio'}, '*')">ratio(input_text.lower(),output_text.lower())
Partial_Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.partial_ratio'}, '*')">partial_ratio(input_text.lower(),output_text.lower())
Token_Sort_Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.token_sort_ratio'}, '*')">token_sort_ratio(input_text.lower(),output_text.lower())
Token_Set_Ratio = fuzz.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.fuzz.token_set_ratio'}, '*')">token_set_ratio(input_text.lower(),output_text.lower())

print("Ratio:",Ratio)
print("Partial Ratio:",Partial_Ratio)
print("Token Sort Ratio:",Token_Sort_Ratio)
print("Token Set Ratio:",Token_Set_Ratio)

Production:

Ratio: 54
Partial Ratio: 79
Token Sort Ratio: 54
Token Set Ratio: 100

We can see the different scores. The partial relationship shows that they are quite similar, what should be the case. What's more, the proportion of the set of tokens is 100, which is evident since the abstract is completely taken from the original text.

Best possible string match:

Let's use the process library to find the best possible string match between a list of strings.

#choosing the possible string match

#using process library

query = 'Stack Overflow'

choices = ['Stock Overhead', 'Stack Overflowing', 'S. Overflow',"Stoack Overflow"] 

print("List of ratios: ")

print(process.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.process.extract'}, '*')">extract(query, choices))

print("Best choice: ",process.<a onclick="parent.postMessage({'referent':'.fuzzywuzzy.process.extractOne'}, '*')">extractOne(query, choices))

Production:

List of ratios: 
[('Stoack Overflow', 97), ('Stack Overflowing', 90), ('S. Overflow', 85), ('Stock Overhead', 64)]
Best choice:  ('Stoack Overflow', 97)

Therefore, similarity scores and best match are given.

Last words

The FuzzyWuzzy library builds on top of the difflib library. And python-Levenshtein used to optimize speed. So we can understand that FuzzyWuzzy is one of the best ways to compare strings in Python.

Check the code in Kaggle here.

About me:

Prateek Majumder

Data science and analytics | Digital Marketing Specialist | SEO | Content creation

Connect with me on Linkedin.

My other articles on DataPeaker: Link.

Thanks.

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.