Named entity recognition (DOWN) in Python with Spacy

Contents

Natural language processing deals with text data. The amount of text data generated these days is enormous. And this data, if used correctly, can bring many fruitful results. Some of the most important natural language processing applications are text analysis, the parts of voice tagging, sentiment analysis and named entity recognition.

26119ner1-9906021

The large amount of text data contains a large amount of information. An important aspect of the analysis of this text data is the identification of Named Entities.

What is a named entity?

A named entity is basically a real life object that has a proper identification and can be denoted with a proper name. Named entities can be a place, a person, an organization, a time, a geographic object or entity.

For instance, the entities named would be Roger Federer, honda city, Samsung Galaxy S10. Named entities are usually instances of entity instances. For instance, Roger Federer is an instance of a tennis player / person, Honda City is an instance of a car and Samsung Galaxy S10 is an instance of a mobile phone.

Named entity recognition:

Named entity recognition is the NLP process that deals with identifying and classifying named entities. Plain and structured text is taken and named entities are classified into people, organizations, places, money, weather, etc. Basically, named entities are identified and segmented into several predefined classes.

NER systems are developed with various linguistic approaches, as well as statistical and machine learning methods. NER has many applications for projects or commercial purposes.

The NER model first identifies an entity and then categorizes it into the most appropriate class. Some of the common types of Named Entities will be:

1. Organizations:

NASA, CERN, ISRO, etc.

2 places:

Mumbai, New York, Kolkata.

3. Money:

Billion dollars, 50 pounds sterling.

4. Date:

15 August 2020

5. Person:

Elon Musk, Richard Feynman, Subhas Chandra Bose.

One important thing about NER models is that their ability to understand Named Entities depends on the data on which they have been trained.. There are many applications of NER.

NER can be used for content classification, the various Named Entities of a text can be compiled and, based on those data, content topics can be understood. In the academic and research fields, NER can be used to retrieve data and information more quickly from a wide variety of textual information. NER helps a lot in the case of extracting information from large text data sets.

NER using Spacy:

Spacy is an open source natural language processing library that can be used for various tasks. Has built-in methods for named entity recognition. Spacy has a fast statistical entity recognition system.

We can use spacy very easily for NER tasks. Although we often need to train our own data for specific business needs, general space model works fine for all text data types.

Let's start with the code, first we import spacy and continue.

import spacy
from spacy import displacy

NER = spacy.load("en_core_web_sm")

Now, we enter our sample text that we will be testing. The text has been taken from the ISRO Wikipedia page.

raw_text="The Indian Space Research Organisation or is the national space agency of India, headquartered in Bengaluru. It operates under Department of Space which is directly overseen by the Prime Minister of India while Chairman of ISRO acts as executive of DOS as well."
text1 = DOWN(raw_text)

Now, we print the data on the NEs found in this text sample.

for word in text1.ents:
    print(word.text,word.label_)

The exit:

The Indian Space Research Organisation ORG
the national space agency ORG
India GPE
Bengaluru GPE
Department of Space ORG
India GPE
ISRO ORG
DOS ORG

Then, now we can see that all the Entities named in this particular text are extracted. If we are faced with any problem regarding the type of a particular NE, we can use the following method.

spacy.explain("ORG")

Production: 'Companies, agencies, institutions, etc.’

spacy.explain("GPE")

Production: ‘Countries, cities, state’

Now, we tried an interesting image, which displays the NEs directly in the text.

displacy.render(text1,style="ent",jupyter=True)

Production:

45514ner2-1751415

I'll leave Kaggle Link at the end, so readers can test the code for themselves. As for the visual, Named Entities are correctly mentioned in the text, with contrasting colors, which makes data visualization quite easy and simple. There is another type of visual, which explores the entire dataset as a whole. See the Kaggle link at the end.

Let's try the same tasks with some tests that contain more named entities.

raw_text2 = ”La Mars Orbiter Mission (MOM), informally known as Mangalyaan, was launched into Earth orbit on 5 November 2013 by the Indian Space Research Organization (ISRO) and entered the orbit of Mars on 24 September 2014. India thus became the first country to enter the orbit of Mars on its first attempt.. It was completed at a record cost of 74 millions of dollars “.

text2 = DOWN(raw_text2)
for word in text2.ents:
    print(word.text,word.label_)

Production:

The Mars Orbiter Mission PRODUCT
MOM ORG
Mangalyaan GPE
Earth LOC
5 November 2013 DATE
the Indian Space Research Organisation ORG
ISRO ORG
Mars LOC
24 September 2014 DATE
India GPE
first ORDINAL
Mars LOC
first ORDINAL
$74 million MONEY

Here, we get more types of named entities. Let's identify what type they are.

spacy.explain("PRODUCT")

Production: 'Objects, vehicles, food, etc. (no services)’

spacy.explain("PLACE")

Production: ‘Places not belonging to the GPE, mountain ranges, bodies of water’

spacy.explain("DATE")

Production: ‘Absolute or relative dates or periods’

spacy.explain("ORDINAL")

Production: '"First", "second", etc.’

spacy.explain("MONEY")

Production: ‘Monetary values, including drive’

Now, we analyze the text as a whole in the form of visual.

displacy.render(text2,style="ent",jupyter=True)

Production:

68260ner3-3209580

Here, the various Named Entities in contrasting colors, so we understand the general nature of the text.

NER of a news article

We will extract data from a news article and perform a NER on the text data collected from there.

We will use Beautiful Soup for web scraping purposes.

from bs4 import BeautifulSoup
import requests
import re

Now, we will use the URL of the news article.

URL="https://www.zeebiz.com/markets/currency/news-cryptocurrency-news-today-june-12-bitcoin-dogecoin-shiba-inu-and-other-top-coins-prices-and-all-latest-updates-158490"
html_content = requests.get(URL).text
soup = BeautifulSoup(html_content, "lxml")

Now, we get the body content.

body=soup.body.text

Now, we use regular expressions to clean the text.

body= body.replace('n', ' ')
body= body.replace('t', ' ')
body= body.replace('r', ' ')
body= body.replace('xa0', ' ')
body=re.sub(r'[^ ws]', '', body)

Let's now take a look at the text.

body[1000:1500]
'       View in App    Bitcoin was down by 6 and was trading at Rs 2728815 after hitting days high of Rs 2900208 Source Reuters        Reported By ZeeBiz WebTeam Written By Ravi Kant Kumar      Updated Sat Jun 12 20210646 pm   Patna ZeeBiz WebDesk    RELATED NEWS            Cryptocurrency Latest News Today June 14 Bitcoin leads crypto rally up over 12 after ELON MUSK TWEET Check Ethereum Polka Dot Dogecoin Shiba Inu and other top coins INR price World India updates             Bitcoin law is only'

Now, let's proceed with the recognition of named entities.

text3 = DOWN(body)
displacy.render(text3,style="ent",jupyter=True)

Good, the visual form is very large, but there are some interesting parts that I want to cover.

81153ner4-8773793

Now, arriving at some observations.

Bitcoin is supposedly a geographic location. Patna is an organization. Leaving aside some cases, most of the text has been correctly classified into their respective named entities. Therefore, we can understand that the recognition of the entity has been carried out correctly.

NER has many challenges and many developments yet to be made. Proper NER implementation is still a big problem. Besides Spacy, other NLP platforms include GATE Y OpenNLP.

To see the complete code, see this link in Kaggle.

Then, we can conclude that NER is an important application of NLP and has widespread uses.

About me:

Prateek Majumder

Data science and analytics | Digital Marketing Specialist | SEO | Content creation

Connect with me on Linkedin.

Thanks.

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.