Automatic text summarization using the TextRank algorithm

Share on facebook
Share on twitter
Share on linkedin
Share on telegram
Share on whatsapp

Contents

Introduction

Text summarization is one of those applications of natural language processing (PNL) that will surely have a great impact on our lives. With growing digital media and constantly growing publications, Who has time to review articles / documents / complete books to decide if they are useful or not? Fortunately, this technology is here.

Have you come across the mobile app? in shorts? It is an innovative news application that converts news articles into a summary of 60 words. And that is exactly what we are going to learn in this article.: Automatic text summary.

inshorts-2246262

Automatic text summarization is one of the most challenging and interesting problems in the field of natural language processing. (PNL). It is a process of generating a concise and meaningful text summary from multiple text resources, like books, news articles, blog posts, research articles, emails and tweets.

The demand for automated text summarization systems is increasing these days thanks to the availability of large amounts of textual data.

Through this article, we will explore the domains of the text summary. We will understand how the TextRank algorithm works and we will also implement it in Python. Fasten your seatbelt, This is going to be a fun trip!

Table of Contents

  1. Text summary approaches
  2. Understanding the TextRank algorithm
  3. Understanding the problem statement
  4. Implementation of the TextRank algorithm
  5. Whats Next?

Text summary approaches

image_1-2831872

Automatic text summarization gained attention as early as the decade of 1950. A research work, published by Hans Peter Luhn in the late 1990s. 1950, titled “Automatic creation of literature abstracts”, used characteristics such as word frequency and phrase frequency to extract important sentences from the text for summary purposes.

Other important investigate, performed by Harold P. Edmundson in the late 1960, used methods like the presence of keywords, the words used in the title that appear in the text and the placement of sentences, to extract meaningful sentences for the text summary. Since then, many important and exciting studies have been published to address the challenge of automatic text summarization.

TThe extension summary can be divided into two categories: Extractive summary Y Abstract abstract.

  1. Extractive summary: These methods are based on extracting multiple parts, like phrases and sentences, a piece of text and stack them to create a summary. Therefore, Identifying the correct sentences to summarize is of utmost importance in an extractive method.
  2. Abstract abstract: These methods use advanced NLP techniques to generate a completely new summary.. Some parts of this summary may not even appear in the original text.

In this article, we will focus on the extractive summary technique.

Understanding the TextRank algorithm

Before starting with the TextRank algorithm, there is another algorithm we should get familiar with: the PageRank algorithm. In fact, This really inspired TextRank! PageRank is primarily used to rank web pages in online search results. Let's quickly understand the basics of this algorithm with the help of an example.

PageRank algorithm

pagerank11-8257511

Source: http://www.scottbot.net/HIAL/

Suppose we have 4 websites: w1, w2, w3 and w4. These pages contain links that point to each other. Some pages may not have any links; are called hanging pages.

webpages-8826241

  • The w1 web page has links to w2 and w4
  • w2 has links for w3 and w1
  • w4 has links only for web page w1
  • w3 has no links and, Thus, it will be called hanging page

To rank these pages, we would have to calculate a score called PageRank score. This score is the probability that a user visits that page.

To capture the odds of users navigating from one page to another, we will create a square matriz M, which has n rows and n columns, where North is the number of web pages.

m_matrix-5295882

Each element of this matrix denotes the probability that a user goes from one web page to another. For instance, the highlighted cell below contains the transition probability from w1 to w2.

transition_probability-2705698

Odds initialization is explained in the following steps:

  1. Probability of going from page i to j, namely, M[ i ][ j ], is initialized with 1 / (number of unique links on the website wi)
  2. If there is no link between page i and j, then the probability will be initialized with 0
  3. If a user has landed on a hanging page, it is assumed to be equally likely to transition to any page. Therefore, M[ i ][ j ] will be initialized with 1 / (number of web pages)

Therefore, in our case, array M will be initialized as follows:

final_matrix-3250029

Finally, the values ​​in this array will be iteratively updated to reach the web page rankings.

Text range algorithm

Let's understand the TextRank algorithm, now that we know the PageRank. I have listed the similarities between these two algorithms below:

  • Instead of web pages, we use sentences.
  • The similarity between any two sentences is used as equivalent to the transition probability of the web page.
  • Similarity scores are stored in a square matrix, similar to matrix M used for PageRank

TextRank is an extractive and unsupervised text summarization technique. Let's take a look at the flow of the TextRank algorithm that we will follow:

block_3-5794578

  • The first step would be to concatenate all the text contained in the articles.
  • Then divide the text into individual sentences
  • In the next step, we will find the vector representation (word inlays) for each and every sentence.
  • Similarities between sentence vectors are calculated and stored in a matrix.
  • Later, the similarity matrix is ​​converted to a graph, with sentences as vertices and similarity scores as edges, for calculating the range of sentences.
  • Finally, a certain number of best-ranked sentences form the final summary.

Then, without any more preambles, Power up our Jupyter Notebooks and let's start coding!!

Note: If you want to learn more about graph theory, I recommend you check this Article.

Understanding the problem statement

Being a big tennis fan, I always try to keep up to date with what is happening in the sport by religiously checking as many tennis updates online as possible. But nevertheless, This has proven to be quite a difficult job!! There are too many resources and time is a limitation.

Therefore, I decided to design a system that could prepare me a bulleted summary by scanning multiple articles. How to do this? That is what I will show you in this tutorial. We will apply the TextRank algorithm on a data set of scraped articles in order to create a nice and concise summary.

t_collage-2875834

Note that this is essentially a single domain, multi-document summary task, namely, we will take multiple articles as input and generate a single bulleted summary. Multi-domain text summary is not covered in this article, but feel free to try it in the end.

You can download the dataset that we will use from here.

Implementation of the TextRank algorithm

Then, without any more preambles, power up your Jupyter Notebooks and let's implement what we've learned so far.

Import required libraries

First, import the libraries that we will leverage for this challenge.

import numpy as np
import pandas as pd
import nltk
nltk.download('Point') # one time execution
import re

Read data

Now let's read our data set. I have provided the link to download the data in the section above (in case you missed it).

df = pd.read_csv("tennis_articles_v4.csv")

Inspect the data

Let's take a quick look at the data.

df.head()

head_1-6168465
Have 3 columns in our dataset: ‘article_id’, ‘article_text’ y "source". We are more interested in the column ‘article_text’ as it contains the text of the articles. Let's print some of the variable values ​​just to see what they look like.

df['article_text'][0]

Production:

"Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player 
has no problems in openly speaking about it and in a recent interview she said: 'I don't really 
hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts 
or when I'm on the court playing, I'm a competitor and I want to beat every single person whether 
they're in the locker room or across the net...
df['article_text'][1]
BASEL, Switzerland (AP), Roger Federer advanced to the 14th Swiss Indoors final of his career by beating 
seventh-seeded Daniil Medvedev 6-1, 6-4 on Saturday. Seeking a ninth title at his hometown event, and a 99th 
overall, Federer will play 93th-ranked Marius Copil on Sunday. Federer dominated the 20th-ranked Medvedev and had 
his first match-point chance to break serve again at 5-1...
df['article_text'][2]
Roger Federer has revealed that organisers of the re-launched and condensed Davis Cup gave him three days to 
decide if he would commit to the controversial competition. Speaking at the Swiss Indoors tournament where he will 
play in Sundays final against Romanian qualifier Marius Copil, the world number three said that given the 
impossibly short time frame to make a decision, he opted out of any commitment...

Now we have 2 options: we can summarize each article individually or we can generate a single summary for all articles. For our purpose, we will go ahead with the latter.

Divide text into sentences

Now the next step is to divide the text into individual sentences. We will use the sent_tokenize () function of the nltk library to do this.

from nltk.tokenize import sent_tokenize
sentences = []
for s in df['article_text']:
  sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x] # flatten list

Let's print some items from the list. phrases.

sentences[:5]

Production:

['Maria Sharapova has basically no friends as tennis players on the WTA Tour.',
"The Russian player has no problems in openly speaking about it and in a recent
interview she said: 'I don't really hide any feelings too much.",
'I think everyone knows this is my job here.',
"When I'm on the courts or when I'm on the court playing,
I'm a competitor and I want to beat every single person whether they're in the
locker room or across the net.So I'm not the one to strike up a conversation about
the weather and know that in the next few minutes I have to go and try to win a tennis match.",
"I'm a pretty competitive girl."]

Descarga GloVe Word Embeddings

Glove Word inlays are vector representations of words. These word inlays will be used to create vectors for our sentences. We could also have used the Bag-of-Words or TF-IDF approaches to create characteristics for our sentences, but these methods ignore word order (and the number of features is usually quite large).

We will use the pre-trained Wikipedia 2014 + Gigaword 5 GloVe Vectors Available here. Notice: the size of these word embeds is 822 MB.

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

Let's extract the embedded words or word vectors.

# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = e.g. asarray(values[1:], dtype="float32")
    word_embeddings[word] = coefs
f.close()
len(word_embeddings)
400000

We now have word vectors for 400.000 different terms stored in the dictionary: ‘word_embeddings’.

Text preprocessing

It is always good practice to make your textual data noise free as much as possible. Then, let's do some basic text cleaning.

# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^ a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

Eliminate empty words (commonly used words of a language: it is, am, the, of, in, etc.) present in prayers. If you haven't downloaded nltk-stopwords, then run the following line of code:

nltk.download('stopwords')

Now we can import the empty words.

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

Let's define a function to remove these stopwords from our dataset.

# function to remove stopwords
def remove_stopwords(its):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

we will use clean_sentences to create vectors for sentences in our data with the help of GloVe word vectors.

Vector representation of sentences

# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = e.g. asarray(values[1:], dtype="float32")
    word_embeddings[word] = coefs
f.close()

Now, let's create vectors for our prayers. We will first look for vectors (each of the size items 100) for the constituent words in a sentence and then we will take the mean / average of those vectors to arrive at a consolidated vector for the sentence.

sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

Note: For more best practices for text pre-processing, you can consult our video course, Natural language processing (NLP) using Python.

Preparation of the similarity matrix

The next step is to find similarities between the sentences, and we will use the cosine similarity approach for this challenge. Let's create an empty similarity matrix for this task and populate with cosine similarities from the sentences.

Let's first define a zero-dimensional array (n * n). We will initialize this matrix with cosine similarity scores of the sentences. Here, North is the number of sentences.

# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

We will use Cosine Similarity to calculate the similarity between a pair of sentences.

from sklearn.metrics.pairwise import cosine_similarity

And initialize the matrix with cosine similarity scores.

for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

Application of the PageRank algorithm

before continuing, convert the similarity matrix sim_mat on a graph. The nodes of this graph will represent the sentences and the edges will represent the similarity scores between the sentences. In this graph, we will apply the PageRank algorithm to arrive at the classification of the sentences.

import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

Summary extraction

Finally, it's time to extract the top N sentences based on their rankings for summary generation.

ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
# Extract top 10 sentences as the summary
for i in range(10):
  print(ranked_sentences[i][1])
When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person 
whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the 
weather and know that in the next few minutes I have to go and try to win a tennis match.

Major players feel that a big event in late November combined with one in January before the Australian Open will 
mean too much tennis and too little rest.

Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius 
Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of 
any commitment.

"I felt like the best weeks that I had to get to know players when I was playing were the Fed Cup weeks or the 
Olympic weeks, not necessarily during the tournaments.

Currently in ninth place, Nishikori with a win could move to within 125 points of the cut for the eight-man event 
in London next month.

He used his first break point to close out the first set before going up 3-0 in the second and wrapping up the 
win on his first match point.
The Spaniard broke Anderson twice in the second but didn't get another chance on the South African's serve in the 
final set.

"We also had the impression that at this stage it might be better to play matches than to train.

The competition is set to feature 18 countries in the November 18-24 finals in Madrid next year, and will replace 
the classic home-and-away ties played four times per year for decades.

Federer said earlier this month in Shanghai in that his chances of playing the Davis Cup were all but non-existent.

And here we go! An amazing summary, tidy, concise and useful for our articles.

Whats Next?

Automatic text summarization is a hot topic of research and, in this article, we've only covered the tip of the iceberg. In the future, we will explore the abstract text summarization technique where deep learning plays an important role. What's more, we can also examine the following summary tasks:

Specific to the problem

  • Multi-domain text summary
  • Single Document Summary
  • Text summary in multiple languages (source in one language and summary in another language)

Algorithm specific

  • Text summary using RNN and LSTM
  • Summary of text using reinforcement learning
  • Text summary by means of confrontational generative networks (GAN)

Final notes

Hope this post helped you understand the concept of automatic text summarization. It has a variety of use cases and has spawned extremely successful applications. Either to take advantage of your business or simply for your own knowledge, text summarization is an approach that all NLP enthusiasts should be familiar with.

I will try to cover the abstractive text summarization technique using advanced techniques in a future article.. Meanwhile, feel free to use the comment section below to let me know your thoughts or ask any questions you may have about this article..

Find the code in this GitHub repository.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.