Keyword extraction | Keyword extraction in Python


This post was released as part of the Data Science Blogathon.


keyword extraction

Unstructured data contains a large amount of information. It's like energy when it is harnessed, create high value for your stakeholders. Several companies are already working hard in this area. There is no doubt that unstructured data is noisy and significant work must be done to clean it up, analyze them and make them meaningful for your use. This post talks about an area that helps analyze large amounts of data by summarizing content and identifying topics of interest.: Keyword extraction

Keyword extraction overview

It is a text analysis technique. We can obtain important knowledge on the subject in a short period of time. Helps to concise the text and get relevant keywords. Save the time of reviewing the entire document. Example use cases are finding topics of interest in a news post and identifying issues based on customer feedback, etc. One of the techniques used for keyword extraction is TF-IDF (termination frequency – reverse document frequency)

TF – IDF overview

Term frequency – How often a definition appears in a text. It is measured as the number of times a definition t appears in the text / Total number of words in the document

Reverse document frequency – How relevant is a word in a document?. Measure as log (total number of sentences / Number of sentences with term t)

TF-IDF – The relevance of words is measured by this score. It is measured as TF * IDF

We will use the same concept and try to code it line by line using Python. We will take a smaller set of text documents and do all the above steps. Although there are already more advanced concepts for keyword extraction on the market, This post aims to understand the basic concept behind identifying the relevance of words. Let's get started!


1. Import packages

We need tokenize to create word tokens, itemgetter to sort the dictionary and math to perform the log base e operation

from nltk import tokenize
from operator import itemgetter
import math

2. Declare variables

We will declare a string variable. It will be a placeholder for the sample text document.

doc="I am a graduate. I want to learn Python. I like learning Python. Python is easy. Python is interesting. Learning increases thinking. Everyone should invest time in learning"

3. Take away the empty words

Stopwords are words that appear often and may not be relevant to our analysis.. We can delete the use of the nltk library

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
stop_words = set(stopwords.words('english'))

4. Find the total words in the document.

This will be necessary when calculating the termination frequency

total_words = doc.split()
total_word_length = len(total_words)

5. Calculate the total number of sentences

This will be necessary when calculating the inverse frequency of the document.

total_sentences = tokenize.sent_tokenize(doc)
total_sent_len = len(total_sentences)

6. Calculate TF for each word

We will start by calculating the word count for each word without stopping and in conclusion we will divide each item by the result of the step 4

tf_score = {}
for each_word in total_words:
    each_word = each_word.replace('.','')
    if each_word not in stop_words:
        if each_word in tf_score:
            tf_score[each_word] += 1
            tf_score[each_word] = 1

# Dividing by total_word_length for each dictionary element
tf_score.update((x, y/int(total_word_length)) for x, y in tf_score.items())

7. Function to check if the word is present in a list of phrases.

This method will be required when calculating the IDF.

def check_sent(word, sentences): 
    final = [all([w in x for w in word]) for x in sentences] 
    sent_len = [sentences[i] for i in range(0, len(final)) if final[i]]
    return int(len(sent_len))

8. Calculate the IDF for each word.

We will use the function in step 7 to iterate the word endlessly and store the result for the inverse frequency of the document.

idf_score = {}
for each_word in total_words:
    each_word = each_word.replace('.','')
    if each_word not in stop_words:
        if each_word in idf_score:
            idf_score[each_word] = check_sent(each_word, total_sentences)
            idf_score[each_word] = 1

# Performing a log and divide
idf_score.update((x, math.log(int(total_sent_len)/Y)) for x, y in idf_score.items())


9. Calcular TF * IDF

Since the key of both dictionaries is the same, we can iterate a dictionary to get the keys and multiply the values ​​of both

tf_idf_score = {key: tf_score[key] * idf_score.get(key, 0) for key in tf_score.keys()}

10. Create a function to get N important words in the document

def get_top_n(dict_elem, n):
    result = dict(sorted(dict_elem.items(), key = itemgetter(1), reverse = True)[:n]) 
    return result

11. Get the 5 most important words

print(get_top_n(tf_idf_score, 5))


Then, This is one of the ways you can build your own keyword extractor in python!! The above steps can be summarized in a simple way as Document -> Clear Stop Words -> Find Term Frequency (TF) -> Find reverse document frequency (IDF) -> Search TF * IDF -> Get Top N Keywords. Share your thoughts if this post was interesting or helped you in any way. Always open to improvements and suggestions. You can find the code in GitHub

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.