This post was released as part of the Data Science Blogathon.
Introduction
Unstructured data contains a large amount of information. It's like energy when it is harnessed, create high value for your stakeholders. Several companies are already working hard in this area. There is no doubt that unstructured data is noisy and significant work must be done to clean it up, analyze them and make them meaningful for your use. This post talks about an area that helps analyze large amounts of data by summarizing content and identifying topics of interest.: Keyword extraction
Keyword extraction overview
It is a text analysis technique. We can obtain important knowledge on the subject in a short period of time. Helps to concise the text and get relevant keywords. Save the time of reviewing the entire document. Example use cases are finding topics of interest in a news post and identifying issues based on customer feedback, etc. One of the techniques used for keyword extraction is TF-IDF (termination frequency – reverse document frequency)
TF – IDF overview
Term frequency – How often a definition appears in a text. It is measured as the number of times a definition t appears in the text / Total number of words in the document
Reverse document frequency – How relevant is a word in a document?. Measure as log (total number of sentences / Number of sentences with term t)
TF-IDF – The relevance of words is measured by this score. It is measured as TF * IDF
We will use the same concept and try to code it line by line using Python. We will take a smaller set of text documents and do all the above steps. Although there are already more advanced concepts for keyword extraction on the market, This post aims to understand the basic concept behind identifying the relevance of words. Let's get started!
Implementation
1. Import packages
We need tokenize to create word tokens, itemgetter to sort the dictionary and math to perform the log base e operation
from nltk import tokenize from operator import itemgetter import math
2. Declare variables
We will declare a string variable. It will be a placeholder for the sample text document.
doc="I am a graduate. I want to learn Python. I like learning Python. Python is easy. Python is interesting. Learning increases thinking. Everyone should invest time in learning"
3. Take away the empty words
Stopwords are words that appear often and may not be relevant to our analysis.. We can delete the use of the nltk library
import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize stop_words = set(stopwords.words('english'))
4. Find the total words in the document.
This will be necessary when calculating the termination frequency
total_words = doc.split() total_word_length = len(total_words) print(total_word_length)
5. Calculate the total number of sentences
This will be necessary when calculating the inverse frequency of the document.
total_sentences = tokenize.sent_tokenize(doc) total_sent_len = len(total_sentences) print(total_sent_len)
6. Calculate TF for each word
We will start by calculating the word count for each word without stopping and in conclusion we will divide each item by the result of the step 4
tf_score = {} for each_word in total_words: each_word = each_word.replace('.','') if each_word not in stop_words: if each_word in tf_score: tf_score[each_word] += 1 else: tf_score[each_word] = 1 # Dividing by total_word_length for each dictionary element tf_score.update((x, y/int(total_word_length)) for x, y in tf_score.items()) print(tf_score)
7. Function to check if the word is present in a list of phrases.
This method will be required when calculating the IDF.
def check_sent(word, sentences): final = [all([w in x for w in word]) for x in sentences] sent_len = [sentences[i] for i in range(0, len(final)) if final[i]] return int(len(sent_len))
8. Calculate the IDF for each word.
We will use the function in step 7 to iterate the word endlessly and store the result for the inverse frequency of the document.
idf_score = {} for each_word in total_words: each_word = each_word.replace('.','') if each_word not in stop_words: if each_word in idf_score: idf_score[each_word] = check_sent(each_word, total_sentences) else: idf_score[each_word] = 1 # Performing a log and divide idf_score.update((x, math.log(int(total_sent_len)/Y)) for x, y in idf_score.items()) print(idf_score)
9. Calcular TF * IDF
Since the key of both dictionaries is the same, we can iterate a dictionary to get the keys and multiply the values of both
tf_idf_score = {key: tf_score[key] * idf_score.get(key, 0) for key in tf_score.keys()} print(tf_idf_score)
10. Create a function to get N important words in the document
def get_top_n(dict_elem, n): result = dict(sorted(dict_elem.items(), key = itemgetter(1), reverse = True)[:n]) return result
11. Get the 5 most important words
print(get_top_n(tf_idf_score, 5))
Conclution
Then, This is one of the ways you can build your own keyword extractor in python!! The above steps can be summarized in a simple way as Document -> Clear Stop Words -> Find Term Frequency (TF) -> Find reverse document frequency (IDF) -> Search TF * IDF -> Get Top N Keywords. Share your thoughts if this post was interesting or helped you in any way. Always open to improvements and suggestions. You can find the code in GitHub