Function engineering in NLP | How to do function engineering in NLP

Contents

Overview

  • Feature engineering in NLP is about understanding the context of the text.
  • In this blog, we will look at some of the common engineering features in NLP.
  • We will compare the results of a classification task with and without performing feature engineering.

Table of Contents

  1. Introduction
  2. NLP Task Overview
  3. List of features with code
  4. Implementation
  5. Comparison of results with and without function engineering
  6. Conclution

Introduction

“If he 80 percent of our work is data preparation, ensuring data quality is the important job of a machine learning team”. – Andrew Ng

Function engineering is one of the most important steps in machine learning. It is the process of using the domain knowledge of the data to create characteristics that make machine learning algorithms work.. Think of the machine learning algorithm as a child who learns; the more accurate the information you provide, the more they will be able to interpret the information well. Focusing on our data first will give us better results than focusing only on models. Feature engineering helps us create better data that helps the model to understand it well and provide reasonable results.

NLP is a subfield of artificial intelligence in which we understand human interaction with machines using natural languages. To understand a natural language, it is necessary to understand how we write a sentence, how we express our thoughts using different words, signs, special characters, etc., basically we must understand the context of the sentence to interpret its meaning.

If we can use these contexts as characteristics and feed them into our model, then the model will be able to better understand the sentence. Some of the common characteristics that we can extract from a sentence are the number of words, the number of uppercase words, the score number, the number of unique words, the number of empty words, the average sentence length, etc. We can define these characteristics based on our data set that we are using. In this blog, we will use a Twitter dataset so we can add some other characteristics like the number of hashtags, the amount of mentions, etc. We will discuss them in detail in the next sections..

NLP Task Overview

To understand the task of function engineering in NLP, we will implement it in a Twitter dataset. We will use COVID-19 Fake News Data Set. The task is to classify the tweet as Fake O True. The data set is divided into train, validation and test set. Below is the distribution,

Break apart True Fake Total
Train 3360 3060 6420
Validation 1120 1020 2140
Test 1120 1020 2140

Feature list

I will list a total of 15 features we can use for the above dataset, the number of features totally depends on the type of dataset you are using.

1. Number of characters

Count the number of characters present in a tweet.

def count_chars(text):
    return len(text)

2. Number of words

Count the number of words present in a tweet.

def count_words(text):
    return len(text.split())

3. Capital letters number

Count the number of uppercase characters present in a tweet.

def count_capital_chars(text):
    count=0
    for i in text:
        if i.isupper():
            count+=1
    return count

4. Number of uppercase words

Count the number of uppercase words present in a tweet.

def count_capital_words(text):
    return sum(map(str.isupper,text.split()))

5. Count the number of scores

In this function, we return a dictionary of 32 punctuation marks with counts, that can be used as standalone features, which I will discuss in the next section.

def count_punctuations(text):
    punctuations="!"#$%&"()*+,-./:;<=>[email protected][]^_`{|}~'
    d=dict()
    for i in punctuations:
        d[str(i)+' count']=text.count(i)
    return d 

6. Number of words in quotes

The number of words between single quotes and double quotes.

def count_words_in_quotes(text):
    x = re.findall("'.'|"."", text)
    count=0
    if x is None:
        return 0
    else:
        for i in x:
            t=i[1:-1]
            count+=count_words
        return count

7. Number of sentences

Count the number of sentences in a tweet.

def count_sent(text):
    return len(nltk.sent_tokenize(text))

8. Count the number of unique words.

Count the number of unique words in a tweet.

def count_unique_words(text):
    return len(set(text.split()))

9. Hashtag count

Since we are using the Twitter dataset, we can count the number of times users used the hashtag.

def count_htags(text):
    x = re.findall(r'(#w[A-Za-z0-9]*)', text)
    return len(x) 

10. Mention count

And Twitter, most of the time people reply or mention someone in their tweet, counting the number of mentions can also be treated as a characteristic.

def count_mentions(text):
    x = re.findall(r'(@w[A-Za-z0-9]*)', text)
    return len(x)

11. Empty word count

Here we will count the number of stopwords used in a tweet.

def count_stopwords(text):
    stop_words = set(stopwords.words('english'))  
    word_tokens = word_tokenize(text)
    stopwords_x = [w for w in word_tokens if w in stop_words]
    return len(stopwords_x)

12. Calculate the average length of words

This can be calculated by dividing the number of characters by the number of words.

df['avg_wordlength'] = df['char_count']/df['word_count']

13. Calculation of the average length of sentences

This can be calculated by dividing the word count by the sentence count.

df['avg_sentlength'] = df['word_count']/df['sent_count']

14. unique words vs word count function

This characteristic is basically the ratio of unique words to a total number of words.

df['unique_vs_words'] = df['unique_word_count']/df['word_count']

15. Stop word count vs. word count function

This characteristic is also the relationship between the number of stop words and the total number of words.

df['stopwords_vs_words'] = df['stopword_count']/df['word_count']

Implementation

You can download the dataset from here. After download, we can start to implement all the functions we defined above. We will focus more on function engineering, for this we will keep the approach simple, using TF-IDF and simple preprocessing. All the code will be available in my GitHub repository https://github.com/ahmadkhan242/Feature-Engineering-in-NLP.

  • Train reading, validation and test suite with pandas.

    train = pd.read_csv("train.csv")
    val = pd.read_csv("validation.csv")
    test = pd.read_csv(testWithLabel.csv")
    
    # For this task we will combine the train and validation dataset and then use
    # simple train test split from sklern.
    df = pd.concat([train, val])
    df.head()
77710head1-3-2119417
First 5 tickets
  • Apply previously defined feature extraction on the train and test set.

    df['char_count'] = df["tweet"].apply(lambda x:count_chars(x))
    df['word_count'] = df["tweet"].apply(lambda x:count_words(x))
    df['sent_count'] = df["tweet"].apply(lambda x:count_sent(x))
    df['capital_char_count'] = df["tweet"].apply(lambda x:count_capital_chars(x))
    df['capital_word_count'] = df["tweet"].apply(lambda x:count_capital_words(x))
    df['quoted_word_count'] = df["tweet"].apply(lambda x:count_words_in_quotes(x))
    df['stopword_count'] = df["tweet"].apply(lambda x:count_stopwords(x))
    df['unique_word_count'] = df["tweet"].apply(lambda x:count_unique_words(x))
    df['htag_count'] = df["tweet"].apply(lambda x:count_htags(x))
    df['mention_count'] = df["tweet"].apply(lambda x:count_mentions(x))
    df['punct_count'] = df["tweet"].apply(lambda x:count_punctuations(x))
    df['avg_wordlength'] = df['char_count']/df['word_count']
    df['avg_sentlength'] = df['word_count']/df['sent_count']
    df['unique_vs_words'] = df['unique_word_count']/df['word_count']
    df['stopwords_vs_words'] = df['stopword_count']/df['word_count']
    # SIMILARLY YOU CAN APPLY THEM ON TEST SET
  • dding some additional features using score count

    We will create a DataFrame from the dictionary returned by the "punct_count" function and then we will merge it with the main data set.

    df_punct = pd.DataFrame(list(df.punct_count))
    test_punct = pd.DataFrame(list(test.punct_count))
    
    # Merging pnctuation DataFrame with main DataFrame
    df = pd.merge(df, df_punct, left_index=True, right_index=True)
    test = pd.merge(test, test_punct,left_index=True, right_index=True)
    # We can drop "punt_count" column from both df and test DataFrame
    df.drop(columns=['punct_count'],inplace=True)
    test.drop(columns=['punct_count'],inplace=True)
    df.columns
75824columns-6721536
Final column list

  • reprocessing

    We perform a simple step prior to processing, how to remove links, remove username, numbers, double space, punctuation, lowercase, etc.

    def remove_links(tweet):
        '''Takes a string and removes web links from it'''
        tweet = re.sub(r'httpS+', '', tweet) # remove http links
        tweet = re.sub(r'bit.ly/S+', '', tweet) # rempve bitly links
        tweet = tweet.strip('https://www.analyticsvidhya.com/blog/2021/04/a-guide-to-feature-engineering-in-nlp/
    ') # remove [links]
        return tweet
    def remove_users(tweet):
        '''Takes a string and removes retweet and @user information'''
        tweet = re.sub('([email protected][A-Za-z]+[A-Za-z0-9-_]+)', '', tweet) # remove retweet
        tweet = re.sub('(@[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet) # remove tweeted at
        return tweet
    my_punctuation = '!"$%&'()*+,-./:;<=>?[]^_`{|}~•@'
    def preprocess(sent):
        sent = remove_users(sent)
        sent = remove_links(sent)
        sent = sent.lower() # lower case
        sent = re.sub('['+my_punctuation + ']+', ' ', sent) # strip punctuation
        sent = re.sub('s+', ' ', sent) #remove double spacing
        sent = re.sub('([0-9]+)', '', sent) # remove numbers
        sent_token_list = [word for word in sent.split(' ')]
        sent=" ".join(sent_token_list)
        return sent
    df['tweet']   = df['tweet'].apply(lambda x: preprocess(x))
    test['tweet'] = test['tweet'].apply(lambda x: preprocess(x))
  • Text encoding

    We will encode our text data using TF-IDF. We first fit transform in our train tweets column and test set and then merge it with all feature columns.

    vectorizer            =  TfidfVectorizer()
    train_tf_idf_features =  vectorizer.fit_transform(df['tweet']).toarray()
    test_tf_idf_features  =  vectorizer.transform(test['tweet']).toarray()
    
    # Converting above list to DataFrame
    train_tf_idf          = pd.DataFrame(train_tf_idf_features)
    test_tf_idf           = pd.DataFrame(test_tf_idf_features)
    
    # Saparating train and test labels from all features
    train_Y               = df['label']
    test_Y = test['label']
    
    #Listing all features
    features = ['char_count', 'word_count', 'sent_count',
           'capital_char_count', 'capital_word_count', 'quoted_word_count',
           'stopword_count', 'unique_word_count', 'htag_count', 'mention_count',
           'avg_wordlength', 'avg_sentlength', 'unique_vs_words',
           'stopwords_vs_words', '! count', '" count', '# count', '$ count',
           '% count', '& count', '' count', '( count', ') count', '* count',
           '+ count', ', count', '- count', '. count', '/ count', ': count',
           '; count', '< count', '= count', '> count', '? count', '@ count',
           '[ count', ' count', '] count', '^ count', '_ count', '` count',
           '{ count', '| count', '} count', '~ count']
    
    # Finally merging all features with above TF-IDF. 
    train = pd.merge(train_tf_idf,df[features],left_index=True, right_index=True)
    test = pd.merge(test_tf_idf,test[features],left_index=True, right_index=True)
  • Training

    For training, we will use the random forest algorithm from the sci-kit learning library.

    X_train, X_test, y_train, y_test = train_test_split(train, train_Y, test_size=0.2, random_state = 42)
    # Random Forest Classifier
    clf_model = RandomForestClassifier(n_estimators = 1000, min_samples_split = 15, random_state = 42)
    clf_model.fit(X_train, y_train)
    _RandomForestClassifier_prediction = clf_model.predict(X_test)
    val_RandomForestClassifier_prediction = clf_model.predict(test)

Results comparison

For comparison, we first train our model on the data set above using feature engineering techniques and then without using feature engineering techniques. In both approaches, we preprocessed the dataset using the same method described above and TF-IDF was used in both approaches to encode the text data. You can use any encoding technique you want, like word2vec, glove, etc.

1. Without using function engineering techniques

66674without-6801561
Here the precision of the validation is the precision of the test.

2. Use of function engineering techniques

34793with-9675225
Here the precision of the validation is the precision of the test.

From the previous results, we can see that feature engineering techniques helped us to increase our f1 of 0,90 until 0,92 on the train and from 0,90 until 0,94 in the test team.

Conclution

The above results show that if we perform function engineering, we can achieve higher precision using classical machine learning algorithms. Using a transformer-based model is a time-consuming and resource-intensive algorithm. If we do function engineering the right way, namely, after analyzing our dataset, we can get comparable results.

We can also do some other feature engineering, how to count the number of emojis used, the type of emojis used, what frequencies of unique words, etc. We can define our characteristics by analyzing the data set. I hope you have learned something from this blog, share it with others. Check out my personal machine learning blog (https://code-ml.com/) to get new and exciting content in different domains of ML and AI.

About the Author

Mohammad Ahmad (B.Tech)
LinkedIn - https://www.linkedin.com/in/mohammad-ahmad-ai/
Personal Blog - https://code-ml.com/
GitHub - https://github.com/ahmadkhan242
Twitter - https://twitter.com/ahmadkhan_242

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.