Text classification in natural language processing

Share on facebook
Share on twitter
Share on linkedin
Share on telegram
Share on whatsapp

Contents

This article was published as part of the Data Science Blogathon.

Introduction

Artificial intelligence has been vastly improved without the need to change the underlying hardware infrastructure. Users can run an artificial intelligence program on an old computer system. Secondly, the beneficial effect of machine learning is unlimited. Natural language processing is one of the branches of artificial intelligence that gives machines the ability to read, understand and deliver meaning. NLP has been very successful in healthcare, the media, finance and human resources.

the most common form of unstructured data is text and speech.. It is abundant, but difficult, extract useful information. On the contrary, it would take a long time to extract the information. Written text and speech contain valuable information. It's because we, as intelligent beings, we use writing and speech as the main form of communication. NLP can analyze this data for us and perform tasks such as sentiment analysis., cognitive assistant, interval filtering, identification of fake news and translation of languages in real time.

This article will cover how NLP understands texts or parts of speech. Mainly we will focus on words and sequence analysis. Includes text classification, vector semantics and word embedding, probabilistic language model, sequential tagging and speech reorganization. We'll look at the sentiment analysis of fifty thousand IMDB film critics. Our goal is to identify whether the review posted on the IMDB site by its user is positive or negative.

Track listing

  • Do you understand what NLP is??
  • What is NLP for??
  • Words and sequences
    • Text classification
    • Semantics and Word vector embedding
    • Probabilistic models of language
    • Sequence tagging
  • Analyzers
  • Semantics
  • Performing semantic analysis on the IMDB Movie Review Data Project

NLP has been widely used in automobiles, smart phones, Speakers, Computer, websites, etc. Automatic translator for using Google Translator, what is the NLP system. Google Translator wrote and spoke in natural language for the language users want to translate. NLP helps Google Translate understand the word in context, remove additional noise and create CNN to understand native voice.

NLP is also popular in chatbots. Chatbots are very useful because they reduce the human work of asking what the customer needs.. NLP chatbot bots ask sequential questions such as what the user's problem is and where to find the solution. Apple and AMAZON have a robust chatbot in their system. When the user asks some questions, the chatbot turns them into understandable phrases in the internal system.

It's called a toke.. Later, the token goes to NLP to get an idea of what users are asking. NLP is used in information retrieval (IR). IR is a software program that deals with large storage, evaluating information from large text documents from repositories. Retrieve only relevant information. For instance, used in Google voice detection to trim unnecessary words.

NLP Application

  • Automatic translation, namely, Google translator
  • Information retrieval
  • Answer to questions, namely, ChatBot
  • Summary
  • Sentiment analysis
  • Social Media Analytics
  • Big Data Mining

Words and sequences

The NLP system needs to understand the text correctly, signs and semantics. Many methods help the NLP system understand text and symbols. They are text classification, vector semantics, word embedding, probabilistic language model, sequence tagging and speech reorganization.

  1. Text classification

    Text clarification is the process of categorizing text into a group of words. When using NLP, text classification can automatically parse the text and then assign a set of predefined tags or categories based on its context. NLP is used for opinion analysis, topic detection and language detection. There are mainly three approaches to text classification:

    • Rules-based system,
    • Machine system
    • Hybrid system.

    In the rules-based approach, texts are separated into an organized group using a set of artisanal linguistic rules. These artisanal linguistic rules contain users to define a list of words that are characterized by groups.. For instance, words like Donald Trump and Boris Johnson would be classified in politics. People like LeBron James and Ronaldo would qualify in sports.

    Machine-based classifier learns to make a classification based on past observations of datasets. User data is pre-labeled as tarin and test data. Collect the ranking strategy of previous entries and continuously learn. Machine-based sorter uses a one-word bag for feature extension.

    In a bag of words, a vector represents the frequency of words in a predefined dictionary of a list of words. We can perform NLP using the following machine learning algorithms: Naïve Bayer, SVM and Deep Learning.

    67236machinebasedsystem-5429975

    The third approach to text classification is the hybrid approach. The use of the hybrid approach combines a rules-based and machine-based approach. Use the rules-based system hybrid approach to create a tag and use machine learning to train the system and create a rule. Later, the list of machine-based rules is compared to the list of rule-based rules. If something doesn't match on the labels, humans improve the list manually. It is the best method to implement text classification.

  2. Vector semantics

    Vector Semantic is another form of word and sequence analysis. Vector semantics defines semantics and interprets the meaning of words to explain characteristics such as similar words and opposite words.. The main idea behind vector semantics is that two words are the same if they have been used in a similar context.. Vector semantics divides words into a multidimensional vector space. Vector semantics is useful in sentiment analysis.

  3. Word embedding

    Word embedding is another method of word and sequence analysis. Embedding translates reserve vectors into a low-dimensional space that preserves semantic relationships. Word embedding is a type of word representation that allows words with a similar meaning to have a similar representation. There are two types of word embeddings:

    word2vec is a statistical method for effectively learning an embedding of words independent of a corpus of text.

    Doc2Vec is similar to Doc2Vec, but parses a text group as pages.

  4. Probabilistic language model

    Another approach to word and sequence analysis is the probabilistic language model.. The goal of the probabilistic language model is to calculate the probability of a sentence from a sequence of words.. For instance, the probability that the word “a” appears in a given word “a” it is 0.00013131 percent.

  5. Sequence tagging

    sequence tagging is a typical nlp task that assigns a class or tag to each token in a given input stream. If someone says “put the tom hanks movie”. In sequence, the labelling is [Play, movie, tom hanks]. The game determines an action. Movies are an example of action. Tom Hanks searches for a search entity. Splits the input into multiple tokens and uses LSTM to parse it. There are two ways to tagging sequences. They are tagging tokens and tagging of tranches.

    Analysis is a phase of NLP in which the parser determines the syntactic structure of a text by analyzing the words that constitute it based on an underlying grammar.. For instance, "tom ate an apple" will be divided in its own name  tom, verb  ate, determinant , noun  apple. The best example is Amazon Alexa.

    We discuss how text is classified and how to divide the word and sequence so that the algorithm can understand and categorize it. In this project, let's discover a sentiment analysis of fifty thousand IMDB film critics. Our goal is to identify whether the review posted on the IMDB site by its user is positive or negative.

    This project covers text mining techniques such as text embedding, word bags, context of words and other things. We will also cover the introduction of a bidirectional LSTM sentiment classifier. We'll also look at how to import a tagged dataset from TensorFlow automatically.. This project also covers steps like data cleansing, word processing, data balancing by sampling and training and testing a deep learning model to classify text.

Analyzing

the parser determines the syntactic structure of a text by analyzing the words that constitute it based on an underlying grammar. Divide the words in the group into component parts and separate the words.

For more details about the analysis, see This article.

Semantic

Text is at the heart of how we communicate. What is really difficult is to understand what is said in a written or spoken conversation?? Understanding books and long articles is even harder. Semantics is a process that seeks to understand linguistic meaning by constructing a model of the principle that the speaker uses to convey meaning.. Has been used in customer feedback analysis, article analysis, fake news detection, semantic analysis, etc.

Sample application

Here is the code example:

Importing the necessary library

# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I / O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#Importing require Libraries
import os

import matplotlib.pyplot as plt
import nltk
from tkinter import *
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
import scipy

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
from tensorflow.python import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Descargando el archivo necesario

# this cells takes time, please run once
# Split the training set into 60% and 40%, so we'll end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
original_train_data, original_validation_data, original_test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

Obtener el índice de palabras de los conjuntos de datos de Keras

#tokanizing by tensorflow
word_index = tf.keras.datasets.imdb.get_word_index(
    path="imdb_word_index.json"

)

In [8]:

{k:v for (k,v) in word_index.items() if v < 20}

Out of[8]:

{'with': 16,  'i': 10,  'as': 14,  'it': 9,  'is': 6,  'in': 8,  'but': 18,  'of': 4,  'this': 11,  'a': 3,  'for': 15,  'br': 7,  'the': 1,  'was': 13,  'and': 2,  'to': 5,  'film': 19,  'movie': 17,  'that': 12}

Comparison of positive and negative review

33908screen20shot202020-12-1020at201-59-1720am-9549729

Create train, test data

83446screen20shot202020-12-1020at202-01-1320am-4897218

Model and summary of the model

97890screen20shot202020-12-1020at202-02-0620am-9479875

Split data and tune the model

65024screen20shot202020-12-1020at202-02-5120am-1680492

Model Effect Overview

27021screen20shot202020-12-1020at202-03-3720am-2413858

Confusion matrix and correlation report

92103screen20shot202020-12-1020at202-05-4820am-2365941

Note: The data source and data for this model are publicly available and can be accessed using Tensorflow.

To get the full code and details, follow this GitHub repository.

In conclusion, NLP is a field full of opportunities. NLP has a tremendous effect on how to analyze texts and speeches. NLP is getting better every day. Extracting knowledge from the large dataset was impossible five years ago. The rise of the NLP technique made it possible and easy. There are still many opportunities to discover in NLP.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.