Regular expressions in Python: a beginner's guide

Contents

This post was released as part of the Data Science Blogathon

Regular expressions, also recognized as “regex” O “regexp”, are used to match text strings, as characters, particular words or character patterns. It means that we can match and extract any string pattern from the text with the help of regular expressions. I have used two terms, match Y extract Y both terms have a slightly different meaning. There may be cases where we want to match a specific pattern but extract a subset of it. As an example, we want to extract the names of PhD fellows from a list of names of people in an organization.

For this case, we will do it match la keyword “Dr XYZ” Y extract just the name, In other words, “XYZ”, not the prefix “Dr.” of the list. Regex is very useful to search in texts, large emails and documents. Regex is also called “programming language for string matching”. Before diving into regex and its implementation in Python, it is essential to know its applications in the real world.

Applications

Form validation

The most common use of regular expressions is form validation, In other words, email validation, password validation, phone number validation and many other form fields.

Bank account details

You must have noticed that each bank has an IFSC code for its different branches that begins with the name of the bank. The credit card number consists of 16 digits and the first digits represent if the card is Master, Visa o Rupay. In all these cases, regex is used.

Data processing

How can we forget the relevance of regex in data mining? When data is present in unstructured form, In other words, in text form, it is necessary to convert them to numbers to train the model. Therefore, regular expression plays an important role in data analysis, find patterns in the data and, finally, perform operations on the dataset.

PNL

NLP is a procedure through which a computer understands and generates human language. and NLP, regular expressions are used to delete unnecessary words, In other words, stop text words, which helps to clean the data. Regex is also used to analyze texts and, therefore, helps in forecasting the algorithm to process the data.

Social media platforms

Social media platforms like Google, Facebook, Twitter provide various search techniques, that are different and efficient from a normal search. Créame, if you know these techniques, can explore much more. All these technicians use regular expressions in the backend to process these searches.

You can think of other regex apps whenever pattern matching is required.

Wildcard patterns

The smallest individual units by means of the regular expressions that are formed are called wildcard patterns.. The list of commonly used patterns are

^

This wildcard matches the characters at the beginning of a line.

PS

This wildcard matches the characters at the end of the line.

.

This wildcard matches any character on the line.

s

This wildcard is used to match the space in a string.

S

This wildcard matches characters that are not whitespace.

D

This wildcard matches a digit.

*

This wildcard repeats any previous character zero or more times. Matches the longest feasible string.

*?

This wildcard also repeats any previous character or characters zero or more times. Despite this, match the shorter string following the pattern.

+

This wildcard repeats any previous character one or more times. Match the longest feasible string following the pattern.

+?

This wildcard repeats any previous character one or more times. Despite this, matches the shortest feasible string following the pattern.

[aeiou]

Matches any character in a specified character set.

[^XYZ]

Matches any character not included in the set.

[a-z0-9]

Matches any character given in az or 0-9.

(

This wildcard represents the beginning of the string extraction.

)

This wildcard represents the end of the string extraction.

Examples of

If you want to extract numbers from a document, the regular expression will be: [0-9]+

If you want to extract all characters other than numbers, the regular expression will be: [^0-9]+

To extract a pattern such that a name begins with “A” and finish with “h”, the regular expression will be: ^ A[a-zA-Z]+ h $

A more complex regular expression if you want to extract the email address is: ^[a-zA-Z][a-zA-Z0-9 ._ + -][email protected][A-Za-z]+.[A-Za-z]

Building Regex!

Regex can be very complex. Understanding and building complex regular expressions is an art that is learned by doing. You can refer here to learn how to build complex regular expressions.

Python implementation

Regex is provided by many programming languages, like python, java, javascript, etc. Even though the concept is the same everywhere still, may find some differences in different languages.

Now we will see the various functions provided by python to implement regular expressions together with your code.

Python does not provide a built-in regex module. You must install it using the pip command and then import it into your Python IDE. Later we store some text in a variable called string.

pip install re
import re
string = "Virat Kohli is one of the greatest players in the Indian cricket team.nHe was born on November 5, 1988, in Delhi.nHe has completed his education at Vishal Bharti School.nIn 2008, he won the World Cup for India on Omar’s children under 19 years. From 2011, he started Test cricket matches. nHe is currently the captain of all three formats of India.n In 2017, Virat Kohli got married to Hindi film actress Anushka Sharma.nVirat has won the Man of the Tour twice, in 2014 and 2016. nSince 2008, he has represented Delhi in-home teams. nHe has been awarded the Arjuna Award in recognition of the achievements of international cricket."

matching method

This function looks for the RE pattern at the beginning of the string and returns the match object of the string. You can enter the value in the object through the group function (). The syntax of the match function is

re.match (Pattern, chain, banderas)

the Pattern represents the regular expression, the rope represents the text to be found to match the pattern, and the flags represent the modifiers. If we want to apply any conditions in pairing we use flags. This is an optional parameter.

python code

pattern=r'(^[V].+?)s'
print(re.match(pattern,string))      # Returns the match object
print(re.match(pattern,string).group()) #Extracting value from the object

PRODUCTION

Virat

This function matches if the first string starts with V.

search method

this function looks for the first occurrence of the RE pattern in the given string. This function also returns the match object if the pattern is found; opposite case, returns none. The syntax is

re.search (Pattern, chain)

Note that match finds a match only at the beginning of the string, while search for looks for a first match anywhere in the string.

python code

pattern=r'[0-9]+'
re.search(pattern,string)      # Returns the match object
print(re.search(pattern,string).group())

PRODUCTION

This function returns the first number present in the text.

findall method

This function will return all occurrences of the RE pattern in the string. The syntax for findall is

re.findall (Pattern, rope)

python code

pattern=r'[0-9]+'
print(re.findall(pattern,string))

PRODUCTION

['5', '1988', '2008', '19', '2011', '2017', '2014', '2016', '2008']

This function extracts all numbers from the text.

sub method

This function is used to replace all occurrences of the RE pattern with the new string / Pattern. The syntax is:

re.sub (Pattern, answer, chain)

python code

repl = r’Chiku ‘

to print (re.sub (Pattern, answer, chain))


PRODUCTION

"Chiku Kohli is one of the greatest players in the Indian cricket team.nHe was born on November 5, 1988, in Delhi.nHe has completed his education at Vishal Bharti School.nIn 2008, he won the World Cup for India on Omar’s children under 19 years. From 2011, he started Test cricket matches. nHe is currently the captain of all three formats of India.n In 2017, Virat Kohli got married to Hindi film actress Anushka Sharma.nChiku has won the Man of the Tour twice, in 2014 and 2016. nSince 2008, he has represented Delhi in-home teams. nHe has been awarded the Arjuna Award in recognition of the achievements of international cricket."

This role replaces Virat with Chiku, In other words, Kohli's nickname.

These are the most used functions of the module “re”. You can refer re documentation for more details.

Summary

We start with a basic definition of regular expressions and then discuss its various applications. Later we learned to form regular expressions using wildcards. Finally, we implement several regex tools in Python.

References

main photo – https://www.codingforentrepreneurs.com/blog/python-regular-expressions/

About me

Hello! Soy Ashish Choudhary. I am studying B.Tech from JC Bose University of Science and Technology. Data science is my passion and I take pride in writing interesting blogs related to it. Feel free to contact me at LinkedIn.

The media shown in this post is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.