Python Regular Expressions Tutorial | Tutorial de Python Regex

Contents

Relevance of regular expressions

In recent years, there has been a dramatic shift in the use of general purpose programming languages ​​for data science and machine learning. This was not always the case: a decade ago, This thought would have met many skeptical eyes!

This means that more people / Institutions are using tools like Python / JavaScript to solve your data needs. This is where regular expressions get super useful. Regular expressions are regularly the default way of cleaning and disputing data in most of these tools.. Either the extraction of specific parts of text from web pages, make sense of Twitter data or prepare your data for text mining, regular expressions are your best option for all these tasks.

Given its applicability, it makes sense to know and use them properly.

What will you learn from this post?

In this post, I will guide you through the use, examples and applications of regular expressions. Regular expressions are very popular with programmers and can be applied in many programming languages ​​such as Java, JS, php, C ++, etc. To develop our understanding, we have explained this concept using the Python programming language. To the end, i solved various problems using regex.

learn regular expressions in python

Let's get started!

What is regular expression and how is it used?

Briefly, regular expression is a sequence of characters that are mainly used to find and replace patterns in a string or file. As i mentioned before, are compatible with most programming languages ​​like python, perl, R, Java and many others. Then, learning them helps in multiple ways (More on this later).

Regular expressions use two types of characters:

a) Metacaracteres: as the name suggests, these characters have a special meaning, equivalent to * in the joker.

b) Literals (like, b, 1,2…)

In Python, we have the module “reWhat helps with regular expressions. Then you need to import the library re before you can use regular expressions in Python.

Use this code --> Import re

The most common uses of regular expressions are:

  • Search for a string (search and match)
  • Find a string (findall)
  • Break the string into a substring (share)
  • Replace part of a string (sub)

Let's see the methods that the library “re”Provides to perform these tasks.

Note: We also have a video course on natural language processing that also covers regular expressions. Check it!

What are the different regex methods?

The 're package’ provides various methods to query an input string. Here are the most used methods, I will discuss:

  1. rematch()
  2. investigate()
  3. re.findall ()
  4. re.split ()
  5. re.sub ()
  6. re.compile ()

Let's see them one by one.

rematch(Pattern, rope):

This method finds a match if it occurs at the beginning of the string. As an example, call match () en la cadena ‘AV Analytics AV’ and look for an 'AV pattern’ will match. Despite this, if we only look for Analytics, the pattern will not match. Let's do it in Python now.

Here's a live encoding window to get you started. You can run the codes and get the result in this window:

Above you can see the start and end position of the matching pattern ‘AV’ on the rope and, sometimes, helps a lot when performing rope manipulation.

investigate(Pattern, rope):

It is equivalent to match () but it does not limit us to looking for matches only at the beginning of the string. Unlike the previous method, here the search for the 'Analytics pattern’ will return a match.

Code

result = re.search(r'Analytics', 'AV DataPeaker AV')
print result.group(0)
Output:
Analytics

Here you can see that the search method () can find a pattern from any position in the string, but it only returns the first occurrence of the search pattern.

re.findallPattern, rope):

That helps to get a list of all matching patterns. You have no search restrictions from the beginning or the end. If we use the findall method to search for 'AV’ in a given string, will return both occurrences of AV. While searching for a string, I would recommend that you use re.findall () forever, can work as re.search () y re.match () both of them.

Code

result = re.findall(amber', 'AV DataPeaker AV')
print result

Output:
['OF', 'OF']

re.splitPattern, rope, [maxsplit=0]):

This method helps to divide rope by occurrences of given Pattern.

Code

result=re.split(r'y','Analytics')
result

Output:
['Anal', 'tics']

Above, we have divided the chain “Analytics” by “Y”. The split method () has another argument “maxsplit“. Has a default value of zero. In this circumstance it makes the maximum divisions that can be made, but if we give value to maxsplit, will split the chain. Let's see the following example:

Code

result=re.split(r'i','DataPeaker')
print result

Output:
['Analyt', 'cs V', 'dhya'] #It has performed all the splits that can be done by pattern "i".

Code

result=re.split(r'i','DataPeaker',maxsplit=1)
result

Output:
['Analyt', 'cs Vidya']

Here, you can see that we have set maxsplit in 1. And the result is that it only has two values, while the first example has three values.

re.subPattern, responder, rope):

Helps to find a pattern and replace it with a new substring. If the pattern is not found, rope is returned unchanged.

Code

result=re.sub(r'India','the World','AV is largest Analytics community of India')
result
Output:
'AV is largest Analytics community of the World'

re.compilePattern, responder, rope):

We can combine a regex pattern into pattern objects, that can be used to match patterns. It is also useful to search for a pattern again without retyping it..

Code

import re
pattern=re.compile('OF')
result=pattern.findall('AV DataPeaker AV')
print result
result2=pattern.findall('AV is largest analytics community of India')
print result2
Output:
['OF', 'OF']
['OF']

Quick summary of various methods:

Until now, we parse various regex methods using a constant pattern (fixed characters). But, What if we don't have a constant search pattern and we want to return a specific set of characters (defined by a rule) of a chain? Don't be intimidated.

This can be easily solved by defining an expression with the help of pattern operators (meta and literal characters). Let's look at the most common pattern operators.

What are the most used operators?

Regular expressions can specify patterns, not just fixed characters. These are the most used operators that help to generate an expression to represent the required characters in a string or file. Commonly used in web scrapping and text mining to extract required information.

Operators Description
. Matches any single character except the new line ‘ n’.
? agree with 0 O 1 appearance of the pattern on your left
+ 1 or more occurrences of the pattern to your left
* 0 or more appearances of the pattern to your left
w Matches an alphanumeric character while W (W mayúscula) matches a non-alphanumeric character.
D Digit Matches [0-9] Y / D (D mayúscula) matches no digits.
s Matches a single whitespace character (space, new line, return, tabulation, form) y S (S mayúscula) matches any non-whitespace character.
B boundary between word and non-word and / B is opposite / b
[..] Matches any single character in a bracket and [^..] matches any single character that is not in brackets
Used for special meaning characters like . to coincide with a point or + for the plus sign.
^ and $ ^ and $ match the start or end of the string respectively
{New Mexico} It matches at least n and at most m occurrences of the preceding expression if we write it as {, m} then it will return at least any minimum occurrence to the maximum m preceding expression.
a | B Matches a or b
() Group regular expressions and return matching text
t, n, r Matches tab, new line, return

For more details on metacharacters “(“, “)”, “|” and other details, you can check this link (https://docs.python.org/2/library/re.html).

Now, Let's understand the pattern operators by looking at the following examples.

Some examples of regular expressions

Trouble 1: Returns the first word of a given string

Solution-1 Extract each character (using “ w)

Code

import re
result=re.findall(r'.','AV is largest Analytics community of India')
print result

Output:
['A', 'V', ' ', 'i', 's', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'A', 'n', 'a', 'l', 'and', 't', 'i', 'c', 's', ' ', 'c', 'O', 'm', 'm', 'u', 'n', 'i', 't', 'and', ' ', 'O', 'f', ' ', 'I', 'n', 'd', 'i', 'a']

Above also extracts space, now to prevent it from being used “ w” instead of “.“.

Code

result=re.findall(r'w','AV is largest Analytics community of India')
print result

Output:
['A', 'V', 'i', 's', 'l', 'a', 'r', 'g', 'e', 's', 't', 'A', 'n', 'a', 'l', 'and', 't', 'i', 'c', 's', 'c', 'O', 'm', 'm', 'u', 'n', 'i', 't', 'and', 'O', 'f', 'I', 'n', 'd', 'i', 'a']

Solution-2 Extract every word (using “*” O “+)

Code

result=re.findall(r'w*','AV is largest Analytics community of India')
print result

Output:
['OF', '', 'is', '', 'largest', '', 'Analytics', '', 'community', '', 'of', '', 'India', '']

One more time, is returning the space as a word because “*”Returns zero or more pattern matches to its left. Now, to erase the spaces, we will go with “+“.

Code

result=re.findall(r'w+','AV is largest Analytics community of India')
print result
Output:
['OF', 'is', 'largest', 'Analytics', 'community', 'of', 'India']

Solution-3 Extract every word (using “^)

Code

result=re.findall(r'^w+','AV is largest Analytics community of India')
print result

Output:
['OF']

If we use "$" instead of "^", will return the word from the end of the string. Veámoslo.

Code

result=re.findall(r'w+$','AV is largest Analytics community of India')
print result
Output:

[‘India’]

Trouble 2: Returns the first two characters of each word

Solution-1 Extract two consecutive characters from each word, excluding spaces (using “ w)

Code

result=re.findall(r'ww ','AV is largest Analytics community of India')
print result

Output:
['OF', 'is', 'the', 'rg', 'it is', 'An', 'al', 'yt', 'ic', 'co', 'mm', 'a', 'it', 'of', 'In', 'from']

Solution-2 Extract two consecutive characters available at the beginning of the word boundary (using “B)

result=re.findall(r'bw.','AV is largest Analytics community of India')
print result

Output:
['OF', 'is', 'the', 'An', 'co', 'of', 'In']

Trouble 3: Returns the domain type of the given email IDs

To explain it simply, I'll go again with a step-by-step approach:

Solution-1 Extract all characters after “@”

Code

result=re.findall(r'@w+','[email protected], [email protected], [email protected], [email protected]') 
print result 
Output: ['@gmail', '@test', '@analyticsvidhya', '@rest']

Above, you can see that the part “.with”, “.in” it is not extracted. To add it, we will go with the following code.

result=re.findall(r'@w+.w+','[email protected], [email protected], [email protected], [email protected]')
print result
Output:
['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']

Solution – 2 Extract only the domain name using “()”

Code

result=re.findall(r'@w+.(w+)','[email protected], [email protected], [email protected], [email protected]')
print result
Output:
['with', 'in', 'with', 'biz']

Trouble 4: return date of the given string

Here we will use “D”To extract the digit.

Solution:

Code

result=re.findall(r'd{2}-d{2}-d{4}','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print result
Output:
['12-05-2007', '11-11-2011', '12-01-2009']

If you want to extract only one year, the parenthesis "()" it will help you.

Code


result=re.findall(r'd{2}-d{2}-(d{4})','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print result
Output:
['2007', '2011', '2009']

Trouble 5: Returns all words in a string that begin with a vowel

Solution-1 Returns every word

Code

result=re.findall(r'w+','AV is largest Analytics community of India')
print result

Output:
['OF', 'is', 'largest', 'Analytics', 'community', 'of', 'India']

Solution-2 Return words begin with alphabets (using [])

Code

result=re.findall(r'[aeiouAEIOU]w+','AV is largest Analytics community of India')
print result

Output:
['OF', 'is', 'argest', 'Analytics', 'ommunity', 'of', 'India']

Above you can see that it has returned “argest” Y “ommunity” from the middle of the words. To erase these two, we need to use ” b” for the limit of the word.

Solution 3

Code

result=re.findall(r'b[aeiouAEIOU]w+','AV is largest Analytics community of India')
print result 

Output:
['OF', 'is', 'Analytics', 'of', 'India']

Equivalently, we can extract words that start with constant using "^" in brackets.

Code

result=re.findall(r'b[^ aeiouAEIOU]w+','AV is largest Analytics community of India')
print result

Output:
[' is', ' largest', ' Analytics', ' community', ' of', ' India']

Above you can see that you have returned words that start with space. To remove it from the output, include a space between brackets[].

Code

result=re.findall(r'b[^ aeiouAEIOU ]w+','AV is largest Analytics community of India')
print result

Output:
['largest', 'community']


Trouble 6: validate a phone number (the phone number must have 10 digits and start with 8 O 9)

We have a list of phone numbers on the list “at the” and here we will validate the phone numbers using

Solution

Code

import re
li=['9999999999','999999-999','99999x9999']
for val in li:
 if re.match(r'[8-9]{1}[0-9]{9}',val) and len(val) == 10:
     print 'yes'
 else:
     print 'no'
Output:
yes
no
no

Trouble 7: split a string with many delimiters

Solution

Code

import re
line = 'asdf fjdk;afed,fjek,asdf,foo' # String has multiple delimiters (";",","," ").
result= re.split(r'[;,s]', line)
print result

Output:
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

We can also use the method re.sub () to replace these multiple delimiters with one as a space "".

Code

import re
line = 'asdf fjdk;afed,fjek,asdf,foo'
result= re.sub(r'[;,s]',' ', line)
print result

Output:
asdf fjdk afed fjek asdf foo

Trouble 8: retrieve information from an HTML file

I want to extract information from an HTML file (see sample data below). Here we need to extract the information available between

Y

except the first numeric index. I have assumed here that the html code below is stored in a string str.

HTML file example (str)

<tr align="center"><td>1</td> <td>Noah</td> <td>Emma</td></tr>
<tr align="center"><td>2</td> <td>Liam</td> <td>Olivia</td></tr>
<tr align="center"><td>3</td> <td>Mason</td> <td>Sophia</td></tr>
<tr align="center"><td>4</td> <td>Jacob</td> <td>Isabella</td></tr>
<tr align="center"><td>5</td> <td>William</td> <td>Ava</td></tr>
<tr align="center"><td>6</td> <td>Ethan</td> <td>My</td></tr>
<tr align="center"><td>7</td> <td HTML>Michael</td> <td>Emily</td></tr>

Solution:

Code

result=re.findall(r'<td>w+</td>s<td>(w+)</td>s<td>(w+)</td>',str)
print result
Output:
[('Noah', 'Emma'), ('Liam', 'Olivia'), ('Mason', 'Sophia'), ('Jacob', 'Isabella'), ('William', 'Ava'), ('Ethan', 'My'), ('Michael', 'Emily')]

You can read the html file using urllib2 library (see code below).

Code

import urllib2
response = urllib2.urlopen('')
html = response.read()

Final notes

In this post, we discussed about the regular expression, the methods and metacharacters to form a regular expression. We have also analyzed several examples to see the practical uses of it.. Here I have tried to introduce you to regex and cover the most common methods for solving maximum regex problems..

Was the post helpful? Let us know your thoughts on this guide in the comment section below..

If you like what you have just read and want to continue learning about analytics, subscribe to our emails, Follow us on twitter or like ours page the Facebook.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.