< BACKMake Note | BookmarkCONTINUE >
152015024128143245168232148039199167010047123209178152124239215162148044238000231176115064

Regular Expressions

We already know that the string module is used to apply basic manipulation operations on strings; meanwhile, at the time of developing advanced routines, you might need to enhance Python's string-processing capabilities. That's when you should consider using the re module (re stands for regular expression).

Regular expressions are strings, which contain a mix of text and special characters, that let you define complicated pattern matching and replacement rules for other strings.

Some of the special characters that compound regular expressions must be preceded by backslashes in order to be matched. Consequently, regular expressions are usually written as raw strings because they tend to use a lot of backslashes. That means that instead of writing "\\b(usa)\\d", it is much easier to say r"\b(usa)\d".

Older versions of Python used to support the following regular expression obsolete modules are: regexp, regex, and regsub.

Table 9.3. Special Characters Recognized by the re Module
Special Character What It Matches
. Any character (except newline by default).
^ The start of the string, or of a line (in case of multiline re's).
$ The end of the string, or of a line (in case of multiline re's).
* Any number of occurrences of the preceding expression.
+ 1 or n number of occurrences of the preceding expression.
| Either the preceding re or the following re, whichever is true.
? 1 or 0 number of occurrences of the preceding expression.
*? Similar to *, but it matches as few occurrences as possible.
+? Similar to +, but it matches as few occurrences as possible.
?? Similar to ?, but it matches as few occurrences as possible.
{ m, n } From m to n occurrences of the preceding expression. It matches as many occurrences as possible.
{ m, n }? From m to n occurrences of the preceding expression. It matches as few occurrences as possible.
[ list ] A set of characters, such as r"[A-Z]".
[^ list ] Characters that are not in the list.
( re ) Matches the regular expression as a group. It specifies logical groups of operations and saves the matched substring.
Anystring The string anystring.
\w Any alphanumeric character.
\W Any non-alphanumeric character.
\d Any decimal digit.
\D Any non-decimal digit.
\b Empty strings at the starting or ending of words.
\B Empty strings that are not at the starting or ending of words.
\s Matches a whitespace character.
\S Matches any non-whitespace character.
\ number Text already matched by the group number.
\A Only at the start of the string.
\Z Only at the end of the string.
\\ The literal backslash.
(?: str ) Matches str, but the group can't be retrieved when matched.
(?! str ) If not followed by str (for example, only matches r"Andre (?!Lessa)" if it doesn't find "Andre Lessa").
(?= str ) If followed by str.
(?=.* str ) If followed at some point by str (for example, only matches r"Andre (?=.*Lessa)" if it finds something similar to "Andre S Lessa"). This syntax doesn't consume any of the string, so in this example, the re only matches the "Andre " portion of the string.
(?# str ) This is just to insert a comment in the middle of the regular expression string.
(?P< name >…) Matches the regular expression that follows the name and creates a group name.
(?P= name ) Matches the same things that the group name has matched.
.* Any number of characters.

In case you need to know a full definition of the syntax, visit the following link:

					
				http://www.python.org/doc/current/lib/re-syntax.html
			
				

Next, you have the regular expression flags. These flags are used as bitwise-or operators in the re functions.

re.DOTALL (also used as re.S)—  Allows the dot character to match all characters, including newlines.

re.IGNORE (also used as re.I)—  Allows non case sensitive matching.

re.LOCALE (also used as re.L)—  Enables locale settings for \w, \W, \b, and \B.

re.MULTILINE (also used as re.M)—  Applies ^ and $ for each line, and not for each string.

re.VERBOSE (also used as re.X)—  Ignores unescaped whitespace and comments.

Let's look at our first example of regular expressions. Suppose that you have the following conversation text:

					
oldtext = """
    That terrible dead PARROT sketch must end!
    Oh, Come on! It is a terrific parrot joke.
    I agree, but I don't like to see dead parrot.
    Ok. I will suggest a new terrific parrot sketch."""

				

Okay. Now our challenge is to create an expression that is able to identify all the words "parrot" that

  1. Are preceded by either "terrible" or "terrific" (such as "terrible parrot", "terrific parrot").

  2. Are not immediately preceded by the word "dead".

  3. Are separated from the previous word by a whitespace ("terribleparrot" does not work).

  4. Are not followed by the word "joke", hence, "parrot joke" is an invalid string.

  5. Are followed by a whitespace, and right after, by the word "sketch" (neither "parrotsketch" nor "parrot old sketch" are valid).

  6. The matching must not be case sensitive.

The word "parrot" that gets identified must be replaced with the word "spam".

The following code is a possible solution for this problem:

					
 1: import re
 2: restring = re.compile(
 3:     r"""\b(terrible|terrific)
 4:         (?!dead)
 5:         (\s+
 6:          parrot
 7:         (?!joke)
 8:         \s+sketch)""",
 9:     re.DOTALL | re.IGNORECASE | re.VERBOSE)
10: newline = restring.sub(r'\1 spam', oldtext)

				

We are calling the compile function (line 2), which generates a compiled regular expression object called restring. Then, we call the class method sub (line 10) to substitute the matches found in the text variable that we have already defined (oldtext). The sub() method replaces the entire matched section of the string. Note that the r'\1 spam' argument uses \1 to make sure that the result collected in the first group of parenthesis ("Terrible" and "Terrific") is placed right before the word "spam".

Regular Expression Functions and Object Methods

The re module implements just one exception—the error exception, which is raised only when a regular expression string is not valid.

Next, you have the list of available re functions.

re.compile()

Compiles a regular expression pattern string and generates a regular expression object.

							
RegExpObject = compile(string [, flags])

						

For details about the flags argument, check out the previous list of available flags.

Every regular expression object exposes the following attributes and methods:

RegExpObject.search()

Searches for the compiled pattern in the string.

							
MatchObject = RegExpObject.search(string [,startpos] [,endpos])

						

It uses the startpos and endpos arguments to delimit the range of the search.

All functions that are supposed to return a MatchObject when the function succeeds, return None when a fail occurs.

RegExpObject.match()

Checks whether the initial characters of string match the compiled pattern.

							
MatchObject = RegExpObject.match(string [,startpos] [,endpos])

						

It uses the startpos and endpos arguments to delimit the scope of the matching.

RegExpObject.findall()

Finds nonoverlapping matches of the compiled pattern in string.

							
MatchList = RegExpObject.findall(string)

						
RegExpObject.split()

Splits the string by the occurrences of the compiled pattern.

							
StringList = RegExpObject.split(string [, maxsplit])

						
RegExpObject.sub()

Substitutes the matches of pattern in string with newtext.

							
RegExpObject.sub(newtext,
						string [, count])

						

The replacements are done count number of times, starting from the left side of string. When you leave out the count argument, you are not really saying don't perform the substitution at all, but apply it as many times as necessary.

RegExpObject.subn()

It is similar to sub. However, it returns a tuple that contains the new string and the number of substitutions executed. When you leave out the count argument, you are not really saying don't perform the substitution at all, but apply it as many times as necessary.

							
RegExpObject.subn(newtext, string [, count])

						
re.search()

Searches for the pattern in the string.

							
MatchObject = search(pattern, string [,flags])

						
re.match()

Sees whether the initial characters of string match the pattern.

							
MatchObject = match(pattern, string [,flags])

						
re.findall()

Finds nonoverlapping matches of pattern in string.

							
MatchList = findall(pattern, string)

						
re.split()

Splits the string by the occurrences of pattern.

							
StringList = split(pattern, string [, maxsplit])

						
re.sub()

Substitutes the matches of pattern in string with newtext.

							
sub(pattern, newtext, string [, count])

						

The replacements are done count number of times, starting from the left side of string.

re.subn()

It is similar to sub(). However, it returns a tuple that contains the new string and the number of substitutions executed.

							
subn(pattern, newtext, string [, count = 0])

						
re.escape()

Backslashes all the nonalphanumeric characters of string.

							
newstring = escape(string)

						

Each RegExpObject also implements the following methods and attributes:

RegExpObject.flags—   Returns the flag arguments used at the compilation time of the regular expression object.

RegExpObject.groupindex—   Returns a dictionary that maps symbolic group names to group numbers.

RegExpObject.pattern—   Returns the object's original pattern string.

Each MatchObject implements the following methods and attributes:

MatchObject.group([groupid,…])—   Once you provide a list of group names or numbers, Python returns a tuple containing the text matched by each of the groups.

MatchObject.groupdict()—   Returns a dictionary that contains all the named subgroups of the match.

MatchObject.groups()—   Returns a tuple that contains all the text matched by all groups.

MatchObject.start([group]) and MatchObject.end([group])—   Returns the first and last positions of the substring matched by the group.

MatchObject.span([group])—   Returns a tuple that contains both the MatchObject.start and the MatchObject.end values.

MatchObject.pos and MatchObject.endpos—   Returns the pos and endpos values, which were passed to the function when creating it.

MatchObject.string—   Returns the string value, which was passed to the function when creating it.

MatchObject.re—   Return the RegExpObject that was used to generate the MatchObject instance.

Special Note for Python 2.0 Users

All the internals of the re module were changed in Python 2.0. Now, the regular expression engine is located in a new module called SRE written by Fredrik Lundh of Secret Labs AB. The reason for that was to allow Unicode strings to be used in regular expressions along with 8-bit strings. Pay attention to the re module as it continues to be the front-end module, which internally calls the SRE module.




Last updated on 1/30/2002
Python Developer's Handbook, © 2002 Sams Publishing

< BACKMake Note | BookmarkCONTINUE >

Index terms contained in this section

$ (dollar sign)
      re module
(?
     str)
            re module
(?!str)
      re module
(?#str)
      re module
(?=.str)
      re module
(?=str)
      re module
(?P<
      name
(?P=name)
      re module
(re)
      re module
* (asterisk)
      re module
*?
      re module
+ (plus sign)
      re module
+?
      re module
. (period)
      re module
.*
      re module
? (question mark)
      re module
??
      re module
[^list]
      re module
[list]
      re module
\\
      re module
\A
      re module
\b
      re module
\B
      re module
\d
      re module
\D
      re module
\number
      re module
\s
      re module
\S
      re module
\W
      re module
\Z
      re module
^ (carat)
      re module
{m, n}
      re module
{m, n}?
      re module
| (pipe)
      re module
Anystring
      re module
asterisk (*)
      re module
attributes
      MatchObject
      RegExpObject
carat (^)
      re module
characters
      recognized by re module 2nd
dollar sign ($)
      re module
dot(.)
      re module
expressions
      regular 2nd 3rd 4th 5th 6th 7th
functions
      re.compile()
MatchObject
      methods and attributes
methods
      MatchObject
     objects
            regular expressions 2nd 3rd 4th
      re.escape()
      re.findall()
      re.match()
      re.search()
      re.split()
      re.sub() 2nd
      RegExpObject
      RegExpObject.split()
      RegExpObject.sub()
modules
      re 2nd 3rd 4th 5th 6th 7th 8th
            internals
      SRE
objects
     methods
            regular expressions 2nd 3rd 4th
period (.)
      re module
pipe (|)
      re module
plus sign (+)
      re module
question mark (?)
      re module
re module 2nd 3rd 4th 5th 6th 7th 8th 9th
      internals
re.compile() function
re.escape() method
re.findall() method
re.match() method
re.search() method
re.split() method
re.sub() method 2nd
RegExpObject
      methods and attributes
RegExpObject.split() method
RegExpObject.sub() method
regular expressions 2nd 3rd 4th 5th 6th 7th
SRE module

© 2002, O'Reilly & Associates, Inc.