See All Titles |
![]() ![]() Regular ExpressionsWe already know that the string module is used to apply basic manipulation operations on strings; meanwhile, at the time of developing advanced routines, you might need to enhance Python's string-processing capabilities. That's when you should consider using the re module (re stands for regular expression). Regular expressions are strings, which contain a mix of text and special characters, that let you define complicated pattern matching and replacement rules for other strings. Some of the special characters that compound regular expressions must be preceded by backslashes in order to be matched. Consequently, regular expressions are usually written as raw strings because they tend to use a lot of backslashes. That means that instead of writing "\\b(usa)\\d", it is much easier to say r"\b(usa)\d". Older versions of Python used to support the following regular expression obsolete modules are: regexp, regex, and regsub. In case you need to know a full definition of the syntax, visit the following link: http://www.python.org/doc/current/lib/re-syntax.html Next, you have the regular expression flags. These flags are used as bitwise-or operators in the re functions.
Let's look at our first example of regular expressions. Suppose that you have the following conversation text: oldtext = """ That terrible dead PARROT sketch must end! Oh, Come on! It is a terrific parrot joke. I agree, but I don't like to see dead parrot. Ok. I will suggest a new terrific parrot sketch.""" Okay. Now our challenge is to create an expression that is able to identify all the words "parrot" that
The word "parrot" that gets identified must be replaced with the word "spam". The following code is a possible solution for this problem: 1: import re 2: restring = re.compile( 3: r"""\b(terrible|terrific) 4: (?!dead) 5: (\s+ 6: parrot 7: (?!joke) 8: \s+sketch)""", 9: re.DOTALL | re.IGNORECASE | re.VERBOSE) 10: newline = restring.sub(r'\1 spam', oldtext) We are calling the compile function (line 2), which generates a compiled regular expression object called restring. Then, we call the class method sub (line 10) to substitute the matches found in the text variable that we have already defined (oldtext). The sub() method replaces the entire matched section of the string. Note that the r'\1 spam' argument uses \1 to make sure that the result collected in the first group of parenthesis ("Terrible" and "Terrific") is placed right before the word "spam". Regular Expression Functions and Object MethodsThe re module implements just one exception—the error exception, which is raised only when a regular expression string is not valid. Next, you have the list of available re functions. re.compile()Compiles a regular expression pattern string and generates a regular expression object. RegExpObject = compile(string [, flags]) For details about the flags argument, check out the previous list of available flags. Every regular expression object exposes the following attributes and methods: RegExpObject.search()Searches for the compiled pattern in the string. MatchObject = RegExpObject.search(string [,startpos] [,endpos]) It uses the startpos and endpos arguments to delimit the range of the search. All functions that are supposed to return a MatchObject when the function succeeds, return None when a fail occurs. RegExpObject.match()Checks whether the initial characters of string match the compiled pattern. MatchObject = RegExpObject.match(string [,startpos] [,endpos]) It uses the startpos and endpos arguments to delimit the scope of the matching. RegExpObject.findall()Finds nonoverlapping matches of the compiled pattern in string. MatchList = RegExpObject.findall(string) RegExpObject.split()Splits the string by the occurrences of the compiled pattern. StringList = RegExpObject.split(string [, maxsplit]) RegExpObject.sub()Substitutes the matches of pattern in string with newtext. RegExpObject.sub(newtext, string [, count]) The replacements are done count number of times, starting from the left side of string. When you leave out the count argument, you are not really saying don't perform the substitution at all, but apply it as many times as necessary. RegExpObject.subn()It is similar to sub. However, it returns a tuple that contains the new string and the number of substitutions executed. When you leave out the count argument, you are not really saying don't perform the substitution at all, but apply it as many times as necessary. RegExpObject.subn(newtext, string [, count]) re.search()Searches for the pattern in the string. MatchObject = search(pattern, string [,flags]) re.match()Sees whether the initial characters of string match the pattern. MatchObject = match(pattern, string [,flags]) re.findall()Finds nonoverlapping matches of pattern in string. MatchList = findall(pattern, string) re.split()Splits the string by the occurrences of pattern. StringList = split(pattern, string [, maxsplit]) re.sub()Substitutes the matches of pattern in string with newtext. sub(pattern, newtext, string [, count]) The replacements are done count number of times, starting from the left side of string. re.subn()It is similar to sub(). However, it returns a tuple that contains the new string and the number of substitutions executed. subn(pattern, newtext, string [, count = 0]) re.escape()Backslashes all the nonalphanumeric characters of string. newstring = escape(string) Each RegExpObject also implements the following methods and attributes:
Each MatchObject implements the following methods and attributes:
Special Note for Python 2.0 Users
All the internals of the re module were changed in Python 2.0. Now, the regular expression engine is located in a new module called SRE written by Fredrik Lundh of Secret Labs AB. The reason for that was to allow Unicode strings to be used in regular expressions along with 8-bit strings. Pay attention to the re module as it continues to be the front-end module, which internally calls the SRE module.
|
© 2002, O'Reilly & Associates, Inc. |