< BACKMake Note | BookmarkCONTINUE >
152015024128143245168232148039199167010047123209178152124239215162148043124058173185002141

Summary

This chapter provides information concerning how to use Python for data parsing and manipulation. You learned how to interpret XML, SGML, and HTML documents and how to parse and manipulate email messages, among other things. As you might already know, Python can be used as a very effective and productive tool to parse and manipulate information from the Web.

Extensible Markup Language describes a class of data objects called XML documents and partially describes the behavior of computer programs that process them. For those who want to play around with XML in Python, there is a Python/XML package to serve several purposes at once. This package contains everything required for basic XML applications, along with documentation and sample code.

Besides that, the xmllib module serves as the basis for parsing text files formatted in XML. Note that xmllib is not XML 1.0 compliant, and it doesn't provide any Unicode support. It provides just simple XML support for ASCII only element and attribute names.

Many XML-based technologies are available for Python/XML development, such as

SAX—  This is a common event-based interface for object-oriented XML parsers.

The Document Object Model (DOM)—This is a standard interface for manipulating XML and HTML documents developed by the World Wide Web Consortium. 4DOM is a Python library for XML and HTML processing and manipulation using the W3C's Document Object Model for interface.

XSLT—  This is an XML transformation processor based on the W3C's specification.

XML Bookmark Exchange Language (XBEL)—  This is an Internet "bookmarks" interchange format.

SOAP—  This is an XML/HTTP-based protocol for accessing services, objects, and servers in a platform-independent manner. Scarab is a minimal Python SOAP implementation.

PythonPoint—  This has a simple XML markup language for doing presentation slides and converting them to PDF documents.

Pyxie—  This is an Open Source XML processing library for Python.

XML-RPC—  This is a specification and a set of implementations that allow software running on different operating systems and different environments to make procedure calls over the Internet. It is important to say that Python has its own implementation of XML-RPC.

XDR—  This is a standard for data description and encoding. Protocols such as RPC and NFS use XDR to describe the format of their data.

But Python is not just XML. It also provides support for other markup languages.

The sgmllib module is an SGML (Standard Generalized Markup Language) parser subset. Although it has a simple implementation, it is powerful enough to build the HTML parser.

The htmllib module defines a parser class that can serve as a base for parsing text files formatted in HTML. Two helper modules are used by htmllib:

  • The htmlentitydefs module is a dictionary that contains all the definitions for the general entities defined by HTML 2.0.

  • The formatter module is used for generic output formatting by the HTMLPARSER class of the htmllib module.

Apart from markup languages, this chapter also covers mail messages manipulation.

MIME (Multipurpose Internet Mail Extensions) is a standard for sending multi-part multimedia data through Internet mail. This standard exposes mechanisms for specifying and describing the format of Internet message bodies. Python provides many modules to support MIME messages, including the following:

mimetools—  Provides utility tools for parsing and manipulation of MIME multi-part and encoded messages.

MimeWriter—  Implements a generic file-writing class that is used to create MIME encoded multi-part files (messages).

multifile—  Enables you to treat distinct parts of a text file as file-like input objects.

mailcap—  Reads mailcap files and configures how MIME-aware applications react to files with different MIME types.

mimetypes—  Supports conversions between a filename or URL and the MIME type associated with the filename extension.

quopri—  Performs quoted-printable transport encoding and decoding of MIME quoted-printable data.

mailbox—  Implements classes that allow easy and uniform access to read various mailbox formats in a UNIX system.

mimify—  Contains functions to convert and process simple and multi-part mail messages to/from MIME format.

rfc822—  Parses mail headers that are defined by the Internet standard RFC 822.

Python uses the following modules for general data conversions:

netrc—  Parses, processes, and encapsulates the .netrc configuration file format used by UNIX FTP program and other FTP clients.

mhlib—  Provides a Python interface to access MH folders, mailboxes, and their contents.

base64—  Performs base64 encoding and decoding of arbitrary binary strings into text string that can be safely emailed or posted.

binhex—  Encodes and decodes files in binhex4 format. This format is commonly used to represent files on Macintosh systems.

uu—  Encodes and decodes files in uuencode format.

binascii—  Implements methods to convert data between binary and various ASCII-encoded binary representations, including binhex,uu, and base64.


Last updated on 1/30/2002
Python Developer's Handbook, © 2002 Sams Publishing

< BACKMake Note | BookmarkCONTINUE >

Index terms contained in this section

data
      manipulating
manipulating
      data

© 2002, O'Reilly & Associates, Inc.