See All Titles |
![]() ![]() XML ProcessingThe first standard that you will learn how to manipulate in Python is XML. The Web already has a standard for defining markup languages like HTML, which is called SGML. HTML is actually defined in SGML. SGML could have been used as this new standard, and browsers could have been extended with SGML parsers. However, SGML is quite complex to implement and contains a lot of features that are very rarely used. SGML is much more than a Web standard because it was around long before the Web. HTML is an application of SGML, and XML is a subset. SGML also lacks character sets support, and it is difficult to interpret an SGML document without having the definition of the markup language (the DTD—Document Type Definition) available. Consequently, it was decided to develop a simplified version of SGML, which was called XML. The main point of XML is that you, by defining your own markup language, can encode the information of your documents more precisely than is possible with HTML. This meas that programs processing these documents can "understand" them much better and therefore process the information in ways that are impossible with HTML (or ordinary text processor documents). Introduction to XMLThe Extensible Markup Language (XML) is a subset of SGML. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML. XML describes a class of data objects called XML documents and partially describes the behavior of computer programs that process them. XML is an application profile or restricted form of SGML, the Standard Generalized Markup Language (ISO 8879). By construction, XML documents are conforming SGML documents. An XML parser can check if an XML document is formal without the aid of a DTD. XML documents are made up of storage units called elements, which contain either parsed or unparsed data, and are delimited by tags. Parsed data is made up of characters, some of which form character data, and some of which form markup elements. Markup encodes a description of the document's storage layout and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure. A software module called an XML parser is used to read XML documents and provide access to their content and structure. It is assumed that an XML parser is doing its work on behalf of another module, called the application. This specification describes the required behavior of an XML parser in terms of how it must read XML data and the information it must provide to the application. For more information, check out
Writing an XML FileAs you can see next, it is simple to define your own markup language with XML. The next block of code is the content of a file called survey.xml. This code defines a specific markup language for a given survey. <!DOCTYPE SURVEY SYSTEM "SURVEY.DTD"> <SURVEY> <CLIENT> <NAME> Lessaworld Corp. </NAME> <LOCATION> Pittsburgh, PA </LOCATION> <CONTACT> Andre Lessa </CONTACT> <EMAIL> webmaster@lessaworld.com </EMAIL> <TELEPHONE> (412)555-5555 </TELEPHONE> </CLIENT> <SECTION SECTION_ID="1"> <QUESTION QUESTION_ID="1" QUESTION_LEVEL="1"> <QUESTION_DESC>What is your favorite language?</QUESTION_DESC> <Op1>Python</Op1> <Op2>Perl</Op2> </QUESTION> <QUESTION QUESTION_ID="2" QUESTION_LEVEL="1"> <QUESTION_DESC>Do you use this language at work?</QUESTION_DESC> <Op1>Yes</Op1> <Op2>No</Op2> </QUESTION> <QUESTION QUESTION_ID="3" QUESTION_LEVEL="1"> <QUESTION_DESC>Did you expect the Spanish inquisition?</QUESTION_DESC> <Op1>No</Op1> <Op2>Of course not</Op2> </QUESTION> </SECTION> </SURVEY> In order to complement the XML markup language shown previously, we need a Document Type Definition (DTD), just like the following one. The DTD can be part of the XML file, or it can be stored as an independent file, as we are doing here. Note the first line of the XML file, where we are passing the name of the DTD file (survey.dtd). Also, it seems that XML is standardizing the use of XML Schemas rather the DTDs. <!ELEMENT SURVEY (CLIENT, SECTION+)> <!ELEMENT CLIENT (NAME, LOCATION, CONTACT?, EMAIL?, TELEPHONE?)> <!ELEMENT NAME (#PCDATA)> <!ELEMENT LOCATION (#PCDATA)> <!ELEMENT CONTACT (#PCDATA)> <!ELEMENT EMAIL (#PCDATA)> <!ELEMENT TELEPHONE (#PCDATA)> <!ELEMENT SECTION (QUESTION+)> <!ELEMENT QUESTION (QUESTION_DESC, Op1, Op2)> <!ELEMENT QUESTION_DESC (#PCDATA)> <!ELEMENT Op1 (#PCDATA)> <!ELEMENT Op2 (#PCDATA)> <!ATTLIST SECTION SECTION_ID CDATA #IMPLIED> <!ATTLIST QUESTION QUESTION_ID CDATA #IMPLIED QUESTION_LEVEL CDATA #IMPLIED> Now, let's understand how a DTD works. For a simple example, like this one, we need two special tags called <!ELEMENT> and <!ATTLIST>. The <!ELEMENT> definition tag is used to define the elements presented in the XML file. The general syntax is lt;!ELEMENT NAME CONTENTS> The first argument (NAME) gives the name of the element, and the second one (CONTENTS) lists the element names that are allowed to be underneath the element that we are defining. The ordering that we use to list the contents is important. When we say, for example, lt;!ELEMENT SURVEY (CLIENT, SECTION+)> it means that we must have a CLIENT first, followed by a SECTION. Note that we have a special character (the plus sign) just after the second element in the content list. This character, as well as some others, has a special meaning:
Note
These characters have similar meanings to what they do in regular expressions. (Of course, not everything you use in an re can be used in a DTD.) Note that #PCDATA is used to indicate an entity that carries the information. <!ATTLIST>, the other definition tag in the example, defines the attributes of an element. In our DTD, we have three attributes, one for SECTION, and two for QUESTION. An important difference between XML and SMGL is that elements in XML that do not have any contents (like <IMG> and <BR> of HTML) are written like this in XML: lt;IMG SRC="stuff.gif"/> or in an equivalent format, such as lt;img src="stuff.gif"></img> Note the slash before the final >. This means that a program can read the document without knowing the DTD (which is where it says that IMG does not have any contents) and still know that IMG does not have an end tag as well as what follows IMG is not inside the element. For more information about XML and Python, check out the XML package. It comes with a Python XML-HOWTO in the doc directory, and very good examples: http://www.python.org/sigs/xml-sig/status.html Python XML PackageFor those who want to play around with XML in Python, there will be a Python/XML package to serve several purposes at once. This package will contain everything required for basic XML applications, along with documentation and sample code—basically, something easy to compile and install. A release candidate of the latest release of this package is now available as PyXML-0.5.5.tar.gz (GPG signature), dated June 5, 2000. This version contains SAX, the Pyexpat module, sgmlop, the prototype DOM code, and xmlproc, an XML parser written in Python. The individual components contained in the Python/XML package include
The document called Python/XML Reference Guide is the reference manual for the Python/XML package, containing descriptions for several XML modules. For more information, check out the following sites:
Python 2.0 was released with a lot of enhancements concerning the XML support, including a SAX2 interface and a re-designed DOM interface as part of the xml package. Note that the xml package that is shipped with Python 2.0 contains just a basic set of options for XML development. If you want (or need) to use the full XML package, you are suggested to install PyXML. The PyXML distribution also uses the xml package. That's the reason why PyXML versions 0.6.0 or greater can be used to replace the xml package that is bundled with Python. By doing so, you will extend the set of XML functionalities that you can have available. That includes
xmllibThe xmllib module defines a class XMLParser, which serves as the basis for parsing text files formatted in XML. Note that xmllib is not XML 1.0 compliant, and it doesn't provide any Unicode support. It provides simple XML support for ASCII only element and attribute names. Of course, it probably handles UTF8 character data without problems. XMLParser()The XMLParser class must be instantiated without a arguments. This class provides the following interface methods and instance variables:
XML NamespacesThe xmllib module has support for XML namespaces as defined in the XML namespaces proposed recommendation. Tag and attribute names that are defined in an XML namespace are handled as if the name of the tag or element consisted of the namespace (that is, the URL that defines the namespace) followed by a space and the name of the tag or attribute. For instance, the tag <html xmlns:html= http://www.w3.org/TR/REC-html40 is treated as if the tag name was http://www.w3.org/TR/REC-html40 html, and the tag <html:a href= http://frob.com inside the previous element is treated as if the tag name were http://www.w3.org/TR/REC-html40 a and the attribute name as if it were http://www.w3.org/TR/REC-html40 src. An older draft of the XML namespaces proposal is also recognized, but triggers warn about it. XML ExamplesThe next example uses xmllib to parse a XML file. The file being used is the same survey.xml that you saw in the beginning of this chapter. Our proposal is to read the file, parse it, and convert it to a structure such as the following: Survey of section number 1 1- What is your favorite language? Python Perl 2- Do you use this language at work? Yes No 3- Did you expect the Spanish inquisition? No Of course not The following code implements a solution for our problem. Remember that XML tags are case sensitive, thus the code must be properly balanced. In this code, note that attributes are passed to the tag handlers in a dictionary, not in a tuple. import xmllib, string class myparser(xmllib.XMLParser): def __init__(self): xmllib.XMLParser.__init__(self) self.currentquestiondesc = '' self.currentOp1 = '' self.currentOp2 = '' self.currentquestion = '' self.currentdata = [] def handle_data(self, data): self.currentdata.append(data) def start_SURVEY(self, attrs): print "Survey of section number ", def end_SURVEY(self): pass def start_SECTION(self, attrs): print attrs['SECTION_ID'] def end_SECTION(self): pass def start_QUESTION(self, attrs): self.currentquestion = attrs['QUESTION_ID'] def end_QUESTION(self): print """ %(currentquestion)s- %(currentquestiondesc)s %(currentOp1)s %(currentOp2)s """ % self.__dict__ def start_QUESTION_DESC(self, attrs): self.currentdata = [] def end_QUESTION_DESC(self): self.currentquestiondesc = string.join(self.currentdata,'') def start_Op1(self, attrs): self.currentdata = [] def end_Op1(self): self.currentOp1 = string.join(self.currentdata,'') def start_Op2(self, attrs): self.currentdata = [] def end_Op2(self): self.currentOp2 = string.join(self.currentdata,'') if __name__ == "__main__": filehandle = open("survey.xml") data = filehandle.read() filehandle.close() parser=myparser() parser.feed(data) parser.close() Let's see another example. The next one opens our survey.xml file and lists all the questions available. It also tries to find question #4, but as we don't have it, it raises a message to the user. import xmllib class QuestionNotFound: pass class Parser(xmllib.XMLParser): def __init__(self, filename=None): self.found = 0 xmllib.XMLParser.__init__(self) if filename: self.load(filename) def load(self, filename): while 1: xmldata=filename.read(1024) if not xmldata: break self.feed(xmldata) self.close() def start_QUESTION(self, attrs): question_id = attrs.get("QUESTION_ID") print "I found Question #" + question_id if question_id == "4": self.found = 1 def end_SECTION(self): if not self.found: raise QuestionNotFound try: MyParser = Parser() MyParser.load(open("survey.xml")) except QuestionNotFound(Exception): print "I couldn't find Question #4 !!!" The SAX APISAX is a common event-based interface for object-oriented XML parsers. The Simple API for XML isn't a standard in the formal sense, but an informal specification designed by David Megginson, with input from many people on the XML-DEV mailing list. SAX defines an event-driven interface for parsing XML. To use SAX, you must create Python class instances that implement a specified interface, and the parser will then call various methods of those objects. SAX is most suitable for purposes in which you want to read through an entire XML document from beginning to end, and perform some computation, such as building a data structure representing a document, or summarizing information in a document (computing an average value of a certain element, for example). It isn't very useful if you want to modify the document structure in some complicated way that involves changing how elements are nested, though it could be used if you simply want to change element contents or attributes. For example, you would not want to re-order chapters in a book using SAX, but you might want to change the contents of any name elements with the attribute lang equal to greek into Greek letters. Of course, if this is an XML file, we would use the standard attribute xml:lang rather than just lang to store the language. One advantage of SAX is speed and simplicity. There is no need to expend effort examining elements that are irrelevant to your application. You can therefore write a class instance that ignores all elements that aren't what you need. Another advantage is that you don't have the whole document resident in memory at any one time, which matters if you are processing huge documents. SAX defines four basic interfaces; a SAX-compliant XML parser can be passed any objects that support these interfaces, and will call various methods as data is processed. Your task, therefore, is to implement those interfaces relevant to your application. The SAX interfaces are as follows:
Because Python doesn't support the concept of interfaces, the previous interfaces are implemented as Python classes. The default method implementations are defined to do nothing—the method body is just a Python pass statement—so usually you can simply ignore methods that aren't relevant to your application. The one big exception is the ErrorHandler interface; if you don't provide methods that print a message or otherwise take some action, errors in the XML data will be silently ignored. This is almost certainly not what you want your application to do, so always implement at least the error() and fatalError() methods. xml.sax.saxutils provides an ErrorPrinter class that sends error messages to standard error, and an ErrorRaiser class that raises an exception for any warnings or errors. Pseudo-code for using SAX looks similar to the following: # Define your specialized handler classes from xml.sax import saxlib class docHandler(saxlib.DocumentHandler): … # Create an instance of the handler classes dh = docHandler() # Create an XML parser parser = … # Tell the parser to use your handler instance parser.setDocumentHandler(dh) # Parse the file; your handler's method will get called parser.parseFile(sys.stdin) # Close the parser parser.close() For more information, check out the following sites:
DOM: The Document Object ModelThe Document Object Model (DOM) is a standard interface for manipulating XML and HTML documents developed by the World Wide Web Consortium (W3C). 4DOM is a Python library developed by FourThought LLC for XML and HTML processing and manipulation using the W3C's Document Object Model for interface. 4DOM supports all of DOM level 1 (core and HTML), as well as core, HTML and Document Traversal from level 2. 4DOM also adds some helper components for DOM Tree creation and printing, python integration, whitespace manipulation, and so on. 4DOM is designed to allow developers to rapidly design applications that read, write, or manipulate HTML and XML. Check out http://www.fourthought.com/4Suite/4DOM/ XSL Transformations (XSLT)This W3C specification defines the syntax and semantics of XSLT, which is a language for transforming XML documents into other XML documents. XSLT is designed for use as part of XSL, which is a stylesheet language for XML. In addition to XSLT, XSL includes an XML vocabulary for specifying formatting. XSL specifies the styling of an XML document by using XSLT to describe how the document is transformed into another XML document that uses the formatting vocabulary. XSLT is also designed to be used independently of XSL. However, XSLT is not intended as a completely general-purpose XML transformation language. Rather, it is designed primarily for the kinds of transformations that are needed when XSLT is used as part of XSL. XSLT is also good for transforming some custom XML format into XHTML that can be displayed by a browser, for instance. For more information, check out 4XSLT is an XML transformation processor based on the W3C's specification, and written by FourThought LLC, for the XSLT transform language. Currently, 4XSLT supports a subset of the final recommendation of XSLT. For more information, check out the site: http://www.fourthought.com/4Suite/4XSLT/ XBEL—XML Bookmark Exchange LanguageThe XML Bookmark Exchange Language, or XBEL, is an Internet bookmarks interchange format. It was designed by the Python XML Special Interest Group on the group's mailing list. It grew out of an idea for a demonstration of using Python for XML processing. Mark Hammond contributed the original idea, and other members of the SIG chimed in to add support for their favorite browser features. After debate that deviated from the original idea, compromises were reached that allow XBEL to be a useful language for describing bookmark data for a range of browsers, including the major browsers and a number of less widely used browsers. At this time, the formal DTD was finalized and documentation was written. The formal DTD and the documentation are available online at the following sites: http://www.python.org/topics/xml/xbel/ http://www.python.org/topics/xml/xbel/docs/html/xbel.html Supporting software is provided as part of the Python XML package. This software is located in the demo/xbel/ directory of the distribution. This includes command-line processes for converting XBEL instances to other common formats, including the Navigator and Internet Explorer formats. Note that the current release of the Grail Internet browser from CNRI supports XBEL as a native bookmarks format. The script, created by Jürgen Hermann, on the following site, checks the URLs in an XBEL document: http://cscene.org/%7ejh/xml/bookmarks/checkurls.py RPC—What Is It?A Remote Procedure Call (RPC) uses the ordinary procedure call mechanism that is familiar to every user in order to hide the intricacies of the network. A client process calls a function on a remote server and suspends itself until it gets back the results. Parameters are passed the same as in any ordinary procedure. The RPC, similar to an ordinary procedure, is synchronous; clients and servers must run concurrently. Servers must keep up with clients. The process (or thread) that issues the call waits until it gets the results. Behind the scenes, the RPC runtime software collects values for the parameters, forms a message, and sends it to the remote server. (Note that servers must first come up before clients can talk to them.) The server receives the request, unpacks the parameters, calls the procedure, and sends the reply back to the client. Asynchronous processing is limited because it requires threads and tricky code for managing threads. A procedure call is the name of a procedure, its parameters, and the result it returns. Procedure calls are very important for the existence of computers. Every program is just a single procedure called main; every operating system has a main procedure called a kernel. There's a top level to every program that sits in a loop waiting for something to happen and then distributes control to a hierarchy of procedures that respond. This is at the heart of interactivity and networking, it's at the heart of software. RPC is a very simple extension to the procedure call idea; it says, "let's create connections between procedures that are running in different applications or on different machines." Conceptually, there's no difference between a local procedure call and a remote one, but they are implemented differently, perform differently (RPC is much slower), and therefore are used for different things. Remote calls are marshaled into a format that can be understood on the other side of the connection. As long as two machines agree on a format, they can talk to each other. That's why Windows machines can be networked with other Windows machines, Macs can talk to Macs, and so on. The value in a standardized cross-platform format for RPC is that it allows UNIX machines to talk to Windows machines and vice versa. A number of formats are possible. One possible format is XML. XML-RPC uses XML as the marshaling format. It allows Macs to easily make procedure calls to software running on Windows machines and BeOS machines, as well as all flavors of UNIX and Java, IBM mainframes, PDAs, and so on. With XML it's easy to see what it's doing, and it's also relatively easy to marshal the internal procedure call format into a remote format. Simple Object Access Protocol (SOAP)SOAP is an XML/HTTP-based protocol for accessing services, objects, and servers in a platform-independent manner. For more information, check out A minimal Python SOAP implementation is located at This module is derived in part from Andrew Kuchling's xml.marshal code. It implements the SOAP "" serialization using the same API as pickle.py (dump/load). Scarab is an Open Source Communications library implementing protocols, formats, and interfaces for writing distributed applications, with an emphasis on low-end and lightweight implementations. Users can combine Scarab module implementations to build a messaging system to fit their needs, scaling from very simple messaging or data transfer all the way up to where CORBA can take over. Scarab implementations include support for such areas as distributed objects, remote procedure calls, XML messages, TCP transport, and HTTP transport. PythonPointThe ReportLab package contains a demo called PythonPoint, which has a simple XML for doing presentation slides and can convert them to PDF documents, along with imaginative presentation effects. The demo script that is provided in the Web site illustrates how easily complex XML can be translated into useful PDF. The demo output, pythonpoint.pdf, demonstrates some of the more exotic PDF capabilities: http://www.reportlab.com/demos/demos.html PyxiePyxie is an Open Source XML processing library for Python developed by Sean McGrath. He has also written a book called XML Processing with Python for Prentice Hall. The book contains a description of the Pyxie library and many sample programs. Pyxie is heavily based on a line-oriented notation for parsed XML known as PYX. Pyxie includes utilities, known as xmln and xmlv, that generate PYX. PYX is independent of Python and a number of programs processing PYX have appeared in Java, Perl, and JavaScript: http://www.digitome.com/pyxie.html
|
© 2002, O'Reilly & Associates, Inc. |