< BACKMake Note | BookmarkCONTINUE >
152015024128143245168232148039199167010047123209178152124239215162148041040017229250144100

XML Processing

The first standard that you will learn how to manipulate in Python is XML.

The Web already has a standard for defining markup languages like HTML, which is called SGML. HTML is actually defined in SGML. SGML could have been used as this new standard, and browsers could have been extended with SGML parsers. However, SGML is quite complex to implement and contains a lot of features that are very rarely used.

SGML is much more than a Web standard because it was around long before the Web. HTML is an application of SGML, and XML is a subset.

SGML also lacks character sets support, and it is difficult to interpret an SGML document without having the definition of the markup language (the DTD—Document Type Definition) available.

Consequently, it was decided to develop a simplified version of SGML, which was called XML. The main point of XML is that you, by defining your own markup language, can encode the information of your documents more precisely than is possible with HTML. This meas that programs processing these documents can "understand" them much better and therefore process the information in ways that are impossible with HTML (or ordinary text processor documents).

Introduction to XML

The Extensible Markup Language (XML) is a subset of SGML. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML.

XML describes a class of data objects called XML documents and partially describes the behavior of computer programs that process them. XML is an application profile or restricted form of SGML, the Standard Generalized Markup Language (ISO 8879). By construction, XML documents are conforming SGML documents. An XML parser can check if an XML document is formal without the aid of a DTD.

XML documents are made up of storage units called elements, which contain either parsed or unparsed data, and are delimited by tags. Parsed data is made up of characters, some of which form character data, and some of which form markup elements. Markup encodes a description of the document's storage layout and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure.

A software module called an XML parser is used to read XML documents and provide access to their content and structure. It is assumed that an XML parser is doing its work on behalf of another module, called the application. This specification describes the required behavior of an XML parser in terms of how it must read XML data and the information it must provide to the application. For more information, check out

Extensible Markup Language (XML) Recommendation

W3C Recommendation—Extensible Markup Language (XML) 1.0

http://www.w3.org/TR/REC-xml.html

Writing an XML File

As you can see next, it is simple to define your own markup language with XML. The next block of code is the content of a file called survey.xml. This code defines a specific markup language for a given survey.

						
<!DOCTYPE SURVEY SYSTEM  "SURVEY.DTD">
<SURVEY>
  <CLIENT>
     <NAME>         Lessaworld Corp.           </NAME>
     <LOCATION>     Pittsburgh, PA             </LOCATION>
     <CONTACT>      Andre Lessa                </CONTACT>
     <EMAIL>        webmaster@lessaworld.com   </EMAIL>
     <TELEPHONE>   (412)555-5555                </TELEPHONE>
  </CLIENT>
  <SECTION SECTION_ID="1">
  <QUESTION QUESTION_ID="1" QUESTION_LEVEL="1">
   <QUESTION_DESC>What is your favorite language?</QUESTION_DESC>
   <Op1>Python</Op1>
   <Op2>Perl</Op2>
  </QUESTION>
  <QUESTION QUESTION_ID="2" QUESTION_LEVEL="1">
    <QUESTION_DESC>Do you use this language at work?</QUESTION_DESC>
    <Op1>Yes</Op1>
    <Op2>No</Op2>
  </QUESTION> <QUESTION QUESTION_ID="3" QUESTION_LEVEL="1">
    <QUESTION_DESC>Did you expect the Spanish inquisition?</QUESTION_DESC>
    <Op1>No</Op1> 
    <Op2>Of course not</Op2>
  </QUESTION> 
   </SECTION>
</SURVEY>

					

In order to complement the XML markup language shown previously, we need a Document Type Definition (DTD), just like the following one. The DTD can be part of the XML file, or it can be stored as an independent file, as we are doing here. Note the first line of the XML file, where we are passing the name of the DTD file (survey.dtd). Also, it seems that XML is standardizing the use of XML Schemas rather the DTDs.

						
<!ELEMENT SURVEY      (CLIENT, SECTION+)>

<!ELEMENT CLIENT (NAME, LOCATION, CONTACT?, EMAIL?, TELEPHONE?)>
<!ELEMENT NAME         (#PCDATA)>
<!ELEMENT LOCATION     (#PCDATA)>
<!ELEMENT CONTACT      (#PCDATA)> 
<!ELEMENT EMAIL        (#PCDATA)>
<!ELEMENT TELEPHONE    (#PCDATA)>

<!ELEMENT SECTION     (QUESTION+)>
<!ELEMENT QUESTION    (QUESTION_DESC, Op1, Op2)>

<!ELEMENT QUESTION_DESC   (#PCDATA)>
<!ELEMENT Op1              (#PCDATA)>
<!ELEMENT Op2              (#PCDATA)>

<!ATTLIST SECTION    SECTION_ID     CDATA #IMPLIED>
<!ATTLIST QUESTION   QUESTION_ID    CDATA #IMPLIED
                     QUESTION_LEVEL CDATA #IMPLIED>

					

Now, let's understand how a DTD works. For a simple example, like this one, we need two special tags called <!ELEMENT> and <!ATTLIST>.

The <!ELEMENT> definition tag is used to define the elements presented in the XML file. The general syntax is

						
lt;!ELEMENT NAME
 CONTENTS>
					

The first argument (NAME) gives the name of the element, and the second one (CONTENTS) lists the element names that are allowed to be underneath the element that we are defining.

The ordering that we use to list the contents is important. When we say, for example,

						
lt;!ELEMENT SURVEY (CLIENT,
 SECTION+)>
					

it means that we must have a CLIENT first, followed by a SECTION. Note that we have a special character (the plus sign) just after the second element in the content list. This character, as well as some others, has a special meaning:

  • A + sign after an element means that it can be included one or more times.

  • A ? sign indicates that the element can be skipped.

  • A * sign indicates an entity that can be skipped or included one or more times.

Note

These characters have similar meanings to what they do in regular expressions. (Of course, not everything you use in an re can be used in a DTD.)



Note that #PCDATA is used to indicate an entity that carries the information.

<!ATTLIST>, the other definition tag in the example, defines the attributes of an element. In our DTD, we have three attributes, one for SECTION, and two for QUESTION.

An important difference between XML and SMGL is that elements in XML that do not have any contents (like <IMG> and <BR> of HTML) are written like this in XML:

						
lt;IMG
 SRC="stuff.gif"/>
					

or in an equivalent format, such as

						
  lt;img
 src="stuff.gif"></img>

					

Note the slash before the final >. This means that a program can read the document without knowing the DTD (which is where it says that IMG does not have any contents) and still know that IMG does not have an end tag as well as what follows IMG is not inside the element.

For more information about XML and Python, check out the XML package. It comes with a Python XML-HOWTO in the doc directory, and very good examples:

http://www.python.org/sigs/xml-sig/status.html

Python XML Package

For those who want to play around with XML in Python, there will be a Python/XML package to serve several purposes at once. This package will contain everything required for basic XML applications, along with documentation and sample code—basically, something easy to compile and install.

A release candidate of the latest release of this package is now available as PyXML-0.5.5.tar.gz (GPG signature), dated June 5, 2000. This version contains SAX, the Pyexpat module, sgmlop, the prototype DOM code, and xmlproc, an XML parser written in Python.

The individual components contained in the Python/XML package include

  • A Python implementation of SAX (Simple API for XML)

    A SAX implementation has been written by Lars Marius Garshol. Garshol has also written a draft specification of the Python version of SAX 1.0.

  • An XML-HOWTO containing an overview of Python and XML processing. (This is still being actively revised.)

    Andrew Kuchling is working on this. A first draft of the XML-HOWTO is available, and introduces the SAX interface in tutorial form. A reference manual is available separately.

  • A fairly stable Python interface to James Clark's Expat parser. A Pyexpat C extension has been written by Jack Jansen.

  • Both Python and C implementations of the DOM (Document Object Model).

    Stefane Fermigier's DOM package has been modified to match the final DOM W3C Recommendation.

  • A module to marshal simple Python data types into XML. A module called xml.marshal is available. However, it might end up being superseded by Lotos, WDDX, or some other DTD.

The document called Python/XML Reference Guide is the reference manual for the Python/XML package, containing descriptions for several XML modules. For more information, check out the following sites:

Python/XML Reference Guide

http://www.python.org/doc/howto/xml-ref/

"SAX Implementation". by Lars Marius Garshol

http://www.stud.ifi.uio.no/~lmariusg/download/python/xml/saxlib.html

Draft specification of the Python version of SAX 1.0

http://www.stud.ifi.uio.no/~lmariusg/download/python/xml/sax-spec.html

XML-HOWTO

http://www.python.org/doc/howto/xml/

Pyexpat C extension written by Jack Jansen

http://ftp://ftp.cwi.nl/pub/jack/python/pyexpat.tgz

DOM Recommendation

http://www.w3.org/TR/REC-DOM-Level-1/

Stefane Fermigier's DOM package

http://www.math.jussieu.fr/~fermigie/python/

Python 2.0 was released with a lot of enhancements concerning the XML support, including a SAX2 interface and a re-designed DOM interface as part of the xml package. Note that the xml package that is shipped with Python 2.0 contains just a basic set of options for XML development. If you want (or need) to use the full XML package, you are suggested to install PyXML.

The PyXML distribution also uses the xml package. That's the reason why PyXML versions 0.6.0 or greater can be used to replace the xml package that is bundled with Python. By doing so, you will extend the set of XML functionalities that you can have available. That includes

  • 4DOM, a full DOM implementation from FourThought, Inc

  • The xmlproc validating parser, written by Lars Marius Garshol

  • The sgmlop parser accelerator module, written by Fredrik Lundh

xmllib

The xmllib module defines a class XMLParser, which serves as the basis for parsing text files formatted in XML. Note that xmllib is not XML 1.0 compliant, and it doesn't provide any Unicode support. It provides simple XML support for ASCII only element and attribute names. Of course, it probably handles UTF8 character data without problems.

XMLParser()

The XMLParser class must be instantiated without a arguments. This class provides the following interface methods and instance variables:

attributes—  This is a mapping of element names to mappings. The latter mapping maps attribute names that are valid for the element to the default value of the attribute, or to None if there is no default. The default value is the empty dictionary. This variable is meant to be overridden and not extended because the default is shared by all instances of XMLParser.

elements  This is a mapping of element names to tuples. The tuples contain a function for handling the start and end tag, respectively, of the element, or None if the method unknown_starttag() or unknown_endtag() is to be called. The default value is the empty dictionary. This variable is meant to be overridden and not extended because the default is shared by all instances of XMLParser.

entitydefs—  This is a mapping of entitynames to their values. The default value contains definitions for lt,gt,amp,quot, and apos.

reset()—  Resets the instance. Loses all unprocessed data. This is called implicitly at the instantiation time.

setnomoretags()—  Stops processing tags. Treats all following input as literal input (CDATA).

setliteral()—  Enters literal mode (CDATA mode). This mode is automatically exited when the close tag matching the last unclosed open tag is encountered.

feed (data)—  Feeds some text to the parser. It is processed insofar as it consists of complete tags; incomplete data is buffered until more data is fed or close() is called.

close()—  Forces processing of all buffered data as if it were followed by an end-of-file mark. This method can be redefined by a derived class to define additional processing at the end of the input, but the redefined version should always call close().

translate_references(data)—  Translates all entity and character references in data and returns the translated string.

handle_xml(encoding, standalone)handle_xml(encoding, standalone)—  This method is called when the <?xml ...?> tag is processed. The arguments are the values of the encoding and standalone attributes in the tag. Both encoding and standalone are optional. The values passed to handle_xml() default to None and the string no, respectively.

handle_doctype(tag, data)—  This method is called when the <!DOCTYPE...> tag is processed. The arguments are the name of the root element and the uninterpreted contents of the tag, starting following the whitespace after the name of the root element.

handle_starttag(tag, method, attributes)—  This method is called to handle starttags for which a start tag handler is defined in the instance variable elements. The tag argument is the name of the tag, and the method argument is the function (method) that should be used to support semantic interpretation of the start tag. The attributes argument is a dictionary of attributes; the key being the name and the value being the value of the attribute found inside the tag's <> brackets. Character and entity references in the value have been interpreted. For instance, for the start tag <A HREF="http://www.python.org/">, this method would be called as handle_starttag ('A', self.elements['A'][0], { 'HREF': 'http://www.python.org/'} ). The base implementation simply calls a method with attributes as the only argument.

handle_endtag(tag, method)—  This method is called to handle endtags for which an end tag handler is defined in the instance variable elements. The tag argument is the name of the tag, and the method argument is the function (method) that should be used to support semantic interpretation of the end tag. For instance, for the endtag </A>, this method would be called as handle_endtag('A', self.elements['A'][1]). The base implementation simply calls method.

handle_charref(ref)—  This method is called to process arbitrary data. It is intended to be overridden by a derived class; the base class implementation does nothing.

handle_charref(ref)—  This method is called to process a character reference of the form &#ref;.ref can either be a decimal number, or a hexadecimal number when preceded by an x. In the base implementation, ref must be a number in the range 0-255. It translates the character to ASCII and calls the method handle_data() with the character as argument. If ref is invalid or out of range, the method unknown_charref(ref) is called to handle the error. A subclass must override this method to provide support for character references outside the ASCII range.

handle_entityref(ref)—  This method is called to process a general entity reference of the form &ref; where ref is an general entity reference. It looks for ref in the instance (or class) variable entitydefs that should be mapping from entity names to corresponding translations. If a translation is found, it calls the method handle_data() with the translation; otherwise, it calls the method unknown_entityref(ref). The default entitydefs defines translations for &amp;,&apos;,&gt;,&lt;, and &quot;.

handle_comment(comment)—  This method is called when a comment is encountered. The comment argument is a string containing the text between the <! and -> delimiters, but not the delimiters themselves. For example, the comment <!-text-> will cause this method to be called with the argument text. The default method does nothing.

handle_cdata(data)—  This method is called when a CDATA element is encountered. The data argument is a string containing the text between the <![CDATA[" and "> delimiters, but not the delimiters themselves. For example, the entity <![CDATA[text> will cause this method to be called with the argument text. The default method does nothing, and is intended to be overridden.

handle_proc(name, data)—  This method is called when a processing instruction (PI) is encountered. The name is the PI target, and the data argument is a string containing the text between the PI target and the closing delimiter, but not the delimiter itself. For example, the instruction <?XML text?> will cause this method to be called with the arguments XML and text. The default method does nothing. Note that if a document starts with <?xml ..?>,handle_xml() is called to handle it.

handle_special(data)—  This method is called when a declaration is encountered. The data argument is a string containing the text between the <! and > delimiters, but not the delimiters themselves. For example, the entity <!ENTITY text> will cause this method to be called with the argument ENTITY text. The default method does nothing. Note that <!DOCTYPE ...> is handled separately if it is located at the start of the document.

syntax_error(message)—  This method is called when a syntax error is encountered. The message is a description of what was wrong. The default method raises a RuntimeError exception. If this method is overridden, it is permissible for it to return. This method is only called when the error can be recovered from. Unrecoverable errors raise a RuntimeError without first calling syntax_error().

unknown_starttag(tag, attributes)—  This method is called to process an unknown start tag. It is intended to be overridden by a derived class; the base class implementation does nothing.

unknown_endtag(tag)—  This method is called to process an unknown end tag. It is intended to be overridden by a derived class; the base class implementation does nothing.

unknown_charref(ref)—  This method is called to process unresolvable numeric character references. It is intended to be overridden by a derived class; the base class implementation does nothing.

unknown_entityref(ref)—  This method is called to process an unknown entity reference. It is intended to be overridden by a derived class; the base class implementation does nothing.

XML Namespaces

The xmllib module has support for XML namespaces as defined in the XML namespaces proposed recommendation.

Tag and attribute names that are defined in an XML namespace are handled as if the name of the tag or element consisted of the namespace (that is, the URL that defines the namespace) followed by a space and the name of the tag or attribute. For instance, the tag <html xmlns:html= http://www.w3.org/TR/REC-html40 is treated as if the tag name was http://www.w3.org/TR/REC-html40 html, and the tag <html:a href= http://frob.com inside the previous element is treated as if the tag name were http://www.w3.org/TR/REC-html40 a and the attribute name as if it were http://www.w3.org/TR/REC-html40 src.

An older draft of the XML namespaces proposal is also recognized, but triggers warn about it.

XML Examples

The next example uses xmllib to parse a XML file. The file being used is the same survey.xml that you saw in the beginning of this chapter. Our proposal is to read the file, parse it, and convert it to a structure such as the following:

							
Survey of section number 1
1- What is your favorite language? Python Perl 2- Do you use this language at
work? Yes No 3- Did you expect the Spanish inquisition? No Of course not

						

The following code implements a solution for our problem. Remember that XML tags are case sensitive, thus the code must be properly balanced. In this code, note that attributes are passed to the tag handlers in a dictionary, not in a tuple.

							
import xmllib, string
class myparser(xmllib.XMLParser):
    def __init__(self):
         xmllib.XMLParser.__init__(self)
         self.currentquestiondesc = ''
         self.currentOp1 = '' 
         self.currentOp2 = ''
         self.currentquestion = ''
         self.currentdata = []

    def handle_data(self, data):
        self.currentdata.append(data)

    def start_SURVEY(self, attrs):
        print "Survey of section number ",

    def end_SURVEY(self):
        pass

    def start_SECTION(self, attrs):
        print attrs['SECTION_ID']

    def end_SECTION(self):
        pass 

    def start_QUESTION(self, attrs):
        self.currentquestion = attrs['QUESTION_ID']

    def end_QUESTION(self): 
        print """ 

%(currentquestion)s- %(currentquestiondesc)s
      %(currentOp1)s 
      %(currentOp2)s 
""" % self.__dict__ 

    def start_QUESTION_DESC(self, attrs):
        self.currentdata = []
    def end_QUESTION_DESC(self):
        self.currentquestiondesc = string.join(self.currentdata,'')

    def start_Op1(self, attrs):
        self.currentdata = []

    def end_Op1(self): 
        self.currentOp1 = string.join(self.currentdata,'')

    def start_Op2(self, attrs): self.currentdata = []

    def end_Op2(self):
        self.currentOp2 = string.join(self.currentdata,'')


if __name__ == "__main__":
    filehandle = open("survey.xml")
    data = filehandle.read()
    filehandle.close()

    parser=myparser()
    parser.feed(data)
    parser.close()

						

Let's see another example. The next one opens our survey.xml file and lists all the questions available. It also tries to find question #4, but as we don't have it, it raises a message to the user.

							
import xmllib

class QuestionNotFound:
    pass

class Parser(xmllib.XMLParser):

    def __init__(self, filename=None):
         self.found = 0
         xmllib.XMLParser.__init__(self)
         if filename:
             self.load(filename)

    def load(self, filename):
        while 1:
            xmldata=filename.read(1024)
            if not xmldata:
                break
            self.feed(xmldata)
        self.close()

    def start_QUESTION(self, attrs): 
        question_id = attrs.get("QUESTION_ID")
        print "I found Question #" + question_id
        if question_id == "4":
            self.found = 1

    def end_SECTION(self):
        if not self.found:
            raise QuestionNotFound
try:
    MyParser = Parser()
    MyParser.load(open("survey.xml"))
except QuestionNotFound(Exception):
    print "I couldn't find Question #4 !!!" 

						
The SAX API

SAX is a common event-based interface for object-oriented XML parsers. The Simple API for XML isn't a standard in the formal sense, but an informal specification designed by David Megginson, with input from many people on the XML-DEV mailing list. SAX defines an event-driven interface for parsing XML. To use SAX, you must create Python class instances that implement a specified interface, and the parser will then call various methods of those objects.

SAX is most suitable for purposes in which you want to read through an entire XML document from beginning to end, and perform some computation, such as building a data structure representing a document, or summarizing information in a document (computing an average value of a certain element, for example). It isn't very useful if you want to modify the document structure in some complicated way that involves changing how elements are nested, though it could be used if you simply want to change element contents or attributes. For example, you would not want to re-order chapters in a book using SAX, but you might want to change the contents of any name elements with the attribute lang equal to greek into Greek letters. Of course, if this is an XML file, we would use the standard attribute xml:lang rather than just lang to store the language.

One advantage of SAX is speed and simplicity. There is no need to expend effort examining elements that are irrelevant to your application. You can therefore write a class instance that ignores all elements that aren't what you need. Another advantage is that you don't have the whole document resident in memory at any one time, which matters if you are processing huge documents.

SAX defines four basic interfaces; a SAX-compliant XML parser can be passed any objects that support these interfaces, and will call various methods as data is processed. Your task, therefore, is to implement those interfaces relevant to your application.

The SAX interfaces are as follows:

DocumentHandler—  Called for general document events. This interface is the heart of SAX; its methods are called for the start of the document, the start and end of elements, and for the characters of data contained inside elements.

DTDHandler—  Called to handle DTD events required for basic parsing. This means notation declarations (XML spec section 4.7

DTDHandler—  Called to handle DTD events required for basic parsing. This means notation declarations (XML spec ) and unparsed entity declarations (XML spec section 4

DTDHandler—  Called to handle DTD events required for basic parsing. This means notation declarations (XML spec ) and unparsed entity declarations (XML spec ).

EntityResolver—  Called to resolve references to external entities. If your documents will have no external entity references, you won't need to implement this interface.

ErrorHandler—  Called for error handling. The parser will call methods from this interface to report all warnings and errors.

Because Python doesn't support the concept of interfaces, the previous interfaces are implemented as Python classes. The default method implementations are defined to do nothing—the method body is just a Python pass statement—so usually you can simply ignore methods that aren't relevant to your application. The one big exception is the ErrorHandler interface; if you don't provide methods that print a message or otherwise take some action, errors in the XML data will be silently ignored. This is almost certainly not what you want your application to do, so always implement at least the error() and fatalError() methods. xml.sax.saxutils provides an ErrorPrinter class that sends error messages to standard error, and an ErrorRaiser class that raises an exception for any warnings or errors.

Pseudo-code for using SAX looks similar to the following:

							
# Define your specialized handler classes
from xml.sax import saxlib
class docHandler(saxlib.DocumentHandler):
     …
# Create an instance of the handler classes
dh = docHandler()
# Create an XML parser
parser = …
# Tell the parser to use your handler instance
parser.setDocumentHandler(dh)
# Parse the file; your handler's method will get called
parser.parseFile(sys.stdin)
# Close the parser
parser.close()

						

For more information, check out the following sites:

SAX: The Simple API for XML

http://www.python.org/doc/howto/xml/SAX.html

David Megginson's SAX page

Megginson was the primary force behind SAX's development, and implemented the Java version of SAX.

http://www.megginson.com/SAX/

What is an Event-Based Interface?

This page explains what an event-based interface is, and contrasts the event-based SAX with the tree-based Document Object Model (DOM).

http://www.megginson.com/SAX/event.html

Writing an application for a SAX-compliant XML parser

Simon Pepping gives a short overview of the Simple API for XML (SAX). He describes how a SAX-compliant parser and a SAX application interact, and how one should proceed to write a SAX application. The description focuses on the Python implementation of SAX. The examples are written in Python.

http://www.hobby.nl/~scaprea/XML/

DOM: The Document Object Model

The Document Object Model (DOM) is a standard interface for manipulating XML and HTML documents developed by the World Wide Web Consortium (W3C).

4DOM is a Python library developed by FourThought LLC for XML and HTML processing and manipulation using the W3C's Document Object Model for interface. 4DOM supports all of DOM level 1 (core and HTML), as well as core, HTML and Document Traversal from level 2. 4DOM also adds some helper components for DOM Tree creation and printing, python integration, whitespace manipulation, and so on.

4DOM is designed to allow developers to rapidly design applications that read, write, or manipulate HTML and XML. Check out

http://www.fourthought.com/4Suite/4DOM/

XSL Transformations (XSLT)

This W3C specification defines the syntax and semantics of XSLT, which is a language for transforming XML documents into other XML documents.

XSLT is designed for use as part of XSL, which is a stylesheet language for XML. In addition to XSLT, XSL includes an XML vocabulary for specifying formatting. XSL specifies the styling of an XML document by using XSLT to describe how the document is transformed into another XML document that uses the formatting vocabulary.

XSLT is also designed to be used independently of XSL. However, XSLT is not intended as a completely general-purpose XML transformation language. Rather, it is designed primarily for the kinds of transformations that are needed when XSLT is used as part of XSL. XSLT is also good for transforming some custom XML format into XHTML that can be displayed by a browser, for instance. For more information, check out

http://www.w3.org/TR/xslt

4XSLT is an XML transformation processor based on the W3C's specification, and written by FourThought LLC, for the XSLT transform language. Currently, 4XSLT supports a subset of the final recommendation of XSLT. For more information, check out the site:

http://www.fourthought.com/4Suite/4XSLT/

XBEL—XML Bookmark Exchange Language

The XML Bookmark Exchange Language, or XBEL, is an Internet bookmarks interchange format. It was designed by the Python XML Special Interest Group on the group's mailing list. It grew out of an idea for a demonstration of using Python for XML processing. Mark Hammond contributed the original idea, and other members of the SIG chimed in to add support for their favorite browser features. After debate that deviated from the original idea, compromises were reached that allow XBEL to be a useful language for describing bookmark data for a range of browsers, including the major browsers and a number of less widely used browsers.

At this time, the formal DTD was finalized and documentation was written. The formal DTD and the documentation are available online at the following sites:

http://www.python.org/topics/xml/xbel/

http://www.python.org/topics/xml/xbel/docs/html/xbel.html

Supporting software is provided as part of the Python XML package. This software is located in the demo/xbel/ directory of the distribution. This includes command-line processes for converting XBEL instances to other common formats, including the Navigator and Internet Explorer formats. Note that the current release of the Grail Internet browser from CNRI supports XBEL as a native bookmarks format.

The script, created by Jürgen Hermann, on the following site, checks the URLs in an XBEL document:

http://cscene.org/%7ejh/xml/bookmarks/checkurls.py

RPC—What Is It?

A Remote Procedure Call (RPC) uses the ordinary procedure call mechanism that is familiar to every user in order to hide the intricacies of the network.

A client process calls a function on a remote server and suspends itself until it gets back the results. Parameters are passed the same as in any ordinary procedure. The RPC, similar to an ordinary procedure, is synchronous; clients and servers must run concurrently. Servers must keep up with clients. The process (or thread) that issues the call waits until it gets the results. Behind the scenes, the RPC runtime software collects values for the parameters, forms a message, and sends it to the remote server. (Note that servers must first come up before clients can talk to them.) The server receives the request, unpacks the parameters, calls the procedure, and sends the reply back to the client.

Asynchronous processing is limited because it requires threads and tricky code for managing threads. A procedure call is the name of a procedure, its parameters, and the result it returns.

Procedure calls are very important for the existence of computers. Every program is just a single procedure called main; every operating system has a main procedure called a kernel. There's a top level to every program that sits in a loop waiting for something to happen and then distributes control to a hierarchy of procedures that respond. This is at the heart of interactivity and networking, it's at the heart of software.

RPC is a very simple extension to the procedure call idea; it says, "let's create connections between procedures that are running in different applications or on different machines."

Conceptually, there's no difference between a local procedure call and a remote one, but they are implemented differently, perform differently (RPC is much slower), and therefore are used for different things.

Remote calls are marshaled into a format that can be understood on the other side of the connection. As long as two machines agree on a format, they can talk to each other. That's why Windows machines can be networked with other Windows machines, Macs can talk to Macs, and so on. The value in a standardized cross-platform format for RPC is that it allows UNIX machines to talk to Windows machines and vice versa.

A number of formats are possible. One possible format is XML. XML-RPC uses XML as the marshaling format. It allows Macs to easily make procedure calls to software running on Windows machines and BeOS machines, as well as all flavors of UNIX and Java, IBM mainframes, PDAs, and so on.

With XML it's easy to see what it's doing, and it's also relatively easy to marshal the internal procedure call format into a remote format.

Simple Object Access Protocol (SOAP)

SOAP is an XML/HTTP-based protocol for accessing services, objects, and servers in a platform-independent manner. For more information, check out

http://www.develop.com/soap

A minimal Python SOAP implementation is located at

http://casbah.org/Scarab/

This module is derived in part from Andrew Kuchling's xml.marshal code. It implements the SOAP "" serialization using the same API as pickle.py (dump/load).

Scarab is an Open Source Communications library implementing protocols, formats, and interfaces for writing distributed applications, with an emphasis on low-end and lightweight implementations. Users can combine Scarab module implementations to build a messaging system to fit their needs, scaling from very simple messaging or data transfer all the way up to where CORBA can take over. Scarab implementations include support for such areas as distributed objects, remote procedure calls, XML messages, TCP transport, and HTTP transport.

PythonPoint

The ReportLab package contains a demo called PythonPoint, which has a simple XML for doing presentation slides and can convert them to PDF documents, along with imaginative presentation effects. The demo script that is provided in the Web site illustrates how easily complex XML can be translated into useful PDF. The demo output, pythonpoint.pdf, demonstrates some of the more exotic PDF capabilities:

http://www.reportlab.com/demos/demos.html

Pyxie

Pyxie is an Open Source XML processing library for Python developed by Sean McGrath. He has also written a book called XML Processing with Python for Prentice Hall. The book contains a description of the Pyxie library and many sample programs.

Pyxie is heavily based on a line-oriented notation for parsed XML known as PYX. Pyxie includes utilities, known as xmln and xmlv, that generate PYX.

PYX is independent of Python and a number of programs processing PYX have appeared in Java, Perl, and JavaScript:

http://www.digitome.com/pyxie.html


Last updated on 1/30/2002
Python Developer's Handbook, © 2002 Sams Publishing

< BACKMake Note | BookmarkCONTINUE >

Index terms contained in this section

#PCDATA
<
      !ATTLIST definition tag
      !ELEMENT definition tag
(CONTENTS) argument
(NAME) argument
* (asterisk)
+ (plus) sign
/ (slash)
? (question mark)
4DOM
4XSLT
argument
      (CONTENTS)
      (NAME)
asterisk (*)
attributes variable
calls
      Remote Procedure (RPC) 2nd
Clark, James
classes
      ErrorPrinter
      ErrorRaiser
close() method
creating
      Extensible Markup Language (XML) files 2nd
data
     manipulating
            Extensible Markup Language (XML) 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th 13th 14th 15th 16th
Document Type Definition (DTD) 2nd
DocumentHandler interface
DTD (Document Type Definition) 2nd
DTDHandler interface 2nd 3rd
elements
elements— variable
entitydefs variable
EntityResolver interface
ErrorHandler interface
ErrorPrinter class
ErrorRaiser class
Extensible Markup Language (XML)
      manipulating data 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th 13th 14th 15th 16th
feed(data) method
Fermigier, Stefane
files
     Extensible Markup Language (XML)
            writing 2nd
FourThought LLC 2nd
FourThought, Inc.
Garshol, Lars Marius
handle.cdata(data) method
handle.charref(ref) method
handle.comment(comment) method
handle.data(data) method
handle.doctype(tag, data) method
handle.endtag(tag, method) method
handle.entityref(ref) method
handle.proc(name, data) method
handle.special(data) method
handle.starttag(tag, method, attributes) method
handle.xml(encoding, standalone) method
interfaces
      DocumentHandler
      DTDHandler 2nd 3rd
      EntityResolver
      ErrorHandler
      Simple API for XML (SAX API) 2nd 3rd
Jansen, Jack
Kuchling, Andrew
libraries
      4DOM
      Pyxie
      Scarab
Lundh, Fredrik
manipulating
     data
            Extensible Markup Language (XML) 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th 13th 14th 15th 16th
McGrath, Sean
Megginson, David 2nd
methods
      close()
      feed(data)
      handle.cdata(data)
      handle.charref(ref)
      handle.comment(comment)
      handle.data(data)
      handle.doctype(tag, data)
      handle.endtag(tag, method)
      handle.entityref(ref)
      handle.proc(name, data)
      handle.special(data)
      handle.starttag(tag, method, attributes)
      handle.xml(encoding, standalone)
      reset()
      setliteral()
      setnomoretags()
      syntax.error(message)
      translate.references(data)
      unknown.charref(ref)
      unknown.endtag(tag)
      unknown.entityref(ref)
      unknown.starttag(tag, attributes)
modules
      xmllib 2nd 3rd 4th
      XMLParser
namespaces
      Extensible Markup Language (XML)
packages
      Python/XML 2nd
      PythonPoint
      ReportLab
     xml
            PyXML 2nd
Pepping, Simon
plus (+) sign
programming languages
     Extensible Markup (XML)
            manipulating data 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th 13th 14th 15th 16th
      Standard Generalized Markup (SGML) 2nd
      XML Bookmark Exchange (XBEL) 2nd
protocols
      Simple Object Access (SOAP)
Python XML Special Interest Group
Python/XML package 2nd
Python/XML Reference Guide
PythonPoint package
Pyxie
PyXML 2nd
question mark (?)
Remote Procedure Call (RPC) 2nd
ReportLab package
reset() method
RPC (Remote Procedure Call) 2nd
SAX API (Simple API for XML) 2nd 3rd
Scarab library
setliteral() method
setnomoretags() method
SGML (Standard Generalized Markup Language) 2nd
Simple API for XML (SAX API) 2nd 3rd
Simple Object Access Protocol (SOAP)
slash (/)
SOAP (Simple Object Access Protocol)
Standard Generalized Markup Language (SGML) 2nd
syntax.error(message) method
tags
     <
            !ELEMENT definition
translate.references(data) method
unknown.charref(ref) method
unknown.endtag(tag) method
unknown.entityref(ref) method
unknown.starttag(tag, attributes) method
variables
      attributes
      elements
      entitydefs
W3C (World Wide Web Consortium)
World Wide Web Consortium (W3C)
writing
      Extensible Markup Language (XML) files 2nd
XBEL (XML Bookmark Exchange Language) 2nd
XML Bookmark Exchange Language (XBEL) 2nd
xml package
      PyXML 2nd
XML Processing with Python
xmllib module 2nd 3rd 4th
XMLParser module
XSL Transformations (XSLT)
XSLT (XSL Transformations)

© 2002, O'Reilly & Associates, Inc.