< BACKMake Note | BookmarkCONTINUE >
152015024128143245168232148039199167010047123209178152124239215162148042031166102142021009

Handling Other Markup Languages

The initial part of this chapter covers XML, which is, undoubtedly, a future promise for the Internet.

The next pages of this section describe additional modules that support other data format standards commonly used on the internet, SGML and HTML.

sgmllib

The sgmllib module is an SGML parser subset. Although it has a simple implementation, it is powerful enough to build the HTML parser.

This module implements the SGMLParser() class.

SGMLParser()

The SGMLParser class is instantiated without arguments. The parser is hardcoded to recognize the following constructs:

  1. Opening and closing tags of the form <tag attr="value …> and </tag>, respectively.

  2. Numeric character references of the form &#name;.

  3. Entity references of the form &name;.

  4. SGML comments of the form <!--text-->. Note that spaces, tabs, and newlines are allowed between the trailing > and the immediately preceding -.

SGMLParser instances have the following interface methods (note that the interface is similar to the xmllib one):

reset()—  Resets the instance. Loses all unprocessed data. This is called implicitly at instantiation time.

setnomoretags()—  Stops processing tags. Treat all following input as literal input (CDATA). (This is only provided so that the HTML tag <PLAINTEXT> can be implemented.)

setliteral()—  Enters literal mode (CDATA mode).

feed(data)—  Feeds some text to the parser. It is processed insofar as it consists of complete elements; incomplete data is buffered until more data is fed or close() is called.

close()—  Force processing of all buffered data as if it were followed by an end-of-file mark. This method can be redefined by a derived class to define additional processing at the end of the input, but the redefined version should always call close().

handle_starttag(tag, method, attributes)—  This method is called to handle start tags for which either a start_tag() or do_tag() method has been defined. The tag argument is the name of the tag converted to lowercase, and the method argument is the bound method that should be used to support semantic interpretation of the start tag. The attributes argument is a list of (name, value) pairs containing the attributes found inside the tag's <> brackets. The name has been translated to lowercase, and double quotes and backslashes in the value have been interpreted. For instance, for the tag <A HREF= http://www.cwi.nl/, this method would be called as unknown_starttag('a', [('href', http://www.cwi.nl/)]). The base implementation simply calls a method with attributes as the only argument.

handle_endtag(tag, method)—  This method is called to handle endtags for which an end_tag() method has been defined. The tag argument is the name of the tag converted to lowercase, and the method argument is the bound method that should be used to support semantic interpretation of the end tag. If no end_tag() method is defined for the closing element, this handler is not called. The base implementation simply calls method.

handle_data(data)—  This method is called to process arbitrary data. It is intended to be overridden by a derived class; the base class implementation does nothing.

handle_charref(ref)—  This method is called to process a character reference of the form &#ref;. In the base implementation, ref must be a decimal number in the range 0–255. It translates the character to ASCII and calls the method handle_data() with the character as argument. If ref is invalid or out of range, the method unknown_charref(ref) is called to handle the error. A subclass must override this method to provide support for named character entities.

handle_entityref(ref)—  This method is called to process a general entity reference of the form &ref;, where ref is an general entity reference. It looks for ref in the instance (or class) variable entitydefs that should be a mapping from entity names to corresponding translations. If a translation is found, it calls the method handle_data() with the translation; otherwise, it calls the method unknown_entityref(ref). The default entitydefs defines translations for &amp;,&apos,&gt;,&lt;, and &quot;.

handle_comment(comment)—  This method is called when a comment is encountered. The comment argument is a string containing the text between the <!- and -> delimiters, but not the delimiters themselves. For example, the comment <!-text-> will cause this method to be called with the argument text. The default method does nothing.

report_unbalanced(tag)—  This method is called when an end tag is found that does not correspond to any open element.

Tip

In order to handle all tags in your code, you need to overload the following two methods: unknown_starttag and unknown_endtag.



unknown_starttag(tag, attributes)—  This method is called to process an unknown start tag. It is intended to be overridden by a derived class; the base class implementation does nothing.

unknown_endtag(tag)—  This method is called to process an unknown end tag. It is intended to be overridden by a derived class; the base class implementation does nothing.

unknown_charref(ref)—  This method is called to process unresolvable numeric character references. Refer to handle_charref() to determine what is handled by default. It is intended to be overridden by a derived class; the base class implementation does nothing.

unknown_entityref(ref)—  This method is called to process an unknown entity reference. It is intended to be overridden by a derived class; the base class implementation does nothing.

Apart from overriding or extending the methods listed previously, derived classes can also define methods of the following form to define processing of specific tags. Tag names in the input stream are case independent; the tag occurring in method names must be in lowercase:

start_tag(attributes)—  This method is called to process an opening tag. It has precedence over do_tag(). The attributes argument has the same meaning as described for handle_starttag() previously.

do_tag(attributes)—  This method is called to process an opening tag that does not come with a matching closing tag. The attributes argument has the same meaning as described for handle_starttag() previously.

end_tag()—  This method is called to process a closing tag.

Note that the parser maintains a stack of open elements for which no end tag has been found yet. Only tags processed by start_tag() are pushed on this stack. Definition of an end_tag() method is optional for these tags. For tags processed by do_tag() or by unknown_tag(), no end_tag() method must be defined; if defined, it will not be used. If both start_tag() and do_tag() methods exist for a tag, the start_tag() method takes precedence.

The following example opens an SGML file and collects the information regarding the page title.

						
import sgmllib 
import string


filename = "index.html"
class CleanExit(Exception):
    pass

class Titlefinder(sgmllib.SGMLParser):
    def __init__(self, verbose=0):
        sgmllib.SGMLParser.__init__(self, verbose)
        self.title = self.data = None
    def start_title(self, attributes):
        self.data = []
    def end_title(self):
        self.title = string.join(self.data, "")
        raise CleanExit
    def handle_data(self, data):
            if self.data is not None:
                  self.data.append(data)
    def get_title(filehandle):
        Parser = Titlefinder()
        try:
            while 1:
                sgmldata = filehandle.read(1024)
                if not sgmldata:
                    break
                Parser.feed(sgmldata)
            Parser.close()
        except CleanExit:
            return Parser.title
        return None

filehandle = open(filename)
title = get_title(filehandle)

print "The page's title is: %s" % (title) 

					

htmllib

This module defines a parser class that can serve as a base for parsing text files formatted in the Hypertext Markup Language (HTML). The class is not directly concerned with I/O—it must be provided with input in string form via a method, and makes calls to methods of a formatter object in order to produce output. The HTMLParser class is designed to be used as a base class for other classes in order to add functionality, and allows most of its methods to be extended or overridden. In turn, this class is derived from and extends the SGMLParser class defined in module sgmllib. The HTMLParser implementation supports the HTML 2.0 language as described in RFC 1866. Two implementations of formatter objects are provided in the formatter module.

The following is a summary of the interface defined by sgmllib.SGMLParser:

  1. a. The interface to feed data to an instance is through the feed() method, which takes a string argument. This can be called with as little or as much text at a time as desired; "p.feed(a); p.feed(b)" has the same effect as "p.feed(a+b)". When the data contains complete HTML tags, these are processed immediately; incomplete elements are saved in a buffer. To force processing of all unprocessed data, call the close() method.

  2. For example, to parse the entire contents of a file, use

    								
    parser.feed(open('myfile.html').read())
      parser.close() 
    							
  3. b. The interface to define semantics for HTML tags is very simple: derive a class and define methods called start_tag(), end_tag(), or do_tag(). The parser will call these at appropriate moments: start_tag or do_tag() is called when an opening tag of the form <tag ...> is encountered; end_tag() is called when a closing tag of the form <tag> is encountered. If an opening tag requires a corresponding closing tag, such as <H1>... </H1>, the class should define the start_tag() method; if a tag requires no closing tag, such as <P>, the class should define the do_tag() method.

This module defines a single class: HTMLParser(formatter). This is the basic HTML parser class. It supports all entity names required by the HTML 2.0 specification (RFC 1866). It also defines handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements. In addition to tag methods, the HTMLParser class provides some additional methods and instance variables for use within tag methods. They are as follows:

formatter—  This is the formatter instance associated with the parser.

nofill—  This Boolean flag should be true when whitespace should not be collapsed, or false when it should be. In general, this should only be true when character data is to be treated as "preformatted" text, as within a <PRE> element. The default value is false. This affects the operation of handle_data() and save_end().

anchor_bgn(href, name, type)—  This method is called at the start of an anchor region. The arguments correspond to the attributes of the <A> tag with the same names. The default implementation maintains a list of hyperlinks (defined by the href attribute) within the document. The list of hyperlinks is available as the data attribute anchorlist.

anchor_end()—  This method is called at the end of an anchor region. The default implementation adds a textual footnote marker using an index into the list of hyperlinks created by anchor_bgn().

handle_image(source, alt[, ismap[, align[, width[, height]]]])—  This method is called to handle images. The default implementation simply passes the alt value to the handle_data() method.

save_bgn()—  Begins saving character data in a buffer instead of sending it to the formatter object. Retrieve the stored data via save_end(). Use of the save_bgn()/save_end() pair cannot be nested.

save_end()—  Ends buffering character data and returns all data saved since the preceding call to save_bgn(). If the nofill flag is false, whitespace is collapsed to single spaces. A call to this method without a preceding call to save_bgn() will raise a TypeError exception.

The following example is a CGI script that outputs to a Web page the Web links found in a given HTML file.

						
import htmllib
import formatter, string, cgi

form = cgi.FieldStorage()

try:
    myfile = form["filename"].value
except:
    myfile = "index.html"

class ParserClass(htmllib.HTMLParser):
    def __init__(self, verbose=0):
         self.anchors = {}
         fmt = formatter.NullFormatter()
         htmllib.HTMLParser.__init__(self, fmt, verbose)
    def anchor_bgn(self, href, name, type):
        self.save_bgn()
        self.anchor = href
    def anchor_end(self):
        tagtext = string.strip(self.save_end())
        if self.anchor and tagtext:
            self.anchors[tagtext] = self.anchors.get(tagtext, []) + \
                                      [self.anchor]
filename = open(myfile)
htmldata = filename.read()
filename.close()
parserobj = ParserClass()
parserobj.feed(htmldata)
parserobj.close()

print "Content-type: text/html\n"

for key in p.anchors.keys():
    print key, p.anchors[key] 

					

htmlentitydefs

The htmlentitydefs module contains a dictionary called entitydefs that contains all the definitions for the general entities defined by HTML 2.0, as demonstrated next:

						
import htmlentitydefs
htmlentitydef = htmlentitydefs.entitydefs.keys()
for key in htmlentitydef:
    print key, " = ", htmlentitydef[key]

					

formatter

The formatter module is used for generic output formatting by the HTMLParser class of the htmllib module. This module supports two interface definitions, each with multiple implementations: Formatter and Writer.

Formatter objects transform an abstract flow of formatting events into specific output events on writer objects. Formatters manage several stack structures to allow various properties of a writer object to be changed and restored; writers need not be able to handle relative changes nor any sort of "change back" operation. Specific writer properties which can be controlled via formatter objects are horizontal alignment, font, and left margin indentations. A mechanism is provided that supports providing arbitrary, non-exclusive style settings to a writer as well. Additional interfaces facilitate formatting events that are not reversible, such as paragraph separation. The writer interface is required by the formatter interface.

Writer objects encapsulate device interfaces. Abstract devices, such as file formats, are supported as well as physical devices. The provided implementations all work with abstract devices. The interface makes available mechanisms for setting the properties that formatter objects manage and inserting data into the output.

The Formatter Interface

Interfaces to create formatters are dependent on the specific formatter class being instantiated. The interfaces described as follows are the required interfaces, which all formatters must support once initialized.

One data element is defined at the module level: AS_IS. This value can be used in the font specification passed to the push_font() method described in the following, or as the new value to any other push_property() method. Pushing the AS_IS value allows the corresponding pop_property() method to be called without having to track whether the property was changed.

The following attributes are defined for formatter instance objects:

writer—  Interacts with the formatter.

end_paragraph(blanklines)—  Closes any open paragraphs and inserts at least blanklines before the next paragraph.

add_line_break()—  Adds a hard line break if one does not already exist. This does not break the logical paragraph.

add_hor_rule(*args, **kw)—  Inserts a horizontal rule in the output. A hard break is inserted if data is in the current paragraph, but the logical paragraph is not broken. The arguments and keywords are passed on to the writer's send_line_break() method.

add_flowing_data(data)—  Provides data that should be formatted with collapsed whitespaces. Whitespace from preceeding and successive calls to add_flowing_data() is considered as well when the whitespace collapse is performed. The data that is passed to this method is expected to be word wrapped by the output device. Note that any word wrapping still must be performed by the writer object because of the need to rely on device and font information.

add_literal_data(data)—  Provides data that should be passed to the writer unchanged. Whitespace, including newline and tab characters, is considered legal in the value of data.

add_label_data(format, counter)—  Inserts a label that should be placed to the left of the current left margin. This should be used for constructing bulleted or numbered lists. If the format value is a string, it is interpreted as a format specification for counter, which should be an integer. The result of this formatting becomes the value of the label; if format is not a string, it is used as the label value directly. The label value is passed as the only argument to the writer's send_label_data() method. Interpretation of nonstring label values is dependent on the associated writer.

Format specifications are strings that, in combination with a counter value, are used to compute label values. Each character in the format string is copied to the label value, with some characters recognized to indicate a transformation on the counter value. Specifically, the character 1 represents the counter value formatter as an Arabic number, the characters A and a represent alphabetic representations of the counter value in upper- and lowercase, respectively, and I and i represent the counter value in Roman numerals, in upper- and lowercase. Note that the alphabetic and roman transformations require that the counter value be greater than zero.

flush_softspace()—  Sends any pending whitespace buffered from a previous call to add_flowing_data() to the associated writer object. This should be called before any direct manipulation of the writer object.

push_alignment(align)—  Pushes a new alignment setting onto the alignment stack. This might be AS_IS if no change is desired. If the alignment value is changed from the previous setting, the writer's new_alignment() method is called with the align value.

pop_alignment()—  Restores the previous alignment.

push_font((size, italic, bold, teletype))—  Changes some or all font properties of the writer object. Properties that are not set to AS_IS are set to the values passed in, whereas others are maintained at their current settings. The writer's new_font() method is called with the fully resolved font specification.

pop_font()—  Restores the previous font.

push_margin(margin)—  Increases the number of left margin indentations by one, associating the logical tag margin with the new indentation. The initial margin level is 0. Changed values of the logical tag must be true values; false values other than AS_IS are not sufficient to change the margin.

pop_margin()—  Restores the previous margin.

push_style(*styles)—  Pushes any number of arbitrary style specifications. All styles are pushed onto the styles stack in order. A tuple representing the entire stack, including AS_IS values, is passed to the writer's new_styles() method.

pop_style([n = 1])—  Pops the last n style specifications passed to push_style(). A tuple representing the revised stack, including AS_IS values, is passed to the writer's new_styles() method.

set_spacing(spacing)—  Sets the spacing style for the writer.

assert_line_data([flag = 1])—  Informs the formatter that data has been added to the current paragraph out-of-band. This should be used when the writer has been manipulated directly. The optional flag argument can be set to false if the writer manipulations produced a hard line break at the end of the output.

Formatter Implementations

Two implementations of formatter objects are provided by this module. Most applications can use one of these classes without modification or subclassing.

NullFormatter([writer])—  A formatter that does nothing. If writer is omitted, a NullWriter instance is created. No methods of the writer are called by NullFormatter instances. Implementations should inherit from this class if implementing a writer interface but don't need to inherit any implementation.

AbstractFormatter(writer)—  The standard formatter. This implementation has demonstrated wide applicability to many writers, and can be used directly in most circumstances. It has been used to implement a full-featured WWW browser.

The Writer Interface

Interfaces to create writers are dependent on the specific writer class being instantiated. The interfaces described as follows are the required interfaces that all writers must support once initialized. Although most applications can use the AbstractFormatter class as a formatter, the writer must typically be provided by the application.

flush()—  Flushes any buffered output or device control events.

new_alignment(align)—  Sets the alignment style. The align value can be any object, but by convention is a string or None, where None indicates that the writer's preferred alignment should be used. Conventional align values are left,center,right, and justify.

new_font(font)—  Sets the font style. The value of font will be None, indicating that the device's default font should be used, or a tuple of the form (size, italic, bold, teletype). Size will be a string indicating the size of font that should be used; specific strings and their interpretation must be defined by the application. The italic, bold, and teletype values are Boolean indicators specifying which of those font attributes should be used.

new_margin (margin, level) —  Sets the margin level to the integer level and the logical tag to margin. Interpretation of the logical tag is at the writer's discretion; the only restriction on the value of the logical tag is that it not be a false value for non-zero values of level.

new_spacing (spacing)—  Sets the spacing style to spacing.

new_styles (styles)—  Sets additional styles. The styles value is a tuple of arbitrary values; the value AS_IS should be ignored. The styles tuple can be interpreted either as a set or as a stack depending on the requirements of the application and writer implementation.

send_line_break()—  Breaks the current line.

send_paragraph (number)—  Produces a paragraph separation of at least the given number of blank lines, or the equivalent. The blankline value will be an integer. Note that the implementation will receive a call to send_line_break() before this call if a line break is needed; this method should not include ending the last line of the paragraph. It is only responsible for vertical spacing between paragraphs.

send_hor_rule(*args, **kw)—  Displays a horizontal rule on the output device. The arguments to this method are entirely application- and writer-specific, and should be interpreted with care. The method implementation can assume that a line break has already been issued via send_line_break().

send_flowing_data(data)—  Outputs character data that might be word wrapped and re-flowed as needed. Within any sequence of calls to this method, the writer can assume that spans of multiple whitespace characters have been collapsed to single space characters.

send_literal_data(data)—  Outputs character data that has already been formatted for display. Generally, this should be interpreted to mean that line breaks indicated by newline characters should be preserved and no new line breaks should be introduced. The data can contain embedded newline and tab characters, unlike data provided to the send_formatted_data() interface.

send_label_data(data)—  Sets data to the left of the current left margin, if possible. The value of data is not restricted; treatment of non-string values is entirely application- and writer-dependent. This method will only be called at the beginning of a line.

Writer Implementations

Three implementations of the writer object interface are provided as examples by this module. Most applications will need to derive new writer classes from the NullWriter class.

NullWriter()—  A writer that only provides the interface definition; no actions are taken on any methods. This should be the base class for all writers that do not need to inherit any implementation methods.

AbstractWriter()—  A writer that can be used in debugging formatters, but not much else. Each method simply announces itself by printing its name and arguments on standard output.

DumbWriter([file[, maxcol = 72]])—  A simple writer class that writes output on the file object passed in as file or, if file is omitted, on standard output. The output is simply word wrapped to the number of columns specified by maxcol. This class is suitable for reflowing a sequence of paragraphs.

Using the Formatter Module

The following example removes all tags from an HTML file, leaving only the plain text left.

						
1: from htmllib import HTMLParser
2: from formatter import AbstractFormatter, DumbWriter
3: htmlfile = open("stuff.html")
4: parser = HTMLParser(AbstractFormatter(DumbWriter()))
5: parser.feed(htmlfile.read())
6: parser.close()
7: htmlfile.close()

					

The DumbWriter function is used here to dump all the non-tag contents of htmlfile to the standard output.

Note that the file opened by line 3 can also be a URL. You just need to import and use the urllib.urlopen function, like this:

						
from urllib import urlopen
htmlfile = urlopen('http://www.lessaworld.com/') 

					


Last updated on 1/30/2002
Python Developer's Handbook, © 2002 Sams Publishing

< BACKMake Note | BookmarkCONTINUE >

Index terms contained in this section

1
A
a
AbstractFormatter(writer) class
AbstractWriter() class
add_flowing_data(data) attribute
add_hor_rule(*args, **kw) attribute
add_label_data(format, counter) attribute
add_line_break() attribute
add_literal_data(data) attribute
anchor_bgn(href, name, type) method
anchor_end() method
assert_line_data([flag = 1]) method
attributes
      add_flowing_data(data)
      add_hor_rule(*args, **kw)
      add_label_data(format, counter)
      add_line_break()
      add_literal_data(data)
      end_paragraph(blanklines)
      formatter objects 2nd
      writer
CGI scripts
      outputting links from HTML files to Web pages
classes
      AbstractFormatter(writer)
      AbstractWriter()
      DumbWriter([file[, maxcol = 72]])
      formatter objects 2nd
      HTMLParser 2nd
      NullFormatter([writer])
      NullWriter()
      sgmllib module 2nd 3rd 4th
      SGMLParser 2nd 3rd 4th 5th
      writer objects 2nd
close() method 2nd
collecting
      page title information
counter values
data
     manipulating
            formatter module 2nd 3rd 4th 5th 6th 7th
            hemlentitydefs module
            htmllib module 2nd 3rd
            sgmllib module 2nd 3rd
dictionaries
      entitydefs
do_tag() method
do_tag(attributes) method
DumbWriter([file[, maxcol = 72]]) class
end_paragraph(blanklines) attribute
end_tag() method 2nd
entitydefs dictionary
feed() method
feed(data) method
files
     HTML
            outputting links from to Web pages, CGI scripts
     SGML
            opening
flag
      nofill
flush() method
flush_softspace() method
formatter module 2nd 3rd 4th 5th 6th 7th
formatter objects 2nd 3rd 4th 5th
formatter variable
handle_charref(ref) method
handle_comment(comment) method
handle_data(data) method
handle_endtag(tag, method) method
handle_entityref(ref) method
handle_image(source, alt[, is map[, align[, width[, height]]]]) method
handle_starttag(tag, method, attributes) method
handling
      tags
hemlentitydefs module
HTML files
      outputting links from to Web pages, CGI scripts
htmllib module 2nd 3rd 4th
HTMLParser class 2nd
I
i
interfaces
      formatter 2nd 3rd 4th
      sgmllib.SGMLParser 2nd
      writer 2nd
links
      outputting from HTML files to Web pages, CGI scripts
manipulating
     data
            formatter module 2nd 3rd 4th 5th 6th 7th
            hemlentitydefs module
            htmllib module 2nd 3rd
            sgmllib module 2nd 3rd
methods
      anchor_bgn(href, name, type)
      anchor_end()
      assert_line_data([flag = 1])
      close() 2nd
      do_tag()
      do_tag(attributes)
      end_tag() 2nd
      feed()
      feed(data)
      flush()
      flush_softspace()
      formatter objects 2nd 3rd
      handle_charref(ref)
      handle_comment(comment)
      handle_data(data)
      handle_endtag(tag, method)
      handle_entityref(ref)
      handle_image(source, alt[, is map[, align[, width[, height]]]])
      handle_starttag(tag, method, attributes)
      new alignment(align)
      new_font(font)
      new_margin(margin, level)
      new_spacing(spacing)
      new_styles(styles)
      pop_alignment()
      pop_font()
      pop_margin()
      pop_style([n = 1])
      push_alignment(align)
      push_font((size, italic, bold, teletype))
      push_margin(margin)
      push_style(*styles)
      report_unbalanced(tag)
      reset()
      save_bgn()
      save_end()
      send_flowing_data(data)
      send_hor_rule(*args, **kw)
      send_label_data(data)
      send_line_break()
      send_literal_data(data)
      send_paragraph(number)
      set_spacing(spacing)
      setliteral()
      setnomoretags()
      SGMLParser class 2nd 3rd 4th 5th
      start_tag()
      start_tag(attributes)
      unknown_charref(ref)
      unknown_endtag(tag)
      unknown_entity(ref)
      unknown_starttag(tag, attributes)
      writer objects 2nd 3rd
modules
      formatter 2nd 3rd 4th 5th 6th 7th
      hemlentitydefs
      htmllib 2nd 3rd
      Htmllib
      sgmllib 2nd 3rd
new alignment(align) method
new_font(font) method
new_margin(margin, level) method
new_spacing(spacing) method
new_styles(styles) method
nofill flag
NullFormatter([writer]) class
NullWriter() class
objects
      formatter 2nd 3rd 4th 5th
      writer 2nd 3rd
opening
      SGML files
outputting
      links from HTML files to Web pages, CGI scripts
page titles
      collecting information on
pages
      outputting links from HTML files to, CGI scripts
pop_alignment() method
pop_font() method
pop_margin() method
pop_style([n = 1]) method
push_alignment(align) method
push_font((size, italic, bold, teletype)) method
push_margin(margin) method
push_style(*styles) method
report_unbalanced(tag)method
reset() method
save_bgn() method
save_end() method
scripts
     CGI
            outputting links from HTML files to Web pages
send_flowing_data(data) method
send_hor_rule(*args, **kw) method
send_label_data(data) method
send_line_break() method
send_literal_data(data) method
send_paragraph(number) method
set_spacing(spacing) method
setliteral() method
setnomoretags() method
SGML files
      opening
sgmllib module 2nd 3rd
sgmllib.SGMLParser interface 2nd
SGMLParser class 2nd 3rd 4th 5th
start_tag() method
start_tag(attributes) method
tags
      handling
titles
     page
            collecting information for
unknown_charref(ref) method
unknown_endtag(tag) method
unknown_entity(ref) method
unknown_starttag(tag, attributes) method
values
      counter
variables
      formatter
Web pages
      outputting links from HTML files to, CGI scripts
writer attribute
writer objects 2nd 3rd

© 2002, O'Reilly & Associates, Inc.