Handling Other Markup Languages
The initial part of this chapter covers XML, which is, undoubtedly, a
future promise for the Internet.
The next pages of this section describe additional modules that
support other data format standards commonly used on the internet, SGML and
HTML.
sgmllib
The sgmllib module is an SGML parser subset.
Although it has a simple implementation, it is powerful enough to build the
HTML parser.
This module implements the
SGMLParser() class.
SGMLParser()
The SGMLParser class is instantiated without
arguments. The parser is hardcoded to recognize the following
constructs:
-
Opening and closing tags of the form <tag
attr="value
> and </tag>,
respectively.
-
Numeric character references of the form
&#name;.
-
Entity references of the form
&name;.
-
SGML comments of the form
<!--text-->. Note that spaces, tabs, and newlines are
allowed between the trailing > and the immediately
preceding -.
SGMLParser instances have the following
interface
methods (note that the interface is similar to the xmllib
one):
reset()
Resets the instance. Loses all unprocessed data. This is
called implicitly at instantiation time.
setnomoretags()
Stops processing tags. Treat all following input as literal
input (CDATA). (This is only provided so that the HTML tag
<PLAINTEXT> can be implemented.)
setliteral()
Enters literal mode (CDATA mode).
feed(data)
Feeds some text to the parser. It is processed insofar as it
consists of complete elements; incomplete data is buffered until more data is
fed or close() is called.
close()
Force processing of all buffered data as if it were followed
by an end-of-file mark. This method can be redefined by a derived class to
define additional processing at the end of the input, but the redefined version
should always call close().
handle_starttag(tag, method,
attributes)
This method is called to handle start tags for which either a
start_tag() or do_tag() method has been
defined. The tag argument is the name of the tag converted to lowercase, and
the method argument is the bound method that should be used to support semantic
interpretation of the start tag. The attributes argument is a list of (name,
value) pairs containing the attributes found inside the tag's
<> brackets. The name has been translated to
lowercase, and double quotes and backslashes in the value have been
interpreted. For instance, for the tag <A HREF=
http://www.cwi.nl/, this
method would be called as unknown_starttag('a', [('href',
http://www.cwi.nl/)]). The base
implementation simply calls a method with attributes as the only
argument.
handle_endtag(tag,
method)
This method is called to handle endtags for which an
end_tag() method has been defined. The tag argument is the
name of the tag converted to lowercase, and the method argument is the bound
method that should be used to support semantic interpretation of the end tag.
If no end_tag() method is defined for the closing element,
this handler is not called. The base implementation simply calls method.
handle_data(data)
This method is called to process arbitrary data. It is
intended to be overridden by a derived class; the base class implementation
does nothing.
handle_charref(ref)
This method is called to process a character reference of the
form &#ref;. In the base implementation,
ref must be a decimal number in the range 0255. It
translates the character to ASCII and calls the method
handle_data() with the character as argument. If
ref is invalid or out of range, the method
unknown_charref(ref) is called to
handle the error. A subclass must override this method to provide support for
named character entities.
handle_entityref(ref)
This method is called to process a general entity reference
of the form &ref;, where ref is an
general entity reference. It looks for ref in the instance
(or class) variable entitydefs that should be a mapping from entity names to
corresponding translations. If a translation is found, it calls the method
handle_data() with the translation; otherwise, it calls the
method unknown_entityref(ref). The
default entitydefs defines translations for
&,&apos,>,<,
and ".
handle_comment(comment)
This method is called when a comment is encountered. The
comment argument is a string containing the text between the
<!- and -> delimiters, but not the
delimiters themselves. For example, the comment
<!-text-> will cause this method to be called with the
argument text. The default method does nothing.
report_unbalanced(tag)
This method is called when an end tag is found that does not
correspond to any open element.
Tip
In order to
handle all tags in your code, you need to overload the
following two methods: unknown_starttag and
unknown_endtag.
unknown_starttag(tag,
attributes)
This method is called to process an unknown start tag. It is
intended to be overridden by a derived class; the base class implementation
does nothing.
unknown_endtag(tag)
This method is called to process an unknown end tag. It is
intended to be overridden by a derived class; the base class implementation
does nothing.
unknown_charref(ref)
This method is called to process unresolvable numeric
character references. Refer to handle_charref() to determine
what is handled by default. It is intended to be overridden by a derived class;
the base class implementation does nothing.
unknown_entityref(ref)
This method is called to process an unknown entity reference.
It is intended to be overridden by a derived class; the base class
implementation does nothing.
Apart from overriding or extending the methods listed previously,
derived classes can also define methods of the following form to define
processing of specific tags. Tag names in the input stream are case
independent; the tag occurring in method names must be in lowercase:
start_tag(attributes)
This method is called to process an opening tag. It has
precedence over do_tag(). The attributes argument has the
same meaning as described for handle_starttag()
previously.
do_tag(attributes)
This method is called to process an opening tag that does not
come with a matching closing tag. The attributes argument has the same meaning
as described for handle_starttag() previously.
end_tag()
This method is called to process a closing tag.
Note that the parser maintains a stack of open elements for which
no end tag has been found yet. Only tags processed by
start_tag() are pushed on this stack. Definition of an
end_tag() method is optional for these tags. For tags
processed by do_tag() or by
unknown_tag(), no end_tag() method must
be defined; if defined, it will not be used. If both
start_tag() and do_tag() methods exist
for a tag, the start_tag() method takes precedence.
The following example
opens an SGML file and collects the information regarding
the
page title.
import sgmllib
import string
filename = "index.html"
class CleanExit(Exception):
pass
class Titlefinder(sgmllib.SGMLParser):
def __init__(self, verbose=0):
sgmllib.SGMLParser.__init__(self, verbose)
self.title = self.data = None
def start_title(self, attributes):
self.data = []
def end_title(self):
self.title = string.join(self.data, "")
raise CleanExit
def handle_data(self, data):
if self.data is not None:
self.data.append(data)
def get_title(filehandle):
Parser = Titlefinder()
try:
while 1:
sgmldata = filehandle.read(1024)
if not sgmldata:
break
Parser.feed(sgmldata)
Parser.close()
except CleanExit:
return Parser.title
return None
filehandle = open(filename)
title = get_title(filehandle)
print "The page's title is: %s" % (title)
htmllib
This module defines a parser class that can serve as a base for
parsing text files formatted in the Hypertext Markup Language
(HTML). The class is not directly concerned with I/Oit must
be provided with input in string form via a method, and makes calls to methods
of a formatter object in order to produce output. The
HTMLParser class is designed to be used
as a base class for other classes in order to add functionality, and allows
most of its methods to be extended or overridden. In turn, this class is
derived from and extends the
SGMLParser class defined in module
sgmllib. The HTMLParser implementation
supports the HTML 2.0 language as described in RFC 1866. Two implementations of
formatter objects are provided in the formatter
module.
The following is a summary of the
interface defined by
sgmllib.SGMLParser:
-
a. The interface to feed data to an instance is through the
feed() method, which takes a string
argument. This can be called with as little or as much text at a time as
desired; "p.feed(a); p.feed(b)" has the same effect as
"p.feed(a+b)". When the data contains complete HTML tags,
these are processed immediately; incomplete elements are saved in a buffer. To
force processing of all unprocessed data, call the
close() method.
-
For example, to parse the entire contents of a file, use
parser.feed(open('myfile.html').read())
parser.close()
-
b. The interface to define semantics for HTML tags is very
simple: derive a class and define methods called
start_tag(),
end_tag(), or
do_tag(). The parser will call these
at appropriate moments: start_tag or
do_tag() is called when an opening tag of the form
<tag ...> is encountered; end_tag()
is called when a closing tag of the form <tag> is
encountered. If an opening tag requires a corresponding closing tag, such as
<H1>... </H1>, the class should define the
start_tag() method; if a tag requires no closing tag, such
as <P>, the class should define the
do_tag() method.
This module defines a single class:
HTMLParser(formatter). This is the basic HTML parser class.
It supports all entity names required by the HTML 2.0 specification (RFC 1866).
It also defines handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.
In addition to tag methods, the HTMLParser class provides
some additional methods and instance variables for use within tag methods. They
are as follows:
formatter
This is the formatter instance associated with the
parser.
nofill
This Boolean flag should be true when whitespace should not
be collapsed, or false when it should be. In general, this should only be true
when character data is to be treated as "preformatted" text, as
within a <PRE> element. The default value is false.
This affects the operation of handle_data() and
save_end().
anchor_bgn(href, name,
type)
This method is called at the start of an anchor region. The
arguments correspond to the attributes of the <A> tag
with the same names. The default implementation maintains a list of hyperlinks
(defined by the href attribute) within the document. The
list of hyperlinks is available as the data attribute
anchorlist.
anchor_end()
This method is called at the end of an anchor region. The
default implementation adds a textual footnote marker using an index into the
list of hyperlinks created by anchor_bgn().
handle_image(source, alt[, ismap[, align[, width[,
height]]]])
This method is called to handle images. The default
implementation simply passes the alt value to the
handle_data() method.
save_bgn()
Begins saving character data in a buffer instead of sending
it to the formatter object. Retrieve the stored data via
save_end(). Use of the
save_bgn()/save_end() pair cannot be
nested.
save_end()
Ends buffering character data and returns all data saved
since the preceding call to save_bgn(). If the nofill flag
is false, whitespace is collapsed to single spaces. A call to this method
without a preceding call to save_bgn() will raise a
TypeError exception.
The following example is a
CGI script that outputs to a Web page the Web links found
in a given HTML file.
import htmllib
import formatter, string, cgi
form = cgi.FieldStorage()
try:
myfile = form["filename"].value
except:
myfile = "index.html"
class ParserClass(htmllib.HTMLParser):
def __init__(self, verbose=0):
self.anchors = {}
fmt = formatter.NullFormatter()
htmllib.HTMLParser.__init__(self, fmt, verbose)
def anchor_bgn(self, href, name, type):
self.save_bgn()
self.anchor = href
def anchor_end(self):
tagtext = string.strip(self.save_end())
if self.anchor and tagtext:
self.anchors[tagtext] = self.anchors.get(tagtext, []) + \
[self.anchor]
filename = open(myfile)
htmldata = filename.read()
filename.close()
parserobj = ParserClass()
parserobj.feed(htmldata)
parserobj.close()
print "Content-type: text/html\n"
for key in p.anchors.keys():
print key, p.anchors[key]
htmlentitydefs
The htmlentitydefs module contains a dictionary
called
entitydefs that contains all the
definitions for the general entities defined by HTML 2.0, as demonstrated
next:
import htmlentitydefs
htmlentitydef = htmlentitydefs.entitydefs.keys()
for key in htmlentitydef:
print key, " = ", htmlentitydef[key]
formatter
The formatter module is used for generic output
formatting by the
HTMLParser class of the
htmllib module. This module supports two
interface definitions, each with multiple implementations: Formatter and
Writer.
Formatter objects transform an abstract flow of formatting
events into specific output events on writer objects. Formatters manage several
stack structures to allow various properties of a writer object to be changed
and restored; writers need not be able to handle relative changes nor any sort
of "change back" operation. Specific writer properties which can
be controlled via formatter objects are horizontal alignment, font, and left
margin indentations. A mechanism is provided that supports providing arbitrary,
non-exclusive style settings to a writer as well. Additional interfaces
facilitate formatting events that are not reversible, such as paragraph
separation. The writer interface is required by the
formatter interface.
Writer objects encapsulate device interfaces. Abstract
devices, such as file formats, are supported as well as physical devices. The
provided implementations all work with abstract devices. The interface makes
available mechanisms for setting the properties that formatter objects manage
and inserting data into the output.
The Formatter Interface
Interfaces to create formatters are dependent on the specific
formatter class being instantiated. The interfaces described as follows are the
required interfaces, which all formatters must support once initialized.
One data element is defined at the module level:
AS_IS. This value can be used in the font specification
passed to the push_font() method described in the following,
or as the new value to any other push_property() method.
Pushing the AS_IS value allows the corresponding
pop_property() method to be called without having to track
whether the property was changed.
The following
attributes are defined for formatter instance
objects:
writer
Interacts with the formatter.
end_paragraph(blanklines)
Closes any open paragraphs and inserts at least blanklines
before the next paragraph.
add_line_break()
Adds a hard line break if one does not already exist. This
does not break the logical paragraph.
add_hor_rule(*args, **kw)
Inserts a horizontal rule in the output. A hard break is
inserted if data is in the current paragraph, but the logical paragraph is not
broken. The arguments and keywords are passed on to the writer's
send_line_break() method.
add_flowing_data(data)
Provides data that should be formatted with collapsed
whitespaces. Whitespace from preceeding and successive calls to
add_flowing_data() is considered as well when the whitespace
collapse is performed. The data that is passed to this method is expected to be
word wrapped by the output device. Note that any word wrapping still must be
performed by the writer object because of the need to rely on device and font
information.
add_literal_data(data)
Provides data that should be passed to the writer unchanged.
Whitespace, including newline and tab characters, is considered legal in the
value of data.
add_label_data(format,
counter)
Inserts a label that should be placed to the left of the
current left margin. This should be used for constructing bulleted or numbered
lists. If the format value is a string, it is interpreted as a format
specification for counter, which should be an integer. The result of this
formatting becomes the value of the label; if format is not a string, it is
used as the label value directly. The label value is passed as the only
argument to the writer's send_label_data() method.
Interpretation of nonstring label values is dependent on the associated writer.
Format specifications are strings that, in combination with a
counter value, are used to compute label values. Each
character in the format string is copied to the label value, with some
characters recognized to indicate a transformation on the counter value.
Specifically, the character
1 represents the counter value formatter
as an Arabic number, the characters
A and
a represent alphabetic representations
of the counter value in upper- and lowercase, respectively, and
I and
i represent the counter value in Roman
numerals, in upper- and lowercase. Note that the alphabetic and roman
transformations require that the counter value be greater than zero.
flush_softspace()
Sends any pending whitespace buffered from a previous call to
add_flowing_data() to the associated writer object. This
should be called before any direct manipulation of the writer object.
push_alignment(align)
Pushes a new alignment setting onto the alignment stack. This
might be AS_IS if no change is desired. If the alignment
value is changed from the previous setting, the writer's
new_alignment() method is called with the align
value.
pop_alignment()
Restores the previous alignment.
push_font((size, italic, bold,
teletype))
Changes some or all font properties of the writer object.
Properties that are not set to AS_IS are set to the values
passed in, whereas others are maintained at their current settings. The
writer's new_font() method is called with the fully resolved
font specification.
pop_font()
Restores the previous font.
push_margin(margin)
Increases the number of left margin indentations by one,
associating the logical tag margin with the new indentation. The initial margin
level is 0. Changed values of the logical tag must be true
values; false values other than AS_IS are not sufficient to
change the margin.
pop_margin()
Restores the previous margin.
push_style(*styles)
Pushes any number of arbitrary style specifications. All
styles are pushed onto the styles stack in order. A tuple representing the
entire stack, including AS_IS values, is passed to the
writer's new_styles() method.
pop_style([n = 1])
Pops the last n style specifications
passed to push_style(). A tuple representing the revised
stack, including AS_IS values, is passed to the writer's
new_styles() method.
set_spacing(spacing)
Sets the spacing style for the writer.
assert_line_data([flag =
1])
Informs the formatter that data has been added to the current
paragraph out-of-band. This should be used when the writer has been manipulated
directly. The optional flag argument can be set to false if the writer
manipulations produced a hard line break at the end of the output.
Formatter Implementations
Two implementations of formatter objects are provided by this
module. Most applications can use one of these classes without modification or
subclassing.
NullFormatter([writer])
A formatter that does nothing. If writer is omitted, a
NullWriter instance is created. No methods of the writer are called by
NullFormatter instances. Implementations should inherit from this class if
implementing a writer interface but don't need to inherit any
implementation.
AbstractFormatter(writer)
The standard formatter. This implementation has demonstrated
wide applicability to many writers, and can be used directly in most
circumstances. It has been used to implement a full-featured WWW browser.
The Writer Interface
Interfaces to create writers are dependent on the specific writer
class being instantiated. The interfaces described as follows are the required
interfaces that all writers must support once initialized. Although most
applications can use the AbstractFormatter class as a
formatter, the writer must typically be provided by the application.
flush()
Flushes any buffered output or device control events.
new_alignment(align)
Sets the alignment style. The align value can be any object,
but by convention is a string or None, where
None indicates that the writer's preferred alignment should
be used. Conventional align values are
left,center,right, and
justify.
new_font(font)
Sets the font style. The value of font will be
None, indicating that the device's default font should be
used, or a tuple of the form (size, italic, bold, teletype). Size will be a
string indicating the size of font that should be used; specific strings and
their interpretation must be defined by the application. The italic, bold, and
teletype values are Boolean indicators specifying which of those font
attributes should be used.
new_margin (margin,
level)
Sets the margin level to the integer level and the logical
tag to margin. Interpretation of the logical tag is at the writer's discretion;
the only restriction on the value of the logical tag is that it not be a false
value for non-zero values of level.
new_spacing
(spacing)
Sets the spacing style to spacing.
new_styles
(styles)
Sets additional styles. The styles value is a tuple of
arbitrary values; the value AS_IS should be ignored. The
styles tuple can be interpreted either as a set or as a stack depending on the
requirements of the application and writer implementation.
send_line_break()
Breaks the current line.
send_paragraph
(number)
Produces a paragraph separation of at least the given number
of blank lines, or the equivalent. The blankline value will be an integer. Note
that the implementation will receive a call to
send_line_break() before this call if a line break is
needed; this method should not include ending the last line of the paragraph.
It is only responsible for vertical spacing between paragraphs.
send_hor_rule(*args, **kw)
Displays a horizontal rule on the output device. The
arguments to this method are entirely application- and writer-specific, and
should be interpreted with care. The method implementation can assume that a
line break has already been issued via
send_line_break().
send_flowing_data(data)
Outputs character data that might be word wrapped and
re-flowed as needed. Within any sequence of calls to this method, the writer
can assume that spans of multiple whitespace characters have been collapsed to
single space characters.
send_literal_data(data)
Outputs character data that has already been formatted for
display. Generally, this should be interpreted to mean that line breaks
indicated by newline characters should be preserved and no new line breaks
should be introduced. The data can contain embedded newline and tab characters,
unlike data provided to the send_formatted_data()
interface.
send_label_data(data)
Sets data to the left of the current left margin, if
possible. The value of data is not restricted; treatment of non-string values
is entirely application- and writer-dependent. This method will only be called
at the beginning of a line.
Writer Implementations
Three implementations of the writer object interface are provided
as examples by this module. Most applications will need to derive new writer
classes from the NullWriter class.
NullWriter()
A writer that only provides the interface definition; no
actions are taken on any methods. This should be the base class for all writers
that do not need to inherit any implementation methods.
AbstractWriter()
A writer that can be used in debugging formatters, but not
much else. Each method simply announces itself by printing its name and
arguments on standard output.
DumbWriter([file[, maxcol =
72]])
A simple writer class that writes output on the file object
passed in as file or, if file is omitted, on standard output. The output is
simply word wrapped to the number of columns specified by maxcol. This class is
suitable for reflowing a sequence of paragraphs.
Using the Formatter Module
The following example removes all tags from an HTML file, leaving only the plain text left.
1: from htmllib import HTMLParser
2: from formatter import AbstractFormatter, DumbWriter
3: htmlfile = open("stuff.html")
4: parser = HTMLParser(AbstractFormatter(DumbWriter()))
5: parser.feed(htmlfile.read())
6: parser.close()
7: htmlfile.close()
The DumbWriter function is used here to dump all the non-tag contents of htmlfile to the standard
output.
Note that the file opened by line 3 can also be a URL. You just need to import and use the urllib.urlopen function, like
this:
from urllib import urlopen
htmlfile = urlopen('http://www.lessaworld.com/')