< BACKMake Note | BookmarkCONTINUE >
152015024128143245168232148039199167010047123209178152124239215162148046198039088135025208

Accessing URLs

URL stands for uniform resource locator. URLs are those strings, such as http://www.lessaworld.com/, that you have to type in your Web browser in order to jump to a Web page.

Python provides the urllib and urlparse modules as great tools to process URLs.

Tip

Many applications today that have to parse Web pages always suffer with changes in the page design. However, these problems will go away when more structural formats (such as XML) start getting used to producing the pages.



The urllib Module

The urllib module is a high-level interface to retrieve data across the World Wide Web, supporting any HTTP, FTP, and gopher connections by using sockets. This module defines functions for writing programs that must be active users of the Web. It is normally used as an outer interface to other modules, such as httplib, ftplib, gopherlib, and so on.

To retrieve a Web page, use the urllib.urlopen(url [,data]) function. This function returns a stream object that can be manipulated as easily as any other regular file object, and is illustrated as follows:

						
>>> import urllib
>>> page = urllib.urlopen("http://www.bog.frb.fed.us")
>>> page.readline()

					

This stream object has two additional attributes: url and headers. The first one is the URL that you are opening, and the other is a dictionary that contains the page headers, as illustrated in the next example.

						
>>> page.url
'http://www.bog.frb.fed.us'
>>> for key, value in page.headers.items():
…       print key, " = ", value
…
server  =  Microsoft-IIS/4.0
content-type  =  text/html
content-length  =  461
date  =  Thu, 15 Jun 2000 15:31:32 GMT
				
					

Next, you have a couple of other functions that are made available by the urllib module.

urllib.urlretrieve(url [,filename] [,hook]—   Copies a network object to a local file.

								
>>> urllib.urlretrieve('http://www.lessaworld.com', 'copy.html')

							

urllib.urlcleanup()—   Cleans up the cache used by urllib.urlretrieve.

urllib.quote(string [,safe])—   Replaces special characters in string using %xx escape codes. The optional safe parameter specifies additional characters that should be quoted.

								
>>> urllib.quote('This & that @ home')
'this%20%26%20that%20%40%20home'

							

urllib.quote_plus(string [,safe])—Works just like quote(), but it replaces spaces by using plus signs.

urllib.unquote(string)—   Returns the original value that was passed to urllib.quote.

								
>>> urllib.unquote('this%20%26%20that%20%40%20home')
'This & that @ home'

							

urllib.urlencode(dict)—Converts a dictionary into a URL-encoded string.

								
>>> dict = { 'sex':'female', 'name':'renata lessa'}
>>> urllib.urlencode(dict)
'sex=female&name=renata+lessa'
						
							

Note

For those that have Python 2.0 installed, keep in mind that the new urllib module is able to scan environment variables for proxy configuration.

Also note that Python 2.0's version of the urllib module has support to " https:// " URLs over SSL.



The urlparse Module

The urlparse module manipulates an URL string, parsing it into tuples. It is able to break an URL up into components, combines them back, and converts relative addresses to absolute addresses. Basically, it rips URLs apart, being able to put them together again.

Let's take a look at the functions that are provided by this module:

urlparse.urlparse()
syntax: urlparse.urlparse(urlstring [,default_scheme [,allow_fragments]])

							

Parses an URL into six elements—addressing scheme, network location, path, parameters, query, fragment identifier—returning the following tuple:

								
>>> urlparse('http://www.python.org/FAQ.html')
('http', 'www.python.org','FAQ.html','','','')

							

urlparse.urlunparse(tuple)—Constructs a URL string from a tuple as returned by urlparse().

urlparse.urljoin(base, url [,allow_fragments])—Combines an absolute URL with a relative URL.

								
>>>urljoin('http://www.python.org', 'doc/lib')
'http://www.python.org/doc/lib'

							

The next example copies a Web page into a local file:

						
import urllib
pagehandler = urllib.urlopen("http://www.lessaworld.com")
outputfile = open("sitecopy.html", "wb")
while 1:
    data = pagehandler.read(512)
    if not data:
        break
    outputfile.write(data)
outputfile.close()
pagehandler.close()

					

If you are behind a firewall, here's a little trick you can do in order to use proxy servers to handle your connections:

						
1: import urllib
2: proxies = { 'http': 'http://proxy:80'}
3: urlopener = urllib.FancyURLopener(proxies)
4: htmlpage = urlopener.open('http://www.bog.frb.fed.us')
5: data = htmlpage.readlines()
6: print data

					

Line 2: Creates a dictionary that identifies the proxy location. Note that proxy:80 corresponds to the name of the proxy server along with the port where it is listening to.

Line 3: Creates a new function that masks the proxy connection.


Last updated on 1/30/2002
Python Developer's Handbook, © 2002 Sams Publishing

< BACKMake Note | BookmarkCONTINUE >

Index terms contained in this section

accessing
      uniform resource locators (URLs) 2nd
applications
      parsing Web pages
attributes
      headers
      stream object
      url
connections
     proxy servers
            handling
copying
      Web pages into local files
duplicating
      Web pages into local files
environment variables
      scanning
files
     local
            copying Web pages into
firewells
      handling proxy server connections
functions
      urllib module
      urllib.quote_plus(string [,safe])()
      urllib.urlcleanup()
      urllib.urlencode(dict)()
      urlparse module
      urlparse.urljoin(base, url [,allow_fragments])()
      urlparse.urlunparse(tuple)()
handling
      proxy server connections
headers attribute
Internet
      copying pages into local files
      parsing Web pages
      retrieving Web pages
local files
      copying Web pages into
modules
     urllib
            environment variables, scanning
      urlparse 2nd
networking
      accessing uniform resource locators (URLs) 2nd
pages
      copying into local files
      parsing
      retrieving
parsing
      Web pages
programs
      parsing Web pages
proxy server connections
      handling
retrieving
      Web pages
scanning
      environment variables
servers
     proxy
            handling connections
software
      parsing Web pages
uniform resource locators (URLs)
      accessing 2nd
url attribute
urllib module
     environment variables
            scanning
urllib.quote_plus(string [,safe])() function
urllib.urlcleanup() function
urllib.urlencode(dict)() function
urlparse module 2nd
urlparse.urljoin(base, url [,allow_fragments])() function
urlparse.urlunparse(tuple)() function
URLs (uniform resource locators)
      accessing 2nd
variables
     environment
            scanning
Web pages
      copying into local files
      parsing
      retrieving

© 2002, O'Reilly & Associates, Inc.