You are here: Global Spin -> Perl for the Web -> XML and Content Management

XML and Content Management

part of Perl for the Web

Click here to order from a bookstore near you.

The Extensible Markup Language (XML) usually is the first technology invoked when talking about the future of the Web. An offshoot of the Hypertext Markup Language (HTML) and the Standard Generalized Markup Language (SGML) standards, XML was developed as a way to create platform-neutral data representation languages. XML languages for common data needs have already started to appear, and many more are in development. In addition, a major strength of XML is the capability to encode XML data without needing to explicitly declare a language. Because the rules of valid XML formatting are strictly defined, any XML document can be parsed using the same tools, regardless of whether the language is known beforehand.

A number of robust XML interfaces have been developed for Perl, and more are likely to be developed in the future. The proliferation of such interfaces is made possible by the standardized accessibility of XML, combined with Perl's excellent text-handling capabilities. Perl's interfaces to XML documents range from the simple and direct to the robust and complex, but so far no clear favorite has emerged. It's possible that a cross-language standard, such as the Document Object Model (DOM), will become a Perl favorite as well. It's just as likely that a specifically Perlish XML interface will emerge the way DBI did as a database interface.

There are many applications for XML interfaces on the Web, but the first wide use of XML might be Web content management. XML fits well into the document-centric Web model, and much of the data currently available on the Web can be translated easily into XML documents. In addition, XML provides a finer grain of control over the way a document is expressed. The same data can be formatted in an infinite number of ways, and alternate views of a single XML file can be created to enable users to exercise control over the way in which they view the data.

XML Now With XML::Simple

XML::Simple is a module that provides a lightweight interface to XML files that is implemented in a Perlish fashion. XML::Simple parses XML files into native Perl data structures that can be accessed using standard Perl functions and operators. The structures then can be altered or updated and translated back to XML files. The XML::Simple module provides a good way to access XML documents that have a simple, predefined structure. Configuration files, database results, and other simple data files are good candidates for translation into XML and for use with XML::Simple.

Unfortunately, not all XML interface situations can be covered by using XML::Simple. The module's reliance on native Perl data types limits its representation of XML structures to those that make sense in a Perlish context. The results of parsing XML files with XML::Simple can be variable, even with files that are well suited to it. Structures that are distinct in XML syntax might map to the same Perl data types, creating ambiguity when the parsed structures are translated back to XML. For simple cases with regular XML structure, however, XML::Simple provides a good way to start accessing XML documents as quickly as possible.

Simple XML Parsing

In most cases, XML is used to store existing data in a platform-independent format. Data is well-suited for storage in XML if it can be stored hierarchically. Listing 16.1 is an example of a hierarchy–a table of contents–that translates well into XML.

Listing 16.1 Table of Contents File

01 <book>
02 <title>Perl for the Web</title>
03 <author>Chris Radcliff</author>
04 <chapter>
05 <number>0</number>
06 <title>Introduction</title>
07 <section>About this book</section>
08 <section>Conventions</section>
09 </chapter>
10 <chapter>
11 <number>1</number>
12 <title>Foobar chapter</title>
13 <section>How to Foobar</section>
14 <section>How to Foobaz</section>
15 </chapter>
16 <chapter>
17 <number>2</number>
18 <title>Barbaz chapter</title>
19 <section>How to Barbaz</section>
20 <section>When not to Barbaz</section>
21 </chapter>
22 <appendix>
23 <number>A</number>
24 <title>Alphabet Soup: Reference and Glossary</title>
25 <section>XML Bestiary</section>
26 <section>Specifications and Organizations</section>
27 </appendix>
28 </book>

Similarly, XML-encoded data that can be represented as a Perl data structure is well-suited for processing by XML::Simple. Because Perl deals in scalars, arrays, and hashes, XML structures that can be meaningfully mapped onto these variables are easier to access using XML::Simple than are structures that are more complex. For instance, an HTML-like document is likely to map awkwardly onto Perl data structures because the tags in such a file are mixed within the values of container tags. A structure such as the following is trivial to represent in Perl:

Listing 16.


An array of name values can be created that mimics the structure directly. However, a mixed structure such as the following would be awkward to represent in Perl:

Listing 16.

    Cats are often given odd names like 
    <name>Hershey</name>, <name>Kahlua</name>, or 

The latter structure, although it's valid XML, mixes unnamed text elements and tagged elements within the <statement> tag, which makes it difficult to create a Perl structure containing all the elements. A simple array of all the elements under <statement> might be created, but it would include all elements in order with no distinction between the text elements and the <name> values. Parsing such a file would require a more robust interface such as XML::DOM. Listing 16.1 is a good candidate for using XML::Simple because the data is stored in a format that can be represented as name/value pairs–a hash, in other words.

The file in Listing 16.1 also is very regular in its structure. Values are stored in elements that are named consistently, which helps when retrieving values programmatically. For instance, all chapter numbers are stored in the element path /book/chapter/number, which is both regular and easily identifiable. Even duplicate data is stored in identifiable sections that can be turned into array structures in Perl. Chapters are found in the element path /book/chapter, for instance, and sections are found through the element path /book/chapter/section. Storing multiples uniformly as arrays makes it possible to access all the values with a foreach loop or use standard array functions such as grep or map.

Accessing a value in an XML::Simple structure is handled through Perl's standard methods for processing variables and references. An object interface is provided for creating the structures, but no object methods are necessary to get values. For instance, retrieving the value of the <title> element from the document in Listing 16.1 could be performed in the following way:

Listing 16.


The result would be a scalar value containing the title. More complex values are returned as array references or hash references, which in turn can be accessed using standard Perl operators. The end result of any accessed path is always the scalar value of the tag or attribute.

XML Without the XML

One good reason to use XML::Simple is that it provides many of the benefits of XML–named access to data and platform independent data files among them–without requiring much knowledge of XML arcana. A Perl program can be written that accesses all parts of an XML file without having to know parsing details or file formatting. For instance, Listing 16.2 is a Web application written in Perl Server Pages (PSP) that uses XML::Simple to display the table of contents in Listing 16.1 as HTML. (A modified version of this program can be used as an entry point for the publishing system mentioned later in this chapter.)

Listing 16.2 Table of Contents Display Program

01 <html>
03 <perl>
04 use XML::Simple ();
06 my $xs = XML::Simple->new(forcearray => ['section', 'appendix', 'chapter'],
07                           memshare => 1);
08 my $book = $xs->XMLin("$ENV{DOCUMENT_ROOT}/thebook/toc.xml");
09 </perl>
11 <output>
12 <head>
13 <title>$book->{title}</title>
14 </head>
16 <body bgcolor="white">
17 <h3>$book->{title}</h3>
18 <p>by $book->{author}</p>
20 <loop name="chapter" list="@{$book->{chapter}}">
21 <p>Chapter $chapter->{number}: $chapter->{title}</p>
22 <ul>
23 <loop name="section" list="@{$chapter->{section}}">
24 <li>$section</li>
25 </loop>
26 </ul>
27 </loop>
29 <loop name="appendix" list="@{$book->{appendix}}">
30 <p>Appendix $appendix->{number}: $appendix->{title}</p>
31 <ul>
32 <loop name="section" list="@{$appendix->{section}}">
33 <li>$section</li>
34 </loop>
35 </ul>
36 </loop>
38 </body>
39 </output>
40 </html>

Lines 04[nd]08 of Listing 16.2 set up XML::Simple and parse the XML document into a Perl data structure. Line 06 creates an XML::Simple parser object and stores it in $xs. The forcearray parameter is specified to standardize the way XML::Simple translates the <section>, <appendix>, and <chapter> elements into arrays. The memcache parameter is set to cache parsed XML documents in memory for the life of the Perl interpreter. Line 08 invokes the XMLin method to load and parse the toc.xml file from Listing 16.1 and store the result in $book. The rest of Listing 16.2 doesn't reference XML::Simple; all further interaction is with the Perl data structures referenced by $book.

Lines 11[nd]39 use the PSP <output> tag to mix HTML and Perl variables. Lines 13 and 17 display the book title, and line 18 displays the book author. Lines 20[nd]27 create a new paragraph for each chapter stored in the array reference $book->{chapter}, and lines 23[nd]25 loop through the sections of each chapter and create a bulleted list accordingly. Lines 29[nd]36 do the same for each appendix stored in $book->{appendix}. The end result is an HTML-formatted list, as shown in Figure 16.1.

***Insert figure 16.116hpp01.tiffSC***crop

Figure 16.1

Table of contents display.

Another bonus to using XML to store the table of contents becomes apparent when the displayed result needs to be changed. For instance, adding a link to each section of the table of contents would be a simple matter of changing line 24 of Listing 16.2 to the following:

Listing 16.

<li><a href="show.psp?section=$section">$section</a></li>

The change automatically would be applied to all sections in the table of contents. Similarly, if the underlying data needs to be changed, the changes can be made without the need to alter formatting as well. In fact, changes can be made to the data without any knowledge of the eventual formatting at all. This provides an additional layer of separation for systems using templates–the result is independent data storage, program logic, and display formatting.

XML::Simple Caveats

An XML interface this simple has to have its faults, though. With XML::Simple, the problem lies in the way in which it translates XML structures into Perl variables. To provide an interface without the need for many object methods and complex data types, XML::Simple has to gloss over some of the distinctions between XML structures. The resulting ambiguity between similar structures can cause the XML coming out of XML::Simple to be considerably different from the XML that went in. An example of this effect can be seen by running the XML document from Listing 16.1 through XML::Simple and viewing the result. Listing 16.3 performs the transformation on any XML file specified.

Listing 16.3 XML Simplifier

01 #!/usr/bin/perl
03 require 5.6.0;
04 use strict;
05 use warnings;
07 use XML::Simple ();
09 my $xs = XML::Simple->new();
11 my $file_in = $ARGV[0]
12   or die "Please specify a file to simplify.\n";
14 print $xs->XMLout($xs->XMLin($file_in));

Line 14 of Listing 16.3 is the one that does all the work. It reads the file specified in $ARGV[0] into the XML::Simple parser created in line 09. It then translates the result back into an XML document and prints it. Because XML::Simple doesn't retain all the distinctions between XML elements and attributes, the XML document displayed looks considerably different, as shown in Listing 16.4.

Listing 16.4 Simplified Result

01 <opt title="Perl for the Web" author="Chris Radcliff">
02   <appendix title="Alphabet Soup: Reference and Glossary" number="A">
03     <section>XML Bestiary</section>
04     <section>Specifications and Organizations</section>
05   </appendix>
06   <chapter title="Introduction" number="0">
07     <section>About this book</section>
08     <section>Conventions</section>
09   </chapter>
10   <chapter title="Foobar chapter" number="1">
11     <section>How to Foobar</section>
12     <section>How to Foobaz</section>
13   </chapter>
14   <chapter title="Barbaz chapter" number="2">
15     <section>How to Barbaz</section>
16     <section>When not to Barbaz</section>
17   </chapter>
18 </opt>

At first glance, the "simplified" XML file in Listing 16.4 would seem to have been modified for the better. The module decided that some aspects of the file were better suited to attributes and changed the format accordingly. The result is a clearer visual representation of the relationship between chapters and sections, with titles and related attributes moved to secondary roles. However, the differences become apparent at closer inspection. The root element has been changed from <book> to <opt>, and the appendix has moved from the end of the table of contents to the front.

These changes wouldn't affect the way XML::Simple represents the XML document, but they would definitely change the way that most other XML parsers would represent the document. In turn, any program that relied on reading the document in the original form might be unable to find information that has been restructured, even though the structure still makes sense from a human standpoint. For instance, a program written using the DOM interface might look for the title of a chapter in the first child node of the <chapter> element, based on the file in Listing 16.1. In the modified file, though, the second child element of a chapter is likely to be a section title.

In addition, XML::Simple can be sensitive to changes in a document's structure–changes that wouldn't seem to make a difference at first glance. For instance, if the document in Listing 16.1 had only one section listed under one of the chapters, the internal representation of that section would have changed from an array reference to a scalar by default. The only way to avoid breaking a program in the way that Listing 16.3 has done is to force every instance of the <section> element to be treated as an array. Unfortunately, these problems might not show up with the documents used to test an application. Therefore, other documents assumed to be in the same format might break an application that worked initially. The solution is to test XML::Simple programs with a wide array of sample documents to ferret out potential problems.

Tools for Creating XML Interfaces

After the limits of XML::Simple are reached, a more robust interface to XML documents becomes necessary. Fortunately, many already have been developed for use with Perl. Unfortunately, each has its own strengths and weaknesses, and some have more of the latter than the former. However, it's usually possible to find a good general-purpose XML interface that is well suited to a particular project, whether the XML needs to be processed as a stream, parsed into a tree structure for random access, or translated into another document directly.

XML::Parser and Expat

The core of most XML interfaces in Perl is the XML::Parser module. XML::Parser was originally written as an interface to the expat parser written in C by James Clark. The module provides an event-based approach to XML parsing–custom Perl subroutines are executed for each element encountered during parsing. For a time, it was Perl's only interface to XML, but other interfaces were soon implemented by writing modules for use as the subroutines XML::Parser calls as it parses the document.

XML::Parser provides a low-level interface to XML processing, one that makes few assumptions about the structure of the XML file or the methods used to access it. It's for this reason that most XML interfaces are built on top of XML::Parser. However, the same reason makes XML::Parser a poor fit for most Web applications, especially in a Perl context. XML::Parser provides few shortcuts to extracting a particular piece of data from an XML file. Thus, even simple XML document access requires a custom parser to be developed. This approach also is less forgiving because adding support for new tags usually requires additional custom subroutines.


One interface to XML that Perl shares with other languages is the DOM. Perl implements the DOM by way of the XML::DOM module. XML documents are parsed into a structure made up of nodes, where each node is an XML element, attribute, or text value. Nodes can be accessed through DOM-standard object methods such as getDocumentElement and getData. The DOM interface enables the creation of XML structures as well–nodes can be created, copied, or relocated through additional object methods.

The existing DOM parser, XML::DOM::Parser, uses XML::Parser to create a native Perl data structure made up of objects and their methods. As the time this book was being written, efforts were underway to provide a faster DOM interface for Perl. A number of candidates written in C have been proposed, including a DOM version of the Sablotron processor and the Xerces parser, and Perl interfaces to them are in development. Just as XML::Parser eventually gave rise to a profusion of robust XML interfaces, XML::DOM will probably lend its interface to a long succession of modules with improved performance.

Other XML Tools

The list of XML tools developed for Perl is long and rapidly expanding. A wide array of parsers, organizers, filters, and object models have been developed in the quest for the ultimate Perl XML interface. An exhaustive list would be impossible to keep current, but a number of projects are worth noting. Check the Comprehensive Perl Archive Network (CPAN) for a more complete list, for documentation, and for download information.

XML::Grove and its newer relative, Orchard, provide a different object model than the DOM, but with the same intent: to represent XML structures in an unambiguous way that enables consistent access to any data within the structures. SAX filters take a different approach, using an event-handler model to process XML files as a series of data events. XSLT processors, such as Sablotron, provide a special subset of functionality based on XSLT. The intent of XSLT isn't to represent the XML structures, but to provide a standard way to specify the translation of XML documents into other documents, including HTML documents and other XML formats. Each interface has already found its niche, and early exploration of the XML space by these modules has paved the way for evolving versions and new modules to come.

A Sample XML-Based Publishing System

Publishing document-based information to multiple targets is a good use for XML on the Web. Often times, XML data can be used to distinguish infrequently changing data from the often-updated formatting that helps display it to a wide audience. The data can live much longer than its original Web site or any other displayed instance of it. Thus, XML provides a standard format that can be used and reused to publish the information.

As an example, what better to publish in a book about Perl for the Web but the book itself? The goal is to publish all chapters from the book on the Web in HTML and other useful forms, as illustrated in Figure 16.2.

***Insert figure 16.216hpp02.tiffSC***crop

Figure 16.2

Chapter 1 as an HTML file.

The first step in publishing chapters to the Web is to encode the chapters in a format that is resistant to change over time. The chapters and their associated files are long, and any changes made across the board would take an inordinate amount of time to implement. This makes XML a natural choice for the chapters–they can be stored in a format that specifies only the basic structure of each chapter, with reasonable assurance that the format is usable by a wide variety of display programs. If all else fails, the files themselves should be human-readable to enable hand editing and transcription, if necessary.

After the chapters are available in an XML format, the next step is to write a display application to add the desired formatting. Because the primary target is the Web, an embedded application using templates is a good choice. These templates then can be modified for other formatting situations. Display isn't the only goal, though, so an additional application should be written to facilitate searching through the chapters and displaying only the relevant results. This type of add-on shows the flexibility of the XML approach and sets the stage for more exciting applications of the data in the future.

Simple Book Format

This book consists of a number of chapters that are divided into sections, most of which have titles. The sections are in turn divided into subsections. This pattern is repeated for at least four levels of sections. At any level, the fundamental block of text used in the book is a paragraph, which might be a text paragraph or a block of code. With these aspects of a chapter in mind, the Simple Book Format (SBF) file in Listing 16.5 can be constructed as an outline of a sample chapter.

Listing 16.5 Sample SBF File

01 <chapter>
02 <number>15</number>
03 <title>Sample Chapter</title>
04 <paragraph type="normal">This is some introductory text.</paragraph>
05 <paragraph type="normal">Multiple &lt;b&gt;paragraphs&lt;/b&gt; are possible here.</paragraph>
06 <section>
07 <title>I'm a c-level section</title>
08 <paragraph type="normal">This is the body of the c-level section.</paragraph>
09 <paragraph type="normal">Multiple paragraphs are possible here as well.</paragraph>
10 <section>
11 <title>I'm a d-level section</title>
12 <paragraph type="normal">This is the body of the d-level section.</paragraph>
13 <paragraph type="listing" number="1" title="Sample code">
14 my $foo = "bar";
15 print "Let's raise the $foo a little.\n";
16 </paragraph>
17 </section>
18 <section>
19 <title>I'm another d-level section</title>
20 <paragraph type="normal">This is the body of the second d-level section.</paragraph>
21 </section>
22 </section>
23 <section>
24 <title>I'm another c-level section</title>
25 <paragraph type="normal">This is the body of the second c-level section.</paragraph>
26 </section>
27 </chapter>

The file in Listing 16.5 contains examples of a few notable structures. The chapter number and title are stored in corresponding elements in lines 02 and 03. Opening paragraphs (which would come right after the chapter title in the printed book) are listed without any containing structures on lines 04 and 05. Both paragraphs are given the attribute type="normal", as are most paragraphs in the sample chapter. One exception is the code listing from line 13[nd]line 16, which is given the attribute type="listing" as well as number and title attributes. Other types of paragraphs–including figures and untitled code segments–could be specified in this way as well. Note also that line 05 contains escaped characters, which will be expanded into the more-familiar <b> and </b> used by HTML. Embedding formatting codes into the data this way breaks the separation a little, but it makes processing a little easier and won't incur a penalty in our Web-centric applications.

One element in the document can be nested within others of its type. The <section> element might be found under the <chapter> element or another <section> element. This kind of recursion makes processing the document a little more complex, but in this case, the structure of the underlying data demanded it. A beneficial side effect of recursion is the equivalent tags used for a top-level section and all its subsections. This equivalence enables any section to be cut and pasted into another section as needed, without changing the name of its tags.

XML::Simple Templates

After the file format has been determined, a display application is needed to translate the XML data into a format usable by Web browsers. It's hoped that eventually this kind of translation will be handled automatically by the browser using the Extensible Style Language (XSL), but until that occurs, it's good to develop a server-side translator that produces HTML. Listing 16.6 is an example of a display application that uses PSP templates and XML::Simple to translate any file in the SBF format for Web viewing.

Listing 16.6 Templates with XML::Simple

01 <perl>
02 use XML::Simple ();
04 my $xs = XML::Simple->new(forcearray => ['section', 'paragraph'], 
05                           memshare => 1);
06 my $chapter = $xs->XMLin("$ENV{DOCUMENT_ROOT}/thebook/$QUERY{chapter}.xml");
07 </perl>
09 <include file="$ENV{DOCUMENT_ROOT}/templates/chapter.psp" />
11 <template chapter="$chapter" />

Listing 16.6 might seem unnaturally short, but that's because most of the work is being handled by the display template included in line 09 and called in line 11. The $chapter variable populated in line 06 contains the entire parsed contents of the XML file specified in the chapter query variable, so no other information has to be passed to the template.

Publishing to HTML

If the contents of the data file are provided in an easily accessible format, creating a template for the data becomes simple as well. Some aspects of the template are nontrivial because some of the data is stored recursively, but the rest can be handled with a simple substitution template. Listing 16.7 is a template that displays an entire SBF chapter in HTML (as seen in Figure 16.2).

Listing 16.7 SBF to HTML Full-Chapter Template

01 <tag name="template" accepts="chapter">
02 <output>
03 <html>
04 <head>
05 <title>Chapter $chapter->{number}: $chapter->{title}</title>
06 </head>
08 <body bgcolor="white">
09 <h3>$chapter->{title}</h3>
10 <p><i>part of <a href="/thebook/">Perl for the Web</a></i></p>
12 <loop name="paragraph" list="@{$chapter->{paragraph}}">
13 <p>$paragraph->{content}</p>
14 </loop>
16 <loop name="section" list="@{$chapter->{section}}">
17 <h4>$section->{title}</h4>
18 <loop name="paragraph" list="@{$section->{paragraph}}">
19 <p>$paragraph->{content}</p>
20 </loop>
21 <loop name="section" list="@{$section->{section}}">
22 <h5>$section->{title}</h5>
23 <loop name="paragraph" list="@{$section->{paragraph}}">
24 <if cond="$paragraph->{type} eq 'listing'">
25 <p><b><i>Listing $chapter->{number}.$paragraph->{number} $paragraph->{title}</i></b></p>
26 <p><font color="green"><pre>$paragraph->{content}</pre></font></p>
27 <else />
28 <p>$paragraph->{content}</p>
29 </if>
30 </loop>
31 </loop>
32 </loop>
33 </body>
34 </html>
35 </output>
36 </tag>

The template in Listing 16.7 uses the same template principles outlined in Chapter 13, "Using Templates with Perl Applications." The entire template is enclosed in a <tag> tag on lines 02 and 36. The template accepts the $chapter variable, which contains the XML::Simple parsed SBF file. Parts of that variable are incorporated into HTML using the <output> tag on lines 03 and 35. The loop in lines 12[nd]14 displays the opening paragraphs, and the larger loop in lines 16[nd]32 handles the rest of the recursive sections and their paragraphs. Line 24 creates a special case for paragraphs that are code listings, displaying the title of the listing and formatting the code with a <pre> tag to preserve spacing.

As complex as the template in Listing 16.7 might seem, it still is a valid HTML file that renders correctly in a Web browser. Graphic HTML editors might have varying success in editing the formatting of the file without disturbing the <loop> tags, however, and the use of loops and Perlish variable references blurs the line between text templates and program structure. A clearer distinction could be enforced in an application such as this, but it would probably involve increased complexity in Listing 16.6 without much reduced complexity or readability in the template.

Alternate Templates

Developing a template for simple HTML translation is only the beginning. Additional templates can be created to provide different windows on the same data. Formatting can be altered significantly by simply changing the HTML in the template, and segments of the chapter can be emphasized or excluded based on user selections. Listing 16.8 is a template that shows one section of the chapter at a time with enough context to provide navigation to other sections.

***Production: Please be sure that the bold in the listing below doesn't drop out.

Listing 16.8 SBF to HTML Single-Section Template

01 <tag name="template" accepts="chapter, selection">
02 <output>
03 <html>
04 <head>
05 <title>Chapter $chapter->{number}: $chapter->{title} - $selection</title>
06 </head>
08 <body bgcolor="#9999FF">
09 <table width="80%" border="0" cellspacing="0" cellpadding="5" align="center">
10 <tr>
11 <td bgcolor="white">
12 <h3>$chapter->{title}</h3>
13 <p><i>part of <a href="/thebook/">Perl for the Web</a></i></p>
15 <loop name="paragraph" list="@{$chapter->{paragraph}}">
16 <p>$paragraph->{content}</p>
17 </loop>
19 <loop name="section" list="@{$chapter->{section}}">
20 <h4>$section->{title}</h4>
21 <if cond="$section->{title} eq $selection">
22 <loop name="paragraph" list="@{$section->{paragraph}}">
23 <p>$paragraph->{content}</p>
24 </loop>
25 <loop name="section" list="@{$section->{section}}">
26 <h5>$section->{title}</h5>
27 <loop name="paragraph" list="@{$section->{paragraph}}">
28 <if cond="$paragraph->{type} eq 'listing'">
29 <p><b><i>Listing $chapter->{number}.$paragraph->{number} $paragraph->{title}</i></b></p>
30 <p><font color="green"><pre>$paragraph->{content}</pre></font></p>
31 <else />
32 <p>$paragraph->{content}</p>
33 </if>
34 </loop>
35 </loop>
36 <else />
37 <perl>my $trans = $section->{title}; $trans =~ s/\s/+/g;</perl>
38 <p><font size="2" face="Arial, Helvetica, sans-serif"><a href="section.psp?chapter=$QUERY{chapter}&section=$trans">view this section</a></font></p>
39 </if>
40 </loop>
41 </td>
42 </tr>
43 </table>
44 </body>
45 </html>
46 </output>
47 </tag>

The template in Listing 16.8 is different from Listing 16.7 only in a few respects. It accepts an additional parameter ($selection) in line 01, and it contains some additional formatting to center the text in a display table. The main difference lies in the <if> conditional in line 21, which checks the section title and displays it only if it's the specified selection. Otherwise, lines 31[nd]33 display a link to the omitted section, as displayed by the same file. The code that utilizes this template can be built into the same application as Listing 16.6, or a separate page can be implemented, as in Listing 16.9.

Listing 16.9 Section Display Page

01 <perl>
02 use XML::Simple ();
04 my $xs = XML::Simple->new(forcearray => ['section', 'paragraph'], 
05                           memshare => 1);
06 my $chapter = $xs->XMLin("$ENV{DOCUMENT_ROOT}/thebook/$QUERY{chapter}.xml");
07 </perl>
09 <include file="$ENV{DOCUMENT_ROOT}/templates/section.psp" />
11 <template chapter="$chapter" selection="$QUERY{section}" />

Again, the differences are minor. Line 09 of Listing 16.9 includes the template from Listing 16.8, and line 11 calls the template with an additional attribute selection, as defined by the query variable section. The rest of the changes are handled by the template. The result is quite different, however, as shown in Figure 16.3.

***Insert figure 16.316hpp03.tiffSC***crop

Figure 16.3

Chapter 1 displayed in sections.

One benefit to selectively displaying information from an XML file is the additional formatting that can be added to the result without needing to worry about file size or readability. Large files such as these chapters tend to overwhelm the formatting capabilities of browsers if displayed as anything more complex than a stream of text. A shorter page gives more freedom to the site designer to add navigation and white space to improve the aesthetics of the page. In addition, an important aspect of reading on the Web is the ability to scan a page for interesting information based on headers and links. Because a chapter in its raw form is simply long blocks of text, omitting the section text emphasizes the headers and enables the user to scan through them before content is selected for expansion.

Full-Text Searching

The capability to pinpoint a section within a chapter also provides a benefit for full-text searching applications. In general, search engines on the Web have to address data at the file level, with little knowledge of the structure of any given file. It's sometimes possible to point to a section within an HTML file, but the entire file still has to be downloaded and displayed before the section can be found. With an XML file, granularity can be set as small as needed. In Listing 16.10, for example, a hash table is created. It identifies the occurrence of keywords in named sections within chapter files.

Listing 16.10 Creating a Full-Text Hash Table

01 #!/usr/bin/perl
03 use 5.6.0;
04 use strict;
05 use warnings;
07 use DBI;
08 use XML::Simple ();
10 # connect to the database
11 my $dbh = DBI->connect('dbi:mysql:test','','',{RaiseError => 1});
13 # clear out the old hash table values
14 my $sth = $dbh->prepare('delete from book_search');
15 $sth->execute;
17 # pre-cache the insert statement for later
18 my $sti = $dbh->prepare(qq{INSERT INTO book_search
19                            (keyword, chapter, section, x_count)
20                            VALUES (?,?,?,?)});
22 # open each chapter
23 my $xs = XML::Simple->new(forcearray => ['section', 'paragraph']);
24 my $file = "chapter01";
25 my $chapter = $xs->XMLin("/http/docroot/thebook/$file.xml");
27 # load sections into a Perl hash
28 our %sections;
29 foreach my $paragraph (@{$chapter->{paragraph}})
30 {
31   next unless $paragraph->{type} eq 'normal';
32   $sections{'main'} .= $paragraph->{content};
33 }
35 foreach my $section (@{$chapter->{section}})
36 {
37   my $title = $section->{title};
38   add_paragraphs($title, $section);
39 }
41 # record keywords for each section
42 foreach my $section (keys %sections)
43 {
44   # create a key count hash
45   my %key_count;
47   # break the message into keywords
48   my $keyword;
49   foreach $keyword (split(/[^A-Za-z-_']/, $sections{$section}))
50   {
51     # increment the hash entry for that keyword
52     $key_count{lc($keyword)}++ if (length($keyword) > 2);
53   }
55   # insert a row for each key_counthash entry
56   my $q_keyword;
57   foreach $keyword (keys %key_count)
58   {
59     $sti->execute($keyword, $file, $section, $key_count{$keyword});
60   }
61 }
62 $sth->finish;
65 $dbh->disconnect;
67 sub add_paragraphs 
68 {
69   my $title = shift;
70   my $section = shift;
72   foreach my $paragraph (@{$section->{paragraph}})
73   {
74     next unless $paragraph->{type} eq 'normal';
75     $sections{$title} .= $paragraph->{content};
76   }
78   foreach my $subsection (@{$section->{section}})
79   {
80     add_paragraphs($title, $subsection);
81   }
82 }

Listing 16.10 builds a hash table in a database to store the relationship between keywords and the sections in which they're found. Lines 34[nd]37 open and parse a chapter file. Lines 40[nd]45 use the %sections hash to associate the initial paragraphs in the chapter with a section called main that can be treated as the default section in display applications. Lines 17[nd]51 do the same for each top-level section in the chapter, using the add_paragraphs subroutine to recurse through subsections and associate them with the same section. Lines 54[nd]74 split each section into keywords and insert each keyword into the hash table. Note that the hash table is being built for the underlying data, instead of for any representation of it. The hash records only the chapter number and section name as identifiers, so search results can be used to display a section, regardless of its format.

The hash table can be used by a standard search page to quickly locate sections with requested keywords and generate links to the appropriate display pages. Listing 16.11 is an example of such a search page. It generates links to the display page from Listing 16.9.

Listing 16.11 Search Page

01 <include file="$ENV{DOCUMENT_ROOT}/templates/book.psp" />
02 <template title="Search">
04 <form action="search.psp" method="get">
05 <output>
06 <p>Search:&nbsp;
07 <input type="text" name="search" size="30" value="$QUERY{search}" />
08 &nbsp;
09 <input type="submit" value="Search" /></p>
10 </output>
11 </form>
13 <if cond="$QUERY{search}">
14 <perl>
15 # break the search up into keywords
16 my $search_where;
17 foreach $keyword (split(/[^A-Za-z-']/,$QUERY{search}))
18 {
19   # add an OR for each keyword
20   next unless (length($keyword) > 2);
21   $search_where .= "OR keyword = '$keyword'\n";
22 }
24 # replace the first OR with an AND
25 if ($search_where)
26 {
27   substr($search_where,0,2) = 'AND (';
28   $search_where .= ")";
29 }
30 else
31 {
32   $search_where = "AND keyword = ''";
33 }
34 </perl>
36 <sql name="search" dbtype="mysql" db="test" action="query">
37 <output>
38 SELECT chapter, section, SUM(x_count) as x_count
39 FROM book_search
40 WHERE x_count IS NOT NULL
41 $search_where
42 GROUP BY chapter, section
43 ORDER BY 3 DESC, chapter
44 </output>
45 </sql>
46 <h4>Results:</h4>
47 <ul>
48 <fetch query="search" fetch="chapter, section, x_count" type="sql">
49 <perl>my $trans = $section; $trans =~ s/\s/+/g;</perl>
50 <output>
51 <li><a href="section.psp?chapter=$chapter&section=$trans">$section</a> in $chapter</li>
52 </output>
53 </fetch>
54 </ul>
55 </if>
56 </template>

The form of Listing 16.11 is almost identical to the search page listed in Chapter 14, "Database-Backed Web Sites." The main difference lies in the SELECT statement in lines 38[nd]43. This statement selects section identifiers from the book_search table, groups them by identifier to combine duplicate results, and orders the results based on the number of keyword occurrences. Other differences are found in lines 49 and 51, which format and display a link to each section specified in the search result. See the description of Listing 14.9 in Chapter 14 for a more complete description of the page. The results of a common search are shown in Figure 16.4.

***Insert figure 16.416hpp04.tiffSC***crop

Figure 16.4

Full-text search result.

Sidebar: The Eminence of Paper

As useful as the Web is for retrieving information, and as useful as XML might be for storing it, paper still reigns supreme when it comes to the display of large amounts of data. Paper has high contrast, high resolution, and only as much flicker as the lighting used to illuminate it. These attributes probably will take decades to replicate with computer screens and other electronic display media. As a result, it should be expected that users will want to view text on paper for a long time to come.

If you're reading this in book format, sit down in front of a computer, pull up the Web site, and compare this to the same page on screen. (If you're reading this on the Web, by all means buy the book and make the same comparison.) Unless you have a huge screen with great resolution, you'll probably notice right off the bat that the page displays more of the book at one time than the screen does. The text also is easier to read; you'll read faster and with better comprehension out of the book than off the Web. (The Web users just skimmed past that, so they won't be offended.) On top of that, the book is more portable overall, uses less power, has a greater field of view and a higher range of operating temperatures. That's the key to paper's continued reign, and it's not likely to be beat for deep reading any time soon.

Whatever paper has in terms of readability, though, it loses when the contest is interactivity. It's not searchable. It doesn't have a place to ask questions or make comments. The code segments can't be cut and pasted into a text editor. Because of this, documents are more likely to come from an electronic source such as the Web, even though they might go straight to the printer as soon as they're found–or as soon as they're composed. The point of all this is that the same document is likely to show up both in electronic form and on paper, no matter how ephemeral the electronic form or old-fashioned the paper form. Keep this in mind when designing content for the Web, and you might end up with a site that's well suited for both.


XML is a complex technology, but at its heart, it's just another text format for Perl to parse and represent in applications. A number of interfaces to XML have been written for Perl–from simple parsers such as XML::Simple to complex interfaces such as XML::DOM–and most are based on XML::Parser. On the Web, XML provides a solid format for content storage, and the flexible nature of XML documents enables a finer grain of control over how the content is expressed. A simple content management system can be created to provide multiple modes of expression for a given XML file, including full text searching of all or part of an XML document.

This is a test.

Page last updated: 15 August 2001