Global Spin: Scaling a Perl Solution

Scaling a Perl Solution

Click here to order from a bookstore near you.

Visitor interest, which is measured in terms of the overall traffic to a site, is the primary goal of most sites, but success can be as overwhelming as it is beneficial. Even with the most efficient architecture and hardware with the highest performance, a site can grow in popularity until it is deluged with requests. For some sites–such as the Internet Archive (http://www.archive.org), which stores gigabytes of archived video and Web content–the content of the site can stress hardware and network connections with even minimal amounts of traffic.

Web applications create their own special concerns. A Web application is likely to use more system resources per request to deliver data than would a static Web page. At the same time, dynamic pages are less likely to be cached by proxy servers and client browsers. As a result, sites that increase their focus on Web applications might see server load increases even when the overall traffic to the site is not increasing. Ironically, this usually means that sites start to perform poorly just as they are starting to provide essential services to their clients.

Adopting a more efficient architecture can provide a marked improvement in performance, but traffic might eventually use up any performance gains realized. The next line of defense is scaling, which increases the performance of a single Web server machine. Faster processors, more memory, and more storage media can be installed to improve the performance of server software without a change in architecture or configuration. However, the performance gains from such changes are only incremental. In addition, there's a fixed limit to the pace of these improvements, which is governed by the fastest available hardware and software. A more practical limit is enforced by the fact that hardware with the highest performance comes at a premium. These limits create an arbitrary ceiling to the traffic a system can support, so, eventually, traffic levels might call for higher performance than a single Web server can provide.

Load Balancing Versus Clustering

Load balancing is based on the idea that duplicated Web server machines scale linearly. In other words, if one machine processes a certain number of requests per second, two machines can process twice as many and ten machines can process ten times as many. In simple cases, this idea works well. For example, a Web site comprised mostly of static HTML pages and other static files scales linearly, so additional traffic can be served by adding duplicate Web server machines and then routing traffic to all machines equally.

Clustering, on the other hand, is based on separating a Web server into functional units and shifting the burden of each unit to a different machine. Clustering works well for sites that have a clear dividing line between functions, either as implemented or as viewed by site visitors. For instance, a site might have four distinct Web applications, each of which uses a database and a session manager. Such a site could be divided across six machines, one for each application and one each for the database and the session manager. The hope is that resource conflicts between applications are eliminated by giving each application its own server. In addition, clustering hopefully enables the applications to scale on their own machines without the need for synchronization between duplicate servers.

External Load Balancing

When balancing traffic across Web servers, one common solution is to implement an external system to route incoming traffic. A simple approach involves changing the DNS entry for a server so that requests for the IP address of a given machine name are given one address out of a group of identical machines. For instance, the name www.site.com might be assigned to machine numbers 2, 3, 4, and 5 on a given class C network. When a specific client requests the IP address for www.site.com, it would be given just one of these addresses, which it would interact with over the course of a visit. The hope is that the random assignment of addresses averages out and that each machine on the list receives an equal amount of traffic. Unfortunately, proxy servers for networks such as America Online and RoadRunner might cache only one address, thereby skewing the amount of traffic that that server receives for a given time period.

Load balancing products, such as F5's BigIP, are designed to serve as gateways to the network where a group of Web servers resides. These systems route requests dynamically to a specific server from a list of available Web servers. To a client machine, each request goes to the same machine name and IP address. However, the requests might actually be served by any number of equivalent Web servers. These tools also provide fail-over support–a Web server is taken off the list if it stops responding correctly to incoming requests. This type of load balancing provides more control over the distribution of traffic over servers.

The main difficulty inherent in a symmetric load-balancing situation is keeping a synchronized copy of all Web applications and their associated data on each Web server. This might not seem like a problem early in the development cycle because the content on all servers is simply copied from a central server. However, the lax development style used when dealing with a single Web server won't work for a dozen Web servers. If a change is made to application code on one server and not copied to the others, applications can quickly become so far out of synchronization that they have to be completely recopied. The same holds true for configuration, especially in Perl. Installing a module or upgrading to a new version becomes a more difficult problem when it has to be performed on all servers simultaneously.

Another concern lies with aspects of a Web application that can't be split into symmetric parts. Session managers, for instance, need to keep a list of open sessions, usually in memory. If each user request within a session is sent to a different server, all servers must have access to an identical copy of the session information. This usually is implemented by keeping a single session server to enable network connections. The same situation occurs with database servers and other system applications that can't be duplicated easily. Unfortunately, this means that not all parts of the Web application are being scaled at the same rate, so system application servers can easily get overwhelmed with Web application server requests.

Perl-Based Load Balancing

Load balancing also can be implemented solely at the application-engine level. Perl processing architectures, such as FastCGI and VelociGen, enable processing to be distributed over a group of machines. A single Web server then can be used to handle static requests, route application processing requests, and return the results to the client. Each set of application engines sits on an identically configured application server machine, which is devoted solely to responding to dynamic requests. Routing can be handled by a two-tiered application–the first program layer resides on the Web server and passes requests to the actual Web applications. It does this by consulting a routing table and choosing an application engine that is not currently processing another request.

As with other load balancing solutions, each application server in a balanced set must have an identical configuration with identical data. New servers can be added by copying the configuration of existing servers, but any changes to the server configuration can cause the applications to get out of synchronization and to produce unwanted results. Perl-based load balancing also suffers from the other problems inherent in symmetric load balancing. These problems include increased stress on databases due to additional application engines accessing the same system resource. In addition, load balancing at the application-engine level requires a single Web server to be used as a gateway, which might become an arbitrary bottleneck. Adding additional gateways increases the potential complexity of the system by requiring all copies of the routing layer application to share information about the disposition of application engines.

Perl-Based Clustering

With FastCGI, VelociGen, and other network-enabled Perl environments, it's also possible to cluster servers into functional groups. Each group can be configured to handle a designated portion of Web application traffic. Apache Web server configuration of FastCGI, for instance, enables individual applications to be assigned to directories, file types, or other URL masks. An additional layer of clustering can be implemented using a two-tiered approach that is similar to a load-balancing solution. The balancing application could check an incoming request against a list of servers that are equipped to handle that type of request and then route the request to an open application engine on one of those servers.

Some parts of a Web application can be clustered easily without any help from Perl. For instance, a database server that resides on the same machine as the Web server usually can be reimplemented on another machine and connected to over the network. Session servers, mail processors, and other system applications also can be moved to machines separate from the Web server with minimal changes to the Web application code. Clustering in this fashion can be a way to solve the problem of system applications that don't scale well. For instance, balancing a database server symmetrically over several machines is difficult because it requires each copy of the database to synchronize updates and inserts with other copies continuously. However, breaking a database into functional parts is considerably easier. Tables specific to each application can reside on different database servers, reducing the overall number of queries to each database as well as the number of application engines with cached connections open.

Unfortunately, clustering also can reach an inherent limit because Web applications only have a finite number of functional parts. After a point, it's no longer feasible to break an application into smaller components because the component functions and modules overlap. When that happens, it's usually necessary to implement a combination of load balancing and clustering techniques to increase capacity even further. However, this kind of combination is likely to increase the performance of a cluster more than either technique would individually. This happens because the limitations of each are overcome to some degree by the use of the other.

Perl-Based Synchronization

Aside from performing the work of application engines, Perl can help synchronize content over a balanced set of servers. Content is likely to be stored as text files, which makes a text-centric language such as Perl a good choice for copying files to a group of servers based on whether they have been changed. Perl also can be used to develop a staged application development environment consisting of a development server and a publishing tool. All changes are made and tested on the development server, and the publishing tool copies the resulting changes to the appropriate balanced or clustered servers as necessary. The publishing tool can use standard file copying tools such as scp more quickly than they can be used by hand, so the possibility for inconsistent results from different servers in a cluster is reduced significantly. Implementing a publishing tool also can reduce a developer's desire to implement "minor changes" on the production servers by making the update process simpler and more automatic.

Perl can be used to update the server configuration on multiple servers simultaneously or in controlled shifts. Software can be installed using predefined scripts, Perl modules can be added automatically, and server processes can be restarted automatically when necessary. In addition, each server can be removed from a load balancing rotation for the duration of the update. It then can be returned to duty before starting on the next server, thereby reducing the chance that normal loads will overload the servers still available. Designing and implementing such an administration tool in Perl can help define configuration procedures explicitly. Update processes can be codified in working code, and the results of each procedure can be recorded to a central log.

Perl is an excellent choice for creating these administrative tools because it can use existing system tools to carry out the process. After all, Perl was originally developed as a system administration language. Rather than reinvent revision control, an administrative tool can use CVS functions to carry out updates. Instead of compiling Perl modules from the command line, an administrative tool can use the Comprehensive Perl Archive Network (CPAN) module to automate and log the process. Configuration hooks even can be written into servers such as Apache to provide a finer grain of control over the state of each Web server process.

DBI and Advanced Data Sets

Perl's DBI module provides a common interface to all databases that use the structured query language (SQL). However, DBI takes a literal approach to database access. It includes only those access methods that are common to SQL databases, without taking into account applications that need to access the database in a way that's awkward in standard SQL. For instance, applications that need to page through a large data set or access multiple databases won't get any inherent support from DBI. It's possible to write abstraction layers on top of DBI that accomplish these types of tasks, but a more satisfying solution would be to implement generic layers that suit a wide variety of these needs with the potential for greater efficiency.

The area of database access will probably gain more attention as the modules used for basic database interaction mature. After the basics are implemented, more effort can be devoted to solving real-world problems with generic interfaces. Two modules that hold great promise in this area are DBIx::Recordset and DBD::Multiplex, both of which try to present simplified interfaces to procedures that existing database servers make complex.

DBIx::Recordset

SQL databases can be very useful for sorting, aggregating, and searching through a large amount of data efficiently, but the SQL interface itself doesn't necessarily provide a robust way of dealing with the resulting data. Record sets, which encapsulate a set of responses to a specific class of database query, might provide a better way of interacting with database results than SQL normally provides. For instance, the results of a database search on a site such as Google usually spans a number of Web pages. The traditional way of generating these pages from a SQL query would involve processing the query for each page, displaying only the results needed for the page, and discarding the rest of the results returned from the database. For search queries that return ten or twenty pages, this process can be very inefficient. However, the database server itself usually provides no way of saving the results of a query or partitioning those results into usable chunks.

DBIx::Recordset implements an abstraction layer between a Perl program and the DBI interface to treat a particular search less like a SQL query and more like an abstract record set. It does this for a number of reasons:

Composing an SQL query based on a large number of input or result fields can be tedious work to program.
SQL syntax tends to be more strict than most Web applications need, which makes it difficult to change applications without explicit testing.
The results of an SQL query might need to be returned over a series of requests or in other arbitrary groups based solely on order.

One method that DBIx::Recordset provides specifically for Web applications is PrevNextForm, which creates a search results list based on the contents of a data set. As new requests come in for additional records from the same query, PrevNextForm finds the next subset of results from the record set and displays it, along with controls for accessing the previous or next set. The current version of this function isn't implemented in a very efficient fashion. However, the basic idea of abstracting the function itself away from the underlying implementation enables the implementation to be optimized over time–using more efficient techniques as they become available–without requiring any changes in the Web applications that use the function.

Eventually, an interface such as DBIx::Recordset could implement a generic form of result caching, which satisfies many uses for record sets with an efficient implementation. Results from common queries could be stored for use as master record sets, which could be parceled out based on the needs of each application request. These sets could be kept in memory, in a cache table created in the same database, or in a separate database specifically for record sets. Each case would require a cache control mechanism that would keep track of data updates and the age of each record set, but such a mechanism could use widely implemented caching techniques. In fact, the module might enable an implementation to be specified for each type of record set, which would provide the greatest flexibility for different performance needs.

DBD::Multiplex

Scaling a Web server architecture sometimes creates a bottleneck in accessing a database back end. The best solution to this would be to create a load-balanced database in the same way Web servers are balanced. Unfortunately, most database servers don't have the capability to handle their own load balancing or synchronization. Those that do require either an expensive addition to the software or complex configuration–with no guarantee that the new configuration will solve the initial performance problem.

A Perl-based solution to this problem is offered by the DBD::Multiplex module, which enables multiple databases to be accessed as a single data source. The module is in early development, but its potential uses make it promising. The idea behind DBD::Multiplex is that each call to the data source would be sent to one or more servers in a list of subordinate data sources. A SELECT statement, for instance, could be handled by a database server chosen randomly from a list to spread the processing load across multiple servers. An INSERT or UPDATE statement, on the other hand, would be sent to all servers in the list to keep them synchronized with each other. Between the two, any type of database could be balanced across servers without requiring its own synchronization or load balancing capability. Additionally, the Multiplex driver could be configured to resend failed statements to another server and warn administrators of the failure. Results from multiple servers could be compared and incorrect results discarded, as is done in a redundant disk array (RAID).

Initially, DBD::Multiplex will provide the capability to interact with symmetrically load-balanced database servers. Each server would have to start as a copy of all the others, both in server type and in the data stored. Eventually, the possibility exists that database servers could be clustered around a single DBD::Multiplex data source. Databases could be split into functional groups of tables that reside on different clusters of balanced servers. Routing between Web applications and the necessary servers would be handled by DBD::Multiplex. For instance, a SELECT statement involving the messages and users table might be routed to a server that holds copies of those tables, while one involving the headlines table would be routed to another cluster. This type of clustering would reduce the size of each database and remove some of the need for synchronization between servers. Combined with application engine clustering, it also could reduce the number of cached database connections overall.

In addition, different types of servers could be clustered into equivalent groups based on a match between the data they carry. For instance, a PostgreSQL server being installed could be listed in parallel with an Oracle server it's replacing. Queries would be sent to both servers, with the result from the Oracle database taking priority. Errors in the configuration of the PostgreSQL server would show up under continuous real-world use, but the Web application user would be shielded from such errors. Eventually the new database could be certified error-free in continuous use, and the older database could be removed from the access list or relegated to a fail-over backup. The progression from one database system to the next would be seamless from the point of view of Web application users.

Summary

Even the most efficient Perl-based Web applications eventually need to scale to meet increased traffic demands. One approach, load balancing, involves duplicating each Web server exactly. Traffic can be routed to load-balanced servers using an external solution to route requests. This can be done before they are processed by the Web server or a Perl solution and after the Web server has received the requests. Another approach, clustering, involves breaking a Web site into functional components that can reside on different machines and then using a more sophisticated load balancer to route requests to the appropriate components. With either approach, it usually becomes necessary to manage server and Web application configurations using a development server and a publishing mechanism. Databases don't always scale well, however, so Perl solutions such as DBIx::Recordset and DBD::Multiplex will become more important as Web applications are required to reduce their impact on the database and potentially use multiple databases in parallel.