SlideShare une entreprise Scribd logo
1  sur  7
Télécharger pour lire hors ligne
3/3/12                                                      No Title




     Building Scalable Websites with Perl
     by Perrin Harkins



     Who is doing it?
     First, let's establish some credit with any doubters in the audience. I shouldn't have to tell you this, but Perl
     runs some of the largest websites in the world. Take a look at some of the better-known examples:

     Yahoo.com uses Perl in nearly all of their properties, in particular the personalized My Yahoo service. On
     the whole, Yahoo serves three billion page views per day, and about 100 million unique users. Yahoo owns
     Overture, the largest sponsored search company. According to their posting on the Perl jobs list at
     http://jobs.perl.org/, they handle "more than 10 billion transactions per month!"

     Amazon.com, the company that pretty much defines e-commerce, uses Perl on their main site and partner
     sites. Amazon also operates the popular Internet Movie Database, IMDB.com, which is built in Perl.

     Ticketmaster.com, the largest on-line ticket retailer, is built almost entirely with Perl. So is it's sister
     company, CitySearch.com, which operates the most widely-used city guide sites in the US.

     Nielsen NetRatings says that Yahoo, Amazon, and InterActiveCorp, which owns Ticketmaster Online and
     CitySearch, are all in the top 10 in terms of overall web traffic. We're talking about phenomenal numbers of
     users and page views here. By comparison, Slashdot.org, which people frequently point to as a high traffic
     site using Perl, is barely a drop in the bucket.



     How are they doing it?
     Okay, so your company probably doesn't get as much traffic as Yahoo. Still, you may be wondering, what
     is it that really large sites do that allows them to scale so big, and is it something you could apply to your
     own sites?

     Obviously, these are all very different applications. There is no single solution for scaling all of them. Even
     buying a lot of hardware isn't a magic bullet, since it just isn't feasible to buy enough computing power to
     prop up a slow application at these levels of traffic. However, what you discover when you talk to people
     who work at these sites, is that there are a few common techniques that tend to get used by almost everyone
     in one form or another. These are fundamental software techniques that have been around for ages, not
     some kind of newly invented Internet magic. Feel free to refer to them as design patterns if it will raise your
     salary. Today we're going to talk about a couple of these and how they apply to web development
     problems.



     Things we won't be covering
file:///Users/perrinharkins/Conferences/scalable_talk.html                                                                1/7
3/3/12                                                      No Title

     I should also mention what we're not going to talk about.

     We're not going to talk about mod_perl tuning: httpd.conf settings, reverse proxy configurations, increasing
     copy-on-write memory sharing, running the profiler... This stuff is very well-documented in the mod_perl
     books and the on-line documentation at http://perl.apache.org/. If you're serious about building a scalable
     site and you haven't read these resources yet, get on it!

     We're not going to talk about DBI tuning. Tim Bunce has detailed slides from his talks available on CPAN
     (http://search.cpan.org/~timb/), and there is more in the mod_perl documentation and books.

     We're not going to talk about hardware because, well, I'm not very interested in hardware. That's for
     cheaters. (However, I'm willing to cut the sites I mentioned above a little slack on this...)



     Caching
     Caching helps performance by reducing the amount of work that needs to be done, and helps scalability by
     reducing the load on shared resources like databases. All of the sites I mentioned above cache like mad
     wherever they can. Page caching, object caching, de-normalized database tables - all of these are variations
     on a theme. Even if your data is so volatile that it changes every 30 seconds, if it only takes 1 second to
     generate it you will still get to serve it from cache for the other 29.

     Whole Pages
     If you can possibly get away with it, cache entire HTML pages and serve them as static files. This is simply
     unbeatable from a performance standpoint. Web servers and operating systems have been tuned to serve
     static files with incredible efficiency. When I worked at eToys.com, we were caching all of the non-
     interactive pages (i.e. the ones that people just browsing the catalog would see) as static files, and serving
     those pages was about ten times as fast as generating the same page on the fly, even when all of the data
     needed to create the page was cached in our mod_perl servers.

     There are a few ways to make this happen. One of them is to simply write out all of the possible pages on
     your site on a regular basis. You can write a big batch job that generates all the files for your website,
     probably by reading a database and then pounding the data through templates. Sometimes people write
     elaborate versions of this, with dependency checking and make-like functionality. See the ttree program that
     comes with Template Toolkit for one take on it.

     However, you can also do this for a site that was not built to be pre-published. Many tools exist for
     spidering websites to local copies, so all you have to do is point one at your dynamic site and dump it out as
     static files.
         wget --mirror --convert-links --html-extension --reject gif,jpg,png 
              --no-parent http://app-server/dynamic/pages/

     In reality, most sites would end up needing something more customized than this, but a simple tool like this
     can give you something to do benchmarks on at least.

     This kind of approach is only feasible if your site is small enough to write out the whole thing on a regular
     basis. If you have a site which is a front-end to a large database of some kind, you might have potentially
     millions of different pages to publish. There might be a few that get the vast majority of the hits though, and
file:///Users/perrinharkins/Conferences/scalable_talk.html                                                              2/7
3/3/12                                                      No Title

     are thus worth caching. Rather than try to figure out which ones to pre-publish, you can use a generate-on-
     demand approach. This is what most people think of when they hear talk about caching web pages.

     The simplest way to do that is with a caching proxy server. If you've read the mod_perl documentation you
     should be familiar with the idea of a reverse proxy, sometimes called an HTTP accelerator. It's an HTTP
     proxy that sits in front of your server, passing through requests for dynamic pages. You can configure it to
     cache the pages and then tell it how long to keep cache them by setting the Expires and Cache-Control
     headers during page generation.

         ProxyRequests Off

         ProxyPass /dynamic/stuff http://app-server/
         ProxyPassReverse /dynamic/stuff http://app-server/
         CacheRoot "/mnt/proxy-cache"
         CacheSize 500000
         CacheGcInterval 12
         CacheMaxExpire 36
         CacheDefaultExpire 2

     These pages are not quite as fast as regular static ones -- mod_proxy checks the headers at the top of the file
     to make sure it hasn't expired before serving it. However, they are much faster than dynamic generation.
     Note that this will only work for pages which you can generate on the fly in a reasonable amount of time. If
     you have a page that takes two minutes to generate, you need to generate it before users ask for it. Of course
     you can still use this approach, and seed it with some artificial requests beforehand, which will basically
     give you a mix between the generate-on-demand and pre-generation approaches.

     One final variation worth mentioning is intercepting the 404 error. It works like this: you set up your
     program as the handler for 404 "NOT FOUND" errors on the site. When a page is requested that is not
     found on the file system, that triggers a 404 and sends the request over to you. You then generate the
     requested page, and write it out to the file system so that it will be there the next time someone comes
     looking for it.

     This is the approach that Vignette StoryServer uses for caching, or at least it did, back in the early days
     when it was spun off from cnet.com. It's easy to configure an Apache server to do this:

         ErrorDocument 404 /page/generator

     This will make apache do an internal redirect to the program at
     /page/generator, passing information about the URL originally
     requested as environment variables. This program writes out the file,
     and then, if you're using mod_perl, you can just do an internal
     redirect to the newly generated page and let apache handle it like any
     other file.

     The upside is great performance, since the pages are served as normal static files. The downside of this is
     that you then have to manage expiring these pages yourself, probably by writing a cron job that will check
     for ones that are too old and delete them. You run the risk of serving a file a little after its expiration time if
     the cron doesn't do its job frequently enough. In general, I think the caching proxy approach is easier to
     manage, but if you are using something other than mod_perl -- like FastCGI, which already separates the
     Perl interpreters from the web server -- there is not as much incentive to run a proxy.

     Chunks of HTML or data
     Many of you were probably thinking during that last part "That sounds great, but my web designers insisted
file:///Users/perrinharkins/Conferences/scalable_talk.html                                                                  3/7
3/3/12                                                      No Title

     on putting the current user's name on every page. I can't cache the whole thing." Obviously sites like
     Amazon or My Yahoo can't cache the whole page either. They can cache pieces of pages though, and
     reduce the page generation to little more than knitting the pieces together, like server-side includes. Yahoo
     uses this technique quite a bit, generating the pieces of content for the portal in advanace, and building a
     custom template for each user based on their preferences that includes the appropriate pieces at request-time.

     By the way, you may be aware that PHP is being used at Yahoo now and assumed that this meant it was
     replacing Perl. That's not the case. PHP is mostly being used for this sort of include-template work,
     replacing some older in-house solutions that Yahoo used to use. The content generation that was done in
     Perl is still being done in Perl.

     The caching built into the Mason web development framework is a good example of caching pieces. It
     allows you to cache arbitrary content with a key and an expiration time and then retrieve it later.

         my $result = $m->cache->get($search_term);
         if (!defined($result)) {
             $result = run_search($search_term);
             $m->cache->set($search_term, $result, '30 min');
         }

     You can cache generated HTML, or you can cache data which you've fetched from a database or
     elsewhere. Caching the generated HTML gives better performance, because it allows you to skip more
     work when you get a cache hit (the HTML generation), but caching at the data level means you get to reuse
     the cached content if it shows up in multiple different layouts. That increases your chances of getting a
     cache hit. Rent.com, one of the top apartment listing services on the web, uses Mason's cache to store
     results on a commonly used search page. Since there is a fair amount of repitition in these searches, they are
     able to serve 55% of the search hits from cache instead of going to the database. That also frees up database
     resources for other things.

     I created a simple plugin module for Template Toolkit that adds partial-page caching, which is available on
     CPAN as Template::Plugin::Cache. It's only really useful if you have templates that do a lot of work,
     fetching data and the like inside the template itself, which is generally not the best way to use Template
     Toolkit. When using a model-view-controller style of development, you will typically be caching data and
     doing it before you get to the templates.

     If you want to add caching to your application, there are several good options on CPAN. For a local cache
     on a single machine, I would recommend Rob Mueller's Cache::FastMmap. BerkeleyDB is about the same
     speed if you use the OO interface and built-in locking, but you'd have to build the cache expiration code
     yourself. Both of these are several times as fast as the popular Cache::FileCache module and hundreds of
     times faster than any of the modules built on top of IPC::ShareLite.

         our $Cache = Cache::FastMmap->new(
                                                             cache_size => '500m',
                                                             expire_time => '30m',
                                                            );

         $Cache->set($key, $value);
         my $value = $Cache->get($key);

     My only real complaint about Cache::FastMmap is that it doesn't provide a way to set different expiration
     times for individual items. You could add this yourself in a wrapper around Cache::FastMmap, but at that
     point it loses its main advantage over BerkeleyDB, which is the built-in expiration and purging
     functionality.

file:///Users/perrinharkins/Conferences/scalable_talk.html                                                             4/7
3/3/12                                                      No Title

     For a cache that needs to be shared across a whole cluster of machines, you need something different.
     Memcached (http://www.danga.com/memcached/) is a cache server that you can access over the network. It
     keeps the cached items in RAM, but can be scaled for large amounts of data by running it on multiple
     servers. Requests are automatically hashed across the available servers, spreading the data set out across all
     of them. It uses some recent advances like the epoll system call in the Linux 2.6 kernel to offer impressive
     scalability. The livejournal.com website is currently using memcached.

         $memd = Cache::Memcached->new({
             'servers' => [ "10.0.0.15:11211", "10.0.0.15:11212",
                             "10.0.0.17:11211", [ "10.0.0.17:11211", 3 ] ],
             'debug'   => 0,
             'compress_threshold' => 10_000,
         };

         $memd->set($key, $value, 5*60 );
         my $value = $memd->get($key);

     If that sounds like more than you want to deal with, you can make something simple with MySQL. Because
     MySQL has an option to use a lightweight non-transactional table type, it is a good choice for this kind of
     application. Just create a simple table with key, value, and expiration time columns and use it the way you
     would use a hash. If you follow DBI best practices, you can get performance that beats most of the cache
     modules on CPAN except the ones I mentioned here.



     Job Queuing
     I could go on for hours about caching, but there are other important things to cover.

     Let's say you run a website that sells concert tickets. That means that at a specific, publicly-announced time,
     Madonna tickets will go on sale. That, in turn, means that a staggering number of people will all be waiting
     at 11am on Sunday morning with their fingers poised above the mouse button ready to click "buy" until
     they get a ticket. But wait, it gets worse! In order to give people who are trying to buy tickets by phone or in
     person a fair shot at the action, you are only allowed to put holds on a certain number of tickets at a time,
     meaning that only that number of people can be in the process of actually buying a ticket at once. Does this
     sound like a good way to ruin your weekend? This is the sort of thing that the ticketmaster.com site has to
     deal with routinely.

     How do you handle excessive demand for a limited resource? The same way you do it in real life: you
     make people line up for it. Queues are a common approach for preventing overloading and making efficient
     use of resources.

     [ queue diagram ]

     So, what have we accomplished with our queue? First of all, we have control of how many processes are
     handling requests in parallel, so we won't overwhelm our backend systems. Second, since it hardly takes
     any time at all to queue a request or or check status, we are keeping our web server processes free to handle
     more users. The site will be responsive even when there are far more users on it sending in requests than we
     can actually handle at one time. Finally, we are providing frequently updated status information to users, so
     they won't leave or try to resubmit their requests.

     Queues are also useful when you have long-running jobs. For example, suppose you're building a site that
     compares prices on hotel rooms by making price quote requests to a bunch of remote servers and comparing
file:///Users/perrinharkins/Conferences/scalable_talk.html                                                               5/7
3/3/12                                                      No Title

     them. That could take some time, even if you send the requests in parallel.

     You can keep the browser from timing out by using the standard forking technique, where you fork off a
     process to do the work and return an "in progress" page. When the forked process finishes handling the
     request, it writes the results to a shared data location, like a database or session file. Meanwhile, the page
     reloads, and until the results are available it justkeeps sending back the "in progress" page. Randal Schwartz
     has an article on-line that demonstrates this technique. It's located at
     http://www.stonehenge.com/merlyn/WebTechniques/col20.html.

     However, this doesn't completely solve the problem. Say these jobs take 15 seconds to complete. What
     happens if 1000 people come in and submit jobs during 15 seconds? You'll have 1000 new processes
     forked! A queue approach avoids this, by just dropping the requests onto the queue and letting the already-
     running job processors handle them at a fixed rate.

     Modules to Use
     Now that you know what queues are good for, where do you get one? The Ticketmaster code is closely tied
     to their backend systems, so it's not open source. There are some other options. One that you can grab from
     CPAN is Jason May's Spread::Queue. This is built on top of the Spread toolkit (http://spread.org/) for
     reliable multicast messaging. What Spread provides is a scalable way to send messages out across a cluster
     of machines and make sure they are received reliably and in order. It actually provides other things too, but
     this is the part that Spread::Queue is using.

     The system consists of three parts: a client library, a queue manager, and a worker library. The client library
     is called from your code when you want to add a request to the queue. That sends a request to the queue
     manager using Spread. You define your job processing code in a worker class. You can start as many
     worker processes as you like and they can be on any machine in the cluster. They will register themselves
     and begin accepting jobs.

     In the client process:

         use Spread::Queue::Sender;

         my $sender = Spread::Queue::Sender->new("myqueue");

         $sender->submit("myfunc", { name => "value" });
         my $response = $sender->receive;

     In the worker process:

         use Spread::Queue::Worker;

         my $worker = Spread::Queue::Worker->new("myqueue");
         $worker->callbacks(
                             myfunc => &myfunc,
                           );
         $SIG{INT} = &signal_handler;
         $worker->run;

         sub myfunc {
             my ($worker, $originator, $input) = @_;

               my $result = {
                                        response => "I heard you!",

file:///Users/perrinharkins/Conferences/scalable_talk.html                                                              6/7
3/3/12                                                      No Title

                            };
               $worker->respond($originator, $result);
         }

     The Spread::Queue system looks very attractive, but there are a few things it could use. There doesn't seem
     to be a way to check where a particular job is in the queue, or even to ask if that job is done yet or not
     without blocking until it is done. Also, the queue is not stored in a durable way: it's just in the memory of
     the queue manager process, so if that process dies, the entire state of the queue is lost. Adding these features
     would make a good project for someone, and someone may be me if I need them before someone else does.



     Where to Learn More
     If some of these concepts are new to you, and you want to learn more about them, the good news is that
     there is lots of good technical writing on these subjects. The Perl Journal, including the "best of" collection
     that O'Reilly has been publishing, is a good resource, and so is the "Algorithms in Perl" book.

     The bad news is that some of the most interesting stuff is written for a Java audience. My advice is that if
     you want to learn how to do this scalable web development well, you can't be trapped in one community or
     one language -- you need to see what other people are doing. I like Martin Fowler's books, because he
     doesn't have an agenda to push and isn't trying to sell you on a particular tool or API. Similarly, the O'Reilly
     sites at http://oreillynet.com/, including http://onjava.com/, get some good stuff. The Java content is mostly
     open-source oriented so it's much less fluffy than most Java sites.



     Acknowledgements
     I'd like to thank Craig McLane and Adam Sussman of Ticketmaster, and Zack Steinkamp of Yahoo for
     being very generous with their time in answering my questions while I was working on this talk.




file:///Users/perrinharkins/Conferences/scalable_talk.html                                                               7/7

Contenu connexe

Tendances

Asynchronous Processing with Ruby on Rails (RailsConf 2008)
Asynchronous Processing with Ruby on Rails (RailsConf 2008)Asynchronous Processing with Ruby on Rails (RailsConf 2008)
Asynchronous Processing with Ruby on Rails (RailsConf 2008)Jonathan Dahl
 
ApacheCon 2014 - What's New in Apache httpd 2.4
ApacheCon 2014 - What's New in Apache httpd 2.4ApacheCon 2014 - What's New in Apache httpd 2.4
ApacheCon 2014 - What's New in Apache httpd 2.4Jim Jagielski
 
Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018Thijs Feryn
 
How to investigate and recover from a security breach in WordPress
How to investigate and recover from a security breach in WordPressHow to investigate and recover from a security breach in WordPress
How to investigate and recover from a security breach in WordPressOtto Kekäläinen
 
Scaling Twitter
Scaling TwitterScaling Twitter
Scaling TwitterBlaine
 
Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Maarten Balliauw
 
Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018Thijs Feryn
 
Altitude San Francisco 2018: Programming the Edge
Altitude San Francisco 2018: Programming the EdgeAltitude San Francisco 2018: Programming the Edge
Altitude San Francisco 2018: Programming the EdgeFastly
 
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013Marcus Barczak
 
2015 ZendCon - Do you queue
2015 ZendCon - Do you queue2015 ZendCon - Do you queue
2015 ZendCon - Do you queueMike Willbanks
 
Ruby in the Browser - RubyConf 2011
Ruby in the Browser - RubyConf 2011Ruby in the Browser - RubyConf 2011
Ruby in the Browser - RubyConf 2011Ilya Grigorik
 
Pagespeed what, why, and how it works
Pagespeed   what, why, and how it worksPagespeed   what, why, and how it works
Pagespeed what, why, and how it worksIlya Grigorik
 
WordPress Speed & Performance from Pagely's CTO
WordPress Speed & Performance from Pagely's CTOWordPress Speed & Performance from Pagely's CTO
WordPress Speed & Performance from Pagely's CTOLizzie Kardon
 
Integrated Cache on Netscaler
Integrated Cache on NetscalerIntegrated Cache on Netscaler
Integrated Cache on NetscalerMark Hillick
 
A web perf dashboard up & running in 90 minutes presentation
A web perf dashboard up & running in 90 minutes presentationA web perf dashboard up & running in 90 minutes presentation
A web perf dashboard up & running in 90 minutes presentationJustin Dorfman
 
Improving WordPress performance (xdebug and profiling)
Improving WordPress performance (xdebug and profiling)Improving WordPress performance (xdebug and profiling)
Improving WordPress performance (xdebug and profiling)Otto Kekäläinen
 
Interactive web. O rly?
Interactive web. O rly?Interactive web. O rly?
Interactive web. O rly?timbc
 
HTTP Basic - PHP
HTTP Basic - PHPHTTP Basic - PHP
HTTP Basic - PHPSulaeman .
 

Tendances (20)

Asynchronous Processing with Ruby on Rails (RailsConf 2008)
Asynchronous Processing with Ruby on Rails (RailsConf 2008)Asynchronous Processing with Ruby on Rails (RailsConf 2008)
Asynchronous Processing with Ruby on Rails (RailsConf 2008)
 
ApacheCon 2014 - What's New in Apache httpd 2.4
ApacheCon 2014 - What's New in Apache httpd 2.4ApacheCon 2014 - What's New in Apache httpd 2.4
ApacheCon 2014 - What's New in Apache httpd 2.4
 
Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018
 
How to investigate and recover from a security breach in WordPress
How to investigate and recover from a security breach in WordPressHow to investigate and recover from a security breach in WordPress
How to investigate and recover from a security breach in WordPress
 
Scaling Twitter
Scaling TwitterScaling Twitter
Scaling Twitter
 
Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...
 
Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018
 
Altitude San Francisco 2018: Programming the Edge
Altitude San Francisco 2018: Programming the EdgeAltitude San Francisco 2018: Programming the Edge
Altitude San Francisco 2018: Programming the Edge
 
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
 
2015 ZendCon - Do you queue
2015 ZendCon - Do you queue2015 ZendCon - Do you queue
2015 ZendCon - Do you queue
 
Ruby in the Browser - RubyConf 2011
Ruby in the Browser - RubyConf 2011Ruby in the Browser - RubyConf 2011
Ruby in the Browser - RubyConf 2011
 
Pagespeed what, why, and how it works
Pagespeed   what, why, and how it worksPagespeed   what, why, and how it works
Pagespeed what, why, and how it works
 
WordPress Speed & Performance from Pagely's CTO
WordPress Speed & Performance from Pagely's CTOWordPress Speed & Performance from Pagely's CTO
WordPress Speed & Performance from Pagely's CTO
 
Integrated Cache on Netscaler
Integrated Cache on NetscalerIntegrated Cache on Netscaler
Integrated Cache on Netscaler
 
A web perf dashboard up & running in 90 minutes presentation
A web perf dashboard up & running in 90 minutes presentationA web perf dashboard up & running in 90 minutes presentation
A web perf dashboard up & running in 90 minutes presentation
 
Os Furlong
Os FurlongOs Furlong
Os Furlong
 
Improving WordPress performance (xdebug and profiling)
Improving WordPress performance (xdebug and profiling)Improving WordPress performance (xdebug and profiling)
Improving WordPress performance (xdebug and profiling)
 
Os Pruett
Os PruettOs Pruett
Os Pruett
 
Interactive web. O rly?
Interactive web. O rly?Interactive web. O rly?
Interactive web. O rly?
 
HTTP Basic - PHP
HTTP Basic - PHPHTTP Basic - PHP
HTTP Basic - PHP
 

Similaire à Scalable talk notes

Apache Con 2008 Top 10 Mistakes
Apache Con 2008 Top 10 MistakesApache Con 2008 Top 10 Mistakes
Apache Con 2008 Top 10 MistakesJohn Coggeshall
 
Top 10 Scalability Mistakes
Top 10 Scalability MistakesTop 10 Scalability Mistakes
Top 10 Scalability MistakesJohn Coggeshall
 
Fundamentals of web_design_v2
Fundamentals of web_design_v2Fundamentals of web_design_v2
Fundamentals of web_design_v2hussain534
 
Web Client Performance
Web Client PerformanceWeb Client Performance
Web Client PerformanceHerea Adrian
 
Top 30 Scalability Mistakes
Top 30 Scalability MistakesTop 30 Scalability Mistakes
Top 30 Scalability MistakesJohn Coggeshall
 
Intro to advanced web development
Intro to advanced web developmentIntro to advanced web development
Intro to advanced web developmentStevie T
 
Issues You Will Confront When Using Third Parties To Build Out Sites
Issues You Will Confront When Using Third Parties To Build Out SitesIssues You Will Confront When Using Third Parties To Build Out Sites
Issues You Will Confront When Using Third Parties To Build Out Sitestouchdown777a
 
Issues You Will Confront When Using Third Parties To Build Out Sites
Issues You Will Confront When Using Third Parties To Build Out SitesIssues You Will Confront When Using Third Parties To Build Out Sites
Issues You Will Confront When Using Third Parties To Build Out Sitesisawyours
 
Top 10 Scalability Mistakes
Top 10 Scalability MistakesTop 10 Scalability Mistakes
Top 10 Scalability MistakesJohn Coggeshall
 
The Guide to becoming a full stack developer in 2018
The Guide to becoming a full stack developer in 2018The Guide to becoming a full stack developer in 2018
The Guide to becoming a full stack developer in 2018Amit Ashwini
 
Improving Drupal Performances
Improving Drupal PerformancesImproving Drupal Performances
Improving Drupal PerformancesVladimir Ilic
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practiceswebuploader
 
Node.js and the MEAN Stack Building Full-Stack Web Applications.pdf
Node.js and the MEAN Stack Building Full-Stack Web Applications.pdfNode.js and the MEAN Stack Building Full-Stack Web Applications.pdf
Node.js and the MEAN Stack Building Full-Stack Web Applications.pdflubnayasminsebl
 
Making Of PHP Based Web Application
Making Of PHP Based Web ApplicationMaking Of PHP Based Web Application
Making Of PHP Based Web ApplicationSachin Walvekar
 
Make Drupal Run Fast - increase page load speed
Make Drupal Run Fast - increase page load speedMake Drupal Run Fast - increase page load speed
Make Drupal Run Fast - increase page load speedPromet Source
 
Intro to-html-backbone
Intro to-html-backboneIntro to-html-backbone
Intro to-html-backbonezonathen
 
7 things every web developer should know about linux administration
7 things every web developer should know about linux administration7 things every web developer should know about linux administration
7 things every web developer should know about linux administrationZareef Ahmed
 

Similaire à Scalable talk notes (20)

Apache Con 2008 Top 10 Mistakes
Apache Con 2008 Top 10 MistakesApache Con 2008 Top 10 Mistakes
Apache Con 2008 Top 10 Mistakes
 
Top 10 Scalability Mistakes
Top 10 Scalability MistakesTop 10 Scalability Mistakes
Top 10 Scalability Mistakes
 
Fundamentals of web_design_v2
Fundamentals of web_design_v2Fundamentals of web_design_v2
Fundamentals of web_design_v2
 
Web Client Performance
Web Client PerformanceWeb Client Performance
Web Client Performance
 
Top 30 Scalability Mistakes
Top 30 Scalability MistakesTop 30 Scalability Mistakes
Top 30 Scalability Mistakes
 
Intro to advanced web development
Intro to advanced web developmentIntro to advanced web development
Intro to advanced web development
 
Issues You Will Confront When Using Third Parties To Build Out Sites
Issues You Will Confront When Using Third Parties To Build Out SitesIssues You Will Confront When Using Third Parties To Build Out Sites
Issues You Will Confront When Using Third Parties To Build Out Sites
 
Issues You Will Confront When Using Third Parties To Build Out Sites
Issues You Will Confront When Using Third Parties To Build Out SitesIssues You Will Confront When Using Third Parties To Build Out Sites
Issues You Will Confront When Using Third Parties To Build Out Sites
 
Top 10 Scalability Mistakes
Top 10 Scalability MistakesTop 10 Scalability Mistakes
Top 10 Scalability Mistakes
 
The Guide to becoming a full stack developer in 2018
The Guide to becoming a full stack developer in 2018The Guide to becoming a full stack developer in 2018
The Guide to becoming a full stack developer in 2018
 
Improving Drupal Performances
Improving Drupal PerformancesImproving Drupal Performances
Improving Drupal Performances
 
Large-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate GuideLarge-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate Guide
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
 
Node.js and the MEAN Stack Building Full-Stack Web Applications.pdf
Node.js and the MEAN Stack Building Full-Stack Web Applications.pdfNode.js and the MEAN Stack Building Full-Stack Web Applications.pdf
Node.js and the MEAN Stack Building Full-Stack Web Applications.pdf
 
Making Of PHP Based Web Application
Making Of PHP Based Web ApplicationMaking Of PHP Based Web Application
Making Of PHP Based Web Application
 
Make Drupal Run Fast - increase page load speed
Make Drupal Run Fast - increase page load speedMake Drupal Run Fast - increase page load speed
Make Drupal Run Fast - increase page load speed
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
Joomla Manual in Compatible with XAMPP
Joomla Manual in Compatible with XAMPPJoomla Manual in Compatible with XAMPP
Joomla Manual in Compatible with XAMPP
 
Intro to-html-backbone
Intro to-html-backboneIntro to-html-backbone
Intro to-html-backbone
 
7 things every web developer should know about linux administration
7 things every web developer should know about linux administration7 things every web developer should know about linux administration
7 things every web developer should know about linux administration
 

Plus de Perrin Harkins

Efficient Shared Data in Perl
Efficient Shared Data in PerlEfficient Shared Data in Perl
Efficient Shared Data in PerlPerrin Harkins
 
Choosing a Templating System
Choosing a Templating SystemChoosing a Templating System
Choosing a Templating SystemPerrin Harkins
 
Scaling Databases with DBIx::Router
Scaling Databases with DBIx::RouterScaling Databases with DBIx::Router
Scaling Databases with DBIx::RouterPerrin Harkins
 
Care and Feeding of Large Web Applications
Care and Feeding of Large Web ApplicationsCare and Feeding of Large Web Applications
Care and Feeding of Large Web ApplicationsPerrin Harkins
 
Top 10 Perl Performance Tips
Top 10 Perl Performance TipsTop 10 Perl Performance Tips
Top 10 Perl Performance TipsPerrin Harkins
 
The Most Common Template Toolkit Mistake
The Most Common Template Toolkit MistakeThe Most Common Template Toolkit Mistake
The Most Common Template Toolkit MistakePerrin Harkins
 

Plus de Perrin Harkins (7)

Efficient Shared Data in Perl
Efficient Shared Data in PerlEfficient Shared Data in Perl
Efficient Shared Data in Perl
 
Choosing a Templating System
Choosing a Templating SystemChoosing a Templating System
Choosing a Templating System
 
Scaling Databases with DBIx::Router
Scaling Databases with DBIx::RouterScaling Databases with DBIx::Router
Scaling Databases with DBIx::Router
 
Low-Maintenance Perl
Low-Maintenance PerlLow-Maintenance Perl
Low-Maintenance Perl
 
Care and Feeding of Large Web Applications
Care and Feeding of Large Web ApplicationsCare and Feeding of Large Web Applications
Care and Feeding of Large Web Applications
 
Top 10 Perl Performance Tips
Top 10 Perl Performance TipsTop 10 Perl Performance Tips
Top 10 Perl Performance Tips
 
The Most Common Template Toolkit Mistake
The Most Common Template Toolkit MistakeThe Most Common Template Toolkit Mistake
The Most Common Template Toolkit Mistake
 

Dernier

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Dernier (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Scalable talk notes

  • 1. 3/3/12 No Title Building Scalable Websites with Perl by Perrin Harkins Who is doing it? First, let's establish some credit with any doubters in the audience. I shouldn't have to tell you this, but Perl runs some of the largest websites in the world. Take a look at some of the better-known examples: Yahoo.com uses Perl in nearly all of their properties, in particular the personalized My Yahoo service. On the whole, Yahoo serves three billion page views per day, and about 100 million unique users. Yahoo owns Overture, the largest sponsored search company. According to their posting on the Perl jobs list at http://jobs.perl.org/, they handle "more than 10 billion transactions per month!" Amazon.com, the company that pretty much defines e-commerce, uses Perl on their main site and partner sites. Amazon also operates the popular Internet Movie Database, IMDB.com, which is built in Perl. Ticketmaster.com, the largest on-line ticket retailer, is built almost entirely with Perl. So is it's sister company, CitySearch.com, which operates the most widely-used city guide sites in the US. Nielsen NetRatings says that Yahoo, Amazon, and InterActiveCorp, which owns Ticketmaster Online and CitySearch, are all in the top 10 in terms of overall web traffic. We're talking about phenomenal numbers of users and page views here. By comparison, Slashdot.org, which people frequently point to as a high traffic site using Perl, is barely a drop in the bucket. How are they doing it? Okay, so your company probably doesn't get as much traffic as Yahoo. Still, you may be wondering, what is it that really large sites do that allows them to scale so big, and is it something you could apply to your own sites? Obviously, these are all very different applications. There is no single solution for scaling all of them. Even buying a lot of hardware isn't a magic bullet, since it just isn't feasible to buy enough computing power to prop up a slow application at these levels of traffic. However, what you discover when you talk to people who work at these sites, is that there are a few common techniques that tend to get used by almost everyone in one form or another. These are fundamental software techniques that have been around for ages, not some kind of newly invented Internet magic. Feel free to refer to them as design patterns if it will raise your salary. Today we're going to talk about a couple of these and how they apply to web development problems. Things we won't be covering file:///Users/perrinharkins/Conferences/scalable_talk.html 1/7
  • 2. 3/3/12 No Title I should also mention what we're not going to talk about. We're not going to talk about mod_perl tuning: httpd.conf settings, reverse proxy configurations, increasing copy-on-write memory sharing, running the profiler... This stuff is very well-documented in the mod_perl books and the on-line documentation at http://perl.apache.org/. If you're serious about building a scalable site and you haven't read these resources yet, get on it! We're not going to talk about DBI tuning. Tim Bunce has detailed slides from his talks available on CPAN (http://search.cpan.org/~timb/), and there is more in the mod_perl documentation and books. We're not going to talk about hardware because, well, I'm not very interested in hardware. That's for cheaters. (However, I'm willing to cut the sites I mentioned above a little slack on this...) Caching Caching helps performance by reducing the amount of work that needs to be done, and helps scalability by reducing the load on shared resources like databases. All of the sites I mentioned above cache like mad wherever they can. Page caching, object caching, de-normalized database tables - all of these are variations on a theme. Even if your data is so volatile that it changes every 30 seconds, if it only takes 1 second to generate it you will still get to serve it from cache for the other 29. Whole Pages If you can possibly get away with it, cache entire HTML pages and serve them as static files. This is simply unbeatable from a performance standpoint. Web servers and operating systems have been tuned to serve static files with incredible efficiency. When I worked at eToys.com, we were caching all of the non- interactive pages (i.e. the ones that people just browsing the catalog would see) as static files, and serving those pages was about ten times as fast as generating the same page on the fly, even when all of the data needed to create the page was cached in our mod_perl servers. There are a few ways to make this happen. One of them is to simply write out all of the possible pages on your site on a regular basis. You can write a big batch job that generates all the files for your website, probably by reading a database and then pounding the data through templates. Sometimes people write elaborate versions of this, with dependency checking and make-like functionality. See the ttree program that comes with Template Toolkit for one take on it. However, you can also do this for a site that was not built to be pre-published. Many tools exist for spidering websites to local copies, so all you have to do is point one at your dynamic site and dump it out as static files. wget --mirror --convert-links --html-extension --reject gif,jpg,png --no-parent http://app-server/dynamic/pages/ In reality, most sites would end up needing something more customized than this, but a simple tool like this can give you something to do benchmarks on at least. This kind of approach is only feasible if your site is small enough to write out the whole thing on a regular basis. If you have a site which is a front-end to a large database of some kind, you might have potentially millions of different pages to publish. There might be a few that get the vast majority of the hits though, and file:///Users/perrinharkins/Conferences/scalable_talk.html 2/7
  • 3. 3/3/12 No Title are thus worth caching. Rather than try to figure out which ones to pre-publish, you can use a generate-on- demand approach. This is what most people think of when they hear talk about caching web pages. The simplest way to do that is with a caching proxy server. If you've read the mod_perl documentation you should be familiar with the idea of a reverse proxy, sometimes called an HTTP accelerator. It's an HTTP proxy that sits in front of your server, passing through requests for dynamic pages. You can configure it to cache the pages and then tell it how long to keep cache them by setting the Expires and Cache-Control headers during page generation. ProxyRequests Off ProxyPass /dynamic/stuff http://app-server/ ProxyPassReverse /dynamic/stuff http://app-server/ CacheRoot "/mnt/proxy-cache" CacheSize 500000 CacheGcInterval 12 CacheMaxExpire 36 CacheDefaultExpire 2 These pages are not quite as fast as regular static ones -- mod_proxy checks the headers at the top of the file to make sure it hasn't expired before serving it. However, they are much faster than dynamic generation. Note that this will only work for pages which you can generate on the fly in a reasonable amount of time. If you have a page that takes two minutes to generate, you need to generate it before users ask for it. Of course you can still use this approach, and seed it with some artificial requests beforehand, which will basically give you a mix between the generate-on-demand and pre-generation approaches. One final variation worth mentioning is intercepting the 404 error. It works like this: you set up your program as the handler for 404 "NOT FOUND" errors on the site. When a page is requested that is not found on the file system, that triggers a 404 and sends the request over to you. You then generate the requested page, and write it out to the file system so that it will be there the next time someone comes looking for it. This is the approach that Vignette StoryServer uses for caching, or at least it did, back in the early days when it was spun off from cnet.com. It's easy to configure an Apache server to do this: ErrorDocument 404 /page/generator This will make apache do an internal redirect to the program at /page/generator, passing information about the URL originally requested as environment variables. This program writes out the file, and then, if you're using mod_perl, you can just do an internal redirect to the newly generated page and let apache handle it like any other file. The upside is great performance, since the pages are served as normal static files. The downside of this is that you then have to manage expiring these pages yourself, probably by writing a cron job that will check for ones that are too old and delete them. You run the risk of serving a file a little after its expiration time if the cron doesn't do its job frequently enough. In general, I think the caching proxy approach is easier to manage, but if you are using something other than mod_perl -- like FastCGI, which already separates the Perl interpreters from the web server -- there is not as much incentive to run a proxy. Chunks of HTML or data Many of you were probably thinking during that last part "That sounds great, but my web designers insisted file:///Users/perrinharkins/Conferences/scalable_talk.html 3/7
  • 4. 3/3/12 No Title on putting the current user's name on every page. I can't cache the whole thing." Obviously sites like Amazon or My Yahoo can't cache the whole page either. They can cache pieces of pages though, and reduce the page generation to little more than knitting the pieces together, like server-side includes. Yahoo uses this technique quite a bit, generating the pieces of content for the portal in advanace, and building a custom template for each user based on their preferences that includes the appropriate pieces at request-time. By the way, you may be aware that PHP is being used at Yahoo now and assumed that this meant it was replacing Perl. That's not the case. PHP is mostly being used for this sort of include-template work, replacing some older in-house solutions that Yahoo used to use. The content generation that was done in Perl is still being done in Perl. The caching built into the Mason web development framework is a good example of caching pieces. It allows you to cache arbitrary content with a key and an expiration time and then retrieve it later. my $result = $m->cache->get($search_term); if (!defined($result)) { $result = run_search($search_term); $m->cache->set($search_term, $result, '30 min'); } You can cache generated HTML, or you can cache data which you've fetched from a database or elsewhere. Caching the generated HTML gives better performance, because it allows you to skip more work when you get a cache hit (the HTML generation), but caching at the data level means you get to reuse the cached content if it shows up in multiple different layouts. That increases your chances of getting a cache hit. Rent.com, one of the top apartment listing services on the web, uses Mason's cache to store results on a commonly used search page. Since there is a fair amount of repitition in these searches, they are able to serve 55% of the search hits from cache instead of going to the database. That also frees up database resources for other things. I created a simple plugin module for Template Toolkit that adds partial-page caching, which is available on CPAN as Template::Plugin::Cache. It's only really useful if you have templates that do a lot of work, fetching data and the like inside the template itself, which is generally not the best way to use Template Toolkit. When using a model-view-controller style of development, you will typically be caching data and doing it before you get to the templates. If you want to add caching to your application, there are several good options on CPAN. For a local cache on a single machine, I would recommend Rob Mueller's Cache::FastMmap. BerkeleyDB is about the same speed if you use the OO interface and built-in locking, but you'd have to build the cache expiration code yourself. Both of these are several times as fast as the popular Cache::FileCache module and hundreds of times faster than any of the modules built on top of IPC::ShareLite. our $Cache = Cache::FastMmap->new( cache_size => '500m', expire_time => '30m', ); $Cache->set($key, $value); my $value = $Cache->get($key); My only real complaint about Cache::FastMmap is that it doesn't provide a way to set different expiration times for individual items. You could add this yourself in a wrapper around Cache::FastMmap, but at that point it loses its main advantage over BerkeleyDB, which is the built-in expiration and purging functionality. file:///Users/perrinharkins/Conferences/scalable_talk.html 4/7
  • 5. 3/3/12 No Title For a cache that needs to be shared across a whole cluster of machines, you need something different. Memcached (http://www.danga.com/memcached/) is a cache server that you can access over the network. It keeps the cached items in RAM, but can be scaled for large amounts of data by running it on multiple servers. Requests are automatically hashed across the available servers, spreading the data set out across all of them. It uses some recent advances like the epoll system call in the Linux 2.6 kernel to offer impressive scalability. The livejournal.com website is currently using memcached. $memd = Cache::Memcached->new({ 'servers' => [ "10.0.0.15:11211", "10.0.0.15:11212", "10.0.0.17:11211", [ "10.0.0.17:11211", 3 ] ], 'debug' => 0, 'compress_threshold' => 10_000, }; $memd->set($key, $value, 5*60 ); my $value = $memd->get($key); If that sounds like more than you want to deal with, you can make something simple with MySQL. Because MySQL has an option to use a lightweight non-transactional table type, it is a good choice for this kind of application. Just create a simple table with key, value, and expiration time columns and use it the way you would use a hash. If you follow DBI best practices, you can get performance that beats most of the cache modules on CPAN except the ones I mentioned here. Job Queuing I could go on for hours about caching, but there are other important things to cover. Let's say you run a website that sells concert tickets. That means that at a specific, publicly-announced time, Madonna tickets will go on sale. That, in turn, means that a staggering number of people will all be waiting at 11am on Sunday morning with their fingers poised above the mouse button ready to click "buy" until they get a ticket. But wait, it gets worse! In order to give people who are trying to buy tickets by phone or in person a fair shot at the action, you are only allowed to put holds on a certain number of tickets at a time, meaning that only that number of people can be in the process of actually buying a ticket at once. Does this sound like a good way to ruin your weekend? This is the sort of thing that the ticketmaster.com site has to deal with routinely. How do you handle excessive demand for a limited resource? The same way you do it in real life: you make people line up for it. Queues are a common approach for preventing overloading and making efficient use of resources. [ queue diagram ] So, what have we accomplished with our queue? First of all, we have control of how many processes are handling requests in parallel, so we won't overwhelm our backend systems. Second, since it hardly takes any time at all to queue a request or or check status, we are keeping our web server processes free to handle more users. The site will be responsive even when there are far more users on it sending in requests than we can actually handle at one time. Finally, we are providing frequently updated status information to users, so they won't leave or try to resubmit their requests. Queues are also useful when you have long-running jobs. For example, suppose you're building a site that compares prices on hotel rooms by making price quote requests to a bunch of remote servers and comparing file:///Users/perrinharkins/Conferences/scalable_talk.html 5/7
  • 6. 3/3/12 No Title them. That could take some time, even if you send the requests in parallel. You can keep the browser from timing out by using the standard forking technique, where you fork off a process to do the work and return an "in progress" page. When the forked process finishes handling the request, it writes the results to a shared data location, like a database or session file. Meanwhile, the page reloads, and until the results are available it justkeeps sending back the "in progress" page. Randal Schwartz has an article on-line that demonstrates this technique. It's located at http://www.stonehenge.com/merlyn/WebTechniques/col20.html. However, this doesn't completely solve the problem. Say these jobs take 15 seconds to complete. What happens if 1000 people come in and submit jobs during 15 seconds? You'll have 1000 new processes forked! A queue approach avoids this, by just dropping the requests onto the queue and letting the already- running job processors handle them at a fixed rate. Modules to Use Now that you know what queues are good for, where do you get one? The Ticketmaster code is closely tied to their backend systems, so it's not open source. There are some other options. One that you can grab from CPAN is Jason May's Spread::Queue. This is built on top of the Spread toolkit (http://spread.org/) for reliable multicast messaging. What Spread provides is a scalable way to send messages out across a cluster of machines and make sure they are received reliably and in order. It actually provides other things too, but this is the part that Spread::Queue is using. The system consists of three parts: a client library, a queue manager, and a worker library. The client library is called from your code when you want to add a request to the queue. That sends a request to the queue manager using Spread. You define your job processing code in a worker class. You can start as many worker processes as you like and they can be on any machine in the cluster. They will register themselves and begin accepting jobs. In the client process: use Spread::Queue::Sender; my $sender = Spread::Queue::Sender->new("myqueue"); $sender->submit("myfunc", { name => "value" }); my $response = $sender->receive; In the worker process: use Spread::Queue::Worker; my $worker = Spread::Queue::Worker->new("myqueue"); $worker->callbacks( myfunc => &myfunc, ); $SIG{INT} = &signal_handler; $worker->run; sub myfunc { my ($worker, $originator, $input) = @_; my $result = { response => "I heard you!", file:///Users/perrinharkins/Conferences/scalable_talk.html 6/7
  • 7. 3/3/12 No Title }; $worker->respond($originator, $result); } The Spread::Queue system looks very attractive, but there are a few things it could use. There doesn't seem to be a way to check where a particular job is in the queue, or even to ask if that job is done yet or not without blocking until it is done. Also, the queue is not stored in a durable way: it's just in the memory of the queue manager process, so if that process dies, the entire state of the queue is lost. Adding these features would make a good project for someone, and someone may be me if I need them before someone else does. Where to Learn More If some of these concepts are new to you, and you want to learn more about them, the good news is that there is lots of good technical writing on these subjects. The Perl Journal, including the "best of" collection that O'Reilly has been publishing, is a good resource, and so is the "Algorithms in Perl" book. The bad news is that some of the most interesting stuff is written for a Java audience. My advice is that if you want to learn how to do this scalable web development well, you can't be trapped in one community or one language -- you need to see what other people are doing. I like Martin Fowler's books, because he doesn't have an agenda to push and isn't trying to sell you on a particular tool or API. Similarly, the O'Reilly sites at http://oreillynet.com/, including http://onjava.com/, get some good stuff. The Java content is mostly open-source oriented so it's much less fluffy than most Java sites. Acknowledgements I'd like to thank Craig McLane and Adam Sussman of Ticketmaster, and Zack Steinkamp of Yahoo for being very generous with their time in answering my questions while I was working on this talk. file:///Users/perrinharkins/Conferences/scalable_talk.html 7/7