This is an advance chapter of Mastering Perl by brian d foy. It is incomplete and may contain merely notes or an outline for future work. It may also contain sections of brian's writing from other sources, although these are noted where possible.
This work is copyrighted under a contract between O'Reilly Media and brian d foy, and you cannot repost it or distribute it without permission.
Tony Hoare's famous quote goes "Premature optimization is the root of all evil", although it doesn't often come with its setup: "We should forget about small efficiencies, say about 97% of the time". That is, don't sweat the small stuff until you need to. In this chapter, I show how I can look into my Perl programs to see where the slow parts are. Before I start working to improve the performance of my program, I should check to see what the program is actually doing. Once I know where the slow parts are, I concentrate on those.
The term "benchmark" comes from surveyors. They create a physical mark in something to denote a known elevation and use that mark to determine other elevations. Those computed elevations can only be right if the original mark is right. Even if the original mark started off right, maybe it changes because it sinks into the ground, the ground moves because of an earthquake, or global warming redefines the ultimate benchmark we call "sea level". Benchmarks are comparisons, not absolutes.
For computers, a benchmark compares the performance of one system against another. They measure in many ways, including time to completion, resource use, network activity, or memory use. Several tools already exist for measuring the parts outside of Perl so I won't cover those here. I want to look inside Perl to see what I can find. I want to know if one bit of code is faster or uses less memory.
Measuring things and extracting numbers is easy, and it's often easy for us to believe the numbers that computers give us. This makes benchmarking dangerous. Unlike those surveyors, we can't stand on a hill and know if we are higher or lower than the next hill by just looking. We have to carefully consider not only the numbers that we get from benchmarks, but the method we use to generate the numbers.
Benchmarking isn't as popular as it used to be. The speed and storage of computers and the bandwidth of networks are not as limiting as they used to be so we don't feel like we have to work hard to conserve them. We also don't have to pay (as in money, literally) for CPU cycles (in most cases), so we don't care how many we actually use. At least, we don't care as much as programmers used to care. After all, you're using Perl, aren't you?
Any measurement comes with risk. If I don't understand what I'm measuring, what affects the measurement, or what the numbers actually mean, I can easily misinterpret the results. If I'm not careful about how I measure things, my numbers may be meaningless. I can let the computer do the benchmarking work for me, but I shouldn't let it do the thinking for me.
A Perl program doesn't run on its own. It depends on a perl interpreter, an operating system, and hardware. Each of those depend on other things. Even if I use the same machine, different perl interpreters, even of the same version of Perl, may give different results. I could have compiled them with different C compilers that have different levels of optimization, I could have included different features in one interpreter, and so on.
You probably don't have to imagine a situation where you develop on one platform but deploy on another. I get to visit many companies in my travels as a consultant with Stonehenge so I've been able to see a lot of different set ups. Often, teams develop on one system that only they use, then deploy the result to a busy server that has a different version of Perl, a different version of the operating system, and a completely different load profile. What was really speedy on a lightly used machine becomes unbearably slow when people start to use it. A good example of this is CGI programs versus mod_perl.
Any benchmark only applies to its situation. Extrapolating my results might not get me in trouble, but they aren't really valid either. The only way for me to really know what will happen in a particular situation is to test that situation. Along with my numbers, I have to report the details. It's not enough to say, for instance, that I'm writing this on a Powerbook G4 running Mac OS 10.4.4. I have to tell you the details of my perl interpreter, how I compiled it (that's just perl -V), and how I've tuned my operating system.
I can't measure something without interacting with it, and that interaction changes the situation. If I want to watch the
details of Perl's memory management, for instance, I can compile Perl with -DDEBUGGING_MSTATS, but
then it's not the same Perl interpreter. Although I can now watch the memory, I've probably slowed the entire program down (and I
verify that at the end of this chapter when I show perlbench). If I add code to time the program,
I have to execute that code, which means my program takes longer. In any case, I might have to use additional modules, which means
that Perl has the find, load, and compile more code.
To measure the time it takes my program to run, I could just look at the clock on the wall when I start the program, and look again when the program finishes. That's the simplest, and the most naîve too. This method might work in extreme circumstances. If I can reduce the run time of my program from an entire work-day to a couple of minutes, then I don't care that the wallclock method might be a bit inaccurate.
I don't have to really look at my watch though. I can do this in directly in my program if I like.
#!/usr/bin/perl
my $start = time;
#... the meat of my program
my $end = time;
print "The total time is ", $end - $start;
For a short running program, this method only tests a portion of the runtime. What about all that time Perl spent compiling the code? If I used a lot of modules, a significant part of the time the whole process takes might be in the parts before Perl even starts running the program. Jean-Louis Leroy wrote in an article for Perl.com[1] about slow start-up times in a Perl FTP program because Perl had to look through 23 different directories to find everything Net::FTP needed to load. The run time portion is still pretty speedy, but the start up time was relatively long. Remember that Perl has to compile the program every time I run it (forgetting about things like mod_perl for the moment). It doesn't have the nice bytecode-saving feature of Python or Ruby.
If I want to time the whole process, compile-time and run-time, I can create a wrapper around the program to do the wallclock timing. I could take this number and compare it to the run-time numbers to estimate the compilation times.
#!/usr/bin/perl
my $start = time;
system( "@ARGV" );
my $end = time;
printf "The whole time was %d seconds", $end - $start;
The wallclock method breaks down, though, because the operating systems can switch between tasks, or even run different tasks at the same time. I can't tell how much time the computer worked on my program by only looking at my watch. The situation is even worse when my program has to wait for resources that might be busy or for network latency. I can't really blame my program for those.
The time program (not the Perl built-in) that comes with most unix-like systems solves this by
reporting only the time that the operating system thinks about my program. Your particular shell may even have a built-in command
for it[2].
From the command line, I tell the time command what it should measure. It runs the command and
reports its results. It breaks down the run time down by the real time, the user time, and the system time. The real time is the
wallclock time. The user time is the time the computer used to execute my instructions, and the system time is the time the
computer has to work to do what I tell it.
When I time the sleep program (not the Perl built-in), the real time is the time I told it to
sleep, but since that's all that program does, the user and system times are miniscule. The output for your particular version of
time may be different.
$ time sleep 5
real 0m5.094s
user 0m0.002s
sys 0m0.011s
Behind-the-scenes, the time program just uses times function
from the standard C library, and that carries along accounting information (although we're fortunate that we don't have to pay for
clock cycles anymore). The times Perl built-in does the same thing. In list context, it returns
four times: the total user and system time, and the user and system time for the children of the process. I take the end times and
subtract the starting times to get the real times.
#!/usr/bin/perl
my @start = times;
#... the meat of my program
my @end = times;
my @diffs = map { $end[$_] - $start[$_] } 0 .. $#end;
print "The total time is @diffs";
I don't have to do those calculations myself, though, because the Benchmark module, which comes with Perl, already does it for me. Again, this approach only measures the run time.
#!/usr/bin/perl
my $start = Benchmark->new;
#... the meat of my program
my $end = Benchmark->new;
my $diff = timediff( $t1, $t0 );
print "My program took: " . timestr( $diff ) . "\n";
( $real, $child_user, $child_system ) = @$diff[0,3,4];
# I'm pretty sure this is POSIX format
printf STDERR "\nreal\t%.3f\nuser\t%.3f\nsys\t%.3f\n",
$real, $child_user, $child_system;
The output looks like the times output I showed previously, but this time it comes completely
from within my Perl program and just for the parts of the program inside of the calls to Benchmark-new>. Instead of timing the entire program, I can focus on the part I want to examine.
This is almost the same thing David Kulp did this to create the Perl Power Tools version of time. Take a benchmark, run the command of interest using system (so
those those are the children times), then take another benchmark once system returns. Since this
version of time is pure Perl, it runs anywhere that Perl runs.
#!/usr/bin/perl
use Benchmark;
$t0 = Benchmark->new;
$rc = system( @ARGV );
$t1 = Benchmark->new;
$diffs = timediff( $t1, $t0 );
printf STDERR "\nreal %.2f\nuser %.2f\nsys %.2f\n", @$diffs[0,3,4];
$rc &= 0xffff;
if ($rc == 0xff00) { exit 127; } else { exit ($rc >> 8); }
Benchmarks by themselves aren't very useful. I file them under the heading of "decision support". I might be able to decide that I need to change a program to improve a number, but the number itself doesn't tell me what to. Sure, I know how long it takes to run my program but it doesn't tell me if I can make it any faster (see more about that in Chapter RRR, "Title"). I need to compare one implementation to another.
I could compare entire programs to each other, but that's not that useful. If I'm trying to speed up a program, for instance, I'm going to change the parts that I think are slow. Most of the other parts will be the same, and the time to run all of those same parts end up in the total time. I really just want to compare the bits that are different. The times for the rest of the code skews the results, so I need to isolate the parts that I want to compare.
If I extract the different bits, I can create small programs with just those. Most of the time the program takes to run then applies only to the interesting bits. I'll talk more about that later, but as I go through this next section, remember that anything I do has some overhead and every measurement changes the situation a bit, so I should think about the numbers before I accept them. For now, I'll go back to the Benchmark module.
If I want to compare two small bits of code instead of entire programs, I can use some of the functions from Benchmark.pm. I can compare either by running a certain number of iterations and comparing the total
time, or the inverse of that, a total time and comparing the total number of iterations.
In the timethese function from Benchmark, I give it a number
of iterations as the first argument. The second argument is an anonymous hash where the keys are labels I give the snippets and the
hash values represent the code I want to compare, in this case as string values that Perl will eval. In this sample script, I want to compare the speed of opendir and
glob for getting a list of files.
#!/usr/bin/perl
use Benchmark;
my $iterations = 10_000;
timethese( $iterations, {
'Opendir' => 'opendir my( $dh ), "."; my @f = readdir( $dh )',
'Glob' => 'my @f = glob("*")',
}
);
The timethese function prints a nice report that shows me the three times I discussed
earlier.
$ perl dir-benchmark.pl
Benchmark: timing 10000 iterations of Glob, Opendir...
Glob: 6 wallclock secs ( 2.12 usr + 3.47 sys = 5.59 CPU) @ 1788.91/s (n=10000)
Opendir: 3 wallclock secs ( 0.85 usr + 1.70 sys = 2.55 CPU) @ 3921.57/s (n=10000)
These aren't "The Numbers" though. People try to get away with running the measurement once. Try it again. Then again. The results vary a little bit every time I run it; certainly some of this is round-off error.
$ perl dir-benchmark.pl
Benchmark: timing 10000 iterations of Glob, Opendir...
Glob: 6 wallclock secs ( 2.10 usr + 3.47 sys = 5.57 CPU) @ 1795.33/s (n=10000)
Opendir: 3 wallclock secs ( 0.86 usr + 1.70 sys = 2.56 CPU) @ 3906.25/s (n=10000)
$ perl dir-benchmark.pl
Benchmark: timing 10000 iterations of Glob, Opendir...
Glob: 7 wallclock secs ( 2.11 usr + 3.51 sys = 5.62 CPU) @ 1779.36/s (n=10000)
Opendir: 3 wallclock secs ( 0.87 usr + 1.71 sys = 2.58 CPU) @ 3875.97/s (n=10000)
$ perl dir-benchmark.pl
Benchmark: timing 10000 iterations of Glob, Opendir...
Glob: 7 wallclock secs ( 2.11 usr + 3.47 sys = 5.58 CPU) @ 1792.11/s (n=10000)
Opendir: 3 wallclock secs ( 0.85 usr + 1.69 sys = 2.54 CPU) @ 3937.01/s (n=10000)
Benchmarking can be deceptive if I let the computer doing the thinking for me. The Benchmark module can spit out numbers all day long, but if I don't think about what I'm doing and what those numbers actually mean, they aren't useful. They may even lead me to believe something that isn't true.
Part of Stonehenge's Intermediate Perl course covers the Schwartzian Transform, which uses a cached sort-key to avoid duplicating work during a sort. The Schwartzian Transform should be faster, especially for more elements and more complicated sort-key computations.
In one of the course exercises, to prove to our students that the transform actually boosts performance, we ask them to sort a
bunch of filenames in order of their modification date. Looking up the modification time is an expensive operation, especially when
you have to do in N*log(N) times.
The answer we used to give in the course materials was not the best answer, though. It is short so it fits on one slide, but it makes things seem worse than they really are. The Schwartzian Transform comes out ahead, as it should, but I always thought it should be faster.
Our example used Benchmark's timethese to compare two methods to sort filenames by their modification age. The "Ordinary" sort
computes the file modification age, -M $a, every time it needs to make a comparison. The
"Schwartzian" method uses the Schwartzian Transform to compute the modification age once per file, and store it with the file.
use Benchmark qw{ timethese };
timethese( -2, {
Ordinary =>
q{ my @results = sort { -M $a <=> -M $b } glob "/bin/*"; },
Schwartzian =>
q{ map $_->[0], sort { $a->[1] <=> $b->[1] } map [$_, -M], glob "/bin/*"; },
});
This code has a number of problems. If I am going to compare two things, they need to be as alike as I can make them. Notice
that in the "Ordinary" case I assign to @results and in the "Schwartzian" case I use map() in a void context. They do different things: one sorts and stores, and one just sorts. To compare
them, they need to produce the same thing. In this case, they both need to store their result.
Also, I need to isolate the parts that are different and abstract the parts that are the same. In each code string I do a glob(), which I already know is an expensive operation. The glob()
taints the results because it adds to the time for the two sorts of, um, sorts.
During one class, while the students were doing their lab exercises, I did my own homework by rewriting our benchmark. I broke up the task in bits, and timed the different bits to see how they impact the overall task. I identified three major parts to benchmark: creating a list of files, sorting the files, and assigning the sorted list. I'm going to time each of those individually, and I am also going to time the bigger task. This seems like such a simple task, comparing two bits of code, but I can mess up in several ways if I'm not careful.
I also want to see how much the numbers improve from the example we have in the slides, so I use the original code strings too. I try a bunch of different snippets to see how each part of the task contribute to the final numbers. How much of it comes from the list assignment, or from the filename generation through glob? I build up a bunch of code strings from the various common parts.
First, I create some package variables. Benchmark
turns my code strings into subroutines, and I want those subroutines to find these variables. They have to be global (package)
variables. Although I know Benchmark puts these subroutines in the main:: package, I use L::*, which is short for Local.
The $L::glob variable is just the pattern I want glob to use, and I get that from @ARGV so I can run this program over different directories to see how the times changes with different
numbers of files. I specify it once and use it everywhere I use glob. That way, every code string gets the same list of files. I
expect the Schwartzian Transform to get better and better as the number of files increases.
I also want to run some code strings that don't use a glob, so I pre-glob the directory and store the list in @L::files. I think the glob is going to significantly affect the times, so I want to see the results
with and without it.
The $code anonymous hash has the code strings. I want to test the pieces as well as the whole
thing, so I start off with control strings to assign the list of files to a variable and to run a glob. Benchmark also runs an empty subroutine behind the scenes so it can adjust its
time for that overhead too. I expect the "assign" times to be insignificant and the glob times to be a big deal. At the outset, I
suspect the glob may be as much as a third of the time of the benchmarks, but that's just a guess.
The next set of code strings measure the sort. The "sort_names" string tries it in void context, and the "sort_names_assign" does the same thing but assigns its result to an array. I expect a measurable difference, and the difference to be the same as the time for the "assign" string.
Then I try the original code strings from our exercise example, and call that "ordinary_orig". That one uses a glob(), which I think inflates the time significantly. The "ordinary_mod" string uses the list of files
in @L::files, which is the same thing as the glob() without the
glob(). I expect these two to differ by the time of the "glob" code string.
The last set of strings compare three things. The "schwartz_orig" string is the one we started with. In "schwartz_orig_assign",
I fix that to assign to an array, just like I did with the other original code string. If we want to compare them, they have to do
the same thing. The final code string, "schwartz_mod", gets rid of the glob().
#!/usr/bin/perl
use strict;
use Benchmark;
$L::glob = $ARGV[0];
@L::files = glob $L::glob;
print "Testing with " . @L::files . " files\n";
my $transform = q|map $_->[0], sort { $a->[1] <=> $b->[1] } map [ $_, -M ]|;
my $sort = q|sort { -M $a <=> -M $b }|;
my $code = {
assign => q| my @r = @L::files |,
'glob' => q| my @files = glob $L::glob |,
sort_names => q| sort { $a cmp $b } @L::files |,
sort_names_assign => q| my @r = sort { $a cmp $b } @L::files |,
sort_times_assign => q| my @r = $sort @L::files |,
ordinary_orig => qq| my \@r = $sort glob \$L::glob |,
ordinary_mod => qq| my \@r = $sort \@L::files |,
schwartz_orig => qq| $transform, glob \$L::glob |,
schwartz_orig_assign => qq| my \@r = $transform, glob \$L::glob |,
schwartz_mod => qq| my \@r = $transform, \@L::files |,
};
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
print "Timing for 2 CPU seconds...\n"; timethese( -2, $code );
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
my $iterations = 1_000;
print "\n", "-" x 73, "\n\n";
print "Timing for $iterations iterations\n";
timethese( $iterations, $code );
The Benchmark module provides the report, which I re-formatted to make it a bit easier to read (so some of the output is missing and some lines are shorter). The results are not surprising, although I like to show the students that they didn't waste an hour listening to me talk about how wonderful the transform is.
$ perl benchmark
Testing with 380 files Timing for 2 CPU seconds...
Benchmark: running assign, glob, ordinary_mod, ordinary_orig,
schwartz_mod, schwartz_orig, schwartz_orig_assign, sort_names,
sort_names_assign for at least 2 + CPU seconds...
assign: (2.03 usr + 0.00 sys = 2.03 CPU) (n= 6063)
glob: (0.81 usr + 1.27 sys = 2.08 CPU) (n= 372)
ordinary_mod: (0.46 usr + 1.70 sys = 2.16 CPU) (n= 80)
ordinary_orig: (0.51 usr + 1.64 sys = 2.15 CPU) (n= 66)
schwartz_mod: (1.54 usr + 0.51 sys = 2.05 CPU) (n= 271)
schwartz_orig: (1.06 usr + 1.03 sys = 2.09 CPU) (n= 174)
schwartz_orig_assign: (1.20 usr + 0.87 sys = 2.07 CPU) (n= 156)
sort_names: (2.09 usr + 0.01 sys = 2.10 CPU) (n=3595626)
sort_names_assign: (2.16 usr + 0.00 sys = 2.16 CPU) (n= 5698)
-------------------------------------------------------------------------
Timing for 1000 iterations Benchmark: timing 1000 iterations of assign,
glob, ordinary_mod, ordinary_orig, schwartz_mod, schwartz_orig,
schwartz_orig_assign, sort_names, sort_names_assign ...
assign: 1 secs ( 0.33 usr + 0.00 sys = 0.33 CPU)
glob: 6 secs ( 2.31 usr + 3.30 sys = 5.61 CPU)
ordinary_mod: 28 secs ( 5.57 usr + 21.49 sys = 27.06 CPU)
ordinary_orig: 34 secs ( 7.86 usr + 24.74 sys = 32.60 CPU)
schwartz_mod: 8 secs ( 5.12 usr + 2.47 sys = 7.59 CPU)
schwartz_orig: 12 secs ( 6.63 usr + 5.52 sys = 12.15 CPU)
schwartz_orig_assign: 14 secs ( 7.76 usr + 5.41 sys = 13.17 CPU)
sort_names: 0 secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
sort_names_assign: 0 secs ( 0.39 usr + 0.00 sys = 0.39 CPU)
The "sort_names" result stands out. It ran almost 2 million times a second. It also doesn't do anything since it is in a void
context. It runs really fast, and it runs just as fast no matter what I put in the sort() block. A
sort() in void context will always be the fastest. The difference between the sort() and the map() in void context is not as pronounced in
"schwartz_orig" and "schwartz_orig_assign" because it's only the last map that is in void context. Both still have the rightmost
map() and the sort() to compute before it can optimize for void
context. There is an approximately 10% difference in the number of extra iterations the "schwartz_orig" can go through, so the
missing assignment gave it an apparent but unwarranted boost in our original example.
I like to look at the second set of results for the comparisons, and use the wallclock seconds even though they are not as exact as the CPU seconds. The "glob" code string took about six seconds, and the "schwartz_orig_assign" code string took 14 seconds. If I subtract those extra six seconds from the 14, I get the wallclock time for "schwartz_mod", just like I expected. That's over a third of the time! The "ordinary_*" times drop six seconds too, but from 34 to 28 seconds, so the percent difference is not as alarming.
I try the same benchmark with more and more files. For the rest of the comparisons, I'll use the actual CPU time since the round-off error is a lot higher now.
Notice that the glob() still has a significant affect on the times, and that the original transform that was in the void context
is still shaving off about 10% off the real time. The quotient between the transform and the ordinary sort() is 73 / 20 = 3.6, which is a little bit higher than before. Now log( N ) = log( 873 ) = 6.8, so
although the transform still outperforms the ordinary sort(), it hasn't gotten that much better. The sort() performance can vary
based on its input, so this comparison to log( N ) doesn't really mean much. It isn't an order of magnitude different (well, at
least in powers of 10 it isn't), so that is something, I guess.
Benchmark: timing 1000 iterations of glob, ordinary_mod, schwartz_mod, + schwartz_orig_assign...
glob: 14 secs ( 6.28 usr + 8.00 sys = 14.28 CPU)
ordinary_mod: 73 secs (14.25 usr + 57.05 sys = 71.30 CPU)
ordinary_orig: 93 secs (20.83 usr + 66.14 sys = 86.97 CPU)
schwartz_mod: 20 secs (14.06 usr + 5.52 sys = 19.58 CPU)
schwartz_orig: 32 secs (17.38 usr + 13.59 sys = 30.97 CPU)
schwartz_orig_assign: 34 secs (19.95 usr + 13.60 sys = 33.55 CPU)
Idle CPUs are wasted CPUs, but I think I'd rather have an idle CPU instead of one doing this benchmark. My disk was spinning quite a bit as I ran this benchmark. The quotient could be as bad a log(N) = log(3162) = 8.0, but with the real numbers, I got 603 / 136 = 4.4.
How is the transform scaling? The quotient with 873 files was 19.6. So, does 3612 / 873 come close to 136.2 / 19.6? For a
four-fold increase in files, the map took about 7 times longer. How about the ordinary sort(), with 603.8 / 71.3? It took 8.4 times
as long. Don't be fooled into thinking that the transform and the sort() are close: the sort() took 8.4 times as long as an already long time. It's paying compound interest!
Look at the huge penalty from the glob()! Now the glob() takes almost as much time as the transform. If we stuck with the original solution, students might think that the transform wasn't so hot.
Benchmark: timing 1000 iterations of glob, ordinary_mod, schwartz_mod, + schwartz_orig_assign...
glob: 148 secs ( 31.26 usr + 102.59 sys = 133.85 CPU)
ordinary_mod: 675 secs ( 86.64 usr + 517.19 sys = 603.83 CPU)
ordinary_orig: 825 secs (116.55 usr + 617.62 sys = 734.17 CPU)
schwartz_mod: 151 secs ( 68.88 usr + 67.32 sys = 136.20 CPU)
schwartz_orig: 297 secs ( 89.33 usr + 174.51 sys = 263.84 CPU)
schwartz_orig_assign: 294 secs ( 96.68 usr + 168.76 sys = 265.44 CPU)
When a programmer talks about benchmarking, she's probably talking about speed. After all, that's what the Benchmark Perl module measures, and what most articles on the subject discuss. Time is an easy thing to
measure, so it's understandable, though not necessarily right, that people measure what they can. Sometimes time is not the major
constraint, but something else, such as memory use, causes the problem.
The perldebguts documentation says:
There is a saying that to estimate memory usage of Perl,
assume a reasonable algorithm for memory allocation, mul-
tiply that estimate by 10, and while you still may miss
the mark, at least you won't be quite so astonished.
Perl trades memory for processing speed. Instead of doing a lot of computation, Perl does a lot of look up. Higher level languages handle memory management so the developer can think more about the task at hand than about getting more memory, releasing memory, or creating memory management bugs.[3]
This ease of use comes at an expense though. Since I don't control the memory, and Perl doesn't know what I plan to do ahead of time, Perl has to guess. When Perl needs more memory, it grabs a big chunk of it. If I need memory now, I'll probably need more later too, so Perl gets more than I need immediately. If I watch the memory use of my program carefully, I'll see it jump in big amounts, stay that way for a bit, then jump again. Perl doesn't give memory back to the operating system, either. It needed the memory before, so it might need it again. It tries to reuse the space it doesn't need anymore, but it's going to keep what it's got.
Also, Perl is built on top of C, but it doesn't have C's data types. Perl has scalars, array, hashes, and a couple of others. Perl doesn't expose the actual storage to me, so I don't have to think about it. Not only that, but Perl has to deal with context. Are those data strings, or numbers, or both? Where, in memory, are all of the elements of the array? What other variables does this thing reference?
That's the long way to say that a number in Perl is more than a number. It's really a movie star with a big entourage. It may be a 32-bit integer, but it's really 20 bytes. The Devel::Peek module lets me see what's going on by letting me inspect the variable's data structure to see how Perl stores it.
#!/usr/bin/perl
use Devel::Peek;
my $a;
print_message( "Before I do anything" );
Dump( $a );
print_message( "After I assign a string" );
$a = '123456789';
Dump( $a ),
print_message( "After I use it as a number" );
$b = $a + 1;
Dump( $a );
sub print_message
{
print STDERR "\n", "-" x 50,
"\n$_[0]\n", "-" x 50, "\n"
}
The output shows me what Perl is tracking at each point in the program. When I first create the variable, it doesn't have a
value. I can see the Perl created the scalar (in internals parlance, the SV, or "scalar value"), it has a reference count of 1, and
that it has some flags set. The SV doesn't have anything in it (that's the NULL(0x0)), but it has
an address, 0x1808248, because the scalar infrastructure is set up and ready to go when I'm ready
to give it a value.
When I assign a string to $a, it has more flags set and now has a PV, a "pointer value", which
really means it's just a string (or char * for you C people). Now the scalar value points to the
string data.
Once I use the scalar as a number, Perl has to convert the string to a number. Once it does that, it stores the number value
too, turning my scalar into an PVIV, meaning that it has a pointer value and an integer value.
Perl sets more flags to indicate that it's done the conversion and it has both values.
--------------------------------------------------
Before I do anything
--------------------------------------------------
SV = NULL(0x0) at 0x1808248
REFCNT = 1
FLAGS = (PADBUSY,PADMY)
--------------------------------------------------
After I assign a string
--------------------------------------------------
SV = PV(0x1800908) at 0x1808248
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK)
PV = 0x301c10 "123456789"\0
CUR = 9
LEN = 10
--------------------------------------------------
After I use it as a number
--------------------------------------------------
SV = PVIV(0x1800c20) at 0x1808248
REFCNT = 1
FLAGS = (PADBUSY,PADMY,IOK,POK,pIOK,pPOK)
IV = 123456789
PV = 0x301c10 "123456789"\0
CUR = 9
LEN = 10
Just from that I can see that Perl is doing a lot of work. Each Perl variable has some overhead even if it doesn't have a defined value.
The Devel::Size module can tell me how much my memory my variable takes up. I have to remember, though, that the actual memory is probably a little bit more since Perl has to align the low-level values at the appropriate byte boundaries. It can't just store a value starting anywhere it likes.
#!/usr/bin/perl
use Devel::Size qw(size);
my $a;
print_message( "Before I do anything" );
print "Size is ", size( \$a );
print_message( "After I assign a string" );
$a = '1';
print "Size is ", size( \$a );
print_message( "After I use it as a number" );
$b = $a + 1;
print "Size is ", size( \$a );
sub print_message { print "\n", "-" x 50, "\n$_[0]\n", "-" x 50, "\n" }
I see that even before I do anything, my scalar $a takes up 12 bytes, at least. When I assign
it a string, the size of the scalar is larger, and by more than just the number of characters in the string. Perl at least tacks on
a null byte to terminate the string, and might have some extra space allocated in case the string gets bigger. When I use the
string as a number, Perl stores the numeric version too, so the scalar gets even larger. Every one of these things can use a bit
more memory than you expect.
--------------------------------------------------
Before I do anything
--------------------------------------------------
Size is 12
--------------------------------------------------
After I assign a string
--------------------------------------------------
Size is 26
--------------------------------------------------
After I use it as a number
--------------------------------------------------
Size is 31
What about references, which are also scalars? They only need to know where to find the value, and they don't store values themselves. They stay the same size even when the values change. The size of a reference doesn't change. I have to be careful with Devel::Size, though. If I give it a reference, it finds the size of the thing at which the reference points. That's a good thing, as I'll show when I try it with arrays or hashes. However, if I have a reference pointing at a reference, the size of that second reference is the size of the thing at which it points, which is just a reference.
#!/usr/bin/perl
use LWP::Simple;
use Devel::Size qw(size);
# watch out! This is 50+ MB big!
my $data = get( "http://www.cpan.org/src/stable.tar.gz" );
print "The size of the data is " , size( $data ), "\n";
my $ref = \$data;
print "The size of the reference is " , size( $ref ), "\n";
my $ref2 = \$ref;
print "The size of the second reference is " , size( $ref2 ), "\n";
The output shows that the second reference is just 16 bytes. It doesn't include all of the data stored in the ultimate scalar. I'll show in a moment why I need to know that, but I have to look at Perl's containers first.
The size of the data is 12829217
The size of the reference is 12829217
The size of the second reference is 16
The situation for Perl's containers is different. Arrays are collections of scalars, and hashes have scalar keys and scalar
values. Those scalars can be the normal variety that hold values or they can be references. The size module from Devel::Size tells us the size of the data structure. Remember, references may
point to big values, but they don't take up that much space themselves.
#!/usr/bin/perl
use Devel::Size qw(size);
my @array = ( 1 ) x 500;
print "The size of the array is ", size( \@array ), "\n";
I can see how much space the array takes up. The Devel::Size documentation is careful to note that this doesn't count the size of the things in the array, just the size of the array. Notice that the size of my 500 element array is much larger than 500 times the 16 bytes my individual scalars used.
The size of the array is 2052
That number isn't counting the contents though. The array takes up the same size no matter what the scalars hold.
#!/usr/bin/perl
use Devel::Size qw(size);
my $data = '-' x 500;
print "The size of the scalar is ", size( $data ), "\n";
my @array = ( $data ) x 500;
print "The size of the array is ", size( \@array ), "\n";
I created a scalar with 500 characters, and the entire scalar including the overhead takes up 525 bytes. The array takes up the same space.
The size of the scalar is 525
The size of the array is 2052
Devel::Size has a fix for this. To get around this, I need to look at each of the scalars in the container and find their sizes. The reference values may point to other containers, which may have more references. My array might look like it's really small until I try to make a deep copy, store it, or anything else where I have to get all the data in one place, reference or not.
#!/usr/bin/perl
use Devel::Size qw(size total_size);
my $data = '-' x 500;
print "The size of the scalar is ", size( $data ), "\n";
print "The total size of the scalar is ", total_size( $data ), "\n";
print "\n";
my @array = ( $data ) x 500;
print "The size of the array is ", size( \@array ), "\n";
print "The total size of the array is ", total_size( \@array ), "\n";
Using total_size, the scalar size stays the same, and the array size now includes all the
scalar sizes. The number, 264552, is 500 times 525, the aggregate size of the scalars added to 2052, the array size.
The size of the scalar is 525
The total size of the scalar is 525
The size of the array is 2052
The total size of the array is 264552
I have to remember what this number actually is, though. It's just the aggregate size of all the data to which the array eventually points. If I did this for all of my data structures, I do not get the program memory size because those structures might contain references to the same data.
The same code can perform differently on different perl binaries, which might differ in their
compilation options, compilers used, features included, and so on. For instance, threaded versions of Perl are slightly slower, as
are shared library versions. It's not necessarily bad to be slower if you want the other benefits, but you don't always get to
choose beforehand. For instance, the stock Perl on some linux distributions is compiled for threads. If you think you might get a
speed-up with a non-threaded interpreter, find out before you reconfigure your system!
To compare different perl interpreters, Gisle Aas wrote perlbench. Give it the paths of the
interpreters you want to test, and it runs several tests on them, giving a report. The perlbench distribution comes with the perlbench-run script that, given the locations of the perl interpreters I want to test, runs a series of
benchmarks on each of them. The script normalizes the numbers to the time for the first interpreter I specify.
perlbench-run /usr/local/bin/perl5*
The output first shows the details for each interpreter I'm testing, and assigns them letters that correspond in a column in the
table that it's about to output. Especially interesting are the ccflags information. In this run,
I'm using a perl-5.8.7 I compiled with -DDEBUGGING_MSTATS, for
instance. Also interesting in the compiler information. It looks like I've got a pretty old version of gcc. That might be a good or bad thing. Different versions, or even different compilers, do better or
worse jobs optimizing the code.
A) perl-5.6.1
version = 5.006001
path = /usr/local/bin/perl5.6.1
ccflags = -fno-strict-aliasing -I/usr/local/include
gccversion = 2.95.2 19991024 (release)
optimize = -O
usemymalloc = n
B) perl-5.8.0
version = 5.008
path = /usr/local/bin/perl5.8.0
ccflags = -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -I/usr/local/include
gccversion = 2.95.2 19991024 (release)
optimize = -O
usemymalloc = n
C) perl-5.8.7
version = 5.008007
path = /usr/local/bin/perl5.8.7
ccflags = -DDEBUGGING_MSTATS
gccversion = 2.95.4 20020320 [FreeBSD]
optimize = -g
usemymalloc = y
D) perl-5.8.8
version = 5.008008
path = /usr/local/bin/perl5.8.8
ccflags = -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -pipe -I/usr/local/include -DDEBUGGING
gccversion = 2.95.4 20020320 [FreeBSD]
optimize = -O
usemymalloc = n
After perlbench-run reports the details of the interpreter, it runs a series of Perl scripts
with each of the interpreters. It measures the time to execute, much like Benchmark's timethese. Once it tries the script
with all of the intrepreters, it normalizes the results so that the first interpreter (that's the one labeled with "A") is 100.
Lower numbers in the other column mean that interpreter is slower for that test. Higher numbers (they can be above 100) mean that
interpreter is faster for that test.
I've cut out some of the output from these results, but this chart gives you the flavor of the comparisons. Interpreter C, the
one compiled with -DDEBUGGING_MSTATS is consistently slower than all of the other interpreters.
For the other tests, sometimes Perl 5.6.1 is faster and sometimes Perl 5.8.8 is faster. That's not a general observation since it
only applies to the ones I've compiled. Overall, it looks as if my Perl 5.6.1 is the fastest. That still doesn't mean I should
choose it, though, because for a slight penalty I get all the nice feature of Perl 5.8.
A B C D
--- --- --- ---
arith/mixed 100 85 73 79
arith/trig 100 87 82 81
array/copy 100 99 81 92
array/foreach 100 93 87 99
array/shift 100 100 94 91
array/sort-num 100 89 92 151
array/sort 100 95 80 94
call/0arg 100 107 79 91
call/1arg 100 92 69 78
call/wantarray 100 95 76 80
hash/copy 100 130 94 124
hash/each 100 119 90 110
hash/foreach-sort 100 103 78 102
loop/for-c 100 102 88 104
loop/for-range-const 100 101 94 106
loop/for-range 100 100 94 104
re/const 100 92 81 88
string/base64 100 86 67 72
string/htmlparser 100 91 75 74
string/tr 100 105 51 111
AVERAGE 100 97 80 91
Results saved in file:///home/brian/perlbench-0.93/benchres-002/index.html
If I have something special to test, I can add my own tests. Most of the infrastructure is already in place. The README from the perlbench distribution gives the basic format of a test
file. I create my test and put it in perlbench's benchmark directory. The distribution gives an example file:
# Name: My name goes here
# Require: 4
require 'benchlib.pl';
# YOUR SETUP CODE HERE
$a = 0;
&runtest(100, <<'ENDTEST');
# YOUR TESTING CODE HERE
ENDTEST
Benchmarking is a tricky subject. It involves a lot of numbers and requires a good understanding of what's actually going on. Not only do I have to look at my Perl program, but I should consider other factors, such as my choice in operating system, the Perl interpreter I'm using and how I compiled it, and anything else that interacts with my program. It's not all about speed either. I might want to compare the memory use of two approaches, or see which one takes up less bandwidth. Different situations have different constraints. No matter what I'm doing, I need to do my best to find out what's really going on before I make any conclusions about how to make it better.
The Benchmark module provides all of the details of its use. The module comes with Perl so you should already have it.
In "A Timely Start", Jean-Louis Leroy finds that his Perl program was slow because of the time it took to find the modules it needed to load: http://www.perl.com/lpt/a/2005/12/21/a_timely_start.html.
In "When Perl Isn't Quite Fast Enough", Perl developer Nick Clark talks about why programs, in general are slow, and which Perl design decisions can make Perl slow. The best part of his talk, which he originally gave at YAPC::EU 2002, is his solutions to speed up his programs. I heard his talk on PerlWhirl 2005, and he related most of his discussion to his work to make Perl's unicode handling faster. If you get a chance to see his talk, take it! I think you'll be entertained as well as educated.
I originally wrote the parts about benchmarking the Schwartzian Transform for Perlmonks in a node titled "Wasting time thinking about wasted time". I nicked it almost verbatim from my original post: http://www.perlmonks.org/index.pl?node=393128. I still use that post in Stonehenge's Perl classes to show that even "experts" can mess up benchmarks.
The second Perl article I ever wrote was "Benchmarking Perl" for The Perl Journal #11,in which I show some of the other functions in Benchmark. http://www.pair.com/comdog/Articles/benchmark.1_4.txt
The perlbench distribution isn't indexed in CPAN when I wrote this, but you can still find it
through CPAN Search, http://search.cpan.org. Check the README file for its documentation.