James O'Neill's Blog

September 6, 2008

Lies, Damn lies, and Statistics – a lesson in Benchmarking.

Filed under: Virtualization,Windows Server,Windows Server 2008 — jamesone111 @ 6:34 pm

Many years ago – before on-line meant "the internet" – I annoyed a journalist in an on-line discussion. I criticized the methodology used by his magazine to test file servers: Machines copied a large file making it a test of the cache effectiveness of the server.  As more machines and hence more files were added performance rose to a peak, then the total of files being coped exceeded the cache size, and it plummeted. This they explained as "ethernet collisions".

I mention this because there’s always a temptation to try to rip up a Benchmark someone else has done (I certainly didn’t use very diplomatic language back then). Single task tests can give you an idea how well a machine will carry out a similar task. What do file server tasks look like ? Realistic tests for file servers are hard. For virtualization it is close to impossible. If you a take a real world question like "I have 100 machines; when I multiply their CPU speed by their average CPU loading it they average out at 200MHz. How many Servers do I need ?" Obviously it’s more than 100×200 = 20GHz / (cores * Clockspeed) … but how much more ?" You need to answer questions like "What’s the overhead of running VMs ?" Would 5 servers running 20VMs have a bigger or smaller percentage overhead than 2 servers running 50 ? Assuming you could work this out and come up with an "available CPU" figure, it doesn’t answer questions about peaks of load i.e. "at any point in the last month would the instantaneous sum of CPU load totaled over a set of machines exceed the available CPU on the virtualization server ? And of course we haven’t mentioned disk and network I/O questions.

If that wasn’t enough to make people want to give up on Benchmarking,  Virtualization is a technology to put a lot of small loads into a single server. Benchmarks tend to load a system to the maximum and measure the throughput. Running benchmarks on virtualization is a bit like putting trucks on a race track. It might be fun and you will get a winner … but it doesn’t tell us anything much about the real world.  Still that doesn’t stop people doing it.  And just with motor racing even when the winner is declared people will argue afterwards.

Below are some numbers from Network world who benchmarked VMware and Hyper-V. They did the test on the two virtualization platforms, using Windows and Suse Linux as the guest OS. Using 4 procs and one proc per machine, they tested 1,3 and 6 instances and  I’m only showing the Windows-as-guest and I’ve multiplied the score per VM  by the number of VMs (they just showed the SPECjbb2005 bop/sec scores per VM, but you can get the raw numbers from pages 3 and 4 of their results (link below)

  One Instance Three Instances Six Instances
OS running natively on one CPU 18,153 n/a n/a
OS Running natively across four CPUs. 32,525 n/a n/a
Uni-proc VM on Hyper-V 17,403 49,089 87,186
Uni-proc VM on VMware ESX 17,963 53,205 83,784
4-proc VM on Hyper-V 31,037 101,022 87,528
4-proc VM on VMware ESX 31,155 81,429 96,816

Lets do some quick analysis of these numbers.

18,153 Vs 32,525 On the bare OS 1 Proc gives 18K bop/s and 4 procs gives 32K. So the application is not CPU bound – presumably I/O bound, and sensitive to disk configuration; however the "How we did it" page doesn’t say how the disks on each system were configured – for example did they use the OS defaults (slow, expanding  dynamic disks in Hyper-V) or do any optimization (faster, but more space consuming Fixed disks in Hyper-V). Given that other benchmarks show that Hyper-V can get better than 90% of the disk IOs of the native OS even under huge loading  – the separate disk benchmark they produced showing a very low number makes me suspect they followed the default. But the basic rule of lab work is record what it was you measured(a fixed, expanding or passthrough disk) otherwise your result lose their meaning. 

101,022 Another odd thing on disks is that the 3x Quad-Proc VM test, Hyper-V’s scores show each of the VMs getting 3.5% more throughput than the bare OS. Hyper-V does not cache VHD files, although it can coalesce disk operations which can occasionally throw tests, personally I don’t trust tests where this affects the result (which it plainly does in this case). 

87,186 Vs 83,784. The system had 16 cores and with uni-proc tests they stopped adding VMs at 6. With a single Uni-proc VM, VMware is a little faster, 3 such VMs and it’s faster still. Get to 6 and hyper-V is winning. Does efficiency favour Hyper-V as load increases who can tell ? Without tests at 12 or more VMs there’s no way of telling if this is an aberration or a trend.

Six instances. On uni-proc VMware and Hyper-V both get nearly 200% more throughput going from 1 to 3 VMs – that’s almost linear scalability, but it tails off to 57% and 78% going from 3 to 6. On 4 proc hyper performance actually goes down at 6 instances, and VMware only goes up 20% – this makes me think the system is I/O bound.

So who is the winner ?

In the uni proc test Hyper-V’s best score was 87,186 and VMware’s was 83,784. So Hyper-V is the winner !

In the quad-proc test Hyper-V’s best score was 101,022 and VMware’s was 96,816. Hyper-V is the Winner again !

Now, I’ve already suggested this is a field where there are more questions than answers. I’m left with just one: Since Microsoft clearly won, why is their article called VMware edges out Microsoft in virtualization performance test ?

Bonus link. This is NOT Simpson’s paradox. but if you’re every stuck for something to talk to statisticians about it’s worth knowing.

This post originally appeared on my technet blog.


Blog at WordPress.com.

%d bloggers like this: