At the recent PowerShell and Dev-ops summit I met Joshua King and went to his session – Whip Your Scripts into Shape: Optimizing PowerShell for Speed – (an area where I overestimated my knowledge) and it’s made me think about some other issues. If you find this post interesting it’s a fair bet you’ll enjoy watching Joshua’s talk. There are a few of things to say before looking at a performance optimization which I added to my knowledge this week.
Something like
$SetA | where {$_ –notIn $setB}
is easy to understand but if the sets are big enough it might need billions of comparisons, the work which gave rise to the hash tables piece cut the number from billions to under a million (and meant that we could run the script multiple times per hour instead of once or twice in a day, so we could test it properly for the first time). But it takes a lot more to understand how it works. One area from Joshua’s talk where the performance could be improved without adding complexity was reducing or eliminating the hit from using Pipelines; usually this doesn’t matter – in fact the convenience of being able to construct a bespoke command by piping cmdlets together was compelling before it was named “PowerShell”. Consider these two scripts which time how long it takes to increment a counter a million times.
$i = 0 ; $j = 1..1000000 ;
$sw = [System.Diagnostics.Stopwatch]::StartNew() ;
$J | foreach {$i++ } ;
$sw.Stop() ; $sw.Elapsed.TotalMilliseconds
$i = 0 ; $j = 1..1000000 ;
$sw = [System.Diagnostics.Stopwatch]::StartNew() ;
foreach ($a in $j) {$i++ } ;
$sw.Stop() ; $sw.Elapsed.TotalMilliseconds
The only thing which is different is the foreach – is it the alias for ForEach-Object, or is it a foreach statement . The logic hasn’t changed, and readability is pretty much the same; you might expect them to take roughly the same time to run … but they don’t: on my machine, using the statement is about 6 times faster than piping to the cmdlet.
This is doing unrealistically simple work; replacing the two “ForEach” lines with
$j | where {$_ % 486331 -eq 0
}
and
$j.where( {$_ % 486331 -eq 0} )
does something more significant for each item and I find the pipeline version takes 3 times as long! And the performance improvement remains if the output of the .where() goes into a pipeline. I’ve written in the past that sometimes very long pipelines can be made easier to read by breaking them up (even though I have a dislike storing intermediate results), and it turns out we also can boost performance by doing that.
Recently I found another change : if I define a function
Function CanDivide {
Param ($Dividend)
$Dividend % 486331 -eq 0 }
and repeat the previous test with the command as
$j.where( {CanDivide $_ } )
People will separate roughly 50:50 into those who find the new version easier to understand, and those who say “I have to look somewhere else to see what ‘can divide’ does”. But is it faster or slower and by how much ? It’s worth verifying this for yourself, but my test said the function call makes the command slower by a factor of 6 or 7 times. If a function is small, and/or is only called from one place, and/or is called many times to complete a piece of work then it may be better to ‘flatten’ the script. I’m in the “I don’t want to look somewhere else” camp so my bias is towards flattening code, but – like reducing the amount of piping – it might feel wrong for other people. It can make the difference between “fast enough”, and “not fast enough” without major changes to the logic.