James O'Neill's Blog

May 14, 2018

A couple of easy boosts for PowerShell performance.

Filed under: Powershell — jamesone111 @ 10:55 am

At the recent PowerShell and Dev-ops summit I met Joshua King and went to his session – Whip Your Scripts into Shape: Optimizing PowerShell for Speed – (an area where I overestimated my knowledge) and it’s made me think about some other issues.  If you find this post interesting it’s a fair bet you’ll enjoy watching Joshua’s talk. There are a few of things to say before looking at a performance optimization which I added to my knowledge this week.

  • Because scripts can take longer to write than to run, we need to know when it is worth optimizing for speed. After all, if cut we the time from pressing return to the reappearance of the prompt from 1/2 second to 1/4 or even to 1/1000th second our reaction time is such that we don’t do the next thing we’re going to do any sooner. On the other hand if something takes 5 minutes to run (which might be the same command being called many times inside a script), giving minutes back is usable time.
  • Execution time varies with input – it often goes up with the square of the number of items being processed.  (Typically when the operation is in the form “For every item, look at [some subset of] all items”). So you might process 1,000 rows of data in half a second … but then someone takes your code and complains that their data take 5 minutes to process, because they’re working with many more rows. Knowing if you should optimize here isn’t straightforward  – most of the time doesn’t matter, but when it matters at all, it matters a lot.  You can discover if performance tails off badly at 10,000 or 1,000,000 rows but it isn’t easy to predict how many of any given size there will be and whether optimizing performance is time is well spent . If the problem happens at scale, then you might run sub-tasks in parallel (especially if each runs on a different computer), or change the way of working – for example this piece on hash tables is about avoiding the “look at every item” problem.
  • No one writes code to be slow. But the fast way might require something which is longer and/or harder to understand. If we want to write scripts which are reusable we might prefer tidy-but-slower over fast-but-incomprehensible. (All other things being equal we’d love the elegance of something tidy and fast, but a lot of us aren’t going to let the pursuit of that prevent us going home). 
    Something like $SetA | where {$_ –notIn $setB}  is easy to understand but if the sets are big enough it might need billions of comparisons, the work which gave rise to the hash tables piece  cut the number from billions to under a million (and meant that we could run the script multiple times per hour instead of once or twice in a day, so we could test it properly for the first time). But it takes a lot more to understand how it works.
  • One area from Joshua’s talk where the performance could be improved without adding complexity was reducing or eliminating the hit from using Pipelines; usually this doesn’t matter – in fact the convenience of being able to construct a bespoke command by piping cmdlets together was compelling before it was named “PowerShell”.  Consider these two scripts which time how long it takes to increment a counter a million times.

    $i  = 0 ; $j = 1..1000000 ;
    $sw = [System.Diagnostics.Stopwatch]::StartNew() ;
    $J | foreach {$i++ }  ;
    $sw.Stop() ; $sw.Elapsed.TotalMilliseconds

    $i  = 0 ; $j = 1..1000000 ;
    $sw = [System.Diagnostics.Stopwatch]::StartNew() ;
    foreach ($a in $j) {$i++ }  ;
    $sw.Stop() ; $sw.Elapsed.TotalMilliseconds

     The only thing which is different is the foreach – is it the alias for ForEach-Object, or is it a foreach statement . The logic hasn’t changed, and readability is pretty much the same; you might expect them to take roughly the same time to run … but they don’t: on my machine, using the statement is about 6 times faster than piping to the cmdlet.
    This is doing unrealistically simple work; replacing the two “ForEach” lines with

    $j | where {$_ % 486331 -eq 0}
    and
    $j.where(  {$_ % 486331 -eq 0} )

    does something more significant for each item and I find the pipeline version takes 3 times as long! And the performance improvement remains if the output of the .where() goes into a pipeline. I’ve written in the past that sometimes very long pipelines can be made easier to read by breaking them up (even though I have a dislike storing intermediate results), and it turns out we also can boost performance by doing that.

    Recently I found another change : if I define a function

    Function CanDivide {
    Param ($Dividend)
        $Dividend % 486331 -eq 0
    }
    and repeat the previous test with the command as
    $j.where( {CanDivide $_ } )

    People will separate roughly 50:50 into those who find the new version easier to understand, and those who say “I have to look somewhere else to see what ‘can divide’ does”. But is it faster or slower and by how much ? It’s worth verifying this for yourself, but my test said the function call makes the command slower by a factor of 6 or 7 times.  If a function is small, and/or is only called from one place, and/or is called many times to complete a piece of work then it may be better to ‘flatten’ the script. I’m in the “I don’t want to look somewhere else” camp so my bias is towards flattening code, but – like reducing the amount of piping – it might feel wrong for other people. It can make the difference between “fast enough”, and “not fast enough” without major changes to the logic.

    Blog at WordPress.com.

    %d bloggers like this: