James O'Neill's Blog

April 9, 2011

Pattern recognition–the human and PowerShell kinds

Filed under: Uncategorized — jamesone111 @ 8:40 pm

Recently BBC’s Top Gear has been promoting the idea that a particular type of obnoxious drivers have been replacing the BMWs that they traditionally bought with Audis. Chatting to a friend who is a long term Audi customer, and whose household features ‘his’ and ‘hers’ Audis we came to the conclusion that once you think there is a pattern, you recognise it and the your awareness increases – even if in reality it is no more prevalent. I think the same thing happens in IT in general and scripting in particular – it has happened to me recently… when  my understanding of regular expressions in PowerShell took a big step forward, and now I’m finding all manner of places where it helps.

I use a handful of basic regular expressions  for things like removing a trailing \ character from the end of a string with something like:
$Path = $Path –replace "\\$" , ""
Many people use –replace to swap text without realising it handles regular expressions – in this case  “\” is the escape character in regular expression, so to match “\” itself it has to escaped as “\\” . The $ character means “end-of-line” so this fragment just says ‘Replace “\” at the end of $Path – if you find one – with nothing, and store the result back in $Path.  PowerShell’s –Split operator also uses regular expressions. This can be a trap – if you try to split using  “.” it has means “any character” any you get a result you didn’t expect:
This.that" –split "." returns 10 empty strings – (the –split operator discards the delimiter) ; to match a “.” it must  be escaped as “\.” . But it’s also a benefit if you want to split sentences apart you can make  “.” and any spaces round it the delimiter– which saves the need to trim afterwards. The –Match operator uses regular expressions too  – I  worry when I see it used in examples for new users who may use something which parses unexpectedly as a regular expression .

I thought that I knew regular expressions – until thanks to an article by Tome Tanasovski, I found I had missed a big bit of the picture, which meant my understanding was wrong.  I thought that a match meant the equivalent of running a highlighter pen over part of the text and –replace means “take something out and put something else back” – both are usually true but not always. Tome also did a presentation for the PowerShell user group – there’s a link to the recording on Richard’s blog – I’d recommend watching it and pausing every so often to try things out.
Tome showed look-aheads and look-behinds. These say “It’s a Match if it is followed by something”, or “preceded by something” (or not).  This adds a whole new dimension…

A couple of days later I hit a snag with PowerShell’s Split-Path cmdlet. If the path is on a remote machine it might uses a drive letter which doesn’t exist on the local machine – and in that situation Split-Path throws an error. But I can use the –Split operator with a regular expression. I want to say “Find a \ followed by some characters that aren’t \ and the end of the string”. Lets test this:
PS C:\Users\James\Documents\windowsPowershell> $pwd -split "\\[^\\]+$"
C:\Users\James\Documents

As in my first example  ‘\\’ is an escaped ‘\’ character, and ‘$’ means “end of line” , ‘[^\\]’ says “Anything which not the ‘\’ character”  and ‘+’ means “at least once” So this translates as “Look for a ‘\’ followed my at least 1 non-‘\’ followed by end of line”. It’s mostly right but it doesn’t work (yet).
I copied my command prompt so you can see that ‘WindowsPowerShell’ is part of the my working directory – but that bit got lost; or to be more precise it was matched in the expression, so –split returned the text on either side of it.
I want to say “Find ONLY a ‘\’ . The one you want is followed by some characters that aren’t ‘\’ and the end of the string but they don’t form part of the delimiter.”  The syntax for Is followed by is “(?=   )” so I can wrap that around the [^\\]+$ part  and test that:
PS C:\Users\James\Documents\windowsPowershell> $pwd -split "\\(?=[^\\]+?$)"
C:\Users\James\Documents
windowsPowershell

Regular-Expressions can turn into a write-only language – easy to build up but pretty hard to pull apart.  At risk of making things worse, not everyone knows that PowerShell has a “multiple = operator”; if you write $a , $b  = 1,2  it will assign 1 to $a and 2 to $b. Since the output of the split operation is 2 items we can try this
PS C:\Users\James\Documents\windowsPowershell> $Parent,$leaf = $pwd -split "\\(?=[^\\]+?$)"
PS C:\Users\James\Documents\windowsPowershell> $Parent
C:\Users\James\Documents
PS C:\Users\James\Documents\windowsPowershell> $leaf
windowsPowershell

The “cost” of using regular expressions is that the term used to do the split is something akin to a magical incantation. The benefit is code is a lot more streamlined than using the string object’s  .LastInstanceOf(), .Substring() and .length() methods and some arithmetic to get to the same result. I’d contend that even allowing for the “incantation” the regex way makes it easier to see that $pwd is being split into 2 parts.
Good stuff so far, but Tome had another trick:  the match that selects nothing and the replace that removes nothing.  That made me stop and redefine my understanding.  Here’s the use case:

Ages ago I wrote about using PowerShell to query the Windows [Vista] Destkop Index – it works just as well with Windows 7.  The a zillion or so field names used in these queries have names like  “System.Title”, “System.Photo.Orientation” and “System.Image.Dimensions” – I’d type the bare field name like “title” by mistake or waste time discovering whether “HorizontalSize” belonged to System.Photo or System.Image. 
It would be better to enable my Get-IndexedFile function to put in the the right prefix: but could it be done reasonably efficiently and elegantly?
Here lookarounds come into their own. They let me write “If you can find a spot which is immediately after a space, and immediately before the word ‘Dimensions’ OR the word ‘HorizontalSize’ OR…” and so on for all the Image Fields “AND that word is followed by any spaces and a ‘=’ sign  THEN put ‘System.image.’ at the spot you found”.  With just the first two fieldnames the operation looks like this
-replace "(?<=\s) (?=(Dimensions|HorizontalSize)\s*=)" , "system.image."
                 ^
I have put an extra space in for the spot that will be matched – the ^ is pointing this out, it isn’t part of the code.
“(?<=  )” is the wrapper for the “look behind” operation  (replacing the ‘=’ with ‘!’ negates the expression) so “(?<=\s)”  says “behind this spot you find a space” and the second half is a “look ahead” which says “in front of this spot you find ‘Dimensions’ or ‘HorizontalSize’ then zero or more spaces (‘\s*’) followed by ‘=’ ”. A match with an expression like this is like an I-beam cursor between characters – rather than highlighting some: so the –replace operator has nothing to remove but it still inserts ‘system.image’ at that point. So lets put that to the test.

PS> "horizontalsize = 1024"  -replace "(?<=\s)(?=(Dimensions|HorizontalSize)\s*=)",
                                       "system.image."
system.image.horizontalsize = 1024

It works !  This whole exercise of writing a Get-IndexedFilesfunction – which I will share in due course –  ended up as worked example in using regex to support good function design. I’ve got another post in draft at the moment about my ideas on good function design, so I’ll post that and then come back to looking at all the different ways I made use of regular expressions in this one function.

Advertisements

Blog at WordPress.com.

%d bloggers like this: