James O'Neill's Blog

June 30, 2012

Using the Windows index to search from PowerShell:Part Two – Helping with user input

Filed under: Uncategorized — jamesone111 @ 10:46 am

Note: this was originally written for the Hey,Scripting guy blog where it appeared as the 26 June 2012 episode. The code is available for download . I have some more index related posts coming up so I wanted to make sure everything was in one place


In part one I developed a working PowerShell function to query the Windows index. It outputs data rows which isn’t the ideal behaviour and I’ll address that in part three; in this part I’ll address another drawback: search terms passed as parameters to the function must be "SQL-Ready". I think that makes for a bad user experience so I’m going to look at the half dozen bits of logic I added to allow my function to process input which is a little more human. Regular expressions are the way to recognize text which must be changed, and I’ll pay particular attention to those as I know I lot of people find them daunting.

Replace * with %

SQL statements use % for wildcard, but selecting files at the command prompt traditionally uses *. It’s a simple matter to replace – but for the need to "escape" the* character, replacing * with % would be as simple as a –replace statement gets:
$Filter = $Filter -replace "\*","%"
For some reason I’m never sure if the camera maker is Canon or Cannon so I’d rather search for Can*… or rather Can%, and that replace operation will turn "CameraManufacturer=Can*" into "CameraManufacturer=Can%". It’s worth noting that –replace is just as happy to process an array of strings in $filter as it is to process one.

Searching for a term across all fields uses "CONTAINS (*,’Stingray’)", and if the -replace operation changes* to % inside a CONTAINS() the result is no longer a valid SQL statement. So the regular expression needs to be a little more sophisticated, using a "negative look behind"
$Filter = $Filter -replace " "(?<!\(\s*)\*","%"

In order to filter out cases like CONTAINS(*… , the new regular expression qualifies "Match on *",with a look behind – "(?<!\(\s*)" – which says "if it isn’t immediately preceded by an opening bracket and any spaces". In regular expression syntax (?= x) says "look ahead for x" and (?<= x) says "Look behind for x" (?!= x) is “look ahead for anything EXCEPT x” and (?<!x) is “look behind for anything EXCEPT x” these will see a lot of use in this function. Here (?<! ) is being used, open bracket needs to be escaped so is written as \( and \s* means 0 or more spaces.

Convert "orphan" search terms into ‘contains’ conditions.

A term that needs to be wrapped as a "CONTAINS" search can be identified by the absence of quote marks, = , < or > signs or the LIKE, CONTAINS or FREETEXT search predicates. When these are present the search term is left alone, otherwise it goes into CONTAINS, like this.
$filter = ($filter | ForEach-Object {
    if  ($_ -match "'|=|<|>|like|contains|freetext") 
          
{$_}
    else   {"Contains(*,'$_')"}
})

Put quotes in if the user omits them.

The next thing I check for is omitted quote marks. I said I wanted to be able to use Can*, and we’ve seen it changed to Can% but the search term needs to be transformed into "CameraManufacturer=’Can%’ ". Here is a –replace operation to do that.
$Filter = $Filter -replace "\s*(=|<|>|like)\s*([^'\d][^\s']*)$",' $1 ''$2'' '
This is a more complex regular expression which takes a few moments to understand

Regular expression

Meaning

Application

\s*(=|<|>|like)\s*
([^’\d][^\s’]*)$

Any spaces (or none)

 

\s*(=|<|>|like)\s*
([^’\d][^s’]*)$

= or < or > or "Like"

CameraManufacturer=Can%

\s*(=|<|>|like)\s*
([^’\d][^\s’]*)$

Anything which is NOT a ‘ character
or a digit

CameraManufacturer=Can%

\s*(=|<|>|like)\s*
([^’\d][^\s’]*)$

Any number of non-quote,
non-space characters (or none)

CameraManufacturer=Can%

\s*(=|<|>|like)\s*
([^’\d][^\s’]*)$

End of line

\s*(=|<|>|like)\s*
([^’\d][^\s’]*)$

Capture the enclosed sections
as matches

$Matches[0]= "=Can%"
$Matches[1]= "="
$Matches[2]= "Can%"

‘ $1 ”$2” ‘0

Replace Matches[0] ("=Can%")
with an expression which uses the
two submatches "=" and "can%".

= ‘Can%’

Note that the expression which is being inserted uses $1 and $2 to mean matches[1] and[2] – if this is wrapped in double quote marks PowerShell will try to evaluate these terms before they get to the regex handler, so the replacement string must be wrapped in single quotes. But the desired replacement text contains single quote marks, so they need to be doubled up.

Replace ‘=’ with ‘like’ for Wildcards

So far, =Can* has become =’Can%’, which is good, but SQL needs "LIKE" instead of "=" to evaluate a wildcard. So the next operation converts "CameraManufacturer = ‘Can%’ "into "CameraManufacturer LIKE ‘Can%’ ":
$Filter = $Filter -replace "\s*=\s*(?='.+%'\s*$)" ," LIKE "

Regular expression

Meaning

Application

\s*=\s*(?=’.+%’\s*$)

= sign surrounded by any spaces

CameraManufacturer = ‘Can%’

\s*=\s*(?=.+%’\s*$)

A quote character

CameraManufacturer = Can%’

\s*=\s*(?=’.+%’\s*$)

Any characters (at least one)

CameraManufacturer = ‘Can%’

\s*=\s*(?=’.+%’\s*$)

% character followed by ‘

CameraManufacturer = ‘Can%’

\s*=\s*(?=’.+%’\s*$)

Any spaces (or none)
followed by end of line

\s*=\s*(?=‘.+%’\s*$)

Look ahead for the enclosed expression but don’t include it in the match

$Matches[0] = "="
(but only if ‘Can%’ is present)

Provide Aliases

The steps above reconstruct "WHERE" terms to build syntactically correct SQL, but what if I get confused and enter “CameraMaker” instead of “CameraManufacturer” or “Keyword” instead of “Keywords” ? I need Aliases – and they should work anywhere in the SQL statement – not just in the "WHERE" clause but in "ORDER BY" as well.
I defined a hash table (a.k.a. a "dictionary", or an "associative array") near the top of the script to act as a single place to store the aliases with their associated full canonical names, like this:
$PropertyAliases = @{
    Width       = "System.Image.HorizontalSize";
    Height      = "System.Image.VerticalSize";
    Name        = "System.FileName";
    Extension   = "System.FileExtension";
    Keyword     = "System.Keywords";
    CameraMaker = "System.Photo.CameraManufacturer"
}
Later in the script, once the SQL statement is built, a loop runs through the aliases replacing each with its canonical name:
$PropertyAliases.Keys | ForEach-Object {
    $SQL= $SQL -replace "(?<=\s)$($_)(?=\s*(=|>|<|,|Like))",$PropertyAliases[$_]
}
A hash table has .Keys and .Values properties which return what is on the left and right of the equals signs respectively. $hashTable.keyName or $hashtable[keyName] will return the value, so $_ will start by taking the value "width", and its replacement will be $PropertyAliases["width"] which is "System.Image.HorizontalSize", on the next pass through the loop, "height" is replaced and so on. To ensure it matches on a field name and not text being searched for, the regular expression stipulates the name must be preceded by a space and followed by "="or "like" and so on.

Regular expression

Meaning

Application

(?<=\s)Width(?=\s*(=|>|<|,|Like))

The literal text "Width"

Width > 1024

(?<=\s)Width(?=\s*(=|>|<|,|Like))

A Space

(?<=\s)Width(?=\s*(=|>|<|,|Like))

Look behind for the enclosed expression
but don’t include it in the match

$Matches[0] = "Width"
(but only if a leading space is present)

(?<=\s)Width(?=\s*(=|>|<|,|Like))

any spaces (or none)

(?<=\s)Width(?=\s*(=|>|<|,|Like))

The literal text "Like", or any of the characters comma, equals, greater than or less than

Width > 1024

(?<=\s)Width(?=\s*(=|>|<|,|Like))

Look ahead for the enclosed expression
but don’t include it in the match

$Matches[0] = "Width"
(but only if " >" is present)

If the prefix is omitted put the correct one in.

This builds on the ideas we’ve seen already. I want the list of fields and prefixes to be easy to maintain, so just after I define my aliases I define a list of field types
$FieldTypes = "System","Photo","Image","Music","Media","RecordedTv","Search"
For each type I define two variables, a prefix and a fieldslist : the names must be FieldtypePREFIX and FieldTypeFIELDS – the reason for this will become clear shortly but here is what they look like
$SystemPrefix = "System."
$SystemFields = "ItemName|ItemUrl"
$PhotoPrefix  = "System.Photo."
$PhotoFields  = "cameramodel|cameramanufacturer|orientation"
In practice the field lists are much longer – system contains 25 fieldnames not just the two shown here. The lists are written with "|" between the names so they become a regular expression meaning "ItemName or ItemUrl Or …". The following code runs after aliases have been processed
foreach ($type in $FieldTypes) {
   $fields = (get-variable "$($type)Fields").value
   $prefix = (get-variable "$($type)Prefix").value 
   $sql    = $sql -replace "(?<=\s)(?=($Fields)\s*(=|>|<|,|Like))" , $Prefix
}
I can save repeating code by using Get-Variable in a loop to get $systemFields, $photoFields and so on, and if I want to add one more field, or a whole type I only need to change the variable declarations at the start of the script. The regular expression in the replace works like this:

Regular expression

Meaning

Application

(?<=\s)(?=(cameramanufacturer|
orientation)\s*(=|>|<|,|Like))"

Look behind for a space
but don’t include it in the match

 

(?<=\s)(?=(cameramanufacturer|
orientation
)\s*(=|>|<|,|Like))"

The literal text "orientation" or "cameramanufacturer"

CameraManufacturer LIKE ‘Can%’

(?<=\s)(?=(cameramanufacturer|
orientation)\s*(=|>|<|,|Like))"

any spaces (or none)

 

(?<=\s)(?=(cameramanufacturer|
orientation)\s*(=|>|<|,|Like))"

The literal text "Like", or any of the characters comma, equals, greater than or less than

CameraManufacturer LIKE ‘Can%’

(?<=\s)(?=(cameramanufacturer|
orientation)\s*(=|>|<|,|Like))"

Look ahead for the enclosed expression
but don’t include it in the match

$match[0] is the point between the leading space and "CameraManufacturer LIKE" but doesn’t include either.

We get the effect of an "insert" operator by using ‑replace with a regular expression that finds a place in the text but doesn’t select any of it.
This part of the function allows "CameraManufacturer LIKE ‘Can%’" to become "System.Photo CameraManufacturer LIKE ‘Can%’ " in a WHERE clause.
I also wanted "CameraManufacturer" in an ORDER BY clause to become "System.Photo CameraManufacturer". Very sharp-eyed readers may have noticed that I look for a Comma after the fieldname as well as <,>,=, and LIKE. I modified the code which appeared in part one so that when an ORDER BY clause is inserted it is followed by a trailing comma like this:
if ($orderby) { $sql += " ORDER BY " + ($OrderBy -join " , " ) + ","}

the new version will work with this regular expression but the extra comma will cause a SQL error and so it must be removed later.
When I introduced the SQL I said the SELECT statement looks like this:

SELECT System.ItemName, System.ItemUrl,      System.FileExtension, System.FileName, System.FileAttributes, System.FileOwner, 
       System.ItemType, System.ItemTypeText , System.KindText,     System.Kind,     System.MIMEType,       System.Size

Building this clause from the field lists simplifies code maintenance, and as a bonus anything declared in the field lists will be retrieved by the query as well as accepted as input by its short name. The SELECT clause is prepared like this:
if ($First) 
     {$SQL = "SELECT TOP $First "}
else {$SQL = "SELECT "}
foreach ($type in $FieldTypes)
     {$SQL +=((get-variable "$($type)Fields").value -replace "\|",", " ) + ", "}

This replaces the "|" with a comma and puts a comma after each set of fields. This means there is a comma between the last field and the FROM – which allows the regular expression to recognise field names, but it will break the SQL , so it is removed after the prefixes have been inserted (just like the one for ORDER BY).
This might seem inefficient, but when I checked the time it took to run the function and get the results but not output them it was typically about 0.05 seconds (50ms) on my laptop – it takes more time to output the results.
Combining all the bits in this part with the bits in part one turns my 36 line function into about a 60 line one as follows

Function Get-IndexedItem{
Param ( [Alias("Where","Include")][String[]]$Filter ,
        [Alias("Sort")][String[]]$OrderBy,
        [Alias("Top")][String[]]$First,
        [String]$Path,
        [Switch]$Recurse 
      )
$PropertyAliases = @{Width ="System.Image.HorizontalSize"; 
                    Height = "System.Image.VerticalSize"}
$FieldTypes      = "System","Photo"
$PhotoPrefix     = "System.Photo."
$PhotoFields     = "cameramodel|cameramanufacturer|orientation"
$SystemPrefix    = "System."
$SystemFields    = "ItemName|ItemUrl|FileExtension|FileName"
if ($First) 
     {$SQL = "SELECT TOP $First "}
else {$SQL = "SELECT "}
foreach ($type in $FieldTypes)
     {$SQL +=((get-variable "$($type)Fields").value -replace "\|",", ")+", " }
if ($Path -match "\\\\([^\\]+)\\.")
     {$SQL += " FROM $($matches[1]).SYSTEMINDEX WHERE "}
else {$SQL += " FROM SYSTEMINDEX WHERE "}
if ($Filter)
     {$Filter = $Filter -replace "\*","%"
      $Filter = $Filter -replace"\s*(=|<|>|like)\s*([^'\d][^\s']*)$",' $1 ''$2'' '
      $Filter = $Filter -replace "\s*=\s*(?='.+%'\s*$)" ," LIKE "
      $Filter = ($Filter | ForEach-Object {
          if ($_ -match "'|=|<|>|like|contains|freetext")
               {$_}
          else {"Contains(*,'$_')"}
      })
      $SQL += $Filter -join " AND "
    }
if ($Path)
    {if ($Path -notmatch "\w{4}:") {$Path = "file:" + $Path}
     $Path = $Path -replace "\\","/"
     if ($SQL -notmatch "WHERE\s$") {$SQL += " AND " }
     if ($Recurse) 
          {$SQL += " SCOPE = '$Path' "}
     else {$SQL += " DIRECTORY = '$Path' "}
}
if ($SQL -match "WHERE\s*$")
     { Write-warning "You need to specify either a path , or a filter." ; return }
if ($OrderBy) { $SQL += " ORDER BY " + ($OrderBy -join " , " ) + ","}
$PropertyAliases.Keys | ForEach-Object 
     { $SQL= $SQL -replace"(?<=\s)$($_)(?=\s*(=|>|<|,|Like))", $PropertyAliases.$_ }
foreach ($type in $FieldTypes)
   
{$fields = (get-variable "$($type)Fields").value
     $prefix = (get-variable "$($type)Prefix").value
     $SQL    = $SQL -replace "(?<=\s)(?=($Fields)\s*(=|>|<|,|Like))" , $Prefix
    }
$SQL = $SQL -replace "\s*,\s*FROM\s+" , " FROM "
$SQL = $SQL -replace "\s*,\s*$" , ""
$Provider="Provider=Search.CollatorDSO;"+ "Extended Properties=’Application=Windows’;"
$Adapter = new-object system.data.oledb.oleDBDataadapter -argument $SQL, $Provider
$DS     = new-object system.data.dataset
if ($Adapter.Fill($DS)) { $DS.Tables[0] }
}

In part 3 I’ll finish the function by turning my attention to output

Advertisements

Blog at WordPress.com.

%d bloggers like this: