James O'Neill's Blog

August 7, 2011

Prayer-based parsing: PowerShell regular expressions again.

Filed under: Powershell — jamesone111 @ 9:52 am

 

Recently I saw a quote on Scott Hanselman’s blog

You’ve got a problem, and you’ve decided to use regular expressions to solve it.
Ok, now you’ve got two problems…”

A telling of the history of this Quote provides some other good quotes,; some restate “Using X as a solution adds to the problems”  another expanded a point I heard Thomas Lee make at one of his PowerShell camps. In PowerShell you can pipe a folder object, or a process object between steps and get its name from a .name property. Without this model: as one of quotes puts it:

The decades-old Unix “pipe” model is just plain dumb, because it forces you to think of everything as serializable text, even things that are fundamentally not text, or that are not sensibly serializable

(DOS copied the Unix pipe model, and it’s still in Windows -away from PowerShell).
Thomas had a great term for de-serializing text: “Prayer based parsing”. Every time you extract the bits you want from the text, you need to pray that it (still) works with your parsing rules. Some of the more arcane switches for command-line tools (on whatever OS) are to control their output – with one eye on simplifying the job of parsing it. 

Some tasks –like  “Screen scraping” – leave no alternative but to parse text. ( “A text only “pipe” is just automated screen scraping” is another way to describe the problem PowerShell’s pipeline solves.) Regular expressions are “big hammer” for advanced parsing (some of the “quotes” article gets into whether in Perl they are the most suitable hammer , or the most obvious hammer), and they’re something which PowerShell handles deftly with the –match and –replace operators.

I recently wanted to take a list of companies offering training which I had found the web, and transform it into a more database like format. In my screen scraped list each company had an entry which looked like this

Contoso Ltd S-1234
123 Station Road
Aberdeen
AB12 3DE
info@Contoso.com
http://www.contoso.com
Ph: (44) 123-456789
Fax: (44) 123-456790

Sign Up for a Online Course

approx: 0.8km / 0.5mi from London, GB, United Kingdom

I copied the list, and opened a new tab the PowerShell ISE and created a small snippet of PowerShell
$list = @"

"@

The @” … “@  defines a multi-line “here-string” – so I can paste the data in between the quotes hit the Run Script button and ta-da! my list is in a PowerShell variable. Studying the format, all the data I want is on consecutive lines – with a double line space after them. The –replace operator can remove single line breaks, leaving additional ones, to group the text I want on one line.
$list -replace "\r\n(?=\w)", ", "
\r and \n indicate the return and newline characters. So -replace "\r\n", ", " would replace all line breaks with with a comma and a space,
the (?=  ) construction specifies a “look ahead” says “match ONLY if what you have found is followed by…” and I want the replacement to happen only if the return/newline is followed by a ‘word’ character (\w) – so a line break followed by another line break won’t be replaced.

Now I can split my list – which is still a single giant string into  into multiple strings. Where I had two line breaks I know have one line break, a comma and a space so I can use that as the expression for the –split operator. 
($list -replace "\r\n(?=\w)", ", ") -split "\r\n, "

So far so good, but this will give be distances and the call to take a course. I want to discard any lines which don’t contain a company’s ID number

($list -replace "\r\n(?=\w)", ", ") -split "\r\n, " |
    foreach { if ($_ -match "\ss-\d+\s*,") {'"' + $_ + '"'}}

This looks at each line and if it contains a space, ‘S’, ‘–’, at least one digit, any number of spaces (including zero) and then a comma, it returns the line wrapped in double quotes. So now my text is one line per company looking like this. 

“Contoso Ltd S-1234, 123 Station Road, Aberdeen, AB12 3DE, info@Contoso.com, http://www.contoso.com, Ph: (44) 123-456789 , Fax: (44) 123-456790”

My next step is to replace the space before the ID number with ‘ “,” ’ 

(($list -replace "`r`n(?=\w)", ", ") -split "`r`n, " |
      foreach { if ($_ -match "\ss-\d+\s*,") {'"' + $_ + '"'}}
) -replace "\s+(?=S-\d+\s*,)", '","'

"\s+(?=S-\d+\s*,)" says “match on one or more spaces ONLY IF  followed by S, –, at least one digit, any spaces (or none) and then a comma.”  Wherever that Occurs I insert “,” breaking my line into two quoted strings separated by a comma – just like a CSV file.  The next step is to make it three quoted strings by replacing the coma and spaces after the ID number with the same ‘ “,” ’  combination

(($list -replace "`r`n(?=\w)", ", ") -split "`r`n, " |
      foreach { if ($_ -match "\ss-\d+\s*,") {'"' + $_ + '"'}}
) -replace "\s+(?=S-\d+\s*,)", '","'  -replace "(?<=S-\d+),\s*",'","'

"(?<=S-\d+),\s*" uses  (?<=   ) which is the look-behind construction, saying  “match on comma and any spaces (or none) ONLY IF it is preceded by ‘S’, ‘–’ and at least one digit”.  Now my line looks like this

“Contoso Ltd”,”S-1234”,”123 Station Road, Aberdeen, AB12 3DE, info@Contoso.com, http://www.contoso.com, Ph: (44) 123-456789 , Fax: (44) 123-456790″

The next task is to put the ” , ” combination before an email address:

(($list -replace "`r`n(?=\w)", ", ") -split "`r`n, " |
foreach { if ($_ -match "\ss-\d+\s*,") {'"' + $_ + '"'}}
) -replace "\s+(?=S-\d+\s*,)", '","' -replace "(?<=S-\d+),\s*",'","' `
–replace ",\s* (?=\w+@\w+\.\w+)" ,'","'

",\s* (?=\w+@\w+\.\w+)" says “match on comma and any spaces ONLY IF followed by at least one character , an @ sign, at least one character , a dot, and at least one character.”
Then I can use  ",\s*(?=http)" which matches on comma and any spaces ONLY IF followed by http and finally "\s*,\s* Ph:\*" and "\s*,\s*Fax:\s* " match on the ph and fax tags and surrounding spaces to put “,” before each of those.

Instead of using a different –replace operation for each step the terms  can be condensed into a single expression with | for “or” between each part.  You can see that building the expression up bit by bit is a lot easier than writing it in one go.

(($list -replace "`r`n(?=\w)", ", ") -split "`r`n, " |
      foreach { if ($_ -match "\ss-\d+\s*,") {'"' + $_ + '"'}}
) -replace "\s+(?=S-\d+\s*,)|(?<=S-\d+),\s*|, (?=\w+@\w+\.\w+)|,\s* (?=http)|\s*,\s*Ph:\s*|\s*,\s*Fax:\s*" , '","'

This turns my text into

“Contoso  Ltd”,”S-1234″,”123 Station Road, Aberdeen, AB12 3DE”, “info@Contoso.com”,”http://www.contoso.com&#8221;,”(44) 123-456789″,”(44) 123-456790″

There is potentially more-cleaning up I could do – for example identifying lines without email or URL sections and inserting a blank “”, in their place. But this gives me all I need; by writing a header row to a file and adding and my text to it I would get a ready made CSV file for Excel. I could have turned it into a bulk database import or used PowerShell’s has a ConvertFrom-Csv cmdlet to turn each row into a object or any number of other things. I don’t think I would have tried typing that line in one go: but PowerShell induced the habit of building up these long lines a little at a time meant it only took a couple of minutes to do. 

Advertisements

1 Comment

  1. […] Aren’t objects great?!  If the SMTP headers were emitted as objects we wouldn’t need to do this ‘Prayer-based Parsing’ James O’Neill has recently posted on. […]

    Pingback by Regex Toolkit, Prayer-Based Parsing, Bad Examples | Chris Warwick – The Blog Edition — August 18, 2011 @ 10:41 pm


RSS feed for comments on this post.

Create a free website or blog at WordPress.com.

%d bloggers like this: