James O'Neill's Blog

June 9, 2014

Screen scraping for pleasure or profit (with PowerShell’s Invoke-RestMethod)

Filed under: Powershell — jamesone111 @ 3:28 pm

Twenty odd years ago I wrote some Excel Macros to get data into Excel from one of the worlds less friendly materials management systems. It was far easier to work with the data in Excel,  and the Macros were master-pieces of send-keys and prayer-based parsing and it was the first time I heard the term Screen Scrape . (When I searched for “Prayer-based parsing” – it took me to another page on this blog where you might develop the argument that in Unix and Linux, the pipe we take for granted is little more than an automated screen scrape).

The technology changes (and gets easier) but the same things push us to screen scrape. Data would be more useful in from other than the one in which it is made available. There is a benefit if we can shift it from one form to the other. There are times when taking the data might break usage terms, or have other legal or ethical reasons why you shouldn’t do it. You need to work out those questions for yourself, here I’m just going to talk about one technique which I’ve found myself using several times recently.  

imageA lot of the data we want today is delivered in web pages. This, of itself, can be helpful, because well formed HTML can often be treated as XML data so with something like PowerShell the parsing is (near enough) free. Instead of having to pull the text completely apart there might be a small amount of preliminary tidying up followed and then putting [XML] in front of something conjures up a navigable hierarchy of data. The data behind my post on Formula one statistics was gathered mainly using this method.  (And the odd regular expression for pulling a needle-like scrap of information from a Haystack of HTML)

But when you’re lucky, the you don’t even need to parse the data out of the format it is displayed. Recently someone sent me a link a to a story about Smart phone market share. It’s presented as a map but it’s not a great way to see how share has gone or down at different times, or compare one place against another at the same time. These days when I see something like this I think “how do they get the data for the map”. The easy way to find out out is to Press [F12] in internet explorer, turn on the network monitor (click the router Icon and then the  “play” button) and reload the page. The hope is a tell-tale sign of data being requested so and processed by the browser ready for display. Often this data will jump out because of the way it is formatted. And circled in the network trace is some JSON format data. JSON or XML data is a real gift for PowerShell….

Once upon a time if you wanted to get data from a web server you had to write a big chuck of code. Then Microsoft provided a System.Net.WebClient object which would do fetching and carrying but left the parsing to you. In recent versions of PowerShell there are two cmdlets, Invoke-WebRequest is basically a wrapper for this. It will do some crunching of the HTML page so you can work with the document. PowerShell also has ConvertFrom-JSON, so you can send the content into that and don’t have to write your own JSON parser. But it’s even easier than that. Invoke-RestMethod will get the page and if it can parse what comes back, it does: so you don’t need a separate convert step. So I can do get the data back and quickly explore it like this:

> $j = Invoke-RestMethod -Uri http://www.kantarworldpanel.com/php/comtech_data.php
>$j
years                                                                                                        
-----                                                                                                        
{@{year=2012; months=System.Object[]}, @{year=2013; months=System.Object[]}, @{year=2014; months=System.Obj...

> $j.years[0]
year                                                    months                                               
----                                                    ------                                               
2012                                                    {@{month=0; cities=System.Object[]}, @{month=1; cit...

> $j.years[0].months[0]
month                                                   cities                                               
-----                                                   ------                                               
0                                                       {@{name=USA; lat=40.0615504; lng=-98.51893029999997...

> $j.years[0].months[0].cities[0]
name                        lat                         lng                         platforms                
----                        ---                         ---                         ---------                
USA                         40.0615504                  -98.51893029999997          {@{name=Android; share=...

> $j.years[0].months[0].cities[0].platforms[0]
name                                                    share                                                
----                                                    -----                                                
Android                                                 43 

From there it’s a small step to make something which sends the data to the clipboard ready formatted for Excel

$j = Invoke-RestMethod -Uri "http://www.kantarworldpanel.com/php/comtech_data.php"
$(   foreach ($year  in $J.years)  {
        foreach ($month in $year.months) {
            foreach ($city in $month.cities) {
                foreach ($platform in $city.platforms) {
                    ('{0} , {1} , "{2}" , "{3}", {4}' -f $year.year ,
                   
  ($month.month + 1) , $city.name , $platform.name , $platform.share) }}}}) | clip

Web page to Excel-ready in less time than it takes to describe it.  The other big win with these two cmdlets is that they understand how to keep track of sessions (which technically means Invoke-RestMethod  can also work with things which aren’t really RESTful)  Invoke-WebRequest understands forms on the page. I can logon to a photography site I use like this
if (-not $cred) { $Global:cred  = Get-Credential -Message "Enter logon details For the site"}
$login                          = Invoke-WebRequest "$SiteRoot/login.asp" –SessionVariable SV
$login.Forms[0].Fields.email    = $cred.UserName
$login.Forms[0].Fields.password = $cred.GetNetworkCredential().Password
$homePage                       = Invoke-WebRequest ("$SiteRoot/" + $login.Forms[0].action) -WebSession $SV -Body $login -Method Post
if ($homePage.RawContent        -match "href='/portfolio/(\w+)'")
    {$SiteUser                    = $Matches[1]}

Provided my subsequent calls to Invoke-WebRequest and Invoke-RestMethod specify the same session variable in their –WebSession parameter the site treats me as logged in and gives me my  data (which mostly is ready prepared JSON with the occasional needs to be parsed out of the HTML) rather than public data. With this I can get straight to the data I want with only the occasional need to resort to regular expressions. Once you have done one or two of these you start seeing a pattern for how you can either pick up XML or JSON data, or how you can Isolate links from a page of HTML which contain the data you want, and then isolate the table which has the data

For example

$SeasonPage  =  Invoke-WebRequest -Uri  "http://www.formula1.com/results/season/2014/"
$racePages   =  $SeasonPage.links | where  href -Match "/results/season/2014/\d+/$"
Will give me a list in $racepages of all the results pages for the 2014 F1 seasons  – I can loop though the data. Setting $R to the different URLS.

$racePage        = Invoke-WebRequest -Uri $R
$contentmainDiv  = ([xml]($racePage.ParsedHtml.getElementById("contentmain").outerhtml -replace " "," ")).div
$racedate        = [datetime]($contentmainDiv.div[0].div.span -replace "^.*-","")

$raceName        = $contentmainDiv.div[0].div.h2

Sometimes finding the HTML section to convert to XML is easy – here it needed a little tweak to remove non breaking spaces because the [XML] type converter doesn’t like them, but once I have $contentMainDiv I can start pulling race data out  – and 20 years on Excel is still the tool of choice, and how I get data into Excel will have to wait for another day.

Advertisements

Blog at WordPress.com.

%d bloggers like this: