Screen scraping with Excel

Excel is a great tool for organizing data, but data entry is boring. I had never really thought much more about those particular facts until a client came to me looking to aggregate quite a bit of stock data into Excel. He has been using Excel for years, manually entering the information he needs.

Generally, when I need scrape web pages for data I write up a script in some "P" language: php, Python, Perl. This, however, is not very friendly for the for the non-programmer among us.

This leaves running the scraper Excel macros with VBA. I honestly wasn't sure if this was feasible at first. After a little research and testing it actually works really well. Most people are familiar with formula's in Excel (e.g. =sum(...), =cos(...), =avg(...), etc.), but less commonly known is the ability to extend Excel via a real programming language: Visual Basic for Application a.k.a VBA. These are called "macros". Macro enabled Excel files are saves as a *.xlsm as opposed to the stand *.xlsx or *.xls. For security reasons each time a macro enabled Excel file is opened, macros must be enabled usually via a security warning as seen below.

excelsecuritywarning

While I cannot give out the file or code, the basics are pretty simple once you know what to do. First, enable the Developer tab in the ribbon (right click File, go to Customize Ribbon, check Developer tab). Then setup your layout. I added a button and some status cells to the sheet one. The bottom of sheet one contained the stock quotes. Sheet 2  contained a list of stock symbols to be scraped. Finally, right click the sheet tab you want to work on click "View Code". In this developer interface, you can write VBA code attached to various events.

I used the click event on the button. With its name "loadQuotes", I added the subroutine:

Private Sub loadQuotes_Click()
End Sub

to the Sheet 1 Events.

The site we were scraping was very poorly formatted. It was not well suited to parsing out the data with only regular expressions (regex); I did use regex on the contents of specific DOM element though. I decided to use a DOM processor to break down the document into something more easily searchable. Initially I tried to use an Internet Explorer object to navigate to, load, and parse the document. It did work but was unreliable; it had a tendency to crash after 150 or so pages.

Using IE

Set doc = CreateObject("InternetExplorer.Application") If doc Is Nothing Then
  MsgBox "For some reason I wasn't able to use Internet Explorer object"
  Exit Sub
End If
doc.Visible = 0
url = "http://somestockquotewebsite.com/" & symbol doc.navigate url
Do Until doc.readyState = 4
  DoEvents
Loop

I ended up using combination Microsoft XML HTTP (msxml2.xmlhttp) and Html File (htmlFile). The form retrieves the page over http, the latter parses the html into a searchable DOM. I suspect htmlFile is using Internet Explorer on some level, but it did not seem to have same issues. In addition to being more reliable, this method turned out to be a little bit faster.

Using MS XML

Set msxml = CreateObject("msxml2.xmlhttp")
If msxml Is Nothing Then
  MsgBox "For some reason I wasn't able to make a MSXML2.XMLHTTP object"
  Exit Sub
End If
Set doc = CreateObject("htmlFile")
If doc Is Nothing Then
  MsgBox "For some reason I wasn't able to make a HTMLDocument object"
  Exit Sub
End If
url = "http://somestockquotewebsite.com/" & symbol
msxml.Open "GET", url, False
msxml.send
While msxml.readyState <> 4
  DoEvents
Wend
'MsgBox msxml.responseText
'Exit Sub
doc.body.innerhtml = msxml.responseText

Now that I had a html document, I could search through it for the specific element I needed with regular Javascript DOM syntax. In my case, I was looking for a table with the attribute width="640". It was the only one one on the entire page.

Finding data using DOM

Set tables = doc.getElementsByTagName("table")
For Each tb In tables
  If tb.getAttribute("width") = 640 Then
    'parse through the looking for the ones I needed
  End If
Next

After finding the table I needed, I looked through each of it child cells with the same technique. Luckily for me, the label for the data contained in each cell was in the same cell. I was able to run a simple string match on the innerText of each cell and pull the data out with a regular expression.

Regular Expressions

Set numbersRegex = CreateObject("vbscript.regexp")
With numbersRegex
  .IgnoreCase = True
  .MultiLine = False
  .Pattern = "[\d,\.]+"
  .Global = True
End With

' Strip new lines out of the innerText (used as oppsed to the innerHtml to strip out html tags)
contents = newlineRegex.Replace(td.innerText, " ")
' Find the number of trades today
If InStr(contents, "Trades Today") > 0 Then
  Debug.Print contents
  Set tradeMatches = numbersRegex.Execute(contents)
  'Debug.Print tradeMatches.Count
  If tradeMatches.Count > 0 Then
    tradesToday = commaRegex.Replace(tradeMatches.Item(0), "")
    Debug.Print "~ " & tradesToday
  End If
End If

A couple other functions I found useful for this particular project were reading, writing, searching, and formatting cells:

Reading Cells

Set s = Worksheets("Sheet 1").Range("B:B") symTotal = 0
For Each sc In s.Cells
  If IsEmpty(sc) Then Exit For
  If sc.Row <> 1 And Worksheets("Sheet 1 ").Cells(sc.Row, "C").Value <> 0 Then
    symTotal = symTotal + 1
  End If
Next

Writing Cells

Cells(1,2).Value = "0 of 0"

Searching Cells

Set tickerCell = tr.Find(What:=symbol, LookIn:=xlValues, _ LookAt:=xlWhole, SearchOrder:=xlByRows, SearchDirection:=xlNext, _ MatchCase:=False, SearchFormat:=False)

Formatting Cells

With s
  .Interior.ColorIndex = xlNone
  .Font.Color = RGB(0, 0, 0)
End With

As with anything, the devil is in the details, however, this should be enough to get started writing code to screens scrape in Excel.

Tags: custom software, excel

Contact Us To Setup A Meeting

Feel free to call or email anytime to setup a meeting. We would love to discuss your project to see if we can help!

contact us