Screen scraping with Excel
Excel is a great tool for organizing data, but data entry is boring. I had never really thought much more about those particular facts until a client came to me looking to aggregate quite a bit of stock data into Excel. He has been using Excel for years, manually entering the information he needs.
Generally, when I need scrape web pages for data I write up a script in some "P" language: php, Python, Perl. This, however, is not very friendly for the for the non-programmer among us.
This leaves running the scraper Excel macros with VBA. I honestly wasn't sure if this was feasible at first. After a little research and testing it actually works really well. Most people are familiar with formula's in Excel (e.g. =sum(...), =cos(...), =avg(...), etc.), but less commonly known is the ability to extend Excel via a real programming language: Visual Basic for Application a.k.a VBA. These are called "macros". Macro enabled Excel files are saves as a *.xlsm as opposed to the stand *.xlsx or *.xls. For security reasons each time a macro enabled Excel file is opened, macros must be enabled usually via a security warning as seen below.
While I cannot give out the file or code, the basics are pretty simple once you know what to do. First, enable the Developer tab in the ribbon (right click File, go to Customize Ribbon, check Developer tab). Then setup your layout. I added a button and some status cells to the sheet one. The bottom of sheet one contained the stock quotes. Sheet 2 contained a list of stock symbols to be scraped. Finally, right click the sheet tab you want to work on click "View Code". In this developer interface, you can write VBA code attached to various events.
I used the click event on the button. With its name "loadQuotes", I added the subroutine:
Private Sub loadQuotes_Click()
to the Sheet 1 Events.
The site we were scraping was very poorly formatted. It was not well suited to parsing out the data with only regular expressions (regex); I did use regex on the contents of specific DOM element though. I decided to use a DOM processor to break down the document into something more easily searchable. Initially I tried to use an Internet Explorer object to navigate to, load, and parse the document. It did work but was unreliable; it had a tendency to crash after 150 or so pages.
Set doc = CreateObject("InternetExplorer.Application")
If doc Is Nothing Then
MsgBox "For some reason I wasn't able to use Internet Explorer object"
doc.Visible = 0
url = "http://somestockquotewebsite.com/" & symbol doc.navigate url
Do Until doc.readyState = 4
I ended up using combination Microsoft XML HTTP (msxml2.xmlhttp) and Html File (htmlFile). The form retrieves the page over http, the latter parses the html into a searchable DOM. I suspect htmlFile is using Internet Explorer on some level, but it did not seem to have same issues. In addition to being more reliable, this method turned out to be a little bit faster.
Using MS XML
Set msxml = CreateObject("msxml2.xmlhttp")
If msxml Is Nothing Then
MsgBox "For some reason I wasn't able to make a MSXML2.XMLHTTP object"
Set doc = CreateObject("htmlFile")
If doc Is Nothing Then
MsgBox "For some reason I wasn't able to make a HTMLDocument object"
url = "http://somestockquotewebsite.com/" & symbol
msxml.Open "GET", url, False
While msxml.readyState <> 4
doc.body.innerhtml = msxml.responseText
Finding data using DOM
Set tables = doc.getElementsByTagName("table")
For Each tb In tables
If tb.getAttribute("width") = 640 Then
'parse through the looking for the ones I needed
After finding the table I needed, I looked through each of it child cells with the same technique. Luckily for me, the label for the data contained in each cell was in the same cell. I was able to run a simple string match on the innerText of each cell and pull the data out with a regular expression.
Set numbersRegex = CreateObject("vbscript.regexp")
.IgnoreCase = True
.MultiLine = False
.Pattern = "[\d,\.]+"
.Global = True
' Strip new lines out of the innerText (used as oppsed to the innerHtml to strip out html tags)
contents = newlineRegex.Replace(td.innerText, " ")
' Find the number of trades today
If InStr(contents, "Trades Today") > 0 Then
Set tradeMatches = numbersRegex.Execute(contents)
If tradeMatches.Count > 0 Then
tradesToday = commaRegex.Replace(tradeMatches.Item(0), "")
Debug.Print "~ " & tradesToday
A couple other functions I found useful for this particular project were reading, writing, searching, and formatting cells:
Set s = Worksheets("Sheet 1").Range("B:B")
symTotal = 0
For Each sc In s.Cells
If IsEmpty(sc) Then Exit For
If sc.Row <> 1 And Worksheets("Sheet 1 ").Cells(sc.Row, "C").Value <> 0 Then
symTotal = symTotal + 1
Cells(1,2).Value = "0 of 0"
Set tickerCell = tr.Find(What:=symbol, LookIn:=xlValues, _
LookAt:=xlWhole, SearchOrder:=xlByRows, SearchDirection:=xlNext, _
.Interior.ColorIndex = xlNone
.Font.Color = RGB(0, 0, 0)
As with anything, the devil is in the details, however, this should be enough to get started writing code to screens scrape in Excel.