How to Screen Scrape Web Data for Input into Diver

by | Dec 22, 2016 | General BI

Reading Time: 6 minutes

Diver accepts input from many sources, but did you know that you can screen scrape information from the Web to use as input? For example, do you want to integrate daily exchange rate data, weather data, or transportation schedules into a model or cBase?

Tweet: How to screen scrape web data for input into Diver

Paul Duggan, Software Development Manager at Dynamic Business Informatics Ltd. (DBI), recently presented a method to accomplish just this at our annual distributors’ conference. With a few steps, you can pull data from the web. In short, you can use a Google sheet to import the web page data, download the shared Google sheet to a local .csv file, and use the .csv file as input to Diver. Sound simple? Let’s give it a try!

Prerequisites

To create and share Google spreadsheets, you’ll need a Google account. If you don’t have one, sign up for this free account (or check if you already have one) here: https://support.google.com/accounts/answer/27441?hl=en

To use this method to download the data from a web page URL, the data must be contained in the page source in HTML table or list format. You can use your browser to view the HTML page source. (Check your browser documentation for how to view the HTML source. Some browsers have a menu option; other browsers use right mouse clicks or control key combinations.) With the HTML source open, check for <table> or <list> tags. If you find the data that you want in a table or list, you are good to go. If not, this method will not work to retrieve the data.

For example, compare the source of these two pages. You should see that the Seattle bus schedule is enclosed in <table> tags. However, the BART schedule is populated with scripting, so the HTML table is not hardcoded. Therefore, the BART schedule cannot be retrieved with this method.

To transfer the Google sheet to a local file in .csv format, you’ll need a utility, such as curl or GNU Wget, to transfer files over HTTPS. If you have another favorite file transfer utility, feel free to use it. Check these sites to get more information and download one of the utilities:

Now that we have the prerequisites covered, here are the steps:

Step 1: Identify the URL of the data you want to import

For this example, we’ll use the URL for Seattle SoundTransit weekday schedule for Route 510: https://www.soundtransit.org/schedules/st-express-bus/510/weekday

The weekday tab of the bus schedule displays in table format on the web page.

And the schedule data is included in a <table> in the HTML source, as you can see in this code snippet:

<table class=”timeTable” id=”timeTable”><thead><tr class=”headers”><th class=”headerData”>Route</th>

<th class=”headerData”>4th &amp; Jackson</th>

<th class=”headerData”>4th &amp; Pike</th>

<th class=”headerData”>I-5 &amp; NE 45th*</th>

<th class=”headerData”>I-5 &amp; NE 145th*</th>

<th class=”headerData”>Mountlake Terrace Bay 7</th> …

Step 2: Create a Google sheet

We’ll create a blank Google sheet to import the bus schedule into. Log into your Google account, navigate to spreadsheets, and create a new, blank spreadsheet: https://www.google.com/sheets/about/

Step 3: Import the data into the Google sheet

To import the bus schedule, we’ll use a Google sheet function, IMPORTHTML. Put your cursor in spreadsheet cell A1, and enter the function with the following parameters:

=IMPORTHTML(“<url>“, “<keyword>“, <num>)

Where:

  • <url> is the URL including the protocol “https://“ for the data you want to import (from Step 1). Enclose the URL in quotation marks.
  • <keyword> is either “list” or “table” from the HTML source structure.
  • <num> is the number of the occurrence, starting at one (1), of the list or table in the HTML source.

For example, to import the first table in the Seattle bus schedule HTML source, you would enter the following function:
=IMPORTHTML(“https://www.soundtransit.org/schedules/st-express-bus/510”, “table”, 1)

After a brief time to load, your spreadsheet should populate with the data from the URL.

Step 4: Create a shareable link for the Google sheet

Now, you want to share the spreadsheet so that you have the spreadsheet ID and you can download it. Share the spreadsheet, and copy the link with the sheet ID.

Step 5: Use GNU Wget or curl to export the Google sheet data to local file

With the URL of the shared Google sheet, we can use a command line tool, curl or Wget, to download the data as a .csv file. Both curl and Wget have lots of optional parameters you can play around with, but the important parameters to specify are the source location of the shared Google sheet and a target location and file type for the downloaded file.

For curl:
Here’s the command syntax:
C:\curl\src>curl –output <output> https://docs.google.com/spreadsheets/d/<sheetID>/export?gid=X&format=csv

Where:

  • –output <output> gives the drive and file name of the output file
  • <sheetID> is the Google sheet ID copied from the sharing URL link
  • gid=X specifies which tab, starting with zero (0), of the Google sheet has the data
  • format=csv specifies that the output file is in .csv format

Here’s the curl command to download our Seattle bus schedule example:
C:\curl\src>curl –output c:\test\output.curl.csv “https://docs.google.com/spreadsheets/d/1zSpPlowxuFr2mf_7nYpq3HN4OH-Pc_8b8liNoNThPY0/export?gid=0&format=csv”

For Wget:
If you prefer Wget, here’s the command syntax:
C:\wget>wget –output-document=<output> https://docs.google.com/spreadsheets/d/<sheetID>/export?gid=X&format=csv

Where:

  • –output-document=<output> gives the drive and file name of the output file
  • <sheetID> is the Google sheet ID copied from the sharing URL link
  • gid=X specifies which tab, starting with zero (0), of the Google sheet has the data
  • format=csv specifies that the output file is in .csv format

Here’s the Wget command to download our Seattle bus schedule example:
C:\wget>wget –output-document=c:\test\output.wget.csv https://docs.google.com/spreadsheets/d/1zSpPlowxuFr2mf_7nYpq3HN4OH-Pc_7b8liNoNThPY0/export?gid=0&format=csv

Whichever file transfer tool you use, you now have the web page data downloaded to a local .csv file. You can open the .csv file to check that the data is there.

Step 6: Use the local file as the file in input to a model or cBase build

Now that you have the web page data in a local .csv file, you can use it as Diver input. You can move the .csv file to the /client-files/ or other directory to use as the file in to your Integrator script. Define the dimensions, run the script, and you have a model with the web data as input!

Jean Collins

You may also like