Using Power Query to Extract Non-Tabular Data from Web Pages

Share on facebook
Share on twitter
Share on linkedin
Share on telegram
Share on whatsapp

Contents

Introduction

Ivan: I recently moved to Australia with my family. While looking for a place to live, I had to find a good suburb and school pair. I found interesting facts in https://www.myschool.edu.au/. But nevertheless, there is no option to compare schools. It only provides some statistics on Australian schools, one page at a time. Information presented on a typical school profile page (p. Not. Burwood Public School) it is as shown below.

img_5b333087a95a4-4304072

Means that, in the worst case, we can copy / paste data manually into Excel and perform our own analysis. Of course, this can only be done for a limited number of schools. Even for twenty schools it would take a little time. But nevertheless, today, when we have tools like Power Query (Get & Transform) in excel, we can collect data from websites efficiently and easily "upgradeable". Everything I describe and share in this post is created for non-commercial educational purposes only.

Obtaining data from a school

Take Burwood School for example. Copy browser URL

https://www.myschool.edu.au/SchoolProfile/Index/104736/BurwoodPublicSchool/41369.

And check the data of this URL in Excel.

img_5b2230a733a87-1237330

Note: I usually do all the queries in Excel. If required, I copy them / stuck in Power BI.

Power Query automatically detects all the tables available on the web page and presents options so that we can choose the one that interests us. When we need more than one table we need to create another query.

For this publication, let's say we are interested in the parameter "Source language other than English". But in this case, Power Query detects only one table on the school page, Not the data I want!

img_5b33345906a55-6562785

The data I want is just a value in a chart and not in a table (n. ° 1 then) Y, Thus, cannot be used for comparison purposes.

img_5b33352257942-8845362

Then, the required data is not in tabular format. You can only recover tables of web pages using Get data from the web. This means that we have to find a different way to get this data from a web page.. Fortunately, there is a way to do this and i share it with you in this post.

The method

And Google Chrome, we can To inspect elements of the web page by right-clicking and then selecting Inspect (n. ° 1 then).

img_5b333864336d5-1125391

In the HTML code below, we can see that “97%” (n. ° 1 then) is a text between the labels “” (n. ° 2 then) which in turn belongs to the main tag “” (n. ° 3 then ), and all this is part of "<div id = ”nonEnglishSpeakingStudents” ” (# 4 then). Probably, this is what we need.

img_5b3338fe920fa-8988053

This may seem scary, but these tags will help us find the data we want in the HTML text of the web page. Then, we must first get that text. Returning to Power Query, “From the web”, right click on the url (n. ° 1 then) and then click Edit (n. ° 2 then).

img_5b3339545828e-2670689

Power Query automatically wraps the content received from the web in the Web.Page function as shown below.

img_5b333990e588a-9786590

But what we need is just the text. So we have to change the settings “Open file as” a “Text file” in the Source pass. Click the gear to edit the source step.

img_5b3339dcb0471-1265728

When we use “Text file”, Power Query loads the html rows from the web page into a column.

img_5b333a01ea0df-8619053

We can filter the column to find the line that contains the text "nonEnglishSpeakingStudents".

img_5b333a3130ec6-4490417

The previous step leaves a single row in the table. All other rows have been removed. By selecting the only row left after filtering, we can see the “97%” required as part of saved text.

img_5b333a6544350-3301356

Now, we can easily extract this text using the standard Power Query function “Text between delimiters”.

img_5b333aa881a66-8194883

All we have to do is specify the start and end delimiters, as it's shown in the following. Very easy, no need to program your own functions, just use the user interface to help.

img_5b333ad21eeae-1984917

This adds a new column with the value we are looking for.

img_5b333b1d10c80-6195782

Get data for another school

In theory, if we change the URL with another school ID, it should work the same way, because the HTML text will have a similar structure on the other school's profile page.
Click on “Advanced editor” and look for the URL with the identification of the school in the language code 'M'.

img_5b333b5b9f9b8-1393976

To apply our query to another school's profile page, we need to change just the url.
Let's take, for instance, Concord Public School. I changed the query and now the query returned 69 (as it's shown in the following), which is the same as on the school page, it works!

img_5b333b8102cd7-2558912

Convert URL to parameter

Now it works, I will convert the url to a parameter instead of encoding it as a string. Better to create a parameter to make it more flexible. Home Manage parameters New parameter

img_5b333bf80ffc9-9295333

Fill in the required fields (Parameter name = URL, Text, Any value)

I use another school here (Meadowbank Public School https://www.myschool.edu.au/school/41257) to verify that the solution works regardless of the selected school.

img_5b333c2286445-8444819

Now that i have URL as parameterr, needed change Power Query code to point to parameter instead of hard-coded string.

For it, go to query source step settings. And select Parameter from the drop down list.

img_5b333c57364ad-5835002

Create a function from a query

The reason we started this task is to compare schools. So we need links for all school profile pages.

The most efficient way to do this in Power Query is to create a function from an existing query. In Power Query Editor, right-click the query and select “Create function”.

img_5b333cfd46dd5-3530866

Give your role a name.

img_5b333d2124f68-1261930

The function replicates the M code used in the initial query, but add an option to call itself with a parameter url. Note that the original query is still there, plus a new function.

img_5b333d4c10d41-7382756

Get URLs of multiple schools

The current version of the website allows you to search for schools, for instance, by suburb name or zip code.

img_5b333d8542517-5320757

While we search, in the browser we can see a URL that contains search parameters –
https://www.myschool.edu.au/school-search?FormPosted=True&SchoolSearchQuery=ryde&SchoolSector=G&SchoolType=P&State=NSW

Use the following parameters:
SchoolSearchQuery = ryde – for instance, suburbs around Ryde
SchoolSector = G – government
SchoolType = P – Primary
State = NSW
Then, if required, said string can be generated with a formula in Power Query.
In the same way as described above for 'Obtain data from a school', we can consult this web page for search results and extract useful information from the text.

For instance, we can get all the information shown on the search results page along with the URLs of the schools we require as parameters, as it's shown in the following.

img_5b333e5005d5d-5957985

Having the URL of each school, we can expand this table with information from each school profile page, for instance, with “% non-English speaking students” as it's shown in the following, calling the function as an aggregate column.

img_5b333e7ae5be5-3079163

That's. The information needed to compare schools is available now.

You can apply the same technique to any website where the data is published in non-tabular format.

Practice exercises

Now that you know how to query non-tabular data from the web, try to get the currency exchange rates on the following pages:

https://www.xe.com/currencytables/?from=AUD&date=2018-06-19

http://www.floatrates.com/daily/AUD.xml

https://www.exchange-rates.org/converter/AUD/EUR/1

https://www.x-rates.com/table/?from=AUD&amount=1

An easy way to get data on Australian schools

Although the intention of this post is to demonstrate what is possible with Power Query, in case you are curious to know more about Australian schools, there is an easier option to retrieve data from schools.
In the footer of https://www.myschool.edu.au website you can find a link to http://www.acara.edu.au/contact-us/acara-data-access where all data is available in Excel format.

img_5b33405c3ebc9-7944837

About the Author

Ivan Bondarenko (aka Ivan Bond) is a specialist in reports based on Excel and Power BI, vba developer, open source author SAP Business Objects Automation Tool Y Excel-based solution to schedule Excel file update (also known as Power Refresh).

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.