Introduction
Ivan: I recently moved to Australia with my family. While looking for a place to live, I had to find a good suburb and school pair. I found interesting facts in https://www.myschool.edu.au/. But nevertheless, there is no option to compare schools. It only provides some statistics on Australian schools, one page at a time. Information presented on a typical school profile page (p. Not. Burwood Public School) it is as shown below.
Means that, in the worst case, we can copy / paste data manually into Excel and perform our own analysis. Of course, this can only be done for a limited number of schools. Even for twenty schools it would take a little time. But nevertheless, today, when we have tools like Power Query (Get & Transform) in excel, we can collect data from websites efficiently and easily "upgradeable". Everything I describe and share in this post is created for non-commercial educational purposes only.
Obtaining data from a school
Take Burwood School for example. Copy browser URL
https://www.myschool.edu.au/SchoolProfile/Index/104736/BurwoodPublicSchool/41369.
And check the data of this URL in Excel.
Note: I usually do all the queries in Excel. If required, I copy them / stuck in Power BI.
Power Query automatically detects all the tables available on the web page and presents options so that we can choose the one that interests us. When we need more than one table we need to create another query.
For this publication, let's say we are interested in the parameter "Source language other than English". But in this case, Power Query detects only one table on the school page, Not the data I want!
The data I want is just a value in a chart and not in a table (n. ° 1 then) Y, Thus, cannot be used for comparison purposes.
Then, the required data is not in tabular format. You can only recover tables of web pages using Get data from the web. This means that we have to find a different way to get this data from a web page.. Fortunately, there is a way to do this and i share it with you in this post.
The method
And Google Chrome, we can To inspect elements of the web page by right-clicking and then selecting Inspect (n. ° 1 then).
In the HTML code below, we can see that “97%” (n. ° 1 then) is a text between the labels “” (n. ° 2 then) which in turn belongs to the main tag “” (n. ° 3 then ), and all this is part of "<div id = ”nonEnglishSpeakingStudents” ” (# 4 then). Probably, this is what we need.
This may seem scary, but these tags will help us find the data we want in the HTML text of the web page. Then, we must first get that text. Returning to Power Query, “From the web”, right click on the url (n. ° 1 then) and then click Edit (n. ° 2 then).
Power Query automatically wraps the content received from the web in the Web.Page function as shown below.
But what we need is just the text. So we have to change the settings “Open file as” a “Text file” in the Source pass. Click the gear to edit the source step.
When we use “Text file”, Power Query loads the html rows from the web page into a column.
We can filter the column to find the line that contains the text "nonEnglishSpeakingStudents".
The previous step leaves a single row in the table. All other rows have been removed. By selecting the only row left after filtering, we can see the “97%” required as part of saved text.
Now, we can easily extract this text using the standard Power Query function “Text between delimiters”.
All we have to do is specify the start and end delimiters, as it's shown in the following. Very easy, no need to program your own functions, just use the user interface to help.
This adds a new column with the value we are looking for.
Get data for another school
In theory, if we change the URL with another school ID, it should work the same way, because the HTML text will have a similar structure on the other school's profile page.
Click on “Advanced editor” and look for the URL with the identification of the school in the language code 'M'.
To apply our query to another school's profile page, we need to change just the url.
Let's take, for instance, Concord Public School. I changed the query and now the query returned 69 (as it's shown in the following), which is the same as on the school page, it works!
Convert URL to parameter
Now it works, I will convert the url to a parameter instead of encoding it as a string. Better to create a parameter to make it more flexible. Home Manage parameters New parameter
Fill in the required fields (Parameter name = URL, Text, Any value)
I use another school here (Meadowbank Public School https://www.myschool.edu.au/school/41257) to verify that the solution works regardless of the selected school.
Now that i have URL as parameterr, needed change Power Query code to point to parameter instead of hard-coded string.
For it, go to query source step settings. And select Parameter from the drop down list.
Create a function from a query
The reason we started this task is to compare schools. So we need links for all school profile pages.
The most efficient way to do this in Power Query is to create a function from an existing query. In Power Query Editor, right-click the query and select “Create function”.
Give your role a name.
The function replicates the M code used in the initial query, but add an option to call itself with a parameter url. Note that the original query is still there, plus a new function.
Get URLs of multiple schools
The current version of the website allows you to search for schools, for instance, by suburb name or zip code.
While we search, in the browser we can see a URL that contains search parameters –
https://www.myschool.edu.au/school-search?FormPosted=True&SchoolSearchQuery=ryde&SchoolSector=G&SchoolType=P&State=NSW
Use the following parameters:
SchoolSearchQuery = ryde – for instance, suburbs around Ryde
SchoolSector = G – government
SchoolType = P – Primary
State = NSW
Then, if required, said string can be generated with a formula in Power Query.
In the same way as described above for 'Obtain data from a school', we can consult this web page for search results and extract useful information from the text.
For instance, we can get all the information shown on the search results page along with the URLs of the schools we require as parameters, as it's shown in the following.
Having the URL of each school, we can expand this table with information from each school profile page, for instance, with “% non-English speaking students” as it's shown in the following, calling the function as an aggregate column.
That's. The information needed to compare schools is available now.
You can apply the same technique to any website where the data is published in non-tabular format.
Practice exercises
Now that you know how to query non-tabular data from the web, try to get the currency exchange rates on the following pages:
https://www.xe.com/currencytables/?from=AUD&date=2018-06-19
http://www.floatrates.com/daily/AUD.xml
https://www.exchange-rates.org/converter/AUD/EUR/1
https://www.x-rates.com/table/?from=AUD&amount=1
An easy way to get data on Australian schools
Although the intention of this post is to demonstrate what is possible with Power Query, in case you are curious to know more about Australian schools, there is an easier option to retrieve data from schools.
In the footer of https://www.myschool.edu.au website you can find a link to http://www.acara.edu.au/contact-us/acara-data-access where all data is available in Excel format.
About the Author
Ivan Bondarenko (aka Ivan Bond) is a specialist in reports based on Excel and Power BI, vba developer, open source author SAP Business Objects Automation Tool Y Excel-based solution to schedule Excel file update (also known as Power Refresh).