In 2018 I wrote an article showing how to extract data from websites using Power Query in Excel. For simple websites, this process is very simple. For more complex websites, the process is not always so easy. Microsoft is working to improve the experience “from the web” and is developing new capabilities. New Power Query features will always be delivered as part of Power BI before being available in Excel. In this article, I will cover a new Power Query feature that is only available in Power BI and that makes the process of extracting data from complex web pages much easier. The new function is named “new web table inference”.
Complex data extraction from web pages
Extracting tabular data from a web page previously only worked fine if there was an underlying HTML table presenting the data on the screen. The function “new web table from inference” is a step change improvement in the way Power Query extracts data. In summary, Power Query looks at the final rendered web page on the screen, then parse the code behind (php, css, js, etc.) to find out how to extract the correct data. This feature is still being developed and, Thus, will not work for all web pages, but it's already pretty good.
Before you can use this function, you must enable it in the preview functions settings. Go to File Options Options & Settings Preview Features and then turn it on as shown below.
An example of the new web table from inference
In this article, I will show you how to use this new preview function to extract data from websites using the same example I used in my previous article. The data I want to extract is the number of views of the posts by topic in the PowerPivot Forum that I own and operate. When I searched for my posts on the website, I found 1.067 coincidences. As you can see in the image below, these publications are delivered to me 10 publications per website in 107 websites. For this demonstration, I will extract a table containing the forum name, the topic and the number of visits (# 2 then) from the first page.
Of course, it is possible to create a function and extract the 1.067 articles, but i have covered it in other posts.
Before activating a new web table from inference
If I try to extract the list of articles from the previous forum without using the new preview function, this is what Power Query presents to me.
Notice how Power Query cannot identify the tabular structure of the page and, However, show a single table (n. ° 1 up) which is just some technical HTML data (n. ° 2 up). This standard approach will not solve the problem and, Thus, I will need to use the new preview function.
Enter the new web table from the inference preview function
With the new function, there is 2 options to extract the table I am looking for.
As you can see in the image below, now there is 7 suggested new tables that could be imported. This is the new feature: infer the tables by analyzing the page. Now, if one of these tables contains everything you need, you can simply start with the suggested table and make any necessary changes from there. This will work fine as long as one of the suggested tables contains a superset of everything you need, without missing anything. If at least one column of data is missing, you will need to use the other approach and create your own table.
As it turns out in my demo example, could have used the Table 1 anterior, but instead I will use the second approach.
Using the option “From the web for example”
This option allows you to create your own table showing Power Query what to expect. This feature has been available for a while, but it's continually improving. Even if you used it before and it didn't work for you, worth taking another look.
To use the manually created table, click on “Add table using examples” in the lower left corner (# 1 then).
After clicking, you will be presented with a spreadsheet style layout where you can write examples of what you want to see in the final table (see lower half of image below). Please note that Intellisense assists you and presents a list of possible values as you begin to enter data; just select the one you want from the list of options presented.
In the picture above, I am extracting the name of the forum (# 1 up) in the column 1. To do this, I need to write the exact name of the forum, then choose the correct representation I want from the list. Try to avoid writing them in their entirety: any misspelling or typo will prevent Power Query from completing the task, so it is better to choose from the list.
Once this is done, I did the same with the topic (n. ° 2) and the views (n. ° 3) as you can see below.
It is common that you need to enter a second row of sample data so Power Query can be sure of the pattern of the web data. You must choose the next consecutive entry on the web page, otherwise the process will not correctly extract all the data from the page. The image below shows that I have selected the values for the first few 2 columns in the second row: Power BI populated the value of the third column and also the rest of the table (the others 8 rows). Once the table is auto populated with the correct values, you can be sure that Power Query has correctly identified the pattern.
Now you can simply click “To accept” and load the data.
I really like this feature and I think it is a great example of how Microsoft is improving the end user experience by using artificial intelligence techniques under the hood.. I think you can look forward to many more features like this in the future.