Web Scraping con Selenium Python

Contents

Introduction: –

Machine learning is powering today's technological wonders, like driverless cars, space flights, image and voice accreditation. Despite this, a data science professional would need a large volume of data to build a robust and reliable machine learning model for such business problems.

web scraping selenium

Data mining or data collection is a very primitive step in the data science life cycle. According to commercial requirements, you may have to collect data from sources such as servers, records, databases, API, SAP web or online repositories.

Web scraping tools like Selenium can scrape a large volume of data, as text and images, in a relatively short time.

Table of Contents: –

  1. What is web scraping?
  2. Why Web Scraping
  3. How Web Scraping Is Useful
  4. What is selenium?
    1. Settings and tools
  5. Implementing Image Scrapping Using Selenium Python
  6. Headless Chrome Browser
  7. Putting it all
  8. Final notes

What is Web Scraping? : –

Web Scrapping, also called “tracking” O “spidering” is the technique of automatically collecting data from an online source, in general of a web portal. Although Web Scrapping is an easy way to obtain a large volume of data in a relatively short period of time, adds stress to the server where the font is hosted.

This is also one of the main reasons why many websites do not make it possible to scrape everything on their web portal.. Despite this, as long as it does not interrupt the main function of the online source, it is quite acceptable.

Why Web Scraping? –

There is a large volume of data on the web that people can use to meet business needs. Therefore, some tool or technique is needed to collect this information from the web. And that's where the concept of Web-Scrapping comes into play..

What is the use of Web Scraping? –

Web scraping can help us extract a huge amount of customer data, products, people, stock markets, etc.

Data collected from a web portal can be used, as an e-commerce portal, job portals, social media channels to understand customer buying patterns, workers' attrition behavior and customer feelings, and the list goes on.

The most popular libraries or frameworks used in Python for the Web – Scrapping son BeautifulSoup, Scrappy y Selenium.

In this post, we will talk about web scrapping using Selenium in Python. And the cherry on top we'll see how we can collect images from the web that you can use to create train data for your deep learning project..

What is selenium?

Selenium is a web-based open source automation tool. Selenium is primarily used for testing in industry, but it can also be used to scrape the fabric. We will use the Chrome browser but you can try it in any browser, it is almost the same.

Image source

Now let's see how to use Selenium for Web Scraping.

Settings and tools: –

  1. Installation:
    • Install selenium using pip
      pip install selenium
  2. Download the Chrome driver:
    To download web drivers, you can select any of the following methods:
    1. You can directly download the Chrome driver from the following link:
      https://chromedriver.chromium.org/downloads
    2. Or you can download it directly using the next line of code:controller = webdriver.Chrome (ChromeDriverManager (). install ())

You can find complete documentation on selenium here. Documentation is self explanatory, so be sure to read it to take advantage of selenium with Python.

The following methods will help us find items on a web page (these methods will return a list):

  • find_elements_by_name
  • find_elements_by_xpath
  • find_elements_by_link_text
  • find_elements_by_partial_link_text
  • find_elements_by_tag_name
  • find_elements_by_class_name
  • find_elements_by_css_selector

Now, write a Python code to extract images from the web.

Implementing Image Scrapping Using Selenium Python: –

Paso 1: – Import libraries

import os
import selenium
from selenium import webdriver
import time
from PIL import Image
import io
import requests
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import ElementClickInterceptedException

Paso 2: – Install Driver

#Install Driver
driver = webdriver.Chrome(ChromeDriverManager().install())

Paso 3: – Specify the search URL

#Specify Search URL 
search_url="https://www.google.com/search?q={q}&tbm=isch&tbs=south:fc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568" 

driver.get(search_url.format(q='Car'))

I've used this specific URL so you don't get in trouble for using copyrighted or licensed images. Opposite case, you can use https://google.com also as a search URL.

Then we search for Car in our search URL. Paste the link into the driver.get function (“Your link here”) and run the cell. This will open a new browser window for that link.

Paso 4: – Scroll to the bottom of the page.

#Scroll to the end of the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)#sleep_between_interactions

This line of code would help us get to the bottom of the page. And then we give him a rest time of 5 seconds so that we do not have problems, where we are trying to read items from the page, that is not yet loaded.

Paso 5: – Locate the images to be scraped off the page.

#Locate the images to be scraped from the current page 
imgResults = driver.find_elements_by_xpath("//img[contains(@class,'Q4LuWd')]")
 totalResults=len(imgResults)

Now we will look for all the links of images present on that particular page. We will create a “ready” to save those links. Then, to do that, go to browser window, right click on the page and select 'inspect item’ or enable developer tools using Ctrl + Shift + I.

Now identify any attribute as class, id, etc. Which is common in all these images.

In our case, class = ”’ Q4LuWd ”is common in all these images.

Paso 6: – Extract the respective link of each image

As we can, the images displayed on the page are still thumbnails, not the original image. Then, to download each image, we must click on each thumbnail and extract the relevant information regarding that image.

#Click on each Image to extract its corresponding link to download

img_urls = set()
for i in  range(0,len(imgResults)):
    img = imgResults[i]
    try:
        img.click()
        time.sleep(2)
        actual_images = driver.find_elements_by_css_selector('img.n3VNCb')
        for actual_image in actual_images:
            if actual_image.get_attribute('src') and 'https' in actual_image.get_attribute('src'):
                img_urls.add(actual_image.get_attribute('src'))
    except ElementClickInterceptedException or ElementNotInteractableException as err:
        print(err)

Then, in the above code snippet, we are carrying out the following tasks:

  • Repeat each thumbnail and then click on it.
  • Make our browser sleep for 2 seconds (: P).
  • Look for the unique HTML tag respective to that image to place it on the page
  • We still get more than one result for a particular image. But we are all interested in the link to download that image.
  • Then, we iterate through each result for that image and extract the attribute ‘src’ of it and then we see if “https” is present in the 'src’ or not. Since regularly the web link starts with 'https'.

Paso 7: – Download and save each image to the destination directory

os.chdir('C:/Qurantine/Blog/WebScrapping/Dataset1')
baseDir=os.getcwd()
for i, Url address in enumerate(img_urls):
    file_name = f"{i:150}.jpg"    
    try:
        image_content = requests.get(Url address).content

except Exception as e:
        print(f"ERROR - COULD NOT DOWNLOAD {Url address} - {e}")

try:
        image_file = io. BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
        
        file_path = os.path.join(baseDir, file_name)
        
        with open(file_path, 'wb') as f:
            image.save(f, "JPEG", quality=85)
        print(f"SAVED - {Url address} - AT: {file_path}")
    except Exception as e:
        print(f"ERROR - COULD NOT SAVE {Url address} - {e}")

Now in summary you have extracted the image for your project 😀

Note: – Once you've written the right code, the browser is not essential, can collect data without a browser, what is called the headless browser window, therefore, replace the following code with the previous one.

Headless Chrome Browser

#Headless chrome browser
from selenium import webdriver 
opts = webdriver.ChromeOptions()
opts.headless =True
driver =webdriver. Chrome(ChromeDriverManager().install())

For this case, the browser will not run in the background, which is very useful when implementing a solution in production.

Let's put all this code into one function to make it more organizable and implement the same idea to download. 100 images for each category (as an example, Autos, Horses).

And this time we would write our code using the idea of headless chrome..

Putting it all together:

Paso 1: import all indispensable libraries

import you
import selenium
from selenium import webdriver
import time
from PIL import Image
import Io
import requests
from webdriver_manager.chrome import ChromeDriverManager

os.chdir('C:/Qurantine/Blog/WebScrapping')

Paso 2: install the Chrome driver

#Install driver
opts=webdriver. ChromeOptions()
opts.headless=True

driver = webdriver.Chrome(ChromeDriverManager().install() ,options=opts)

In this step, we installed a Chrome driver and used a headless browser to scrape the web.

Paso 3: specify the search URL

search_url = "https://www.google.com/search?q={q}&tbm=isch&tbs=south:fc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568"
driver.get(search_url.format(q='Car'))

I have used this specific url to extract royalty-free images.

Paso 4: write a function to move the cursor to the bottom of the page

def scroll_to_end(driver):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)#sleep_between_interactions

This code snippet will scroll down the page.

Paso 5. Write a function to get the URL of each image.

#no license issues

def getImageUrls(name,totalImgs,driver):
    
    search_url = "https://www.google.com/search?q={q}&tbm=isch&tbs=south:fc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568"
    driver.get(search_url.format(q=name))
    img_urls = set()
    img_count = 0
    results_start = 0  
    
    while(img_count<totalImgs): #Extract actual images now
        
        scroll_to_end(driver)
        
        thumbnail_results = driver.find_elements_by_xpath("//img[contains(@class,'Q4LuWd')]")
        totalResults=len(thumbnail_results)
        print(f"Found: {totalResults} search results. Extracting links from{results_start}:{totalResults}")
        
        for img in thumbnail_results[results_start:totalResults]:
            
            img.click()
            time.sleep(2)
            actual_images = driver.find_elements_by_css_selector('img.n3VNCb')
            for actual_image in actual_images:
                if actual_image.get_attribute('src') and 'https' in actual_image.get_attribute('src'):
                    img_urls.add(actual_image.get_attribute('src'))
            
            img_count=len(img_urls)
            
            if img_count >= totalImgs:
                print(f"Found: {img_count} image links")
                break
            else:
                print("Found:", img_count, "looking for more image links ...")                
                load_more_button = driver.find_element_by_css_selector(".mye4qd")
                driver.execute_script("document.querySelector('.mye4qd').click();")
                results_start = len(thumbnail_results)
    return img_urls

This function would return a list of URLs for each category (as an example, cars, horses, etc.)

Paso 6: write a function to download each image

def downloadImages(folder_path,file_name,Url address):
    try:
        image_content = requests.get(Url address).content
except Exception as e:
        print(f"ERROR - COULD NOT DOWNLOAD {Url address} - {e}")
try:
        image_file = io. BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
       
        file_path = os.path.join(folder_path, file_name)
        
        with open(file_path, 'wb') as f:
            image.save(f, "JPEG", quality=85)
        print(f"SAVED - {Url address} - AT: {file_path}")
    except Exception as e:
        print(f"ERROR - COULD NOT SAVE {Url address} - {e}")

This code snippet will download the image of each URL.

Paso 7: – type a function to save each image to the destination directory

def saveInDestFolder(searchNames,destDir,totalImgs,driver):
    for name in list(searchNames):
        path=os.path.join(destDir,name)
        if not os.path.isdir(path):
            os.mkdir(path)
        print('Current Path',path)
        totalLinks=getImageUrls(name,totalImgs,driver)
        print('totalLinks',totalLinks)

if totalLinks is None:
            print('images not found for :',name)
            continue
        else:
            for i, link in enumerate(totalLinks):
                file_name = f"{i:150}.jpg"
                downloadImages(path,file_name,link)
            
searchNames=['Car','horses'] 
destDir=f'./Dataset2/'
totalImgs=5

saveInDestFolder(searchNames,destDir,totalImgs,driver)

this code snippet will save each image to the destination directory.

Final notes

I've tried my part to explain Web Scraping using Selenium with Python in the simplest way feasible.. Feel free to comment on your queries. I will be more than happy to answer you.

You can clone my Github repository to download all the code and data, Click here!!

About the Author

Author

Praveen Kumar Anwla

I have worked as a data scientist with product-backed audit firms and Big 4 for almost 5 years. I have been working on various NLP frameworks, State-of-the-art machine learning and deep learning to solve business problems. Please feel free to review my personal blog, where I cover topics from machine learning: artificial intelligence, chatbots to visualization tools (Painting, QlikView, etc.) and various cloud platforms like Azure, IBM and the AWS Cloud.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.