A Complete Tutorial on Web Scraping Using Python: Full Codes Included

Step-by-step tutorial to use Beautiful Soup and Selenium for Web Scraping

Disappointed with a lack of access to essential websites? Web scraping is the answer. Read this article to learn how to use Python for web scraping.

Ask anyone what the biggest and best source of information (or misinformation) in the entire world is; the answer would possibly be “the internet,” and quite rightfully so. The world wide web is a vast repository of information where you can find anything and everything you desire, as long as you know the proper right methods. This is where the web scraping phenomenon comes into play.

Web scraping is the method by which an individual can gather, analyze, and dissect raw data from websites – regardless of their types and locations. Interestingly, the zealous and ambitious community of Python (a programming language) developers had taken the responsibility of creating various tools and libraries that could help web scrapers with their tasks, and they succeeded.

Nowadays, using Python is one of the best ways for scraping the web. This article is intended as a comprehensive tutorial for beginners who are looking for ideal web scraping techniques with the use of Python.

Take Help From Professionals

First thing’s first: if you are willing and have the means to hire an organization for your web scraping purposes, you would not have to worry about developing the web scraping system from scratch. All you require is a web scraping API that you would need to integrate with your system, and the rest will be taken care of by the company selling you the API.

Furthermore, these organizations are adept at troubleshooting your existing attempts at web scraping and fixing any aspects on your behalf. They would also teach you the proper methods and provide hands-on training if required. Thus, for immediate effect, taking expert advice is the way to go.

Full Web Scraping Tutorial Using Python

To establish a proper, efficient web scraper, you would have to go step-by-step. Firstly, you need to install Python in your system. After that, website inspecting capabilities need to be integrated for the scraping to take place. For data extraction after successful access, multiple Python libraries can be used, with Beautiful Soup being the best of the lot. The Javascript rendering takes time, but after it is done, all the data can be saved into a CSV or JSON file. In simple terms, the following sections comprehensively deal with using Python for web scraping purposes.

Install Python

Many computer systems come with Python pre-installed these days. Check if your machine already has it installed by running the following command.
python3 -v
This will display an output that mentions the existing version – if installed in the first place.

Install the Necessary Libraries and Tools

The required tools and libraries are not many, although there are many options that you can choose from. For selecting specific data from websites, we prefer using the Python library called Beautiful Soup. For installing this package, run the following command.
pip3 install beautifulsoup4
The same procedure is followed for installing the automation tool called Selenium to render dynamically derived content. The exact command is given below.
pip3 install selenium
Make sure you have Google Chrome, and its driver integrated into your system, as Selenium requires them for scraping.

Inspect the Page

As a lot of coding is involved, let us assume that we are scraping from the IMDb website regarding the top ten movies among the famous top 250. The initial course of action is to obtain the titles. Once done, the relevant data per movie is extracted from the website.

Right-click on the landing page and select “Inspect Element.” A separate interface will open up where you need to find the <table> tag. If you are facing difficulty finding it, press CTRL+F, enter the text and let the browser search it out for you. Notable HTML and CSS selectors at this stage of the process are table tbody tr td.titleColumn a and “titleColumn” respectively.

All the titles can be extracted using the letter by pulling out each anchor’s innerText. A single line of JavaScript code is required for displaying the extracted data on the browser. The code is given below.

document.querySelectorAll("table tbody tr td.titleColumn a")[0].innerText

Extract the Statically Loaded Content Using Beautiful Soup

Content on the internet can either be dynamically or statically loaded. For the latter, extraction is reasonably straightforward as it does not use JavaScript, which in itself is a separate programming language altogether. The following lines of code are an example of implementing BeautifulSoup to extract static content from IMDb.
import requests
from bs4 import BeautifulSoup
 
page = requests.get('https://www.imdb.com/chart/top/') # Calling the HTML through request
soup = BeautifulSoup(page.content, 'html.parser') # Parsing the content using the library
 
links = soup.select("table tbody tr td.titleColumn a") # Selecting all the titled anchors
first10 = links[:10] # Holding onto the first 10 anchors only
for anchor in first10:
    print(anchor.text) # Displaying the innerText of each anchor

Extract the Dynamically Loaded Content Using Selenium

Now let us look at how dynamically loaded content is scraped from a website – IMDb in this case.

Go to the Editorial Lists section. If you implement inspect on the page, you will encounter an element with having the parameter data-tested anchored as firstListCardGroup-editorial. However, if you try to locate it in the page source, you will not find it anywhere. That is because the website dynamically loads this entire section onto the browser.

The following lines of code are used to scrape the editorial list and combine the output with the previously scraped statically loaded content. Several more package imports are included in the code to speed up the process.

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
option = webdriver.ChromeOptions()
# The codes are for a window subsystem linux. 
option.add_argument('--headless')
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')
# Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver location
driver = webdriver.Chrome('YOUR-PATH-TO-CHROMEDRIVER', options=option)
 
page = driver.get('https://www.imdb.com/chart/top/') # Calling the HTML through request
soup = BeautifulSoup(driver.page_source, 'html.parser') # Parsing the content using the library
 
totalScrapedInfo = [] # This list contains and saves all the information we scrape
links = soup.select("table tbody tr td.titleColumn a") # Selecting all the titled anchors
first10 = links[:10] # Holding onto the first 10 anchors only
for anchor in first10:
    driver.get('https://www.imdb.com/' + anchor['href']) # Accessing the movie’s page 
    infolist = driver.find_elements_by_css_selector('.ipc-inline-list')[0] # Finding the first element with class ‘ipc-inline-list’
    informations = infolist.find_elements_by_css_selector("[role='presentation']") # Finding all elements with role=’presentation’ from the first element with class ‘ipc-inline-list’
    scrapedInfo = {
        "title": anchor.text,
        "year": informations[0].text,
        "duration": informations[2].text,
    } # Saving all the scraped information in a dictionary
    WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "[data-testid='firstListCardGroup-editorial']")))  # Waiting 5 seconds for our element with the attribute data-testid set as `firstListCardGroup-editorial`
    listElements = driver.find_elements_by_css_selector("[data-testid='firstListCardGroup-editorial'] .listName") # Extracting the editorial lists elements
    listNames = [] # Creating an empty list and then appending only the elements texts
    for el in listElements:
        listNames.append(el.text)
    scrapedInfo['editorial-list'] = listNames # Adding the editorial list names to our scrapedInfo dictionary
    totalScrapedInfo.append(scrapedInfo) # Appending the dictionary to the totalScrapedInformation list
    
print(totalScrapedInfo) # Displaying the list with all the information we scraped

Save the Scraped Content

The final step is to save all the scraped data into a JSON or CSV format. This is to make the data readable and directly usable. Python’s CSV and JSON packages are implemented in this final code section, which is given below.
import csv
import json
        
file = open('movies.json', mode='w', encoding='utf-8')
file.write(json.dumps(totalScrapedInfo))
 
writer = csv.writer(open("movies.csv", 'w'))
for movie in totalScrapedInfo:
    writer.writerow(movie.values())
We believe you will now be able to scrape almost any form of data from the web. Make sure you do not practice any illegal acts by attempting to scrape information from sites that should stay off-limits to regular civilians. A fitting example is the national defense website or a hospital website containing confidential information on patients.

Final Thoughts

The internet has revolutionized the way we conduct our businesses in this day and age. Gone are the days of manually going through tons of content to retrieve the desired information. Nowadays, with the right skillset, almost everything is accessible with the utmost ease.

With the advent of new libraries dedicated to web scraping purposes, programming languages like Python are benefiting greatly. Thus, web scraping is significantly more accessible than it ever used to be. Just write the Python code, set the correct parameters, and you are good to go!

Tharindu

Hey!! I'm Tharindu. I'm from Sri Lanka. I'm a part time freelancer and this is my blog where I write about everything I think might be useful to readers. If you read a tutorial here and want to hire me, contact me here.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button