Disappointed with a lack of access to essential websites? Web scraping is the answer. Read this article to learn how to use Python for web scraping.
Ask anyone what the biggest and best source of information (or misinformation) in the entire world is; the answer would possibly be “the internet,” and quite rightfully so. The world wide web is a vast repository of information where you can find anything and everything you desire, as long as you know the proper right methods. This is where the web scraping phenomenon comes into play.
Web scraping is the method by which an individual can gather, analyze, and dissect raw data from websites – regardless of their types and locations. Interestingly, the zealous and ambitious community of Python (a programming language) developers had taken the responsibility of creating various tools and libraries that could help web scrapers with their tasks, and they succeeded.
Nowadays, using Python is one of the best ways for scraping the web. This article is intended as a comprehensive tutorial for beginners who are looking for ideal web scraping techniques with the use of Python.
Take Help From Professionals
First thing’s first: if you are willing and have the means to hire an organization for your web scraping purposes, you would not have to worry about developing the web scraping system from scratch. All you require is a web scraping API that you would need to integrate with your system, and the rest will be taken care of by the company selling you the API.
Furthermore, these organizations are adept at troubleshooting your existing attempts at web scraping and fixing any aspects on your behalf. They would also teach you the proper methods and provide hands-on training if required. Thus, for immediate effect, taking expert advice is the way to go.
Full Web Scraping Tutorial Using Python
To establish a proper, efficient web scraper, you would have to go step-by-step. Firstly, you need to install Python in your system. After that, website inspecting capabilities need to be integrated for the scraping to take place. For data extraction after successful access, multiple Python libraries can be used, with Beautiful Soup being the best of the lot. The Javascript rendering takes time, but after it is done, all the data can be saved into a CSV or JSON file. In simple terms, the following sections comprehensively deal with using Python for web scraping purposes.
Install Python
python3 -vThis will display an output that mentions the existing version – if installed in the first place.
Install the Necessary Libraries and Tools
pip3 install beautifulsoup4The same procedure is followed for installing the automation tool called Selenium to render dynamically derived content. The exact command is given below.
pip3 install seleniumMake sure you have Google Chrome, and its driver integrated into your system, as Selenium requires them for scraping.
Inspect the Page
As a lot of coding is involved, let us assume that we are scraping from the IMDb website regarding the top ten movies among the famous top 250. The initial course of action is to obtain the titles. Once done, the relevant data per movie is extracted from the website.
Right-click on the landing page and select “Inspect Element.” A separate interface will open up where you need to find the <table> tag. If you are facing difficulty finding it, press CTRL+F, enter the text and let the browser search it out for you. Notable HTML and CSS selectors at this stage of the process are table tbody tr td.titleColumn a and “titleColumn” respectively.
All the titles can be extracted using the letter by pulling out each anchor’s innerText. A single line of JavaScript code is required for displaying the extracted data on the browser. The code is given below.
document.querySelectorAll("table tbody tr td.titleColumn a")[0].innerText
Extract the Statically Loaded Content Using Beautiful Soup
import requests from bs4 import BeautifulSoup page = requests.get('https://www.imdb.com/chart/top/') # Calling the HTML through request soup = BeautifulSoup(page.content, 'html.parser') # Parsing the content using the library links = soup.select("table tbody tr td.titleColumn a") # Selecting all the titled anchors first10 = links[:10] # Holding onto the first 10 anchors only for anchor in first10: print(anchor.text) # Displaying the innerText of each anchor
Extract the Dynamically Loaded Content Using Selenium
Now let us look at how dynamically loaded content is scraped from a website – IMDb in this case.
Go to the Editorial Lists section. If you implement inspect on the page, you will encounter an element with having the parameter data-tested anchored as firstListCardGroup-editorial. However, if you try to locate it in the page source, you will not find it anywhere. That is because the website dynamically loads this entire section onto the browser.
The following lines of code are used to scrape the editorial list and combine the output with the previously scraped statically loaded content. Several more package imports are included in the code to speed up the process.
from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC option = webdriver.ChromeOptions() # The codes are for a window subsystem linux. option.add_argument('--headless') option.add_argument('--no-sandbox') option.add_argument('--disable-dev-sh-usage') # Replace YOUR-PATH-TO-CHROMEDRIVER with your chromedriver location driver = webdriver.Chrome('YOUR-PATH-TO-CHROMEDRIVER', options=option) page = driver.get('https://www.imdb.com/chart/top/') # Calling the HTML through request soup = BeautifulSoup(driver.page_source, 'html.parser') # Parsing the content using the library totalScrapedInfo = [] # This list contains and saves all the information we scrape links = soup.select("table tbody tr td.titleColumn a") # Selecting all the titled anchors first10 = links[:10] # Holding onto the first 10 anchors only for anchor in first10: driver.get('https://www.imdb.com/' + anchor['href']) # Accessing the movie’s page infolist = driver.find_elements_by_css_selector('.ipc-inline-list')[0] # Finding the first element with class ‘ipc-inline-list’ informations = infolist.find_elements_by_css_selector("[role='presentation']") # Finding all elements with role=’presentation’ from the first element with class ‘ipc-inline-list’ scrapedInfo = { "title": anchor.text, "year": informations[0].text, "duration": informations[2].text, } # Saving all the scraped information in a dictionary WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "[data-testid='firstListCardGroup-editorial']"))) # Waiting 5 seconds for our element with the attribute data-testid set as `firstListCardGroup-editorial` listElements = driver.find_elements_by_css_selector("[data-testid='firstListCardGroup-editorial'] .listName") # Extracting the editorial lists elements listNames = [] # Creating an empty list and then appending only the elements texts for el in listElements: listNames.append(el.text) scrapedInfo['editorial-list'] = listNames # Adding the editorial list names to our scrapedInfo dictionary totalScrapedInfo.append(scrapedInfo) # Appending the dictionary to the totalScrapedInformation list print(totalScrapedInfo) # Displaying the list with all the information we scraped
Save the Scraped Content
import csv import json file = open('movies.json', mode='w', encoding='utf-8') file.write(json.dumps(totalScrapedInfo)) writer = csv.writer(open("movies.csv", 'w')) for movie in totalScrapedInfo: writer.writerow(movie.values())We believe you will now be able to scrape almost any form of data from the web. Make sure you do not practice any illegal acts by attempting to scrape information from sites that should stay off-limits to regular civilians. A fitting example is the national defense website or a hospital website containing confidential information on patients.
Final Thoughts
The internet has revolutionized the way we conduct our businesses in this day and age. Gone are the days of manually going through tons of content to retrieve the desired information. Nowadays, with the right skillset, almost everything is accessible with the utmost ease.
With the advent of new libraries dedicated to web scraping purposes, programming languages like Python are benefiting greatly. Thus, web scraping is significantly more accessible than it ever used to be. Just write the Python code, set the correct parameters, and you are good to go!