Python Web Scraping: A Beginner’s Guide
Delve into the realm of web scraping. Harness the power of Python, using our guide, to extract valuable data from websites and make informed business decisions.
Web scraping is a technique that allows you to extract and collect data from websites automatically. You can scrape data from product reviews, social media posts, contact information, and other web content.
You should use web scraping when you wish to access large amounts of data from the internet quickly. Use the extracted data for market research, lead generation, sentiment analysis, price monitoring, data intelligence, and machine learning model training. A big advantage of gathering scraped data is that you can easily store it in a spreadsheet or database for later analysis.
Web scraping has several benefits. First, it’s quite fast. You can download huge amounts of information from multiple websites quickly.
Second, web scraping is cost-effective. A simple scraper can perform numerous tasks that could have otherwise required an organization to hire extra staff.
Third, web scraping has a high level of flexibility. You can easily modify a script that collects data on a particular site to perform other scraping tasks.
Use an API or web scraping tool such as ParseHub or Octoparse to collect information from the internet. Alternatively, if you want complete control over the scraping process, consider creating your own script or bot from scratch using a popular programming language like Python.
This tutorial will provide a step-by-step guide on how to create a web scraping bot using the Python programming language.
- Find a website URL
- Inspect the HTML structure
- Set up the coding environment
- Understand and install Python web scraping libraries
- Create a project folder and file
- Import libraries
- Add a URL and perform a fetch request
- Extract data from the HTML file
- Save scraped data to a file
- Review final code
1. Find a website URL
Before starting your web scraping journey, you must have a target website where you wish to scrape or download data.
Be careful when selecting a website because many sites don’t permit scraping bots; these sites can get you into trouble or cause your software to malfunction. Some websites also use JavaScript, which may be a problem for your scraping software if not handled correctly.
Read through the terms and conditions of your target website to know how to proceed.
We’ll use the ParseHub URL for this web scraping project.
--CODE language-markup line-numbers--
https://parsehub.com/sandbox/showtimes
2. Inspect the HTML structure
Once you’ve identified a target website, the next step is to inspect and evaluate the HTML structure, which determines how a website appears.
Browsers provide developer tools you can use to inspect how websites work. We’ll use Google Chrome’s built-in tools for this tutorial.
Open ParseHub's website in your browser. The web page should look like this.
The web page contains a list of movies. Each item on the page has a movie title, image, and time. We’ll be scraping the movie name and showtime.
Let’s use developer tools to inspect this web page. If you’re on Chrome, click on the three dots at the top-right corner. Then, scroll down to More Tools > Developer Tools.
In the Developer Tools page, navigate to the Elements tab.
When you go through the HTML content on the developer tools page, you’ll notice that all elements have an ID or class name. Some of the class names include image and title. Take note of these class names; we’ll need to specify them in our Python code later.
3. Set up the coding environment
Now that we have a clear picture of the web URL and HTML structure, let’s set up our development environment.
Download and install Python from the official website. We’ll use the latest version (Python 3.10.8) for this tutorial.
You also need to install a code editor, such as Visual Studio Code. A code editor helps create, modify, and save program files easily. Furthermore, a code editor can highlight any errors that arise in your code, which boosts your productivity in the long run.
Refer to the official docs on how to install Python and Visual Studio Code if you experience any installation challenges.
4. Understand and install Python web scraping libraries
To create a web scraping script, we need to import the following libraries to our application.
- Python requests library. This is a Python package that allows us to send and receive HTTP requests.
- BeautifulSoup4. This library allows us to extract information from LXML and HTML files. We’ll use the beautifulsoup4 library to pull the movie title and display time from ParseHub’s showtime website.
- Pandas. This Python library is used for data manipulation and analysis. In this tutorial, we’ll use pandas to store our data in a CSV file.
To install requests, pandas, and beautifulsoup4 libraries, launch your terminal and execute the following command.
--CODE language-markup line-numbers--
pip install requests pandas beautifulsoup4
Your terminal should look like this once the packages finish installing.
5. Create a project folder and file
On your desktop, create a new folder and give it a name. In this tutorial, we’ll name it “web-scraper.”
We’ll store all of our project’s files in this folder. Open the folder in your code editor.
Next, create a new file in the folder and name it “scraper.py.” This file will contain our code for the scraping bot.
6. Import libraries
We need to import the beautifulsoup4, requests, and pandas libraries in the scraper.py file. Add the following lines to your Python file.
--CODE language-markup line-numbers--
import pandas as pd # The pandas library for data analysis and manipulation.
import requests as requests # Requests for making network connections.
from bs4 import BeautifulSoup # For extracting data from HTML and XML docs.
7. Add a URL and perform a fetch request
You need to add a web URL (the website where you’ll scrape data) and perform a fetch request. We’ll do so using the following code.
--CODE language-markup line-numbers--
web_url = "https://parsehub.com/sandbox/showtimes" # Target website
fetched_page = requests.get(web_url) # Fetching the page
Before going further, let’s test if the application is working. In this case, we’ll print out the retrieved web page with the following line.
--CODE language-markup line-numbers--
print(fetched_page.text)
Open your terminal and run the following command to start the Python script.
--CODE language-markup line-numbers--
python scraper.py
You should see the following output if your application is working as expected.
--CODE language-markup line-numbers--
<div>
<div class="movielist">
<div class="header">Movie Showtimes (1 - 10 of 80 movies)</div>
<br>
<ul style="list-style-type: none;">
<li class="borderbox">
<div style="display: flex; justify-content: flex-start" class="movie">
<span css="image">
<img alt="The Shawshank Redemption" src="https://upload.wikimedia.org/wikipedia/en/8/81/ShawshankRedemptionMoviePoster.jpg">
</span>
<span>
<a class="title" href="/sandbox/moviedetails?movie=The Shawshank Redemption">The Shawshank Redemption</a>
<div class="theatres">
<div class="showtimes">
<span class="theatre">IMAX 3D: </span>
<span class="borderbox showtime imax first">06:00 PM</span>
</div>
<div class="showtimes">
<span class="theatre">Regular: </span>
<span class="borderbox showtime regular first">06:00 PM</span>
<span class="borderbox showtime regular first">06:15 PM</span>
<span class="borderbox showtime regular first">09:30 PM</span>
<span class="borderbox showtime regular first">09:45 PM</span>
</div>
</div>
</span>
</div>
</li>
<li class="borderbox">
<div style="display: flex; justify-content: flex-start" class="movie">
<span class="image">
<img alt="Schindler's List" src="https://upload.wikimedia.org/wikipedia/en/3/38/Schindler%27s_List_movie.jpg">
</span>
<span>
<a class="title" href="/sandbox/moviedetails?movie=Schindler's List">Schindler's List</a>
<div class="theatres">
<div class="showtimes">
<span class="theatre">IMAX 3D: </span>
<span class="borderbox showtime imax other">06:15 PM</span>
</div>
<div class="showtimes">
<span class="theatre">Regular: </span>
<span class="borderbox showtime regular other">06:15 PM</span>
<span class="borderbox showtime regular other">06:30 PM</span>
<span class="borderbox showtime regular other">09:45 PM</span>
<span class="borderbox showtime regular other">10:00 PM</span>
</div>
</div>
</span>
</div>
Since the application works, you can add a # before the print(fetched_page.text) statement to prevent it from executing in the future. In Python, a # is used for commenting. The Python compiler does not execute comments.
8. Extract data from the HTML file
We’ll use the beautifulsoup library to pull data from the retrieved HTML page.
Let’s parse the HTML page using the beautifulsoup object.
--CODE language-markup line-numbers--
beautifulsoup = BeautifulSoup(fetched_page.text, "html.parser")
In the above code, we’re using Python’s default library (html.parser) alongside beautifulsoup to parse text.
To extract specific values, we need to know how they were defined in the HTML code, including their CSS selectors. In this web scraping tutorial, we’re extracting the movie title, image URL, and showtime.
Here is how these attributes are defined in the HTML.
Movie title and image URL.
--CODE language-markup line-numbers--
<a class="title" href="/sandbox/moviedetails?movie=The Shawshank Redemption">The Shawshank Redemption</a>
Showtime.
--CODE language-markup line-numbers--
<span class="borderbox showtime imax first">06:00 PM</span>
We’ll use a for loop to fetch all of our attributes (movie title, image URL, and showtime).
The movie title and image URL can be found in the <a> HTML tags, which has the class name of title. Here’s the loop for extracting the movie title and image URL.
--CODE language-markup line-numbers--
for movie in beautifulsoup.find_all('a', 'title'):
print(movie.string) # Movie name
print(movie.get('href')) # Image url
The showtime for each movie can be found in the <span> tag with a unique class name of imax. Here’s the for loop for extracting the showtime.
--CODE language-markup line-numbers--
for showtime in beautifulsoup.find_all('span','imax'):
print(showtime.string)
You can use the soup.find method if you only wish to extract a particular object.
When you run the above code, you should see the output below.
--CODE language-markup line-numbers--
The Shawshank Redemption
/sandbox/moviedetails?movie=The Shawshank Redemption
Schindler's List
/sandbox/moviedetails?movie=Schindler's List
The Godfather
/sandbox/moviedetails?movie=The Godfather
The Godfather: Part II
/sandbox/moviedetails?movie=The Godfather: Part II
The Dark Knight
/sandbox/moviedetails?movie=The Dark Knight
Pulp Fiction
/sandbox/moviedetails?movie=Pulp Fiction
The Good, the Bad and the Ugly
/sandbox/moviedetails?movie=The Good, the Bad and the Ugly
12 Angry Men
/sandbox/moviedetails?movie=12 Angry Men
The Lord of the Rings: The Return of the King
/sandbox/moviedetails?movie=The Lord of the Rings: The Return of the King
Fight Club
/sandbox/moviedetails?movie=Fight Club
06:00 PM
06:15 PM
06:30 PM
06:45 PM
07:00 PM
07:15 PM
07:30 PM
07:45 PM
08:00 PM
08:15 PM
At this point, we’re just printing the values we’ve scraped from the website—which isn’t helpful.
In the next section, we’ll learn how to save these values to a file using the pandas library.
9. Save scraped data to a file
To save data to a file, you first need to create lists that will store the scraped data temporarily.
Add the following lists to your Python code just before the for loops.
--CODE language-markup line-numbers--
titles = []
urls = []
time = []
Next, we need to modify the for loops to ensure each scraped value is stored in a specific list rather than printed. We implement this functionality as follows.
--CODE language-markup line-numbers--
for movie in beautifulsoup.find_all('a','title'):
titles.append(movie.string)
urls.append(movie.get('href'))
# print(movie.string)
# print(movie.get('href'))
for showtime in beautifulsoup.find_all('span','imax'):
time.append(showtime.string)
# print(showtime.string)
We need to define a dictionary to store all of our lists. This is done as follows.
--CODE language-markup line-numbers--
raw_data={
'movie_title':titles,
'show_time':time,
'image_url':urls
}
Now that we have the scraped data in lists and inside a dictionary, let’s add it to a dataframe using the pandas library. A dataframe allows us to view our data in columns, just like in a spreadsheet.
We add the raw_data dictionary and specify column names, as shown below.
--CODE language-markup line-numbers--
dataframe = pd.DataFrame(raw_data, columns=['movie_title', 'show_time', 'image_url'])
Print the dataframe and run the script to see how the data looks on your terminal.
Here’s how the output should look.
We can save the data to a CSV file using the following line.
--CODE language-markup line-numbers--
dataframe.to_csv('raw_data.csv', index=False)
If you add the above line and run your code, a CSV file containing your data will generate and save to your project folder.
10. Review final code
Here’s the entire code for the simple scraping bot.
--CODE language-markup line-numbers--
import pandas as pd
import requests as requests
from bs4 import BeautifulSoup
web_url = "https://parsehub.com/sandbox/showtimes"
fetched_page = requests.get(web_url)
#print(fetched_page.text)
beautifulsoup = BeautifulSoup(fetched_page.text,"html.parser")
titles = []
urls = []
time = []
for movie in beautifulsoup.find_all('a','title'):
titles.append(movie.string)
urls.append(movie.get('href'))
# print(movie.string)
# print(movie.get('href'))
for showtime in beautifulsoup.find_all('span','imax'):
time.append(showtime.string)
# print(showtime.string)
raw_data={
'movie_title':titles,
'show_time':time,
'image_url':urls
}
dataframe = pd.DataFrame(raw_data, columns=['movie_title', 'show_time', 'image_url'])
dataframe.to_csv('raw_data.csv', index=False)
print(dataframe)
Get help from a web scraping expert
In this tutorial, we learned how to create a simple scraping bot using Python. The bot can fetch movie details and store them in a CSV file. Python has numerous libraries, including requests, beautifulsoup, selenium, scrapy, and pandas, that make it easy to develop scraping software.
Scraping data from different web applications can save significant time and effort. You can use the scraped data for data science, machine learning, lead generation, price monitoring, research, and several other activities.
If you wish to engage in web scraping but lack adequate time or skills, you can access the help you need on Upwork. Get started by meeting and hiring web scraping experts today.
If you’re a professional looking for work, Upwork provides a platform where you can sell your services and work with hundreds of clients.
Upwork is not affiliated with and does not sponsor or endorse any of the tools or services discussed in this article. These tools and services are provided only as potential options, and each reader and company should take the time needed to adequately analyze and determine the tools or services that would best fit their specific needs and situation.