TDM 20200: Project 02 — 2024

Motivation: Web scraping is the process of taking content off of the internet. Typically this goes hand-in-hand with parsing or processing the data. In general, scraping data from websites has always been a popular topic in The Data Mine. We will use the website of "books.toscrape.com" to practice scraping skills

Context: In the previous project we gently introduced XML and XPath to parse a XML document. In this project, we will introduce web scraping. We will learn some basic web scraping skills using BeautifulSoup.

Scope: Python, web scraping, BeautifulSoup

Learning Objectives
  • Understand webpage structures

  • Use BeautifulSoup to scrape data from web pages

Readings and Resources

  • Make sure to read about, and use the template found here, and the important information about projects submissions here.

  • This link will provide you more information about python requests library

  • This link will provide you more information about BeautifulSoap

Questions

Question 1 (2 points)

  1. Please use BeautifulSoup to get and display the website’s HTML source code https://books.toscrape.com

  2. Review the website’s HTML source code. What is the title for that webpage?

You may refer to the following to import libraries, modify the code to fit into yours

import requests
from bs4 import BeautifulSoup
...# define url
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')

Question 2 (2 points)

  1. Please use the BeautifulSoup library to get and display all categories' names from the homepage of the website.

  • Review the page source code, find "categories" located at the sidebar under a div tag with class is "nav-list" The BeautifulSoup "select" method is useful to get category, like this:

soup.select('.nav-list li a')

Question 3 (2 points)

  1. Now, instead of only getting the names of the categories, get all of the category links from the homepage as well.

    • Review the homepage source code, explore where is the category links locate, You may use "find" to get the "div" section

    soup.find('div',class = 'side_categories')
    • Under "div" section, the links can be found in the "a" tag. You may use "find_all" to get all category links

      find_all('a')
    • You may refer to the following code to exclude "books" from the category list since it is not part of the category # Assume "link" hold a category link information

    link.get('href').startswith("catalogue/category/books/")
    • Output will look like

    Category: Travel,link: catalogue/category/books/travel_2/index.html
    Category: Mystery,link: catalogue/category/books/mystery_3/index.html
    ....
  2. Update the code from question 3a to get (only) the links for books with the category of "Romance".

  • Output will be like

romance_url is https://books.toscrape.com/catalogue/category/books/romance_8/index.html

Question 4 (2 points)

  1. Use the "Romance" link https://books.toscrape.com/catalogue/category/books/romance_8/index.html from Question 3b to get the webpage source code for the Romance category web page.

  2. Display all book titles in the first page of the romance category.

Question 5 (2 pts)

If you look at this page:

you can see, in the lower-right-hand corner, that the link to the second page is:

Now temporarily forget that you know this fact! We want you to try to find this page-2 link in the Romance book page.

  1. Starting with http://books.toscrape.com/catalogue/category/books/romance_8/index.html, please find the page 2 link the from Romance category web page using BeautifulSoup.

    The following is some sample code, for your reference.

    # need to remove last part from basic url
    url= "http://books.toscrape.com/catalogue/category/books/romance_8/index.html"
    url=url.rsplit('/',1)[0]
    # Assume you get next hyperlink ""category/page-2.html" as the next page, you need to only keep the last part
    next_link = next.split("/")[-1]
    #Combine
    next_url=url+"/"+next_link
  2. List the titles of all of the books from the second page of the "Romance" category.

Project 02 Assignment Checklist

  • Jupyter Lab notebook with your code, comments and output for the assignment

    • firstname-lastname-project02.ipynb

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.