Intro to Web Scraping With Beautiful Soup

In the previous post we learned about Jane and how she spent hours going through thousands of job listings, ending up with only a handful that she could apply to. We also learned that by automating this process with the help of a web scraper, she could extract the information that she is looking for easily while avoiding several hours of manual effort. In this post, let’s understand how we can perform web scraping with the help of the Python library, Beautiful Soup, so that we can help Jane in the quest to find her ideal job.

Step 1: Install Beautiful Soup

Open the terminal and navigate into the directory where you are creating this project (I’m calling mine intro-to-web-scraping). Once you are in the desired directory, activate the virtual environment for this project (create one first if you haven’t already) and then run the following command:

pip install beautifulsoup4

Step 2: Import Beautiful Soup

Create a Python file inside your project directory and import the Beautiful Soup library by writing the following line of code:

from bs4 import BeautifulSoup

Before we proceed to step 3, let’s create a variable to hold the HTML which we want to scrape

BASIC_HTML = '''
<HTML>
  <head></head>
  <body>
    <h1>Intro to Web Scraping With Beautiful Soup</h1>
    <p>Let's learn how!</p>
    <p class="steps">Follow the following steps:</p>
    <ul>
      <li>Install Beautiful Soup</li>
      <li>Import Beautiful Soup</li>
      <li>Instantiate Beautiful Soup</li>
      <li>Extract Data Using Beautiful Soup</li>
    </ul>
  </body>
</HTML>
'''

Here, we have created a string variable called BASIC_HTML which contains the HTML that we will parse, search and format. This simple HTML contains a head and a body which in turn contains a header, two paragraphs and an unordered list with four list items. In the next few steps, let’s give this HTML over to Beautiful Soup for parsing.

Step 3: Instantiate Beautiful Soup

bs = BeautifulSoup(BASIC_HTML, 'html.parser')

In the above line of code, we are creating a Beautiful Soup instance called bs by passing the following parameters:

  1. The document that we want it to parse
  2. A string representing the type of parser that we want it to use.

While Beautiful Soup can parse both HTML and XML, we have passed the string 'html.parser' so that Beautiful Soup understands that it needs to use its HTML parser to parse this document.

Step 4: Extract Data Using Beautiful Soup

Now that we have our Beautiful Soup instance, let’s use it to extract the header tag from the HTML using the following code:

h1_tag = bs.find('h1')
print(h1_tag)

The find method searches through the document and returns the first occurence of the desired element in case there are multiple elements with the same tag. Since our HTML only has a single element with the h1 tag, it returns that.

Output:

<h1>Intro to Web Scraping With Beautiful Soup<\h1>

If we only want to extract the contents of the tag, we can do it like so:

print(h1_tag.string)

Output:

Intro to Web Scraping With Beautiful Soup

Let’s refactor this code by moving it into a function:

def find_title():
  h1_tag = bs.find('h1')
  print(h1_tag.string)

Next, let’s write a function to obtain all the list items in our HTML.

def find_list_items():
  list_items = bs.find_all('li')
  print(list_items)

In the above piece of code, we are using Beautiful Soup’s find_all method to find all the list items by passing it the string li. We didn’t use the find method to find the list items because it only returns a single item.

We get the following output upon calling the find_list_items function:

[<li>Install Beautiful Soup</li>, <li>Import Beautiful Soup</li>, <li>Instantiate Beautiful Soup</li>, <li>Extract Data Using Beautiful Soup</li>]

Let’s modify our find_list_items function using list comprehension to convert our list of items with tags into a list of items with only the content inside those tags.

def find_list_items():
  list_items = bs.find_all('li')
  list_contents = [item.string for item in list_items] # line added
  print(list_contents)

Output:

['Install Beautiful Soup', 'Import Beautiful Soup', 'Instantiate Beautiful Soup', 'Extract Data Using Beautiful Soup']

Now that we have extracted the title and list items from our HTML, let’s move to the next step. We can see that our HTML document has two paragraph tags. Let’s try to access the one which has a CSS class called steps associated with it.

def find_steps_paragraph():
  paragraph = bs.find('p', {'class': 'steps'})
  print(paragraph.string)

If we want to narrow down to a specific HTML tag that we are looking for, we can do so easily by passing additional information about it as an attribute to the find method.

Hence, in the line paragraph = bs.find('p', {'class': 'steps'}) we have passed the CSS class name steps as a dictionary to the find method to narrow down to the specific paragraph we are looking for.

Once we run the function, we get the following output:

Follow the following steps:

We have extracted all but one element from our HTML, i.e., the paragraph without a CSS class. Let’s create a function to extract it!

def find_other_paragraph():
  paragraphs = bs.find_all('p')
  other_paragraph = [p for p in paragraphs if 'steps' not in p.attrs.get('class')]
  print(other_paragraph[0].string)

In the first line inside our find_other_paragraph function, we are trying to find all the paragraphs in our HTML document using Beautiful Soup’s find_all method. In the next line, we are using Python’s list comprehension to extract the paragraph tag which does not have a class associated with it, by comparing our CSS class name steps with each paragraph’s class attribute. We do so by calling the get method with the key class to get the name of the CSS class for each paragraph. Finally, since list comprehension returns a list of items and in our case, the returned list only has a single item, we access the first element of this list and extract its content.

Let’s run this function to see the output:

TypeError: argument of type 'NoneType' is not iterable

Before we start asking the code

We should see where we went wrong.

Probably, our paragraphs list is empty and using list comprehension on it caused this error. To verify this, let’s add a print statement in our function to check the contents of the paragraphs list.

def find_other_paragraph():
  paragraphs = bs.find_all('p')
  print(paragraphs) # new line added
  other_paragraph = [p for p in paragraphs if 'steps' not in p.attrs.get('class')]
  print(other_paragraph[0].string)

Our paragraphs list gets printed:

[<p>Let's learn how!</p>, <p class="steps">Follow the following steps:</p>]

and is followed by the same error message as above:

TypeError: argument of type 'NoneType' is not iterable

So we are getting all the paragraphs. The problem happens in this line of our function.

other_paragraph = [p for p in paragraphs if 'steps' not in p.attrs.get('class')]

Let’s take a closer look at the condition expression inside our list comprehension:

if 'steps' not in p.attrs.get('class')

The get method returns None if it can’t find the key class in the paragraph’s attribute dictionary. Since the paragraph that we are looking for does not have the class property associated with it, p.attrs.get('class') will return None by default. And since we are essentially checking if 'steps' not in None, we get TypeError, because NoneType is not iterable.

Fortunately, the get method can be modified to return something other than None in case of a key lookup failure. We can make it return an empty list if we can’t find the key class inside this dictionary. This way we can look for the absence of the CSS class steps inside an empty list and return the paragraph for which this holds true.

def find_other_paragraph():
  paragraphs = bs.find_all('p')
  other_paragraph = [p for p in paragraphs if 'steps' not in p.attrs.get('class', [])] # line modified
  print(other_paragraph[0].string)

When we run this function, we get our desired output:

Let's learn how!

Now that we have learned the basics of web scraping with Python’s Beautiful Soup library, let’s extend this knowledge into helping Jane find her perfect job!


If you would like to read more posts about web scraping with Python, you can find them here.

Thank you for reading!

License

Copyright 2021-present Vasudha Jha.

Released under the Creative Commons Attribution-ShareAlike 4.0 International License.

Vasudha Jha
Vasudha Jha
MS in Computer Science Student

An engineer, artist and writer, all at the same time, I suppose.

Related