Machine Learning, Python
Web Scraping using Beautifulsoup

Web Scraping using Beautifulsoup

Data is the key to any Machine Learning/Data Science project. There can be many sources of data depending on who is implementing the project. Like, if it is an organization that wants to study their customers’ behaviour they already have the customer data with them. But many times you don’t possess the data or you need more data to train your model. One of the frequently used methods to acquire this data is Web Scraping.

In Web Scraping, you traverse the hierarchy of source code of a page and scrap the relevant data. In python, you can achieve this task using the Beautifulsoup module. If you go through the docs of beautifulsoup, you will get overwhelmed looking at the wide range of methods it provides. But I have experienced that you need to know only a handful of methods in most of the cases.

In this post, we will discuss these methods and how to look for the data on a page. We will also scrap the data from the Election Commission of India website to see beautifulsoup in action. I have also recorded a video to give an overview of the content I am going to cover in this post. You can watch it here.

Note: I will be using Jupyter Notebook for running code in python.

 

Identify the Data:

Before learning how to use beautifulsoup to scrape data from a website, you should learn how to identify the relevant data in a page\website. Go through the points below.

  • Most of the time, the data you want to scrap is spread across multiple pages. In such a case, copy the URL of the first page and identify the parameters. Now visit a few other pages and observe which parameters are changing. Create the list of all possible values those parameters can take. There parameters’ values will be used in your code.
  • Now open the page which contains the data in a web browser (I prefer Google Chrome). Press Ctrl + Shift + I together. It will open the mark-up text of the page in the inspect window.
  • Now comes the most important task to identify which tag contains the data we require. In the inspect window, in the top left corner, click on the cursor sign. Now when you click on any item on a webpage, it will highlight its corresponding mark-up in the inspect window.
  • Using the same procedure identify the tags of your data. In most cases data is repetitive, that is to say, the tag contains a single record and the same tag hierarchy is repeated to display multiple records. This tag can be anything. For example, it can be a div, a table or any container element.

We will follow the same points in the following sections when we will scrape the data from the Election Commission of India website.

 

Getting started with Beautifulsoup:

I use the Anaconda platform for my Data Science/ML projects. It comes with various pre-installed libraries/packages. If beautifulsoup is not installed already, open Anaconda prompt and run below-given command in Anaconda Prompt.

conda install -c anaconda beautifulsoup4

If you don’t use Anaconda and want to write code in a Python environment, you can install beautifulsoup using pip (python package manager) by running below given command.

pip install beautifulsoup4

Before discussing the methods of beautifulsoup, let us first look at the first few lines of code which more or less remains the same.

from bs4 import Beautifulsoup
import requests

url = <webpage_url>

response = requests.get(
                         url,
                         params = {'parameter':'value'}
           )

soup = Beautifulsoup(response.content, "html.parser")

In the above code, we did the following.

  • Import the beautifulsoup and requests dependency packages.
  • Make a request to the webpage using the requests library.
  • Parse the document using Beautifulsoup (Beautifulsoup by default uses the HTML parser if not explicitly provided).

Methods of Beautifulsoup:

Beautifulsoup provides an extensive set of methods to traverse the hierarchy of any webpage. You can go through them on its documentation page. Here we are going to discuss only two methods which are mostly used and can get your work done.

Method 1:

find_all(nameattrsrecursivestringlimit**kwargs)

As the name suggests, this method finds all the tags which satisfy the filter values provided as arguments to this method. Like if you run this method on the whole HTML document, it will traverse the whole page and identify the tags with provided filter arguments.

As you can see in the definition of this method, there are multiple parameters that can be used to filter tags in any element. We will discuss them one by one.

  • name: By passing the name as an argument, find_all() method will find tags with the name same as provided. 

    For ex: soup.find_all(“p”), this will find all the paragraph(‘p’) tags in the soup element.

    You can also provide the class name as the second argument to find tags of a particular name with the class name as provided.

    For ex: soup.find_all(“p”, “bold”), this will find all the paragraph(‘p’) tags with class ‘bold’ in the soup element.

     

  • attrs: This argument comes handy when you want to filter tags based on their attributes. You can use tag’s attributes in searches by putting them into a dictionary and passing the dictionary into find_all() as the ‘attrs’ argument. 

    For ex: soup.find_all(attrs={“name”: “email”}), soup.find_all(attrs={“class”:”bold”})

     

  • recursive: When you call find_all() method on any element, beautifulsoup examines all the descendants of that element. That is to say, it will consider direct children, children’s children and so on. To instruct the find_all() method to consider only direct children, pass recursive equals ‘False’ as the argument. 

    For ex: soup.find_all(‘p’,recursive=False)

     

  • string: If you want Beautifulsoup to search for strings instead of tags, pass string as an argument with string’s value you are looking for. You can even use a regular expression to filter strings. 

    For ex: soup.find_all(string=”scrap”) or soup.find_all(string=re.compile(”scrap”))

     

  • limit: Limit argument works just like it works in any sql query. It limits the number of results returned by beautifulsoup. 

    For ex: soup.find_all(‘p’, limit=10)

     

  • keyword arguments (**kwargs): You can also filter tags by directly passing tag’s attribute value. Like you can pass ‘id’ value or ‘class’ value (using class_ as an argument because ‘class’ is a reserved word in python) or ‘href ‘ value. You can even use regular expressions to provide values of these attributes. 

    For ex: soup.find_all(id='one') or soup.find_all(class_='one') or soup.find_all(href=re.compile('one'))

I might have missed something, but all the things explained above are more than enough to exploit the find_all() method.

Method 2:

find(nameattrsrecursivestring**kwargs)

The find() method works same as find_all() method. The only difference is that the find_all() method scans the whole document and can return more than one result, if available, while find method returns only one result.

You must be thinking we can achieve the same task using the find_all() method with argument limit as one. But as I said, find_all() method scans the whole document or the element on which it has been called while find() method will stop when it will find the first occurrence.

One more thing, if find_all() method can’t find anything it returns an empty list while if find() method can’t find anything, it returns None.

For ex: we know that there is only one <body> tag in html document. It will be a waste of time to scan the entire document.

Tag Chaining:

Apart from these two methods, sometimes navigation using tag names also comes handy which repeatedly calls find() method. 

For ex: soup.head.title

You will see the use of this when we will scrap the ECI website.

 

ECI Website Scrap:

Now we will scrap the Election Commission of India website using the procedure and methods described above. Here we will extract the result of General Elections of India held in 2019. More precisely, we will extract the vote count of every candidate of each of the 543 constituencies.

Now we will follow the steps pointed out in ‘Identify the data’ section.

The Election result data is spread across multiple pages. So, first of all, we will identify which parameters of URL are changing and all the possible values they can take.

The URL of the first page is https://results.eci.gov.in/pc/en/constituencywise/ConstituencywiseU011.htm?ac=1

Now if you visit this page you will find there are two drop-downs, for selecting state and constituency respectively. Now we need to observe which parameters of the url are changing after selecting different combinations of state and constituency using these drop-downs.

URL Inspection for Web Scraping

After trying a few combinations, you can observe one parameter and one section of the URL is changing as shown in the below picture. These values are state code and constituency code corresponding to state and constituency drop-down respectively.

Url parameters comparison for Web Scraping

Now, all the possible values taken by these parameters are behind the HTML code of these drop-downs (which I get to know through inspect element window) and can be extracted through beautifulsoup. But as you are still learning to scrap the website, I have done this for you. I have stored these values in a dictionary, which we will use later to scrap the relevant data.


Now we will identify the tags for the data which we want to scrap. To identify these tags, press Ctrl + Shift + I to open the inspect window. Now click on the icon in the top left corner to enable inspect element cursor. Now follow the below points.

  • Before starting, to identify tags, observe the hierarchy of our data. It will be like table -> row -> column.
  • So, first of all, click on the boundary of the table, it will show the corresponding table tag in the inspect window. So in our code, we are going to use the find_all() method to scrap all the table tags in the webpage. This method will return a list of all table tags.

Webpage Source Code Inspection

  • Now we will identify which item in the above list contains our data. Here we will have to use hit and try method to identify the item. Just print some items randomly and you will find that the 11th item is the required table tag.
  • Now pretty print this 11th item. You will find that our data follows tbody -> tr -> td tags hierarchy. See the code below to better understand how to use find_all() method and tag navigation to scrap the required data.
  • Above procedure followed is just for one page. Now to we need to loop over all the pages that contain our data. This can be done by changing the parameters of the URL. To achieve this task, we will use the dictionary list (containing parameter values) created earlier. Below is the complete code to scrape data from multiple pages.

With this, you have come across different features of beautifulsoup and how to use them to scrap a website. Also, I have shared an approach that I generally use to scrap a website. You can see the complete code on my GitHub repository.

Share this Story

Leave a Reply

Your email address will not be published. Required fields are marked *

About Me


I'm a college student pursuing B.tech. I created this website (Blog) to share my learnings with the people who search for the same content/knowledge.