Photo by Andrey Tikhonovskiy on Unsplash

Using Python for Web Scrapping

Benny Lee

--

This article provides an easy way to web scrap data from a website using Beautiful Soup package in Python.

One of my university assignment was involved in developing analytic dashboards for a non profit hospital. As part of the project, I need to obtain the pricing data for medications for are available on the Pharmaceutical Benefits Scheme (PBS) website. There are several ways to obtain these data and one of the option web scrape these data from their website.

Figure 1 — The pricing data from the PBS website

Step 1 — Import the relevant packages

We first import the following package:

Figure 2 — Import relevant packages

The requests package is used to make a HTTP call and obtain the response in the XML format.

The Beautiful Soap package is used to parse the data in the XML and extract the relevant information.

The CSV package is used to generate the file and the panda package is used to create a dataframe to hold the relevant data.

Step 2 — Retrieve the HTML

The following codes were used to retrieve the website data:

Figure 3 — Retrieve HTML page

I used the requests package to make a GET HTTP call to the defined UR. I then use the Beautiful Soup package to parse the response from the HTTP call and store in the variable.

Step 3 — Extracting the data

Beautiful Soap provides very easy and powerful methods to navigate and extract the data from a XML document.

Figure 4 — Extract relevant data

And so basically for the code above, I firstly used the find_all() method to find the relevant table that contains the data i need, I then extract the value i need. I repeat the steps above to get all values in the table and saved them into a list.

For most of the time when doing the web scrapping, you need to understand the elements in the HTML file and so the command is pretty handle where you see the structure in the well formed format:

Figure 5 — Print out the HTML response

Step 4 — Saving the data into a CSV file

Once I have saved all the data into a list, the rest is pretty simple. I converted the list into a Pandas’s dataframe and then saved the dataframe in a csv file as shown below:

Figure 4— Save data into a CSV file

Step 5 — Validation

Finally it is always a good idea to validate the save CSV data by load them into python and see what the data looks like:

Figure 6 — Validation by loading the CSV file

Python makes web scrapping very easy and fun to do. It has so many libraries that can be used for scrapping, retrieving, parsing and extracting the data. You can refer to here for the complete source code fore details.

Before you start, I would highly recommend to check out this website for the ethics in related to web scrapping: https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01

Happy web scrapping :)

--

--

Benny Lee
Benny Lee

Written by Benny Lee

Hands-on IT architect | Data Science Nerd | Master of Data Science and Innovation student at UTS | https://www.linkedin.com/in/benny-lee-61b11819/

No responses yet