Web Scrapping With BeautifulSoup.

When I did my master project. One of the obstacles that I faced was to scrap data from multiple websites. At first, I was contemplating manual scrapping but after hours of tedious copy-paste from websites to excel spreadsheet, including multiple typos error and miss-entry, I finally gave up.

So I did several research online on how can this web-scrapping process. This lead meto BeautifulSoup. So basically beautifulSoup is:

BeautifulSoup: Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

So what does it basically say that BeautifulSoup allows us to scrap the data through its HTML or XML structure. If you noticed, different pages for a same static website usually made up of the same structure. This kind of format really helpful if we want to automate the web scrapping process.

So let's look at one example using one of property transaction website in Malaysia: www.brickz.com. Here, I want to scrap the property details for all the high rise residential in Kuala Lumpur for the past 10 years.

Kuala Lumpur High Rise Residential Transaction for the past 10 years.

So my information of interest will be:

  • Project Name
  • The Street/Area
  • Tenure/Type
  • Median Price Psf
  • Median Price
  • Filed Transaction

To extract the data, we first need to understand the structure of the HTML. For this website, let use Developer Tool from google chrome by clicking the right click button at the website. Our main interest will be the table structure that holds the data which is called tbody. The table body will be highlighted in blue once the tbody tag is selected.

tbody table

Now let gone through each unique tag that contain the information we want.

For the name of the property, the unique tag that name for each row will be <span itemprop =”name> inside the <a> tag.

For the street name of the property, the tag will be similar with with the HTML tag from from the property title html tag. We will discuss on how we will be able to extract this second html tag later.

For the tenure type, the unique HTML tag will be inside <span class=”ptd_list_item_title”>. For the Residential Type (Condominium, Apartment or flat) is also contain inside the same tag as the tenure type.

For the Median Price Psf, the data will be inside <span class = “ptd_list_item ptd currency”>

For number of transaction, the data will be inside the a tag with the class =”button” of an <a> tag.

Our main goal here is to extract high rise residential data from Jan 2008 — May 2018. using the website’s interface, we are going to filter the dataset based on this condition.

Filtered dataset for high rise residential in Kuala Lumpur.

There are 1045 Projects and 105 total page. So for each page, there will be 10 project. Now we are going to look the url to see how each page is arranged.

URL for page 1

As we can see, the numeric value on the red line represent the directory for each page. Using this information, we can use for loop process to extract information from each page.

Now as we have identified all the tags and the page URL structure. Now let's write our web scraping program. We will import beautifulSoup from bs4 and get from request.

Importing BeautifulSoup and Get

Now here how I write my web-scraping code using python.

The Web Scraping code

So here are the details:

  • pages will be used to for loop all the 105 pages inside the response URL.
  • names,areas,tenures,type_res,median_Price and transaction empty list will be used as container for each category of data.
  • html_soup will be used to parse the HTML parser.
  • residential_container will be used to find all table row for each page.
  • row is for individual row for each page.
  • for the same field used that share the same tag. we will used bracket [] to extract the subsequent value (second and above). This is applied to street and type_res field.
  • For each value extracted from each row. The value will be append to the list container for a particular field.

Now let's transform the dataset into dataframe.

Combining all the extracted field into a dataframe

The result.

Using beautifulSoup really save my time a lots. In addition to that, this package is really easy to understand and use. It was so simple that I just used not more than 20 lines of code for this task. I will cover on how I use Selenium to handle javascript populated website in my next post. Thank you for reading. :)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store