Machine Learning Ensemble Models for High Rise Residential in Kuala Lumpur

Muhammad Izzuddin
16 min readJul 31, 2018

--

The magnificent view of Kuala Lumpur

I guess amongst the biggest dilemma faced by a house buyer or seller is to determine the fair price of the house. This sparked me in developing machine learning models that can help to determine the price of a house with a profound accuracy.

I developed 3 ensembles model which are, Bagging-Random Forest, Boosting-XGBoost and Stack Ensemble — Using multiple base learners. My main focus was high rise property (Service Resident, Condominium, Apartment and Flat) in Kuala Lumpur from year2008–2018.

This post will be covering on how do I acquire the dataset, clean the data, conducting Exploratory Data Analysis, feature engineering, data transformation and finally, developing the models. All of these were done using Python. I will be covering in depth on how I automate my web scraping task in a later post.

1. Data Acquisition

I extracted my first dataset from www.brickz.com using BeautifoulSoup package. Around 1043 high rise property name in Kuala Lumpur between January 2008 to March 2018. The variables extracted from this first stage web scraping process consist of:

  • Property Name
  • Tenure (Freehold/Leasehold)
  • Median Price Psf
  • Median Price
  • Number of Transaction (between 2008–2018)
Sample of the extracted information

I used the property name from this first stage web-scraping as the reference for the next web scrapping activities. Next, under the same website again, I further extracted each property’s feature such as:

  • SPA Date (Date of the transaction)
  • Address
  • Building Type (Service Residence, Condominium, Apartment, Flat)
  • Rooms (Number of rooms)
  • Lot Size ( floor size of the property)
  • Price

There were around 39200 transaction extracted for the 1043 property.

Sample of extracted information for a single property.

I obtained structural information for a particular property by scraping from multiple property review’s website such as (order is based on the google page ranking):

· www.propsocial.com

· www.propwall.com

· www.durianproperty.com

· www.landbar.com

The web scrapping is done in the sequence of the google page ranking. Meaning that the web scrapping will start by scrapping the highest ranking property review’s website, www.propsocial.com. Any information that is not available from this first website will be then scrapped from the www.propwall.com and so on. Since these property review website are populated by javascript, I used the combination between Selenium and BeautifoulSoup package (I will cover these in depth in a later post)

The information acquired from these website consist of:

  • Developer Name (The name of the property developer)
  • Completion Date ( The year of completion)
  • Maintenance Fee (Maintenance Fee charge per square foot)
  • Number of Block (How many block a particular property have)
  • Total Storeys
  • Total Units ( Total unit available for a particular property name)
Information extracted from www.propsocial.com

Any left missing value will be manually extracted from multiple forums and social media pages. Despite the authenticity and validity of information from these sources are questionable. I choose to extract them as I think it is better compare to leaving the value as blank.

Example of extracted information from forum (www.lowyat.com)

I manually scrapped the information form the forum based on leftover missing value. The reason for not doing automation because it is not worth the time to write the code for just a single value of information. So manual scrap is more effective.

Next, I used Google Map API service to get the longitude and latitude value for each property.

Sample of Coordinate extraction using Google Map API

And Finally, for macroeconomic factor. I choose to use OPR (Overnight Policy Rate). OPR is used to study the effect of mortgage interest rate on the housing price.

OPR level from www.bnm.com

All the dataset from multiple source will then be integrated into a final dataset. I make a dumb star schema to show the integration of the dataset from multiple sources into one final dataset

The star schema of the data integration
The Final Dataset in csv form

The final variables are consist of

  • Name (Name of the property)
  • SPADate (Sale Purchase Agreement Date)
  • Address (Address of the property)
  • Street (Neighbourhood)
  • Tenure (Freehold or Leasehold)
  • Completion Date (Year of the property being built)
  • Type (flat, apartment, condominium or service resident)
  • Developer Name (Name of the company the build the property)
  • Number of Blocks (Number of block inside the compound of the property)
  • Maintenance Fee (maintenance fee per square foot area of a residential)
  • Total Storeys (Total number of storey that the property has)
  • Total Units (Total unit of residentials inside the property)
  • Median Price Psf (median price per square foot residential)
  • No of Transaction ( Total number of transaction for a property from 2008–2010)
  • Latitude
  • Longitude
  • Size (loot size of a residential)
  • Number of Rooms (number of room in a residential)
  • OPR (Overnight Policy Rate level)
  • Price (Price of the residential)

Data Cleaning

Data cleaning is the activity to remove data that is inaccurate, incomplete or duplicated. Data cleaning process is important to prevent false conclusion derived from the final model. I used pandas package quite extensively to do the cleaning process.

2.1 Duplicate data

The duplicate data may consist of same property name but different attribute. To prevent any misleading duplicate (for example, same property name but actually different type of property), I select subset=[“Name”,”Type”] under duplicate function to find assess the duplicate result based on this criteria.

My initial assumption is right. Several property with the same name and type but with different other features turn out to be the same property. And these value has the total number of transaction of just 1 through out 10 years period. So I just to drop all of these duplicated value.

2.2 Missing Value

  • Any missing numerical value from ‘Maintenance’, ’Block’, ’Total Unit’, ’Level Number’ and ‘Year of Completion’ will be filled up using median value imputation method of the intended missing feature.
  • The imputation method use statical inference that substitute missing value based on other similar available information. The first stage imputation will group the property based on “Street” feature, follow by “Type” feature. The price will then be filled using median “Price” transformation.
  • The reason behind such selection is that same type of house in a same neighbourhood tend to share similar characteristics. Any remaining missing value will be filled up under the second stage imputation by grouping the property based on “Type” only, followed by filling up using median value imputation of the intended feature.
# Filling up all the missing value for Maintenance column
df["Maintenance"]=pd.to_numeric(df["Maintenance"])
df["Maintenance"].fillna(df.groupby(["Street","Type"])["Maintenance"].transform("mean"),inplace=True)
df["Maintenance"].fillna(df.groupby("Type")["Maintenance"].transform("median"),inplace=True)
df["Maintenance"]=df.Maintenance.round(2)

2.3 Outliers

Outlier is considered as extreme observation that is distant from other observation. I used describe() and median() function to get the gist of outliers from the dataset.

The Descriptive Statistic Table

From the Descriptive Statistic Table, Lot Size and Number of Room possessed extreme value on both min and max end. It is impossible for a house in Malaysia to has the size of 11 sqft2 or to has 118 number of rooms. lets try to investigate any property that is below 100 sqft2, above 20,000 sqft2 or has rooms more than 10 units using pandas.

Property that has size lower than 100 sqft2 more than 10,000 sqft2 and rooms more than 10

The extreme value that did not make sense. For example, it is impossible for a house of size 1238 sqft2 to has 118 number of rooms. Another example, Average selling price for extremely high size as shown by Pelangi Indah (11,000 sqft2 sold at just Rm180,000). Instead of removing this outlier value, we will replace the outliers through grouping by name and then apply average imputation method.

2.4 Data Filtering, Substitution and Sorting

  • Remove string based value from any numerical features and convert them from object into numerical type (int or float). This is applied to Year of Completion, Total Storeys, Number of Blocks, Number of Units, Price and OPR.
  • Under ‘Maintenance’ variable. Small portion of the value consist of the object type “/month”. Any strings value is stripped off and converted into numerical. The ‘/month” value was converted into fee per square foot area by diving the value by the size of the property
Certain value under maintenance fee was for the whole size and by month. We remove the any string value and divide the figure by size to get per square foot maintenance fee
  • Substitute any ordinal category (category with ranking) data into integer. For ‘Type’ feature, Flat = 1 Apartment = 2, Condominium/ Service Residence = 3. For ‘Tenure’, Leasehold = 0 and Freehold =1. The increasing number represent higher quality of the feature.
  • The ‘Developer’ features were ranked based on The Edge Malaysia Property Excellence Award Top 30 Property Developer. Due to difficulty to rank the developer, all the developer that made through the Edge list will be given rank 1, while other developer that that did not make the list will be given value of 0.
  • ‘Median Price Psf’ will be omitted before the final modelling because it is highly correlated with target variable, ‘Price’. ‘Median Price Psf’ will be used only for visualization and EDA purpose.
  • Several features are found out to be inconsistent. For example, the “No of Transaction” which represent total number of transactions volume made for a particular property from 2008 to 2018 acquired from Brickz.com is not consistence with total transaction volume with another similar property transaction portal (www.propertyadvisor.com). This variable will be also excluded from the final modelling.
  • The dataset will be sorted through the ascending date order.

The final dataset after the pre-processing activities consist of 39768 row and 28 columns

2.5 Feature Engineering

Feature engineering is the process of transforming the raw data into the features that better represent the underlying predictive model, this resulting in improved model accuracy on unseen data. Feature engineering is also used to aid visualization and data exploration.

  • For the time series, a new features ‘SPAMonth’ and ‘SPAYear’ will be generated from the ‘Date’ variable as most machine learning model is not able to handle date type data directly. The ‘SPAMonth’ will be used to observe any seasonality effect while the ‘SPAYear’ will be used to assess the trend effect. ‘Date’ variable will be dropped before the final modelling process.
  • From ‘Maintenance’ feature, a new feature features ‘MaintenanceRng’ will be generated which represent the maintenance fee range value. The range value will consist of range 0–0.1,0–0.15,0.15–0.20,0.20–0.25,0.25–0.30, 0.30–0.35, 0.35–0.40, 0.40–0.45, 0.45–0.50, 0.50–2.0. The ‘MaintenanceRng’ will be used for data visualization purpose
  • The ‘Level Number’ features is generated from the Address feature.. This variable represents the storey’s level of the transacted unit. For example, “A1–10–12, Vista Komanwel A”. The A1 represent the block, 10 represent the level number and 12 designated unit for the house. I used pandas manipulation technique to extract the level number from the address.

3. Exploratory Data Analysis

The purpose of exploratory data analysis (EDA) is to find meaningful patterns or insights, often with the assistance of visualization methods. EDA is also used to investigate the correlation between dependent and independent variables, or amongst independent variables themselves.

The main interest is this section is to find the factors that influence the dependent variable (Price) by investigating the relationship between the independent variables (Complete, Type, Developer, Maintenance, Block, Total Storeys, Level of Stroey, Total Unit, Tenure, Latitude, Longitude, Zone, Rooms, SPAYear,SPAMonth and OPR) with the target variable (Price)

3.1 Descriptive Analysis

The dataset consist of 4 type of high rise residential: flat-203 units, apartment-161 units and condominium-447, service residence-140 units. Countplot is use to get the total count of each type of property along with the total transaction for each type of high rise residential throughout the study period.

Maintenance Fee Histogram

Most of the properties have the maintenance fee below RM0.25 per sqft area.

Most of the transaction is made up for residentials that have 3 rooms.

3.2 Price Trend Analysis

Now let's check the price trend for all type of property from 2008–2018

Line Chart of property price from 2008–2018
Annual Price Bar Chart 2008–2018
  • The price for condominium and service residence showed tremendous growth of almost 100% from 2008 to 2013 and then start to become stagnant from 2016 to 2018.
  • Price for flat and apartment start to pick up traction after year 2012. Despite the late explosion of price growth for flat, it can be observed that flat still showed consistent annual growth up to the recent year compare to other three type of property.
  • The surge of the price for apartment and flat start to kick off after year 2012 was due to the price of condominium and service residents already at the peak level, prompting the house buyer to find cheaper alternative through the apartment and flat.
  • Apartment and service resident showed slight decline of median price on 2018 but it too early to conclude that the price of apartment is declining since the period of this study for 2018 only cover the first quarter of 2018.

3.3 Geographic Analysis

Gmaps package is used to acquire the geographical distribution (Latitude & Longitude) which consist of 951 location. The total area of coverage will be approximately around 243km2. It can be observed that luxury residential such as condominiums and service resident tend to populate hot Central Business District area such as Mont Kiara and KLCC. While cheaper high rise residential such as flat tend to populate the area nearing to the boundary of Kuala Lumpur.

Notes: Green dots = flat, Blue dots = Apartments, Red dots = condos & Service Residences . I still do not know how to add legends through gmaps.

Geographical distribution of property type.

3.4 Heatmap Analysis

Heatmap is used to assess the intensity of price per square feet using the same gmaps package.

Heatmap analysis

From the heatmap, it can be observed that highest price per square foot concentrated in KLCC, Mont Kiara and Bangsar area.

Boxplot of price across multiple neighbourhood.

3.5 Year of Completion

Based on the histogram, most of the residentials was built after year 1990, with the most transaction made up from the residentials that are recently built

Histogram of Year of Completion

The recent completed residential tend to command higher price per square foot. From both of Year of Completion histogram and Boxplot, the house buyer tend to buy recent completed residential and more tolerance toward the higher price due to more facilities and up to date design offered by the newly built residential.

3.6 Macroeconomic Analysis (OPR)

Overnight Policy Rate (OPR) is the interest rate at which a depository institution lends immediately available funds (balances within the central bank) to another depository institution overnight. The OPR influence the mortgage interest level imposed by financial institutions.

Price and OPR

The price of house suddenly surge the after BNM (Malaysia National Bank) reduce the OPR from 3.5% to 2.0% around year 2010. BNM raise the rate to 3.0% at the early of 2010 but the median price start to fall below RM300,000 after the middle of 2011. The median price can be observed to raise steadily as BNM maintain the OPR around 3.2% to 3.0% from 2012 to 2018. We can confirm that that OPR has negative correlation with the housing price based on the visualization.

3.7 Correlation Analysis

Pairplot between variables
Heatmap between variables
  • Pearson correlation is also used to see the correlation between the structural variables. In the context of housing attributes, several features showed strong correlation between each other. For example, “Total Unit” and “Maintenance Fee” showed strong positive correlation with “Number of Block”.
  • This is the due to higher number of block able to accommodate higher number of total unit of house for a particular residential and provide much more facilities. This factor is also true for the correlation between ‘Level Number’ and ‘Total Storey’.
  • The OPR showed very small correlation with the price which deny our initial hypothesis.

Data Transformation

‘Price’ variable before transformation

The Price Histogram is not normally distributed and highly skewed to the right tail. Skewness is undesirable for the final modelling as most of the machine learning algorithms make regards the shape (distribution) of data namely normal distribution. The Q-Q plot further indicate how much is the deviation of our Price distribution from the normal distribution

‘Price’ variable after transformation

Log transformation is used to transform the Price into a normal distribution. In addition to that, using log transformation help to minimize the impact of outliers

4. Ensemble Models

Ensemble learning is a technique that combine multiple learners to increase the prediction power of the model. For this study, 3 ensemble models will be developed. This model are bagging (Random Forest), Boosting (Extreme Gradient Boost) and Stack Generalization (several combination of different algorithm).

  • The final dataset after all the pre-processing activities discussed under the previous chapter will be 39748 rows and 17 columns. The final variables are Price, Year of Completion, Developer Reputation, Maintenance Fee, Number of Blocks, Total Storeys, Total Units, Tenure, Longitude, Latitude, Type, Size, Rooms, Level Number, SPA Year, SPA Month, OPR.
  • We will sort the data set into based on transaction date whereby the oldest transaction will be on the top. 80% of this data will be training set and the remaining 20% of the most recent transaction will become the test set.
  • The target variable, Price for the test set will be removed. By the end of the modelling activities, we will compare the result derived from the prediction model of the test set with the actual value.

4.1 Bagging (Random Forest)

Random Forest give prediction based on majority voting (for classification case) or averaging (for regression case). Each tree (learner) in random forest is built independently from each other. The advantages of the Random forest require minimum preparation of the dataset, minimize overfitting and has the ability to generated separate holdout set than can be used as the validation set.

Random forest hyperparameter setting

4.2 Boosting (XGBoost)

XGBoost stands for “Extreme Gradient Boosting”, where the term “Gradient Boosting” originates from the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman. Boosting algorithm trained based on the data iteratively and update the instances weight according to the different between predicted and actual values. The update of the weight depend of the algorithm that being used. Misclassified learner get weight increase while correct classification get weight reduction. This lead to the classifier give more attention to the misclassified one.

4.3 Stack Ensemble

Unlike previous 2 methods which use multiple models of the same kind to improve accuracy, Stacking Generalization Ensemble is an ensemble method that use different kind of models. What is unique about stack ensemble is it’s ability to train new model by combining previously trained model on the same dataset. The initial level of classifiers were trained separately and their results are combined using another classifier. The result of the stack ensemble can be simply achieved by averaging the initial classifiers prediction or by using weighted sum.

Hyperparameter for Stack Ensemble

5. Model Performance

5.1 Accuracy

We use the coefficient of determination, and Root Mean Square Log Error (RMSLE) to determine the accuracy of prediction for all our ensemble model

Performance Comparison between Ensemble Models

5.2 Feature of Importance

Feature of importance indicate how valuable each feature was in the construction of the ensemble model. I could not find any library on how to generate feature of importance for stack ensemble. So for the time being, lets focus on Random Forest and XGboost.

Feature of Importance for Random Forest
Feature of Importance for XGBoost

Surprisingly, both Random Forest and XGboost come up with different features of importance.

Now lets generate our prediction result.

generate the prediciton result
The result in csv form

Here is the link for the prediction result in csv form

6. Conclusion

Overall, I manage to achieve a high accuracy model without much of hyperparameter tuning. I guess maybe the I did the data preparation properly or the models were robust or I was purely lucky.

Here are several question arise from doing this assignment:

  1. Why Random Forest and XGBoost produce different feature of important? Am I doing it wrong or is it because of the characteristic of the models?
  2. Is there any other way to get the feature of importance for stack ensemble?
  3. Is there any better way to interpret the ensemble models, I understand one of the way is to draw the tree from a single tree and observe the splitting decision. Yet, I am still not really good in interpreting tree split’s drawing.
  4. How to deal with the time series analysis? I know that including time wise element such as Date in ensemble tree models is not a good approach since these models cannot extrapolate. I did consider linear regression and ARIMA, but I am having a hard time finding references online since most online examples are for single or several time-series, mine was multiple thousands time series ( even a single property can have thousand transactions for different size of unit).
  5. Is it ok to treat ordinal data like numerical data,let say that I aware of the sequence of importance?. For example, I converted flat = 1, apartment =2, condos =3 (the higher the number, the more expensive the residential). But ordinal value like condos does not necessarily mean 3 times better than flat. So is this assumption is correct or is it better to just use dummy variable instead?

The code source from this project can be found here https://github.com/izzuddin8803/KL-High-Rise-Price-Modelling/blob/master/House%20pirce%20Prediction%20Model.ipynb

Thank you for reading!!!

And feel free to comment :).

--

--

Responses (1)