Project Descreiption

Nowadays, there are millions of students leaving their hometown either internationally or domestically to continue their education. In most cases, they rent apartments around the campus during the academic year. But sometimes students might leave for other places temporarily for exchange or internship. Under these conditions, students might consider to sublease their original apartments. However, it might not be that easy to determine the price for sublease. They have the incentive to increase the rate while a high rate may attract few renters.

In this project, we want to use machine learning method to help students in Chicago area determine a reasonable price to sublease their apartment.

 

How to Do It

In this project, our task is divided into two parts: first, we need to get an estimation of the rental; second, based on previous step, we need to predict the sublease price. The input are attributes of the apartment subleased, along with other factors such as sublease period. A simple illustration is shown in the following figure.

 

Data Set

We collect data about rental from Craigslist. A web spider is written to help us nd and organize the data on the Craigslist website. Our program can get the house/apartment rental datain a particular area. The attributes we use are listed in the following table. Since the data on Craigslist may not be complete for each of the posts, we use '?' to denote the missing attributes. We can use the same method in PS2 to deal with the missing attributes. We grab about 4100 set of data from Craigslist for the project in total.

 

Attribute
Data Type
Attribute Explanation
Price
Float
The listed rental price
Bedroom
Float
The number of bedrooms
Bathroom
Float
The number of bathrooms
Area
Float
The area of the listed house/apartment
House Type
Nominal
0→Apartment, 1→Condo, 2→House
Cat
Nominal
0→No cat, 1→Cat OK
Dog
Nominal
0→No dog, 1→Dog OK
Parking
Nominal
0→No parking,1→Street parking, 2→Garage
Dishwasher
Nominal
0→No dishwasher, 1→With dishwasher

 

Moreover, we collect data about sublease from WildcatPad (http://www.wildcatpad.com/) and BBS of Northwestern (http://bbs.nwucssa.org/). Based on the advertisements, we can get information about attributes listed in Table 1 except the rental (most people post sublease price only), therefore we sent surveys to students who posted the sublease advertisements and asked them for the original rental. Till now more than half have responded to us. In addition to attributes mentioned above, we consider more attributes such as move-in and move-out date, utility and number of roommates.

 

Method and Result

(1) Rental Price Prediction:

 

-- Linear Regression

We first tried the intuitive linear regression model and obtained formula with the form: $$ \text{price} = 296.60 * \text{bedroom} + 694.75 * \text{bathroom} +0.03 * \text{area}$$ $$ + 384.22 * \text{cat} +138.42 * \text{dog }+212.27 * \text{housetype}$$ $$ + 44.30 * \text{parking} + 373.46 * \text{dishwasher} +76.09$$ The root relative squared error was 60.9023%, which was far to be used as a good model.

 

-- Regression Tree

Since linear regression gives very large error. We use a substitute method for linear regression called regression tree. The principle is to first divide the data into small clusters and then in each small cluster linear regression is applied. Under this method, the relative squared error reduced to around 47%

 

-- Divide and Classify

One solution to increase the accuracy is to build the prediction by dividing the instances into different groups and convert the regression problem to a classification problem so that we could ultilize decision trees to better divide the instances. This is a quite legitimate simplification, since in the terms of rent of the sublease, getting a range of the price should be enough to work as the guide line. By training the model to increase the accuracy, we determined to train the model with the aim of the rent per room (deviding the price by the room number). To cooperate the model with the case of studio instead of the regular apartment with bedroom, we assume the studio has 0.75 bedroom (since the studio is considered as the smaller version of one bedroom apartment). And we round the price per room into the multiples of 200 dollars, which might be an audacius assumption. Because we will also consider the sublease instead of the full price, 200 dollars variation of the predicion would also be smaller especially in the apartment with many rooms. Using all of thse assumptions, we trained our data with Random Tree model and realize the accuracy of 80.33321%. The size of the tree was 595. The result is shown in the following figure. Considering the complexity of the problem, we think the model worked quite good.

 

-- Discover Hidden Attribute

One alternative approach to increase the accuracy is to estimate the hidden attributes based on price per square feet. In details, in addition to attributes listed which could be collected in Craigslist, there are some other important attributes such as decoration, environment around and so on. To compensate for these hidden attributes, we divide the instances into four groups (poor, fair, good and extravagant) based on price per square feet, which is closely related to those hidden attributes. The price per unit square ranges from 0.5 to 7 and roughly follows a Gaussian distribution, which is consistent with the real market. For each group, we implement random forests and linear regression, respectively, as shown in the following figures. One interesting observation is for both methods the errors of poor and extravagant group are greater than that of fair and good group, this is because data belonging to the poor and extravagant group lie in the 'tail' of the Gaussian distribution, hence less data is collected and the perplexity is larger.

 

(2) Sublease Price Prediction:

 

We sent surveys to those who posted subleases on WidlcatePad and BBS of Northwestern. To increase accuracy, feedbacks are divided into different groups by the number of bedrooms. Linear regression model is applied in each group and for studios, we assume the number of bedrooms is 1. The relative absolute error is around 24% for different groups. Moreover, we find, compared with the rental, the duration has less impact on the sublease rate.

 

Implementation

To better illustrate the result of our project, the sublease price estimator mentioned above is implemented on this website. Feel free to play with it here.