In this module,we focused on using regression to predict a continuous value(house prices) from features of the house (square feet of living space,numbre of bedrooms,..).We also built an iPython notebook for predicting house prices,using data from King County,USA,the region where the city of Seattle is located.
In this assignment ,we are going to build a more accurate regression model for predicting house prices by including more features of the house. In the process,we will also become more familiar with how the Python langugae can be used for data exploation,data transformations and machine learning.These techniques will be key to building intelligent applications.
Follow the rest of the instructions on this page wo complete your program.When you are done,instead of uploading your code, you will answer a series of quiz questions(see the quiz after this reading)to document your completion of this assignment. The instructions will indicate what data to collect for answering the quiz.
Learning outcomes
- Execute programs with the iPython notebook
- Load and transform real,tabular data
- Compute summaries and statistics of the data
- Buil a regression model using features of the data
Resources you will need
You will need to install the software tools or use the free Amazon EC2 Machine. Instructions for both options are provided in the reading for Module 1.
Now you are ready to get started!
What you will do
Now you are ready! We are going do three tasks in this assignment.There are 3 results you need to gather along the way to enter into the quiz after this reading
- Selection and summary statistics: In the notebook we covered in the module, we discovered which neighborhood(zip code)of Seattle had the highest average house sale price.Now ,take the sales data,select only the houses with this zip code,and compute the average price.Save this result to answer the quiz at the end.
2.Filtering data:One of the key features we used in our model was the number of square feet of living sqace in the house.For this part,we are going to use the idea of filtering data.
- In particular,we are going to use logical filters to select rows of an SFrame. You can find more info in the LogicalFile...
- Using such filter,first select the houses that have sqft_living higher than 2000 sqft but no larger than 4000 sqft
- What fraction of the all houses have sqftliving in this range? Save this result to answer the quiz at the end
3.Building a regression model with several more feature: In the sample notebook,we build two regression models to predict house prices, one using just'sqft_living' and other using a few more features, we called this set []
Now,going back to the original dataset,youwilll build a model using the following features:
Note that using copy and paste from this webpage to the Ipython Notebook sometimes does not work perfectly in some operating systems,especially on Windows.For example,the quotes defining strings myay not paste correctly.Please check carefully if you use copy&paste.
- Compute the RMSE(root mean squared error ) on the test_data for the model using just my_features,and for the one using advanced_features.
Note1 : both models must be trained on the original sales dataset ,not the filtered one.
note2: when doing the train-test split,make sure you use seed=0,so you get the same training and test sets,and thusresults,as we do.
Note3: in the module we discussed residual sum of squares(RSS) as an error metric for regression,but graphlab create uses root mean squared error.These are two common measures of error regression,and RMSE is simply the square root the the mean RSS:
RMSE = 根号()RSS/N)
where N is the number of data points. RMSE can be more intutive than RSS,since its units are the same as that of the target column in the data,in our case the unit is dollars,and doesn't grow with the number of data points,like the RSS does.
Important note:when answering the question below using GraphLab Create,when you call linerar_regression.create() function,make sure you use the para