House Price Prediction using Machine Learning in Python

Welcome back! In today’s video, we are going to explore the full process of data science and machine learning. We will analyze the dataset, preprocess the data, engineer custom features, train and evaluate the model, and even cover advanced concepts like hyperparameter tuning. This video is perfect for beginners in data science and machine learning, as well as anyone interested in these fields.

Prerequisites:
Before we get started, make sure you have a working IPython notebook environment. This can be Jupyter notebook, Jupyter Lab, or IPython notebooks in VS Code or PyCharm. In this tutorial, we will be using IPython notebooks for their flexibility and ease of use. Additionally, you will need to install the following libraries: NumPy, Pandas, Matplotlib, Seaborn, and Scikit-learn. These libraries are the foundation of data science and machine learning, providing essential functionality for analysis, visualization, and modeling.

The Dataset:
For this project, we will be using the California housing prices dataset from Kaggle. The dataset contains various features such as coordinates, median age, total rooms, total bedrooms, population, and more. Our goal is to predict the median house value based on these features. To follow along, you can download the dataset from the Kaggle link.

Now let’s jump right in!

House Price Prediction using Machine Learning in Python
House Price Prediction using Machine Learning in Python

Exploratory Data Analysis

First, let’s load the dataset into our notebook and take a look at the data. We use Pandas to read the CSV file and store it in a DataFrame.

import pandas as pd

# Load the CSV file into a DataFrame
data = pd.read_csv('housing.csv')

# Print the data to get a glimpse
data

Next, we can explore the data a bit further. Let’s check for missing values and see the summary of the dataset using the info function.

# Check for missing values
data.info()

# Print summary statistics
data.describe()

We notice that there are some missing values in the dataset, particularly in the ‘total_bedrooms’ feature. To handle this, we can simply drop the rows with missing values using the dropna function.

# Drop rows with missing values
data.dropna(inplace=True)

# Check again for missing values
data.info()

Data Visualization

Now that we have a clean dataset, let’s visualize the data to gain insights and understand the relationships between different features. We can start by plotting histograms to visualize the distribution of each feature.

import matplotlib.pyplot as plt

# Plot histograms for each feature
data.hist(figsize=(15, 8))
plt.show()

We can also create a correlation heatmap to visualize the correlation between features. This will help us identify which features are more relevant for predicting the median house value.

import seaborn as sns

# Create a correlation heatmap
plt.figure(figsize=(15, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.show()

From the histogram plots, we can observe some skewed features that are not normally distributed. To address this, we can apply logarithmic transformation to these features to make them more normally distributed.

import numpy as np

# Apply logarithmic transformation to skewed features
data['total_rooms'] = np.log1p(data['total_rooms'])
data['total_bedrooms'] = np.log1p(data['total_bedrooms'])
data['population'] = np.log1p(data['population'])
data['households'] = np.log1p(data['households'])

Now, let’s take a look at the correlation heatmap again to see if we observe any changes.

# Create a correlation heatmap after transformation
plt.figure(figsize=(15, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.show()

Feature Engineering

In this step, we can engineer new features that might be useful for predicting the target variable. For example, we can calculate the bedroom ratio which is the number of bedrooms per room. Additionally, we can calculate the rooms per household to capture the average number of rooms per household.

# Engineer new features
data['bedroom_ratio'] = data['total_bedrooms'] / data['total_rooms']
data['rooms_per_household'] = data['total_rooms'] / data['households']

Again, let’s take a look at the correlation heatmap to see if these new features are correlated with the target variable.

# Create a correlation heatmap after feature engineering
plt.figure(figsize=(15, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.show()

Model Training and Evaluation

Now that we have prepared our dataset and engineered new features, it’s time to train and evaluate our models. We will start with a simple linear regression model and then move on to a more powerful random forest regressor.

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the dataset into training and testing data
X = data.drop('median_house_value', axis=1)
y = data['median_house_value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a simple linear regression model
linear_reg = LinearRegression()
linear_reg.fit(X_train_scaled, y_train)

# Evaluate the linear regression model
linear_reg_score = linear_reg.score(X_test_scaled, y_test)

# Train a random forest regressor with default parameters
random_forest = RandomForestRegressor()
random_forest.fit(X_train_scaled, y_train)

# Evaluate the random forest regressor
random_forest_score = random_forest.score(X_test_scaled, y_test)

After training and evaluating the models, we can compare their performance. In our case, the random forest regressor outperformed the linear regression model with a score of 81.43%. This indicates that the random forest regressor is a better model for predicting house prices.

Further reading:  The Dangers of Artificial Intelligence: A Look into the Future

Hyperparameter Tuning

To further improve the performance of the random forest regressor, we can tune its hyperparameters using grid search with cross-validation. We can define a parameter grid with different values for the number of estimators, max features, and min samples split.

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_features': [2, 4, 6, 8],
    'min_samples_split': [2, 4, 6, 8]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=random_forest, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(X_train_scaled, y_train)

# Get the best estimator and evaluate its performance
best_estimator = grid_search.best_estimator_
best_estimator_score = best_estimator.score(X_test_scaled, y_test)

After performing the grid search with cross-validation, we can obtain the best estimator and evaluate its performance. The best estimator is the random forest regressor with the optimal hyperparameters. We can now measure its score on the test data to determine if it improved compared to the default random forest regressor.

Conclusion

In this tutorial, we went through the entire process of data science and machine learning for house price prediction. We started by exploring the dataset, performing data preprocessing, and engineering new features. Then, we trained and evaluated models such as linear regression and random forest regressor. Finally, we performed hyperparameter tuning using grid search with cross-validation to improve the model’s performance. By the end, we were able to find the best estimator and measure its accuracy on the test data.

I hope you enjoyed this comprehensive tutorial and gained valuable knowledge about the data science and machine learning process. If you have any questions or feedback, feel free to leave a comment. Don’t forget to subscribe to our channel and visit Techal for more exciting tech content. Thank you for reading!

Further reading:  Building a Simple Virtual Assistant in Python