How is AI being used in real estate?

One of the key applications is predictive analytics, which leverages historical data and algorithms to forecast trends and make accurate predictions. Real estate professionals can use AI-powered tools to analyze market conditions, evaluate property values, and identify investment opportunities with greater precision .

Will AI replace realtors?

Real Estate Will Always Need Agents While AI is certainly making waves in the real estate industry, agents are far from being replaced . Real estate agents continue to be indispensable players within this industry.

How will Generative AI change real estate?

Using Text-to-Image Generative AI in Real Estate And they all take time and money to address. Generative AI can easily elevate property photos and make a property that's not quite there yet ready for listing and attracting buyers without the agent's investment in staging or finishing .

With AI-powered insights, we help clients predict CRE opportunities before they occur. JLL uses AI and machine learning to analyze over 25 trillion internal and external data points , driving better outcomes for capital investments, risk management and portfolio strategies.

What are the risks of AI in real estate?

AI systems can be used to generate fake data that directly contradicts reliable data, such as property data, neighborhood data, and climate data . But that's not the only problem that AI could bring to the real estate sector.

What is the benefit of AI in real estate?

AI can determine the current state of a building and take measures to improve it . Moreover, AI can help you to predict the property market. It can explore huge datasets to predict which direction the property market will go in. Moreover, it can provide information on local, national, and international scales.

How can AI help me buy a house?

Artificial intelligence is being used more and more to help assist in the homebuying process. Virtual assistants can answer questions, data mine listings with specific characteristics, connect homebuyers with local real estate agents, and a whole lot more .

How can AI help with property management?

Improved Operational Efficiency: AI can automate many routine property management tasks, such as rent collection, maintenance requests, and tenant communication . This automation can significantly reduce the administrative burden on property managers, allowing them to focus on more strategic aspects of their roles.

How AI and ML are changing the real estate sector?

Integrating AI within the real estate industry allows organizations to optimize various aspects of their business . By implementing automated workflows, firms can effectively monitor and manage progress to ensure compliance with regulations.

US Real Estate – Harnessing the Power of AI (2024)

This is thecontinuationof our recentuse-case series dedicated to the real estate (RE) monitoring, trend analysis and forecast. In these series, the focus is on the US house prices by invoking supervised machine learning (ML) and artificial intelligence (AI) algorithms available in Python as it is the language with the largest variety of libraries on the subject (Scikit-learn, TensorFlow, pyTorch, Keras, SparkMLlib, etc.). Our objective is to incorporate these algorithms into the real estate decision making process thanks to its supporting role. Recall that decision-making is a critical part of a typical real estate property valuation aimed atquantifying the market value of a property according to its qualitative characteristics. Being visualization a prominent character of this kind of problems, ML/AI ETL pipelines are commonly used as a support for RE decision analysis. Within the context of testing and validation strategies, it is important to get into training errors and limitations of ML/AI due to its inherent pattern-recognizing nature.

US Real Estate – Harnessing the Power of AI (1)

ML/AI is defined as follows: A code learns from experience E with respect to a task T and a performance measure P, if its performance on T, as measured by P, improves with E. ML is a part of AI.ML algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. ML is an important subset of data science. Through the use of statistical methods, data science algorithms are trained to make classifications or predictions, uncovering key insights within data mining projects. These insights subsequently drive decision making within applications and businesses, ideally impacting key BI/fintech metrics.

Bottom Line: ML is the science of getting computers to learn, without being explicitly programmed.

Contents:

Housing Crunch

The August 2022 NAHB measure of homebuilder confidence fell below 50 for the first time since May 2020.Housingstarts for July dropped 9.6%, more than expected, (although permits dropped less than forecast). And most recently the NAR reported that July existing home sales fell 5.9%, more than anticipated.
The August 2022housingdata did much to confirm a slowdown sought by the Federal Reserve. Along with what may have been peak inflation last week, coolerhousingdata is another piece in the puzzle as the FOMC tightens conditions.

“Existing home sales have now fallen for 6 months in a row, and are 26% lower than the January 2022 peak,” Pantheon Macro Economist Ian Shepherdson said. “But the bottom is still some way off, given the degree to which demand has been crushed by rising rates; the required monthly mortgage payment for a new purchaser of an existing single-family home is no longer rising, but it was still up by 51% year-over-year in July 2022.
“Home sales likely have further to fall,” Odeta Kushi, deputy chief economist at First American Financial, tweeted. “Mortgage applications so far in August 2022 point to another decline in existing-home sales. This month’s number of 4.81 million puts us at about 2014 sales level.”
“Fed officials pay particularly close attention to thehousingmarket and are monitoring how higher mortgage rates are impacting home sales andhousingprices in order to gauge how tighter monetary policy is affecting the broader economy,” Wells Fargo economists wrote.

This post provides an optimized solution to the problem of unclear RE market changes by allowing brokerages and clients to have access to an ML-backed RE solution that draws upon different housing data sources that are updated to close recency.

Content

The paper is divided into the following sections: Business Case (see above), supervised ML Methodology, IDE and learning Prerequisites, ETL Python Workflow & Pipeline, multi-scale RE Use Cases using comprehensive open-source housing datasets (US states and beyond), and Conclusions. Sections contain related links listed in References. Due to the scale of case studies, the entire ML project is split into several Jupyter notebooks: EDA and data cleaning, preprocessing and feature engineering, and model tuning and insights. Each input dataset is limited in scope both in terms of the time frame captured, as well as location.Each training model is also specific to houses in a city or county and may not be as accurate when applied to data from another US state, where house prices may be affected by different factors. The aim of specific training models is not to give a perfect prediction, but act as a guideline to inform RE decisions. In reality, house price may be difficult to predict as it is also affected by buyers’ psychology, the economic climate, and other factors not included in the dataset.

Methodology

We consider the supervised ML techniques (see charts below) when we are given a (training) dataset and already know what our correct output should look like, providing the idea that there is an intrinsic relationship between the input and output data. In this study, house price prediction is regarded as a regression problem, meaning that we are trying to map input variables or features (the size of houses, area, etc.) to a continuous function (house price).

US Real Estate – Harnessing the Power of AI (2)

The supervised ML algorithm consists of the following steps:

Create labeled data (label is the true answer for a given input, the house price $ is the label)
Perform model training, testing and cross-validation
Deploy trained models
Evaluate and tune deployed models
Avoid creating high bias/variance

Model training and evaluation is performed using chosen metrics and objectives. For example, the loss metric is a sum of squares between observed and predicted house prices.

US Real Estate – Harnessing the Power of AI (3)

US Real Estate – Harnessing the Power of AI (4)

The above three-step ML methodology is a way to use regression algorithms to derive predictive insights from housing data and make repeated RE decisions. Qualities of good data (output of EDA): it has coverage, is clean, is complete.

The broader your data’s coverage, the more robust your training model will be. Dirty data can make ML hard in terms of goodness-of-fit. Incomplete data can limit performance.

Here is the list of 10 popular ML regression algorithms:

Conventionally, the Exploratory Data Analysis (EDA) of the dataframe df is carried out using histograms df.plot(kind=’hist’) and pairplots sns.pairplot().

The Feature Engineering (FE) phase consists of the following steps: Log Transform np.log() or Square Root Transform np.sqrt(), Feature Importance analysis coef_.ravel(), and Feature Scaling using StandardScaler() (most common option), RobustScaler() (not widely used option), and MinMaxScaler (least robust choice).
The typical regression algorithm is the liner/polynomial regression with/without regularization (Lasso, Ridge, etc.) and/or Hyper-Parameter Optimization (HPO).
The Model Evaluation phase may represent (optionally) the following comparisons: Ridge vs Lasso and Normal vs Polynomial.
The cross-validation metrics utilities can be used to compute some useful statistics of the prediction performance. Some statistics computed are mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), mean absolute percent error (MAPE), and median absolute percent error (MDAPE).

Prerequisites

We begin with setting up the Python-based IDE using the latest version of Anaconda that contains the Jupyter Noebook coupled with (optionally) Datapane. The latter allows you to share an html link in which you can layout your analysis as a report. When started, theJupyter Notebook Appcan access only files within its start-up folder (including any sub-folder). No configuration is necessary if you place your notebooks in your home folder or subfolders. Otherwise, you need to choose aJupyter Notebook Appstart-up folder which will contain all the notebooks. Read more here.

Check ML learning prerequisites here.

Workflow

The general workflow to create the model will be as follows:

Data handling (loading, cleaning, editing or preprocessing)

Exploratory Data Analysis (EDA)/Feature Engineering (FE)

We use Feature Engineering to deal with missing values, outliers, and categorical features

Model training & hyperparameter tuning

We use various ML models and train/test them on train/test data, viz. after tuning all the hyperparameters, test the model on test data

Model testing, QC diagnostics, evaluation and final deployment

Apply predictions, result interpretation, visualization and export.

Below is the more detailed sequence of steps:

Import Libraries and Loading Dataset

Example: usePython,opendatasetsto load the data from the Kaggle platform,pandasto read and manipulate the data,seaborn,matplotlib,plotly,geopandasfor data points visualizations,sklearnfor data preprocessing and training algorithms.

EDA & Data Visualization/Overview

Use a variety of useful data visualization tools that we can analyze tabular data and discover data cleaning procedures that we can fix the data (e.g. looking for missing values and outliers, applying data cleaning by removing unnecessary values or columns, duplicates values, and fixing some errors which can be human-made mistakes when recording).

Feature Engineering & Selection to improve a model’s predictive performance

Use feature selection techniques such as Feature Importance (using ML algorithms such as Lasso and Random Forest), Correlation Matrix with Heatmap, or Univariate Selection. For example, we may choose the Heatmap correlation matrix technique to select features with correlations higher than zero.

Data preparation/preprocessing using featuresscaling,encoding, andimputing

For example, the function preprocess_data(data) consists of remove_duplicates(), check_missing(), resolve_missing(), and change_types(); it takes in raw data and converts it into data that is ready for making predictions. Here are the steps to be done:

Identify the input and target column(s) for training the model.

Identify numeric and categorical input columns.

Impute(fill) missing values in numeric columns

Scalevalues in numeric columns to a (0,1) range.

Encodecategorical data into one-hot vectors.

Split the dataset into training and validation sets.

Robust model training and hyperparameter tuning

For example, We may decide to train the data on SkLearnmodels Random Forest,Gradient Boosting,ExtraTree,LightGBM, and Catboost.

The predictions from the model can be evaluated using a loss function like the Root Mean Squared Error (RMSE).

We can use the trained model to generate predictions for the training, testing and validation inputs by calculating the R-square in each case. The final score can be the model score and the training/testing accuracy.

Case 1: US

Let’s set the working directory YOURPATH

import os
os.chdir(‘YOURPATH’)
os. getcwd()

and import the following libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

Let’s read the Kaggle dataset

houseDF = pd.read_csv(‘USA_Housing.csv’)

and check the file content

houseDF.shape

(5000, 7)

houseDF.columns

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'], dtype='object')

houseDF.dtypes

Avg. Area Income float64Avg. Area House Age float64Avg. Area Number of Rooms float64Avg. Area Number of Bedrooms float64Area Population float64Price float64Address objectdtype: object

The info is

houseDF.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 5000 entries, 0 to 4999Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Avg. Area Income 5000 non-null float64 1 Avg. Area House Age 5000 non-null float64 2 Avg. Area Number of Rooms 5000 non-null float64 3 Avg. Area Number of Bedrooms 5000 non-null float64 4 Area Population 5000 non-null float64 5 Price 5000 non-null float64 6 Address 5000 non-null object dtypes: float64(6), object(1)memory usage: 273.6+ KB

and the first 5 rows are

houseDF.head(5)

US Real Estate – Harnessing the Power of AI (5)

while the input data descriptive statistics is

US Real Estate – Harnessing the Power of AI (6)

The input data pairplot is

fig=sns.pairplot(houseDF)
fig.savefig(“pairplot.png”)

US Real Estate – Harnessing the Power of AI (7)

and the correlation heatmap is

swarm_plot=sns.heatmap(houseDF.corr(), annot=True)
fig = swarm_plot.get_figure()
fig.savefig(“corrplot.png”)

US Real Estate – Harnessing the Power of AI (8)

Let’s separate features and target variables

X = houseDF[[‘Avg. Area Income’, ‘Avg. Area House Age’ , ‘Avg. Area Number of Rooms’, ‘Avg. Area Number of Bedrooms’, ‘Area Population’]]

Y = houseDF[‘Price’]

Let’s split the data into the train and test subsets as 70:30%, respectively,

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test= train_test_split(X,Y,test_size=0.30, random_state=1)

Let’s apply the LinearRegression() to the training data

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,Y_train)

Let’s make predictions

predictions = lm.predict(X_test)

and plot the result

plt.scatter(Y_test,predictions)
plt.title(‘LinearRegression()’)
plt.xlabel(‘Observed Test Data’)
plt.ylabel(‘Predicted Test Data’)
plt.savefig(‘testlinreg.jpg’)

US Real Estate – Harnessing the Power of AI (9)

Let’s compare it with the xgboost algorithm

import xgboost as xg
reg = xg.XGBRegressor(objective =’reg:linear’,
n_estimators = 1000, seed = 123)

reg.fit(X_train,Y_train)
predictions = reg.predict(X_test)

US Real Estate – Harnessing the Power of AI (10)

We can see that LinearRegression() yields the more accurate prediction than XGBRegressor(). The same considerations apply to the sklearn algorithms (SVR, TweedieRegressor, RandomForestRegressor, etc.).

Case 2: CA

Let’s look at the median house prices for California districts derived from the 1990 census. This is the dataset used in the second chapter of Aurélien Géron’s recent book ‘Hands-On Machine learning with Scikit-Learn and TensorFlow’. The ultimate goalof end-to-end ML is to build a RE prediction engine capable ofminimizing error rate RMSE (Root Mean Square Error) or MAE (Mean Absolute Error) or any other metrics of interest.

Let’s set the working directory YOURPATH

import os

os.chdir(‘YOURPATH’)
os. getcwd()

and import libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

Let’s read the data

housing_data = pd.read_csv(“housing.csv”)
housing_data

US Real Estate – Harnessing the Power of AI (11)

representing 20640 rows × 10 columns.

The data info is

housing_data.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 20640 entries, 0 to 20639Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes: float64(9), object(1)memory usage: 1.6+ MB

Let’s plot the ocean proximity bar chart

housing_data[“ocean_proximity”].value_counts().plot(kind=”barh”)

US Real Estate – Harnessing the Power of AI (12)

We can see that “ISLAND” value_counts is negligible compared to “1H OCEAN”.

The descriptive statistics of input data is

housing_data.describe()

US Real Estate – Harnessing the Power of AI (13)

Let’s plot the histogram of median income

housing_data[“median_income”].hist()

US Real Estate – Harnessing the Power of AI (14)

Let’s introduce 5 categories of median income

housing_data[“income_cat”]= pd.cut(housing_data[“median_income”],
bins=[0,1.5,3.0,4.5,6, np.inf],
labels=[1,2,3,4,5])
housing_data[“income_cat”].value_counts()

3 72362 65814 36395 23621 822Name: income_cat, dtype: int64

and plot histograms of these categories

housing_data[“income_cat”].hist()

US Real Estate – Harnessing the Power of AI (15)

Let’s introduce the target variable median_house_value and the model features

y = housing_data[“median_house_value”]
X= housing_data.drop(“median_house_value”,axis=1)
X

US Real Estate – Harnessing the Power of AI (16)

with 20640 rows × 10 columns.

Let’s split the data into 33% and 66% for Training and Testing, respectively

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Let’s select StratifiedShuffleSplit that provides train/test indices to split data in train/test sets. This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2,random_state=42)
for train_index, test_index in split.split(housing_data,housing_data[“income_cat”]):
strat_train_set = housing_data.loc[train_index]
strat_test_set = housing_data.loc[test_index]

Let’s check strat_test_set value count in terms of income_cat as a fraction

strat_test_set[“income_cat”].value_counts() / len(strat_test_set)

3 0.3505332 0.3187984 0.1763575 0.1143411 0.039971Name: income_cat, dtype: float64

We can see only 4% of strat_test_set belongs to income_cat=1 as compared to 35% of strat_test_set that belongs to income_cat=3.

Let’s plot the histograms of training data

US Real Estate – Harnessing the Power of AI (17)

Let’s plot the geo-location map population and housing median age vs median house value

housing.plot(kind=”scatter”,x=”longitude”,y=”latitude”,alpha=0.4,
s = housing[“population”]/100, label=”population”,figsize=(10,7),
c=”median_house_value”,cmap=plt.get_cmap(“jet”),colorbar=True,
sharex=False)
plt.savefig(‘camappopulationhouseprice.png’)

housing.plot(kind=”scatter”,x=”longitude”,y=”latitude”,alpha=0.4,
s = housing[“housing_median_age”], label=”housing_median_age”,figsize=(10,7),
c=”median_house_value”,cmap=plt.get_cmap(“jet”),colorbar=True,
sharex=False)
plt.savefig(‘camaphouseagehouseprice.png’)

US Real Estate – Harnessing the Power of AI (18)

US Real Estate – Harnessing the Power of AI (19)

Let’s look at the housing correlation matrix

housing.corr()

US Real Estate – Harnessing the Power of AI (20)

and plot the corresponding annotated heatmap

import seaborn as sns
corr = housing.corr()
mask = np.triu(np.ones_like(corr,dtype=bool))

f, ax = plt.subplots(figsize= (11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap = True)
sns_plot=sns.heatmap(corr,mask=mask,cmap=cmap, vmax=.3,center=0,annot = True,
square=True, linewidths=0.5, cbar_kws={“shrink”:.5})
fig = sns_plot.get_figure()
fig.savefig(“cacorrheatmap.png”)

US Real Estate – Harnessing the Power of AI (21)

We can see that median_income is the most dominant factor that affects median_house_price.

Let’s check rows for missing values

sample_incomplete_rows= housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows

US Real Estate – Harnessing the Power of AI (22)

while dropping the column with no values

sample_incomplete_rows.dropna(subset=[“total_bedrooms”])

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity

Let’s fill NaN with median values

median = housing[‘total_bedrooms’].median()
sample_incomplete_rows[‘total_bedrooms’].fillna(median,inplace=True)
sample_incomplete_rows

US Real Estate – Harnessing the Power of AI (23)

Let’s apply the SimpleImputer method with strategy =’median’

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy =’median’)
housing_num = housing.select_dtypes(include=(np.number))
housing_num

US Real Estate – Harnessing the Power of AI (24)

from sklearn import impute
imputer.fit(housing_num)
SimpleImputer(strategy=’median’)

SimpleImputer(strategy='median')

X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns = housing_num.columns,index=housing_num.index)
housing_tr

US Real Estate – Harnessing the Power of AI (25)

Recall that

imputer.strategy

'median'

Let’s encode categorical variables to convert non-numerical data into numerical data to create inferences

housing_cat =housing[[‘ocean_proximity’]]
housing_cat.head(10)

US Real Estate – Harnessing the Power of AI (26)

Let’s apply OrdinalEncoder to this variable

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder= OrdinalEncoder()
housing_cat_encoded=ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]

array([[1.], [4.], [1.], [4.], [0.], [3.], [0.], [0.], [0.], [0.]])

Let’s apply OneHotEncoder to housing_cat

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

array([[0., 1., 0., 0., 0.], [0., 0., 0., 0., 1.], [0., 1., 0., 0., 0.], ..., [1., 0., 0., 0., 0.], [1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.]])

Let’s define the feature_engineering function

def feature_engineering(data):
data[‘bedrooms_per_household’] = data[‘total_bedrooms’]/data[‘households’]
data[‘population_per_households’]=data[‘population’]/data[‘households’]
data[‘rooms_per_households’]=data[‘total_rooms’]/data[‘households’]

return data

and apply this function to the housing data

housing_feature_engineered = feature_engineering(housing_num)
housing_feature_engineered

US Real Estate – Harnessing the Power of AI (27)

Let’s scale our data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

housing_scaled = scaler.fit_transform(housing_feature_engineered)
housing_scaled

array([[-0.94135046, 1.34743822, 0.02756357, ..., 0.05896205, 0.00622264, 0.01739526], [ 1.17178212, -1.19243966, -1.72201763, ..., 0.02830837, -0.04081077, 0.56925554], [ 0.26758118, -0.1259716 , 1.22045984, ..., -0.1286475 , -0.07537122, -0.01802432], ..., [-1.5707942 , 1.31001828, 1.53856552, ..., -0.26257303, -0.03743619, -0.5092404 ], [-1.56080303, 1.2492109 , -1.1653327 , ..., 0.11548226, -0.05915604, 0.32814891], [-1.28105026, 2.02567448, -0.13148926, ..., 0.05505203, 0.00657083, 0.01407228]])

Let’s create the ML input data

ml_input_data = np.hstack([housing_cat_1hot, housing_scaled])
ml_input_data

array([[ 0. , 1. , 0. , ..., 0.05896205, 0.00622264, 0.01739526], [ 0. , 0. , 0. , ..., 0.02830837, -0.04081077, 0.56925554], [ 0. , 1. , 0. , ..., -0.1286475 , -0.07537122, -0.01802432], ..., [ 1. , 0. , 0. , ..., -0.26257303, -0.03743619, -0.5092404 ], [ 1. , 0. , 0. , ..., 0.11548226, -0.05915604, 0.32814891], [ 0. , 1. , 0. , ..., 0.05505203, 0.00657083, 0.01407228]])

Let’s define the entire ETL pipeline to be applied to the housing data

housing = strat_train_set.drop(“median_house_value”, axis=1)
housing_labels = strat_train_set[“median_house_value”].copy()

def data_transformations(data):

### Separate Labels if they Exist ###if "median_house_value" in data.columns: labels = data["median_house_value"] data = data.drop("median_house_value", axis=1)else: labels = None### Feature Engineering ###feature_engineered_data = feature_engineering(data)features = list(feature_engineered_data.columns) # Creating a list of our features for future use### Imputing Data ###from sklearn.impute import SimpleImputerimputer = SimpleImputer(strategy="median")housing_num = feature_engineered_data.select_dtypes(include=[np.number])imputed = imputer.fit_transform(housing_num)### Encoding Categorical Data ###housing_cat = feature_engineered_data.select_dtypes(exclude=[np.number])from sklearn.preprocessing import OneHotEncodercat_encoder = OneHotEncoder(sparse=False)housing_cat_1hot = cat_encoder.fit_transform(housing_cat)features = features + cat_encoder.categories_[0].tolist()features.remove("ocean_proximity") # We're encoding this variable, so we don't need it in our list anymore### Scaling Numerical Data ###from sklearn.preprocessing import StandardScalerscaler = StandardScaler()housing_scaled = scaler.fit_transform(imputed)### Concatening all Data ###output = np.hstack([housing_scaled, housing_cat_1hot])return output, labels, features

cat_encoder.categories_[0].tolist()

['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']

Let’s select and train the model
train_data, train_labels, features = data_transformations(strat_train_set)
train_data

array([[-0.94135046, 1.34743822, 0.02756357, ..., 0. , 0. , 0. ], [ 1.17178212, -1.19243966, -1.72201763, ..., 0. , 0. , 1. ], [ 0.26758118, -0.1259716 , 1.22045984, ..., 0. , 0. , 0. ], ..., [-1.5707942 , 1.31001828, 1.53856552, ..., 0. , 0. , 0. ], [-1.56080303, 1.2492109 , -1.1653327 , ..., 0. , 0. , 0. ], [-1.28105026, 2.02567448, -0.13148926, ..., 0. , 0. , 0. ]])

Let’s test the model

test_data, test_labels, features = data_transformations(strat_test_set)
test_data

array([[ 0.57507019, -0.69657252, 0.0329564 , ..., 0. , 0. , 0. ], [-0.43480141, -0.33466769, -0.36298077, ..., 0. , 0. , 0. ], [ 0.54522177, -0.63547171, 0.58726843, ..., 0. , 0. , 0. ], ..., [-0.08656982, -0.54617051, 1.14158047, ..., 0. , 0. , 0. ], [ 0.81385757, -0.92687559, 0.11214383, ..., 0. , 0. , 0. ], [ 0.49049967, -0.66367208, 0.58726843, ..., 0. , 0. , 0. ]])

We have got the train labels

train_labels

12655 72100.015502 279600.02908 82700.014053 112500.020496 238300.0 ... 15174 268500.012661 90400.019263 140400.019140 258100.019773 62700.0Name: median_house_value, Length: 16512, dtype: float64

and the features

features

['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'bedrooms_per_household', 'population_per_households', 'rooms_per_households', '<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']

Following Case 1 (see above), let’s apply the Linear Regression

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(train_data,train_labels)

LinearRegression()

Let’s compare original and predicted values

original_values = test_labels[:5]
predicted_values = lin_reg.predict(test_data[:5])
comparison_dataframe = pd.DataFrame(data={“Original Values”:original_values, “Predicted Values”:predicted_values})

comparison_dataframe[“Differences”] = comparison_dataframe[“Original Values”] – comparison_dataframe[“Predicted Values”]

comparison_dataframe

US Real Estate – Harnessing the Power of AI (28)

Let’s check the MSE metric

from sklearn.metrics import mean_squared_error

lin_mse = mean_squared_error(original_values,predicted_values)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

78489.87096668077

Let’s check the MAE metric

from sklearn.metrics import mean_absolute_error

lin_mae = mean_absolute_error(original_values, predicted_values)
lin_mae

71328.53325778323

Let’s apply the Decision Tree algorithm

from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(train_data,train_labels)

DecisionTreeRegressor(random_state=42)

train_predictions = tree_reg.predict(train_data)
tree_mse = mean_squared_error(train_labels, train_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

0.0

Let’s compute the cross-validation score

from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, train_data, train_labels, scoring=”neg_mean_squared_error”, cv=10)
tree_rmse_scores = np.sqrt(-scores)

def display_scores(scores):
print(“Scores:”, scores)
print(“Mean:”, scores.mean())
print(“Standard deviation:”, scores.std())

display_scores(tree_rmse_scores)

Scores: [70819.83674558 70585.09139446 69861.50467212 73083.46385442 66246.62162221 74093.76616605 77298.21284135 70265.05374821 70413.46481703 72693.02785945]Mean: 71536.00437208822Standard deviation: 2802.723447985299

Let’s apply the Random Forest Regressor

rom sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
forest_reg.fit(train_data, train_labels)

RandomForestRegressor(random_state=42)

train_predictions = forest_reg.predict(train_data)
forest_mse = mean_squared_error(train_labels, train_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

18797.81343373367

Let’s select the corresponding cross_val_score

from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest_reg, train_data, train_labels,
scoring=”neg_mean_squared_error”, cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

Scores: [51667.47890087 49581.77674843 46845.77133522 52127.48739086 48082.89639917 51050.84681689 53027.94987383 50218.59780997 48609.03966622 54669.97457167]Mean: 50588.18195131385Standard deviation: 2273.9929947683154

Let’s try 12 (3×4) combinations of hyperparameters and then try then try 6 (2×3) combinations with bootstrap set as False using GridSearchCV

from sklearn.model_selection import GridSearchCV

param_grid = [
# try 12 (3×4) combinations of hyperparameters
{‘n_estimators’: [3, 10, 30], ‘max_features’: [2, 4, 6, 8]},
# then try 6 (2×3) combinations with bootstrap set as False
{‘bootstrap’: [False], ‘n_estimators’: [3, 10], ‘max_features’: [2, 3, 4]},
]

forest_reg = RandomForestRegressor(random_state=42)

Let’s train across 5 folds, that’s a total of (12+6)*5=90 rounds of training

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring=’neg_mean_squared_error’,
return_train_score=True)
grid_search.fit(train_data, train_labels)

GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42), param_grid=[{'max_features': [2, 4, 6, 8], 'n_estimators': [3, 10, 30]}, {'bootstrap': [False], 'max_features': [2, 3, 4], 'n_estimators': [3, 10]}], return_train_score=True, scoring='neg_mean_squared_error')

Let’s see the best estimator

grid_search.best_estimator_

RandomForestRegressor(max_features=6, n_estimators=30, random_state=42)

The results of grid search cv are as follows

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres[“mean_test_score”], cvres[“params”]):
print(np.sqrt(-mean_score), params)

64441.33583774864 {'max_features': 2, 'n_estimators': 3}55010.78729315784 {'max_features': 2, 'n_estimators': 10}52756.90743676946 {'max_features': 2, 'n_estimators': 30}60419.95105027927 {'max_features': 4, 'n_estimators': 3}52548.760723492225 {'max_features': 4, 'n_estimators': 10}50475.03023921768 {'max_features': 4, 'n_estimators': 30}58658.87553276854 {'max_features': 6, 'n_estimators': 3}51688.259845013825 {'max_features': 6, 'n_estimators': 10}49602.83903888296 {'max_features': 6, 'n_estimators': 30}57764.545176887186 {'max_features': 8, 'n_estimators': 3}51906.606161086886 {'max_features': 8, 'n_estimators': 10}49851.77165193962 {'max_features': 8, 'n_estimators': 30}63137.43571927858 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}54419.40582754731 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}58195.29390064867 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}52168.74519952844 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}59520.17602710436 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}51828.25647287002 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

The corresponding dataframe is

pd.DataFrame(grid_search.cv_results_)

US Real Estate – Harnessing the Power of AI (29)

representing 18 rows × 23 columns.

Let’s compare it to RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
‘n_estimators’: randint(low=1, high=200),
‘max_features’: randint(low=1, high=8),
}

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
n_iter=10, cv=5, scoring=’neg_mean_squared_error’, random_state=42)
rnd_search.fit(train_data, train_labels)

RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42), param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001669BCE8220>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001669BCE0640>}, random_state=42, scoring='neg_mean_squared_error')

The results are as follows

cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres[“mean_test_score”], cvres[“params”]):
print(np.sqrt(-mean_score), params)

48881.00597871309 {'max_features': 7, 'n_estimators': 180}51634.61963021687 {'max_features': 5, 'n_estimators': 15}50312.55245794906 {'max_features': 3, 'n_estimators': 72}50952.54821857023 {'max_features': 5, 'n_estimators': 21}49063.34454115586 {'max_features': 7, 'n_estimators': 122}50317.63324666772 {'max_features': 3, 'n_estimators': 75}50173.504527094505 {'max_features': 3, 'n_estimators': 88}49248.29804214526 {'max_features': 5, 'n_estimators': 100}50054.94886918995 {'max_features': 3, 'n_estimators': 150}64847.94779269648 {'max_features': 5, 'n_estimators': 2}

Let’s look at the feature importances

feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

array([8.46978272e-02, 7.69983975e-02, 4.08715796e-02, 1.67325719e-02, 1.71418340e-02, 1.73518185e-02, 1.56303531e-02, 3.39824215e-01, 2.30528104e-02, 1.04033701e-01, 8.64983594e-02, 1.29273143e-02, 1.54663950e-01, 7.22217547e-05, 3.62205279e-03, 5.88099358e-03])

The corresponding list is as follows

feature_importance_list = list(zip(features, feature_importances.tolist()))
feature_importance_list

[('longitude', 0.0846978271965227), ('latitude', 0.07699839747855737), ('housing_median_age', 0.040871579612884096), ('total_rooms', 0.016732571900462085), ('total_bedrooms', 0.01714183399184058), ('population', 0.0173518184721046), ('households', 0.015630353131298083), ('median_income', 0.3398242154869636), ('bedrooms_per_household', 0.023052810363875926), ('population_per_households', 0.10403370064780083), ('rooms_per_households', 0.08649835942626646), ('<1H OCEAN', 0.012927314349565632), ('INLAND', 0.15466394981681342), ('ISLAND', 7.222175467748088e-05), ('NEAR BAY', 0.003622052794433035), ('NEAR OCEAN', 0.005880993575933963)]

We can plot this list as the vertical bar container that consists of 16 columns

plt.barh(y=features, width=feature_importances.tolist())

US Real Estate – Harnessing the Power of AI (30)

The final model RMSE is given by

final_model = grid_search.best_estimator_

final_predictions = final_model.predict(test_data)

final_mse = mean_squared_error(test_labels, final_predictions)
final_rmse = np.sqrt(final_mse)

final_rmse

63301.179203602675

This can be modified further using various feature selection methods.

Thus, median_income is the most important feature. The best result is achieved using RandomForestRegressor + RandomizedSearchCV. The trained prediction of

RandomForestRegressor(random_state=42) yields rmse=18797.8+/-2274,

whreas min (mean_test_score) yields

with ‘max_features’: 7, ‘n_estimators’: 180.

Case 3: IA

For this case study, the primary objective was to create and assess advanced ML/AI models to accurately predict house prices based on the Ames dataset. It was compiledby Dean De co*ck for use in data science education. It’s an incredible alternativefor data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.

The data set includes around 3000 records of house sales in Ames, Iowa between 2006 – 2010 and contains 79 explanatory variables detailing various aspects of residential homes such as square footage, number of rooms and sale year. The data is split into a training set, which will be used to create the model and a test set, which will be used to test model performance.

Results can provide insights on the pricing of real estate assets just by plugging in the house characteristics and letting the model return a price. In addition, the ML/AI output can provide information on which features of a new house are more valuable for potential house buyers.Source code:GitHub.

The general ETL Python workflow to create the model is as follows:

Data preprocessing
Exploratory data analysis/Feature Engineering
Model training & hyperparameter tuning
Model diagnostics & evaluation
Result interpretation

Let’s set the working directory YOURPATH

import os
os.chdir(‘YOURPATH’)
os. getcwd()

Let’s import libraries and download train/test Ames datasets
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import sklearn.linear_model as linear_model
import seaborn as sns
import xgboost as xgb
from sklearn.model_selection import KFold
from IPython.display import HTML, display
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

pd.options.display.max_rows = 1000
pd.options.display.max_columns = 20

train = pd.read_csv(‘train.csv’)
test = pd.read_csv(‘test.csv’)

Let’s get the dimensions of the train and test data
print(“Training data set dimension : {}”.format(train.shape))
print(“Testing data set dimension : {}”.format(test.shape))

Training data set dimension : (2051, 81)Testing data set dimension : (879, 80)

Let’s look at the continuous features
numerical_cols = [col for col in train.columns if train.dtypes[col] != ‘object’]
numerical_cols.remove(‘SalePrice’)
numerical_cols.remove(‘Id’)
print(““122)
print(“Continuous features”)
print(““122)
print(numerical_cols)
print(““122)
print(“count of continuous features:”,len(numerical_cols))
print(““122)

Continuous features***************************************************************************['PID', 'MS SubClass', 'Lot Frontage', 'Lot Area', 'Overall Qual', 'Overall Cond', 'Year Built', 'Year Remod/Add', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces', 'Garage Yr Blt', 'Garage Cars', 'Garage Area', 'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch', 'Pool Area', 'Misc Val', 'Mo Sold', 'Yr Sold']***************************************************************************count of continuous features: 37

Let’s look at the categorical features
categorical_cols = [col for col in train.columns if train.dtypes[col] == ‘object’]
print(““122)
print(“categorical features”)
print(““122)
print(categorical_cols)
print(““122)
print(“count of categorical features:”,len(categorical_cols))
print(““122)

categorical features***************************************************************************['MS Zoning', 'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities', 'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin Type 2', 'Heating', 'Heating QC', 'Central Air', 'Electrical', 'Kitchen Qual', 'Functional', 'Fireplace Qu', 'Garage Type', 'Garage Finish', 'Garage Qual', 'Garage Cond', 'Paved Drive', 'Pool QC', 'Fence', 'Misc Feature', 'Sale Type']***************************************************************************count of categorical features: 42

and check unique column values below

print(‘unique column values’)
train.apply(lambda x: len(x.unique())).sort_values(ascending=False).head(10)

unique column values

Out[6]:

Id 2051PID 2051Lot Area 1476Gr Liv Area 1053Bsmt Unf SF 9681st Flr SF 915Total Bsmt SF 893SalePrice 828BsmtFin SF 1 822Garage Area 515dtype: int64

Let’s check the sorted cardinality train values

cardinality = train[categorical_cols].apply(lambda x: len(x.unique()))
cardinality.sort_values(ascending=False).head(30)

Neighborhood 28Exterior 2nd 15Exterior 1st 15Sale Type 9Condition 1 9House Style 8Functional 8Condition 2 8Garage Type 7BsmtFin Type 2 7BsmtFin Type 1 7MS Zoning 7Bsmt Qual 6Roof Matl 6Misc Feature 6Garage Cond 6Garage Qual 6Foundation 6Fireplace Qu 6Bsmt Cond 6Roof Style 6Heating 5Fence 5Pool QC 5Electrical 5Bldg Type 5Bsmt Exposure 5Exter Cond 5Mas Vnr Type 5Lot Config 5dtype: int64

and the cardinality test values

cardinality = test[categorical_cols].apply(lambda x: len(x.unique()))
cardinality.sort_values(ascending=False).head(40)

Neighborhood 26Exterior 2nd 16Exterior 1st 13Sale Type 10Condition 1 9House Style 8Garage Type 7BsmtFin Type 2 7BsmtFin Type 1 7Garage Cond 6Fireplace Qu 6Functional 6Foundation 6Mas Vnr Type 6MS Zoning 6Roof Matl 6Roof Style 6Bsmt Qual 6Kitchen Qual 5Exter Cond 5Fence 5Garage Qual 5Bsmt Exposure 5Lot Config 5Bldg Type 5Electrical 5Misc Feature 4Garage Finish 4Lot Shape 4Land Contour 4Exter Qual 4Heating QC 4Heating 4Bsmt Cond 4Condition 2 4Land Slope 3Alley 3Paved Drive 3Pool QC 3Utilities 2dtype: int64

Let’s check good and bad train+test column lists

good_label_cols = [col for col in categorical_cols if set(test[col]).issubset(set(train[col]))]
len(good_label_cols)

bad_label_cols = list(set(categorical_cols)-set(good_label_cols))
bad_label_cols

['Sale Type', 'Exterior 1st', 'Heating', 'Roof Matl', 'Electrical', 'Exterior 2nd', 'Mas Vnr Type', 'Kitchen Qual']

Let’s plot the count of missing values in the training data column features

cols_with_missing = train.isnull().sum()
cols_with_missing = cols_with_missing[cols_with_missing>0]
cols_with_missing.sort_values(inplace=True)
fig, ax = plt.subplots(figsize=(7,6))
width = 0.70 # the width of the bars
ind = np.arange(len(cols_with_missing)) # the x locations for the groups
ax.barh(ind, cols_with_missing, width, color=”blue”)
ax.set_yticks(ind+width/2)
ax.set_yticklabels(cols_with_missing.index, minor=False)
plt.xlabel(‘Count’)
plt.ylabel(‘Features’)
plt.savefig(“amesfeaturesmissingvalues.png”)

US Real Estate – Harnessing the Power of AI (31)

Let’s count the percentage of missing values in training data
print(‘Percentage of missing values in each columns’)

total = train.isnull().sum().sort_values(ascending=False)
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data_tr = pd.concat([total, round(percent*100,2)], axis=1, keys=[‘Total’, ‘Percent’])
missing_data_tr[missing_data_tr.Total>=1]

Percentage of missing values in each columns

US Real Estate – Harnessing the Power of AI (32)

Similarly, we plot the count of missing values in the test data column features

cols_with_missing = test.isnull().sum()
cols_with_missing = cols_with_missing[cols_with_missing>0]
cols_with_missing.sort_values(inplace=True)
fig, ax = plt.subplots(figsize=(7,6))
width = 0.70 # the width of the bars
ind = np.arange(len(cols_with_missing)) # the x locations for the groups
ax.barh(ind, cols_with_missing, width, color=”blue”)
ax.set_yticks(ind+width/2)
ax.set_yticklabels(cols_with_missing.index, minor=False)
plt.xlabel(‘Count’)
plt.ylabel(‘Features’)

and the percentage of missing values in test data columns

print(‘Percentage of missing values in each columns’)

total = test.isnull().sum().sort_values(ascending=False)
percent = (test.isnull().sum()/test.isnull().count()).sort_values(ascending=False)
missing_data_te = pd.concat([total, round(percent*100,2)], axis=1, keys=[‘Total’, ‘Percent’])
missing_data_te[missing_data_te.Total>=1]

Percentage of missing values in each columns

US Real Estate – Harnessing the Power of AI (33)

Let’s prepare the data for ML.

Separate features and target variable SalePrice
X_train = train_data.drop([‘SalePrice’], axis=1)
y = train_data.SalePrice

and concatenate train and test data
X = pd.concat([X_train, test_data], axis=0)

let’s apply SimpleImputer to deal with missing values

from sklearn.impute import SimpleImputer

group_1 = [
‘PoolQC’, ‘MiscFeature’, ‘Alley’, ‘Fence’, ‘FireplaceQu’, ‘GarageType’,
‘GarageFinish’, ‘GarageQual’, ‘GarageCond’, ‘BsmtQual’, ‘BsmtCond’,
‘BsmtExposure’, ‘BsmtFinType1’, ‘BsmtFinType2’, ‘MasVnrType’
]
X[group_1] = X[group_1].fillna(“None”)

group_2 = [
‘GarageArea’, ‘GarageCars’, ‘BsmtFinSF1’, ‘BsmtFinSF2’, ‘BsmtUnfSF’,
‘TotalBsmtSF’, ‘BsmtFullBath’, ‘BsmtHalfBath’, ‘MasVnrArea’
]

X[group_2] = X[group_2].fillna(0)

group_3a = [
‘Functional’, ‘MSZoning’, ‘Electrical’, ‘KitchenQual’, ‘Exterior1st’,
‘Exterior2nd’, ‘SaleType’, ‘Utilities’
]

imputer = SimpleImputer(strategy=’most_frequent’)
X[group_3a] = pd.DataFrame(imputer.fit_transform(X[group_3a]), index=X.index)

X.LotFrontage = X.LotFrontage.fillna(X.LotFrontage.mean())
X.GarageYrBlt = X.GarageYrBlt.fillna(X.YearBuilt)

Let’s check that there are no remaining missing values

sum(X.isnull().sum())

Let’s drop outliers in GrLivArea and SalePrice (based on Ames EDA)

outlier_index = train_data[(train_data.GrLivArea > 4000)
& (train_data.SalePrice < 200000)].index
X.drop(outlier_index, axis=0, inplace=True)
y.drop(outlier_index, axis=0, inplace=True)

Let’s apply label encoding to the categorical columns

from sklearn.preprocessing import LabelEncoder

label_encoding_cols = [
“Alley”, “BsmtCond”, “BsmtExposure”, “BsmtFinType1”, “BsmtFinType2”,
“BsmtQual”, “ExterCond”, “ExterQual”, “FireplaceQu”, “Functional”,
“GarageCond”, “GarageQual”, “HeatingQC”, “KitchenQual”, “LandSlope”,
“LotShape”, “PavedDrive”, “PoolQC”, “Street”, “Utilities”
]

label_encoder = LabelEncoder()

for col in label_encoding_cols:
X[col] = label_encoder.fit_transform(X[col])

Let’ transform numerical variables to categorical variables
to_factor_cols = [‘YrSold’, ‘MoSold’, ‘MSSubClass’]

for col in to_factor_cols:
X[col] = X[col].apply(str)

Let’s apply feature scaling using RobustScaler
from sklearn.preprocessing import RobustScaler
numerical_cols = list(X.select_dtypes(exclude=[‘object’]).columns)
scaler = RobustScaler()
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

followed by one-hot encoding
X = pd.get_dummies(X, drop_first=True)
print(“X.shape:”, X.shape)

X.shape: (2917, 237)

Let’s define the train and test columns

ntest = len(test_data)
X_train = X.iloc[:-ntest, :]
X_test = X.iloc[-ntest:, :]
print(“X_train.shape:”, X_train.shape)
print(“X_test.shape:”, X_test.shape)

X_train.shape: (1458, 237)X_test.shape: (1459, 237)

let’s perform modeling
from sklearn.model_selection import KFold, cross_val_score

n_folds = 5

def getRMSLE(model):
“””
Return the average RMSLE over all folds of training data.
“””
# Set KFold to shuffle data before the split
kf = KFold(n_folds, shuffle=True, random_state=42)

# Get RMSLE scorermse = np.sqrt(-cross_val_score( model, X_train, y, scoring="neg_mean_squared_error", cv=kf))return rmse.mean()

Let’s apply regularized regressions
from sklearn.linear_model import Ridge, Lasso

lambda_list = list(np.linspace(20, 25, 101))

rmsle_ridge = [getRMSLE(Ridge(alpha=lambda_)) for lambda_ in lambda_list]
rmsle_ridge = pd.Series(rmsle_ridge, index=lambda_list)

rmsle_ridge.plot(title=”RMSLE by lambda”)
plt.xlabel(“Lambda”)
plt.ylabel(“RMSLE”)
plt.savefig(“amesridgelambdarmsle.png”)
print(“Best lambda:”, rmsle_ridge.idxmin())
print(“RMSLE:”, rmsle_ridge.min())

Ridge lambda:

Best lambda: 22.9RMSLE: 0.11409306668450883

US Real Estate – Harnessing the Power of AI (34)

ridge = Ridge(alpha=22.9)

The Lasso Regression is
lambda_list = list(np.linspace(0.0006, 0.0007, 11))
rmsle_lasso = [
getRMSLE(Lasso(alpha=lambda_, max_iter=100000)) for lambda_ in lambda_list
]
rmsle_lasso = pd.Series(rmsle_lasso, index=lambda_list)

rmsle_lasso.plot(title=”RMSLE by lambda”)
plt.xlabel(“Lambda”)
plt.ylabel(“RMSLE”)
plt.savefig(“ameslassolambdarmsle.png”)
print(“Best lambda:”, rmsle_lasso.idxmin())
print(“RMSLE:”, rmsle_lasso.min())

Best lambda: 0.00065RMSLE: 0.11335701578061286

US Real Estate – Harnessing the Power of AI (35)

lasso = Lasso(alpha=0.00065, max_iter=100000)

let’s apply the XGBoost algorithm
from xgboost import XGBRegressor

xgb = XGBRegressor(learning_rate=0.05,
n_estimators=2100,
max_depth=2,
min_child_weight=2,
gamma=0,
subsample=0.65,
colsample_bytree=0.46,
nthread=-1,
scale_pos_weight=1,
reg_alpha=0.464,
reg_lambda=0.8571,
silent=1,
random_state=7,
n_jobs=2)
getRMSLE(xgb)

0.11606096335909163

Let’s apply the LightGBM algorithm
from lightgbm import LGBMRegressor
lgb = LGBMRegressor(objective=’regression’,
learning_rate=0.05,
n_estimators=730,
num_leaves=8,
min_data_in_leaf=4,
max_depth=3,
max_bin=55,
bagging_fraction=0.78,
bagging_freq=5,
feature_fraction=0.24,
feature_fraction_seed=9,
bagging_seed=9,
min_sum_hessian_in_leaf=11)
getRMSLE(lgb)

0.11579673967953394

let’s design the average model

from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin, clone

class AveragingModel(BaseEstimator, RegressorMixin, TransformerMixin):
def init(self, models):
self.models = models

def fit(self, X, y): # Create clone models self.models_ = [clone(x) for x in self.models] # Train cloned models for model in self.models_: model.fit(X, y) return selfdef predict(self, X): # Get predictions from trained clone models predictions = np.column_stack( [model.predict(X) for model in self.models_]) # Return average predictions return np.mean(predictions, axis=1)

avg_model = AveragingModel(models=(ridge, lasso, xgb, lgb))
getRMSLE(avg_model)

0.1106991374718241

Let’s compare the X-plots

US Real Estate – Harnessing the Power of AI (36)

US Real Estate – Harnessing the Power of AI (37)

US Real Estate – Harnessing the Power of AI (38)

US Real Estate – Harnessing the Power of AI (39)

US Real Estate – Harnessing the Power of AI (40)

We can see that both XGBoost and LightGBM methods result in relatively similar X-plots and corresponding RMSLEs.

Case 4: MA

Let’s visualize ML model performance using Scikit-Plot evaluation metrics. The public dataset that we’ll use is the Boston housing price dataset. It has information about various houses of Boston and the price at which they were sold. We’ll divide it as well in train and test sets with the train_size=0.8 proportion. Let’s import libraries and import the data:

import scikitplot as skplt

import sklearn
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

import sys
import warnings
warnings.filterwarnings(“ignore”)

print(“Scikit Plot Version : “, skplt.version)
print(“Scikit Learn Version : “, sklearn.version)
print(“Python Version : “, sys.version)

%matplotlib inline

Scikit Plot Version : 0.3.7Scikit Learn Version : 1.0.2Python Version : 3.9.12 (main, Apr 4 2022, 05:22:27)

boston = load_boston()
X_boston, Y_boston = boston.data, boston.target

print(“Boston Dataset Size : “, X_boston.shape, Y_boston.shape)

print(“Boston Dataset Features : “, boston.feature_names)

Boston Dataset Size : (506, 13) (506,)Boston Dataset Features : ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']

X_boston_train, X_boston_test, Y_boston_train, Y_boston_test = train_test_split(X_boston, Y_boston,
train_size=0.8,
random_state=1)

print(“Boston Train/Test Sizes : “,X_boston_train.shape, X_boston_test.shape, Y_boston_train.shape, Y_boston_test.shape)

Boston Train/Test Sizes : (404, 13) (102, 13) (404,) (102,)

Let’s plot the cross-validation performance of ML models by passing it the Boston dataset. Scikit-plot provides a method namedplot_learning_curve()as a part of theestimatorsmodule which accepts estimator, X, Y, cross-validation info, and scoring metric for plotting performance of cross-validation on the dataset.

skplt.estimators.plot_learning_curve(LinearRegression(), X_boston, Y_boston,
cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”Boston Linear Regression Learning Curve “);
plt.savefig(“bostonlinreglearncurve.png”)

US Real Estate – Harnessing the Power of AI (41)

skplt.estimators.plot_learning_curve(RandomForestRegressor(), X_boston, Y_boston,
cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”Boston RandomForestRegressor Learning Curve “);
plt.savefig(“bostonrandomforestlearncurve.png”)

US Real Estate – Harnessing the Power of AI (42)

from xgboost import XGBRegressor
skplt.estimators.plot_learning_curve(XGBRegressor(), X_boston, Y_boston,
cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”Boston XGBRegressor Learning Curve “);
plt.savefig(“bostonxgboostlearncurve.png”)

US Real Estate – Harnessing the Power of AI (43)

from lightgbm import LGBMRegressor
skplt.estimators.plot_learning_curve(LGBMRegressor(), X_boston, Y_boston,
cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”Boston LGBMRegressor Learning Curve “);
plt.savefig(“bostonlgbmlearncurve.png”)

US Real Estate – Harnessing the Power of AI (44)

from sklearn.linear_model import Ridge, Lasso
skplt.estimators.plot_learning_curve(Ridge(), X_boston, Y_boston,
cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”Boston Ridge Regression Learning Curve “);
plt.savefig(“bostonridgereglearncurve.png”)

US Real Estate – Harnessing the Power of AI (45)

skplt.estimators.plot_learning_curve(Lasso(), X_boston, Y_boston,
cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”Boston Lasso Regression Learning Curve “);
plt.savefig(“bostonlassoreglearncurve.png”)

US Real Estate – Harnessing the Power of AI (46)

from sklearn import linear_model
reg = linear_model.BayesianRidge()
skplt.estimators.plot_learning_curve(reg, X_boston, Y_boston,
cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”Boston BayesianRidge Regression Learning Curve “);
plt.savefig(“bostonBayesianRidgereglearncurve.png”)

US Real Estate – Harnessing the Power of AI (47)

from sklearn.linear_model import TweedieRegressor
reg = TweedieRegressor(power=1, alpha=0.5, link=’log’)
skplt.estimators.plot_learning_curve(reg, X_boston, Y_boston,
cv=7, shuffle=True, scoring=”r2″, n_jobs=-1,
figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”Boston TweedieRegressor Learning Curve “);
plt.savefig(“bostontweediereglearncurve.png”)

US Real Estate – Harnessing the Power of AI (48)

It is clear that RandomForestRegressor, XGBRegressor, and LGBMRegressor yield the best training and cross-validation scores for training examples > 420 compared to other ML algorithms.

Key Takeaways

We predict/estimate US house prices in order to allocate a valuation expert over a period of time.
We need a fast AI to address rapidly increasing populations and the number of dwelling houses in the country.
We use a region-dependent pre-trained ML model to predict prices of new houses.
We import key Python libraries (pandas, scikit-learn, etc.) and download public-domain housing datasets from Kaggle or GitHub.
We gather and clean, edit, scale and transform data so it can be used for model training and test predictions. Specifically, we identify the target variable (SalePrice), impute missing values, perform label encoding, standardization, splitting and (optional) balancing of training and testing datasets. For example, we can look at scatter plots to detect outliers to be dropped.
The input data consists of a home’s features, including its eventual selling price and various descriptive features such as location, remodeling, age, size, type of sale (single family, commercial, etc).
These features will be analyzed in determining a home’s value and what the shopper is most likely to buy.
Feature engineering can determine what are the most important model features as there may be one feature that stands out or there may be several. Fore example, a larger living or basem*nt area is linked to a higher house price.
We perform model training using different linear and non-linear regression algorithms (Ridge, Lasso, Random Forest, Decision Treem SVM, XGBoost, etc.).
The model performance is evaluated using a user-defined loss function (RMSE, MSE, OHMSE, etc.).
The pre-trained model is then used to generate predictions for both training and validation inputs.
Cross-validation of different ML algorithms has proven to be a suitable method to find an acceptable best fitting algorithm for the given set of features.
It appears that location and square feet area play an important role in deciding the price of a property. This is helpful information for sellers and buyers.
Results provide a primer on advanced ML real estate techniques as well as several best practices for ML readiness.

Conclusions

Housing prices are an important reflection of the US real estate, and housing price ranges are of great interest for both buyers and sellers. Real estate is the world’s largest asset class, worthing$277 trillion, three times the total value of all publicly traded companies.And ML/AI applications have been accompanying its sector’s growth.

One of the most popular AI applications in the industry isintelligent investing. This application helps answer questions like:

Which house should I buy or build to maximize my return?
Where or when should I do so?
What is its optimum rent or sale price?

In this blog post, we have reviewedhow ML leverages the power of housing data to tackle these important questions. We have also explored the pros and cons of ML algorithms and how optimizing various steps of actual Python workflows can help improve their performance.