Overview: Here we explore the price per sqft metric and what variables may affect it. This dataset does not contain geographical data pertaining to each sample. This study is meant to explore how much the house dimensions (age, beds, baths, garage spaces, stories, lot sqft) explain the variability in the ppsqft metric.
This dataset contains real estate listings from Utah in year 2024. The dataset is comprised of 4400 entries and 14 columns. Attributes contained within the dataset include type, description, year built, number of bedrooms and bathrooms, garage spaces, lot size, square footage, stories, listing price, status, and when the property was last sold. This data was ethically mined from Realtor.com using an API provided by Apify.
Source: The dataset is from kaggle. Here's the link https://www.kaggle.com/datasets/kanchana1990/real-estate-data-utah-2024
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.float_format', lambda x: '%.2f' %x)
data = pd.read_csv("real_estate_utah.csv") #load data
print(data.columns) #print columns
Index(['type', 'text', 'year_built', 'beds', 'baths', 'baths_full', 'baths_half', 'garage', 'lot_sqft', 'sqft', 'stories', 'lastSoldOn', 'listPrice', 'status'], dtype='object')
print(data.shape) #print shape of data set
(4440, 14)
print(data.head(n=5)) #print first 5 rows
type text \ 0 single_family Escape to tranquility with this off-grid, unfi... 1 single_family Beautiful home in the desirable Oak Hills and ... 2 single_family Welcome to your new home, nestled in the heart... 3 single_family Investment Opportunity. House needs some work ... 4 land Deer Springs Ranch is an 8000 Ac Ranch in an H... year_built beds baths baths_full baths_half garage lot_sqft sqft \ 0 2020.00 1.00 1.00 1.00 1.00 2.00 71438.00 696.00 1 1968.00 4.00 3.00 2.00 1.00 2.00 56628.00 3700.00 2 1985.00 4.00 3.00 3.00 1.00 1.00 10019.00 3528.00 3 1936.00 4.00 2.00 2.00 1.00 2.00 12632.00 2097.00 4 2003.00 4.00 0.00 2.00 1.00 2.00 872071.00 2400.00 stories lastSoldOn listPrice status 0 2.00 2018-05-31 90000.00 for_sale 1 2.00 2018-05-31 799000.00 for_sale 2 2.00 2018-05-31 389900.00 for_sale 3 2.00 2018-04-16 300000.00 for_sale 4 2.00 2018-05-31 70000.00 for_sale
print(data.dtypes) #print data types
type object text object year_built float64 beds float64 baths float64 baths_full float64 baths_half float64 garage float64 lot_sqft float64 sqft float64 stories float64 lastSoldOn object listPrice float64 status object dtype: object
print(data.isna().sum()) #print nmumber null values found in each column
type 0 text 0 year_built 0 beds 0 baths 0 baths_full 0 baths_half 0 garage 0 lot_sqft 0 sqft 0 stories 0 lastSoldOn 0 listPrice 0 status 0 dtype: int64
print(data.describe().T)
count mean std min 25% 50% \ year_built 4440.00 1997.94 23.61 1860.00 1997.00 2003.00 beds 4440.00 3.89 1.27 1.00 3.00 4.00 baths 4440.00 2.45 1.79 0.00 2.00 3.00 baths_full 4440.00 2.24 1.17 1.00 2.00 2.00 baths_half 4440.00 1.02 0.19 1.00 1.00 1.00 garage 4440.00 2.33 1.02 0.00 2.00 2.00 lot_sqft 4440.00 552523.95 11344714.29 436.00 9583.00 13939.00 sqft 4440.00 2712.32 1553.68 0.00 1842.00 2400.00 stories 4440.00 2.00 0.63 1.00 2.00 2.00 listPrice 4440.00 796604.38 1731703.12 0.00 353805.00 528995.00 75% max year_built 2007.00 2026.00 beds 4.00 19.00 baths 3.00 45.00 baths_full 3.00 45.00 baths_half 1.00 6.00 garage 2.00 20.00 lot_sqft 24394.00 600953760.00 sqft 3132.00 20905.00 stories 2.00 4.00 listPrice 754900.00 48000000.00
data.drop(columns=['text'], axis=1, inplace=True) #text not needed for this analysis
print(data.columns) #confirm changes
Index(['type', 'year_built', 'beds', 'baths', 'baths_full', 'baths_half', 'garage', 'lot_sqft', 'sqft', 'stories', 'lastSoldOn', 'listPrice', 'status'], dtype='object')
Removed 'text' column as it is not needed for this analysis
#change 'lastSoldOn' column to date data type
data['lastSoldOn'] = pd.to_datetime(data['lastSoldOn'])
#change year_built to int
data['year_built'] = data['year_built'].astype('int')
print(data.dtypes) #confirm changes
type object year_built int64 beds float64 baths float64 baths_full float64 baths_half float64 garage float64 lot_sqft float64 sqft float64 stories float64 lastSoldOn datetime64[ns] listPrice float64 status object dtype: object
Changed 'lastSoldOn' to 'date' data type.
Changed 'year_built' to 'int' data type
As this analysis is not an analysis of time series data the year feature in this data set turns into a descrete variable which doesn't render much use with statistical tests and modeling without context of what the year means. 'age' will be calculated as current_year - 'year_built' which gives context to each samples 'year_built' feature w.r.t. the entire dataset and is a continuous variable that can bring more insight into the data set.
data['age'] = pd.to_datetime('today').year - data['year_built'] #calculate age
data.drop(columns=['year_built'], axis=1, inplace=True) #drop 'year_built' column
'lastSoldNumYears' will be created by (currentDate - lastSoldOn)/365 making the column in units of years. This new feature is a continous feature that can bring more insight into the data set and will be more use in statistical tests and modeling.
#calculate lastSoldNumYears
data['lastSoldNumYears'] = [(pd.to_datetime('today').date() - x.date()).days/365 for x in data['lastSoldOn']]
data.drop(columns=['lastSoldOn'], axis=1, inplace=True)
#confirm changes
print(data.columns)
Index(['type', 'beds', 'baths', 'baths_full', 'baths_half', 'garage', 'lot_sqft', 'sqft', 'stories', 'listPrice', 'status', 'age', 'lastSoldNumYears'], dtype='object')
Price per square foot is an industry standard calculation to compare different properties. This will be used in the analysis portion of this notebook. Data samples with a 0.0 sqft will evaluate to infinity, these will be manually filtered in analyses along with samples with a 0.0 listPrice
data['pricePerSqft'] = data['listPrice'] / data['sqft'] #create feature
print(data.columns) #confirm changes
Index(['type', 'beds', 'baths', 'baths_full', 'baths_half', 'garage', 'lot_sqft', 'sqft', 'stories', 'listPrice', 'status', 'age', 'lastSoldNumYears', 'pricePerSqft'], dtype='object')
These features appear to be parts of a whole. Combining them will simplify calculation and won't detract from the data set.
data['totalBaths'] = data['baths'] + data['baths_full'] + (data['baths_half']*0.5) #combine features
data.drop(columns=['baths','baths_full','baths_half'], axis=1, inplace=True) #drop columns
print(data.columns) #confirm changes'
print(data.head(n=1)) #print example
Index(['type', 'beds', 'garage', 'lot_sqft', 'sqft', 'stories', 'listPrice', 'status', 'age', 'lastSoldNumYears', 'pricePerSqft', 'totalBaths'], dtype='object') type beds garage lot_sqft sqft stories listPrice status \ 0 single_family 1.00 2.00 71438.00 696.00 2.00 90000.00 for_sale age lastSoldNumYears pricePerSqft totalBaths 0 4 6.45 129.31 2.50
#type column analysis
print("Original Values:\n",data.type.value_counts())
#combine like values
townhome_values = ['townhomes', 'townhouse']
condo_values = ['condos', 'condo', 'condo_townhome_rowhome_coop', 'condo_townhome']
data.loc[data['type'].isin(townhome_values), "type"] = "townhome"
data.loc[data['type'].isin(condo_values), 'type'] = "condo"
print("\nConfirm Changes:\n",data.type.value_counts()) #confirm changes
Original Values: single_family 2883 land 801 townhomes 344 mobile 206 condos 156 townhouse 14 other 12 farm 9 condo_townhome_rowhome_coop 8 condo_townhome 6 condo 1 Name: type, dtype: int64 Confirm Changes: single_family 2883 land 801 townhome 358 mobile 206 condo 171 other 12 farm 9 Name: type, dtype: int64
print(data.status.value_counts())
for_sale 4185 ready_to_build 255 Name: status, dtype: int64
Returns the outlier samples and samples within the lower and upper bounds
def Find_Outliers(df, column, houseType):
#df : input dataframe
#column: clolumn within df to be checked using IQR method
#housetype: filter data based on house type
#return: a list containing 2 dataframes encompassing outlier samples based on house type (if specified):
#1: samples below lower bound
#2: samples above per bound
#3: df with outliers removed
lowOutliers_df = pd.DataFrame()
hgihOutliers_df = pd.DataFrame()
no_outliers_df = pd.DataFrame()
df_list = []
if(houseType != "NULL"):
df = df.loc[df['type'] == houseType].copy()
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1-(1.5*IQR)
upper_bound = Q3+(1.5*IQR)
lowOutliers_df = df[(df[column] < lower_bound)].copy()
highOutliers_df = df[(df[column] > upper_bound)].copy()
no_outliers_df = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)].copy()
df_list = [lowOutliers_df, highOutliers_df, no_outliers_df]
return df_list
data_ppsqft = data.loc[(data["sqft"] > 0.0) & (data['listPrice'] > 0.0)].copy()
print(data_ppsqft.shape)
(4421, 12)
type_list = data['type'].unique().tolist()
print(type_list)
['single_family', 'land', 'mobile', 'condo', 'townhome', 'other', 'farm']
desc = data_ppsqft['pricePerSqft'].describe()
desc = pd.DataFrame(desc).T
print(desc)
count mean std min 25% 50% 75% max pricePerSqft 4421.00 272.36 380.22 0.62 167.23 224.68 294.12 8312.50
plt.figure(figsize=(5,5))
plt.boxplot(data_ppsqft["pricePerSqft"], showmeans=True)
plt.title("Boxplot of Price Per Sqft")
plt.show()
ppsqft_skewness = data_ppsqft.pricePerSqft.skew()
print("Price Per Sqft skewness: ",ppsqft_skewness)
if(ppsqft_skewness > 0):
print("Data is skewed to the right")
elif(ppsqft_skewness == 0):
print("Data is not skewed")
else:
print("Data is skewed to the left")
Price Per Sqft skewness: 12.471674400066984 Data is skewed to the right
#calculate bins for 'ppsqft' variable/feature using Scotts Rule
mean = data_ppsqft.pricePerSqft.mean()
median = data_ppsqft.pricePerSqft.median()
bins = np.histogram_bin_edges(data_ppsqft['pricePerSqft'], bins='scott')
plt.figure(figsize=(5,15))
sns.displot(bins=bins, x=data_ppsqft['pricePerSqft'], color='darkblue', kde=True)
plt.axvline(mean, color='r', label=f'mean: {mean:.2f}')
plt.axvline(median, color='g', label=f'median: {median:.2f}')
plt.title(f"Price Per Sqft Distribution: {len(bins)} bins")
plt.legend(loc='best')
plt.show()
<Figure size 500x1500 with 0 Axes>
H0: Sample is from the normal distribution (p > a)
H1: Sample is not from a normal distribution (p < a)
This test will confirm whether or not the Price Per Sqft data is a normal distribution. This test informs us if ppsqft is centered around the mean, meaning the ppsqft is similar for all listed homes within the dataset. If it is not a normal distribution we will continue to analyze which factors affect this metric.
from scipy.stats import shapiro
alpha = 0.05
result, p = shapiro(data_ppsqft['pricePerSqft'])
print("W: ",result)
print("P-Value: ",p)
if(p < alpha):
print("Price Per Sqft is not a normal distribution.\nWe will continue to explore what may have an effect on price per sqft")
else:
print("Price Per Sqft is a normal distribution.")
W: 0.30437177419662476 P-Value: 0.0 Price Per Sqft is not a normal distribution. We will continue to explore what may have an effect on price per sqft
Different house types on average are marketed at different prices (e.g. townhomes tend to be more affordable than single-famlily homes). House type is used as a filter in these analyses to see if it factors into the PPSqft metric.
plt.figure(figsize=(20,20))
for i, houseType in enumerate(type_list, 1):
plt.subplot(3,3,i)
plt.boxplot(data_ppsqft["pricePerSqft"].loc[data_ppsqft['type'] == houseType], showmeans=True)
plt.title(f"PPSqft of {houseType}")
plt.suptitle("Price Per Sqft Boxplots of Each House Type")
#plt.tight_layout()
plt.show()
plt.figure(figsize=(8,8))
sns.boxplot(x="type", y="pricePerSqft", data=data_ppsqft)
plt.title("PPSqft by type Boxplots ")
plt.xticks(rotation=45)
plt.show()
desc_df = pd.DataFrame()
for houseType in type_list:
desc = data_ppsqft['pricePerSqft'].loc[data_ppsqft['type'] == houseType].describe()
desc.name = houseType
desc = pd.DataFrame(desc).T
#print(desc)
desc_df = pd.concat([desc_df, desc], ignore_index=False)
desc_df['type'] = desc_df.index
desc_df.index = range(desc_df.shape[0])
desc_df = desc_df[['type', 'count','mean', 'std', 'min', '25%', '50%', '75%', 'max']]
print(desc_df)
type count mean std min 25% 50% 75% max 0 single_family 2864.00 282.30 249.03 24.99 196.13 241.22 307.99 6458.33 1 land 801.00 254.51 733.18 0.62 52.08 101.88 160.42 8312.50 2 mobile 206.00 159.32 138.43 10.42 74.22 118.37 200.75 1260.23 3 condo 171.00 332.63 193.46 29.09 240.27 272.96 352.26 1811.91 4 townhome 358.00 255.92 133.24 33.37 203.21 236.21 278.76 1463.19 5 other 12.00 223.35 33.82 192.50 200.90 211.30 239.48 301.87 6 farm 9.00 858.72 983.25 49.58 149.96 250.00 1645.83 2743.95
plt.figure(figsize=(20,20))
for i, houseType in enumerate(type_list, 1):
plt.subplot(3,3,i)
sns.histplot(data_ppsqft["pricePerSqft"].loc[data_ppsqft['type'] == houseType], kde=True)
plt.title(houseType)
plt.suptitle("Price Per Sqft distribution per House Type")
plt.show()
alpha = 0.05
for houseType in type_list:
print("House Type: " ,houseType)
print("Skewness: ", data_ppsqft['pricePerSqft'].loc[data_ppsqft['type'] == houseType].skew())
print("Mean: ",data_ppsqft['pricePerSqft'].loc[data_ppsqft['type'] == houseType].mean())
print("Median: ",data_ppsqft['pricePerSqft'].loc[data_ppsqft['type'] == houseType].median())
print("Standard Deviation: ",data_ppsqft['pricePerSqft'].loc[data_ppsqft['type'] == houseType].std())
result, p = shapiro(data_ppsqft['pricePerSqft'].loc[data_ppsqft['type'] == houseType])
print("W: ",result)
print("P-Value: ",p)
if(p < alpha):
print("Price Per Sqft is not a normal distribution.\n")
else:
print("Price Per Sqft is a normal distribution.\n")
House Type: single_family Skewness: 13.72700763466026 Mean: 282.3002800121268 Median: 241.2213752174358 Standard Deviation: 249.0291842312236 W: 0.34130775928497314 P-Value: 0.0 Price Per Sqft is not a normal distribution. House Type: land Skewness: 7.653097072801116 Mean: 254.51303058676655 Median: 101.875 Standard Deviation: 733.1844005075772 W: 0.27775460481643677 P-Value: 0.0 Price Per Sqft is not a normal distribution. House Type: mobile Skewness: 3.7998513042956175 Mean: 159.3191862275339 Median: 118.37121212121212 Standard Deviation: 138.43446177888248 W: 0.6849346160888672 P-Value: 2.1042841473192538e-19 Price Per Sqft is not a normal distribution. House Type: condo Skewness: 4.043862961040698 Mean: 332.63160051628023 Median: 272.96360485268633 Standard Deviation: 193.4618840451925 W: 0.6271494626998901 P-Value: 3.909373706656839e-19 Price Per Sqft is not a normal distribution. House Type: townhome Skewness: 5.713055585803318 Mean: 255.91988972793735 Median: 236.21170320239574 Standard Deviation: 133.24036284914519 W: 0.5161730051040649 P-Value: 3.923861080671942e-30 Price Per Sqft is not a normal distribution. House Type: other Skewness: 1.3439531402177216 Mean: 223.3542540415976 Median: 211.29519587961244 Standard Deviation: 33.817914525663184 W: 0.8448854088783264 P-Value: 0.03176218271255493 Price Per Sqft is not a normal distribution. House Type: farm Skewness: 1.088982975813337 Mean: 858.7218212681898 Median: 250.0 Standard Deviation: 983.2467194494586 W: 0.806327760219574 P-Value: 0.024111604318022728 Price Per Sqft is not a normal distribution.
This output shows price per Sqft is not similar amongst homes that are classified as the same type. Each house type group has a non-normal heavily skewed right distribution with a large standard deviation indicating much variabilty within groups.
PPSqft data within the house type groups are not normally distributed meaning an ANOVA test can not be performed to analyze if each group is different. Kruskal-Wallis is used instead to test whether the median PPSqft for each house type is different or not.
H0: Samples all have the same central tendency and therefore there is no difference between groups (p > 0.05)
H1: At least one of the samples does not have the same central tendency and therefore there is a difference between groups (p < 0.05)
from scipy.stats import kruskal
singleFam_df = Find_Outliers(data_ppsqft, "pricePerSqft", 'single_family')[2]
land_df = Find_Outliers(data_ppsqft, "pricePerSqft", 'land')[2]
mobile_df = Find_Outliers(data_ppsqft, "pricePerSqft", 'mobile')[2]
condo_df = Find_Outliers(data_ppsqft, "pricePerSqft", 'condo')[2]
townhome_df = Find_Outliers(data_ppsqft, "pricePerSqft", 'townhome')[2]
other_df = Find_Outliers(data_ppsqft, "pricePerSqft", 'other')[2]
farm_df = Find_Outliers(data_ppsqft, "pricePerSqft", 'farm')[2]
result = kruskal(singleFam_df['pricePerSqft'], land_df['pricePerSqft'], mobile_df['pricePerSqft'],\
condo_df['pricePerSqft'], townhome_df['pricePerSqft'], other_df['pricePerSqft'],\
farm_df['pricePerSqft'])
print(result)
KruskalResult(statistic=1552.24788362764, pvalue=0.0)
This result shows at least one of the samples do not have the same central tendency. This analysis indicates house types differ in terms of PPSqft, and within each house type group PPSqft differs as well.
'age' is being used as a preliminary feature to explore relationships between price per sqft and other features within the data set. It is known that as homes age the depreciation rate increases. This means the price per sqft should show a downward trend as home age increases because the ppsqft ratio should be less for older homes.
plt.figure(figsize=(10,5))
plt.scatter(x=data_ppsqft['age'], y=data_ppsqft['pricePerSqft'])
plt.xlabel("Home Age")
plt.ylabel("Price Per Sqft")
plt.show()
This graph doesn't reveal a discernable relationship between ppsqft and home age
'age' is being used as a preliminary feature to explore relationships between price per sqft and other features within the data set. It is known that as homes age the depreciation rate increases. This means the price per sqft should show a downward trend as home age increases because the ppsqft ratio should be less for older homes. Outliers are removed to lower noise in the data.
df_list = Find_Outliers(data_ppsqft, "pricePerSqft", "NULL") #[low, high, no_outliers]
#print(df_list[2].info())
print("Data Skewness: ",df_list[2]['pricePerSqft'].skew())
plt.figure(figsize=(10,5))
plt.scatter(x=df_list[2]["age"], y=df_list[2]["pricePerSqft"])
plt.show()
Data Skewness: 0.00047584806276346154
Additionally, with outliers removed there is no discernable relationship between ppsqft and home age
Different house types on average are marketed at different prices (e.g. townhomes tend to be more affordable than single-famlily homes). This section uses the single family home as a filter to see if a discernable relationship exists between home age and PPSqft removing noise from other house types within the data set. Single family home is used as it is the most represented house type in the data set.
df_list = Find_Outliers(data_ppsqft, "pricePerSqft", "single_family")
print("Data Skewness: ",df_list[2]['pricePerSqft'].skew())
plt.figure(figsize=(10,5))
plt.scatter(x=df_list[2]["age"], y=df_list[2]["pricePerSqft"])
plt.show()
Data Skewness: 0.5552148446021512
Using the most represented house type class in the data set, single family, and with outliers removed there is no discernable relationship between ppsqft and home age.
Home age and type don't appear to have a discernable relationship to PPSqft. In this section a pairwise correlation will be computed using the Spearman rank correlation with all features in the data set. Spearman is used here as the analysis with this data set does not pass all assumptions of the Pearson Correlation and Spearman is a nonparametric test for correlation. Spearman measures the strength and direction of association between two variables. This pairwise correlation will show us the measure of association PPSqft has with the other features in the data set. The Spearman coefficient ranges from -1 to +1. +1 indicates a perfect positive association. -1 indicates a perfect negative association. Values closer to 0 indicate a weak association. 0 indicates no association.
corr_matrix = data_ppsqft.corr(method='spearman')
plt.figure(figsize=(14,8))
sns.heatmap(corr_matrix, annot=True, cmap='Spectral', fmt='.2f')
plt.show()
This pairwise correlation doesn't indicate a strong association of PPSqft with the other features other than 'listPrice' which is a factor of the metric. However, correlation appears to be present among the other variables which are considered to be the independent variables in this study. We will continue with factor analysis to see if a combination of these features into unobserved variables may be able to explain the variability of the ppsqft metric.
data_ppsqft_FA = Find_Outliers(data_ppsqft, "pricePerSqft", "NULL")[2]
print(data_ppsqft_FA.shape)
(4157, 12)
data_ppsqft_FA.drop('status', axis=1, inplace=True)
print(data_ppsqft_FA.columns)
Index(['type', 'beds', 'garage', 'lot_sqft', 'sqft', 'stories', 'listPrice', 'age', 'lastSoldNumYears', 'pricePerSqft', 'totalBaths'], dtype='object')
These are factors of 'pricePerSqft'
data_ppsqft_FA.drop(columns=['sqft','listPrice'], axis=1, inplace=True)
print(data_ppsqft_FA.columns)
Index(['type', 'beds', 'garage', 'lot_sqft', 'stories', 'age', 'lastSoldNumYears', 'pricePerSqft', 'totalBaths'], dtype='object')
The 'type' feature is a categorical variable, it needs to be converted to a quantitative variable for this factor analysis. Label encoding has been chosen to understand how house type overall may explain the variablility within the dataset. Label encoding this column will make the factor results easier to interpret.
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
old_head = pd.DataFrame()
print(data_ppsqft_FA['type'].unique()) #print unique house type categories
old_head = data_ppsqft_FA.head()
data_ppsqft_FA['type'] = label_encoder.fit_transform(data_ppsqft_FA['type']) #encode house type lables
#data_ppsqft_FA = pd.get_dummies(data_ppsqft_FA, drop_first=True, dtype='int')
print(data_ppsqft_FA['type'].unique())
print(old_head)
print(data_ppsqft_FA.head())
['single_family' 'land' 'mobile' 'condo' 'townhome' 'other' 'farm'] [5 2 3 0 6 4 1] type beds garage lot_sqft stories age lastSoldNumYears \ 0 single_family 1.00 2.00 71438.00 2.00 4 6.45 1 single_family 4.00 2.00 56628.00 2.00 56 6.45 2 single_family 4.00 1.00 10019.00 2.00 39 6.45 3 single_family 4.00 2.00 12632.00 2.00 88 6.57 4 land 4.00 2.00 872071.00 2.00 21 6.45 pricePerSqft totalBaths 0 129.31 2.50 1 215.95 5.50 2 110.52 6.50 3 143.06 4.50 4 29.17 2.50 type beds garage lot_sqft stories age lastSoldNumYears pricePerSqft \ 0 5 1.00 2.00 71438.00 2.00 4 6.45 129.31 1 5 4.00 2.00 56628.00 2.00 56 6.45 215.95 2 5 4.00 1.00 10019.00 2.00 39 6.45 110.52 3 5 4.00 2.00 12632.00 2.00 88 6.57 143.06 4 2 4.00 2.00 872071.00 2.00 21 6.45 29.17 totalBaths 0 2.50 1 5.50 2 6.50 3 4.50 4 2.50
Single_family = 5 land = 2 mobile = 3 condo = 0 townhome = 6 other = 4 farm = 1
An assumption in factor analysis is there should not be homoscedasticity between variables. We will check this using 2 methods.
import statsmodels.api as sm
X = data_ppsqft_FA.drop(columns=['pricePerSqft'], axis=1)
y = data_ppsqft_FA['pricePerSqft']
print(X.shape)
print(y.shape)
model = sm.OLS(y, X).fit()
residuals = model.resid
fitted = model.fittedvalues
plt.scatter(fitted, residuals)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()
(4157, 8) (4157,)
The data points aren't scattered around zero with a constant width. This appears to show heteroscedasticity between variables, the opposite of homoscedasitcity.
H0: Homoscedasticity is present (p > alpha)
H1: Homoscedasticity is not present. Heteroscedasticity exists (p < alpha)
from statsmodels.compat import lzip
import statsmodels.stats.api as sms
alpha = 0.05
names = ['Lagrange Multipler Statistic', 'p-value', 'f-value', 'f p-value']
test = sms.het_breuschpagan(residuals, X)
results = lzip(names,test)
for item in results:
print(item)
if(results[1][1] < alpha):
print("Null Hypothesis (H0) is rejected")
else:
print("Failed to reject Null Hypothesis (H0)")
('Lagrange Multipler Statistic', 1509.3690996398877) ('p-value', 0.0) ('f-value', 295.659243588019) ('f p-value', 0.0) Null Hypothesis (H0) is rejected
The Lagrange multiplier for the test is 1509.369 and the correspnding p-value is p<0.001 which is cause to reject the null hypothesis (H0) meaning homoscedasticity is not present. The assumption of no homoscedasticity between variables has been passed.
An assumption in Factor Analysis is there should not be perfect multicolinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
#create dataframe to hold VIFs
vif_data = pd.DataFrame()
vif_data['feature'] = data_ppsqft_FA.columns
#calculate VIF of each feature
vif_data["VIF"] = [variance_inflation_factor(data_ppsqft_FA, i) for i in range(len(data_ppsqft_FA.columns))]
print(vif_data)
feature VIF 0 type 12.80 1 beds 15.39 2 garage 8.32 3 lot_sqft 1.04 4 stories 11.85 5 age 2.33 6 lastSoldNumYears 4.25 7 pricePerSqft 7.56 8 totalBaths 8.91
The 'pricePerSqft' feature is our variable being analyzed. With a VIF value that doesn't exceed 10 we will proceed with factor analysis considering the multicollinearity assumption passed.
PPSqft is the dependent variable, we are looking to see what factors (independent variables) affect the PPSqft metric (dependent variable). Factor analysis is performed seperately for independent and dependent variables.
data_ppsqft_FA.drop('pricePerSqft', axis=1, inplace=True)
print(data_ppsqft_FA.columns)
Index(['type', 'beds', 'garage', 'lot_sqft', 'stories', 'age', 'lastSoldNumYears', 'totalBaths'], dtype='object')
Checks whether or not the observered variables intercorrelate at all using the observed correlation matrix against the identity matrix. If the test is found statistically insignificant (p > alpha), factor analysis should not be employed. This is a method to check the factorability of the data set.
H0: Variables are orthogonal; not correlated (p > alpha)
H1: Variables are not orthogonal; diverges significantly from the identity matrix (p < alpha)
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
alpha = 0.05
chi_square, p_value = calculate_bartlett_sphericity(data_ppsqft_FA)
print(f"chi-square value: {chi_square}")
print(f"p-value: {p_value:.3f}")
if(p_value < alpha):
print("Test has found statistical significance. Reject null hypothesis (H0)")
else:
print("Test has found statistical insignificance. Fail to reject null hypothesis (H0)")
chi-square value: 4082.708012018086 p-value: 0.000 Test has found statistical significance. Reject null hypothesis (H0)
Data set is not an identity matrix, indicating there is some covariance in the data. This is good for factor analysis as it is meant to find the association among independent variables. If there was no covariance, orthogonal, then factor analysis would bear no fruit.
This test is a measure of how suited the data is for factor analysis. The test measures sampling adequacy for each variable in the model and for the complete model. The statistic (KMO value) is a measure of the proportion of variance among variables that might be common variance. Lower proportion is more suited for factor analysis. KMO values range from 0 to 1. A value less than .6 is considered inadequate.
from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all, kmo_model = calculate_kmo(data_ppsqft_FA)
print(f"KMO Value: {kmo_model}")
KMO Value: 0.6359704793254995
A KMO value of 0.636 > 0.6 is considered adequate for factor analysis
This is an analytical approach to determining the number of factors based on selecting factors that explain a more significant proportion of variance. The eigen value is a good criterion for determining this. For this an eigen value more than 1 will be selected.
from factor_analyzer import FactorAnalyzer
#perform factor analysis
fa = FactorAnalyzer()
fa.fit(data_ppsqft_FA)
#check Eigen values
ev, v = fa.get_eigenvalues()
print("Eigen Values: ")
print(ev)
Eigen Values: [2.26749245 1.19225011 1.04566932 0.93762314 0.82004539 0.73414455 0.63154082 0.37123422]
We also use a scree plot as a visual representation to determine the number of factors; this will be where the curve makes an elbow. We will pick the number of factors that are above 1.
plt.scatter(range(1, data_ppsqft_FA.shape[1]+1), ev)
plt.plot(range(1, data_ppsqft_FA.shape[1]+1), ev)
plt.title("Scree Plot")
plt.xlabel("Factors")
plt.ylabel("Eigen Value")
plt.grid()
plt.show()
num_factors = len([x for x in ev if x >= 1.0])
print("The number of factors chosen using the Kaiser Criterion: ", num_factors)
The number of factors chosen using the Kaiser Criterion: 3
After using the Kaiser Criterion and Scree Plot to help determine the number of factors 3 has been chosen as the number of factors to begin with
Varimax rotation (orthogonal) has been chosen as each factor has a small number of large loadings and a large number of small loadings which simplifies interpretation of factors.
Loadings indicate how much a factor explains a variable. The loading score ranges from -1 to 1. Values close to -1 or 1 indicate the factor has an influence on these variables. Values close to 0 indicate the factor has a lower influence on the variable.
fa = FactorAnalyzer(n_factors=num_factors, rotation="varimax")
fa.fit(data_ppsqft_FA)
print(pd.DataFrame(fa.loadings_, index=data_ppsqft_FA.columns))
0 1 2 type 0.23 0.71 0.08 beds 0.99 -0.13 0.09 garage 0.35 0.18 -0.17 lot_sqft 0.00 -0.20 -0.03 stories 0.40 0.12 0.01 age -0.20 0.01 0.80 lastSoldNumYears 0.09 0.08 0.14 totalBaths 0.56 0.46 -0.15
print(pd.DataFrame(fa.get_factor_variance(), index=['Variance', 'Proportional Var', 'Cumulative Var']))
0 1 2 Variance 1.67 0.82 0.72 Proportional Var 0.21 0.10 0.09 Cumulative Var 0.21 0.31 0.40
Variance: Variance explained by each factor.
Proportional Var: Variance explained by a factor out of the total variance.
Cumulative Var: Cumulative sum of proportional variances of each factor.
These 3 factors together are able to explain 40% of the total variance.
Communality is the proportion of each variable's variance that can be explained by the factors. It shows the relative contributions each variable has on each of the factors. These values have been multiplied by 100 to show them as a percentage.
print(pd.DataFrame(fa.get_communalities()*100, index=data_ppsqft_FA.columns, columns=['Communalities']))
Communalities type 55.57 beds 99.86 garage 18.49 lot_sqft 4.27 stories 17.33 age 67.96 lastSoldNumYears 3.30 totalBaths 55.08
Factor 0 exlpains 21% of the variance in price per sqft which consists of the variables 'beds' (.99) and 'totalBaths' (.56)
Factor 1 explains 10% of the variance in price per sqft which consists of the variables 'type' (.71) and 'totalBaths' (.46)
Factor 2 explains 9% of the variance in price per sqft which consists of the variable 'age' (.80)
Total beds and baths show to be the strongest variables in the dataset to explain price per sqft (Factor 0).
All together these factors explain 40% of the variability in the dependent variable price per sqft which is lower than the 75% we'd like with price prediction in real estate.
Although the heavily weighted variables' variances in this analysis are explained by these factors with a communality greater than 50% (type: 55.57%, beds: 99.86%, age: 67.96%, totalBaths: 55.08%) they only account for 40% of the variability in the response variable, price per sqft. Meaning, additional data points of each sample are required to gain a better understanding of what affects the ppsqft metric such as longitude/latitude, county, and city. This also shows that data points outside of house dimensions (type, beds, garage spaces, stories, age, total baths, lot sqft) may have a larger effect on the price per sqft as the house dimensions only account for 40% of the variablility in the metric.