%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


pd.set_option('display.float_format', lambda x: '%.2f' %x)


data = pd.read_csv("real_estate_utah.csv") #load data
print(data.columns) #print columns

Index(['type', 'text', 'year_built', 'beds', 'baths', 'baths_full',
       'baths_half', 'garage', 'lot_sqft', 'sqft', 'stories', 'lastSoldOn',
       'listPrice', 'status'],
      dtype='object')


print(data.shape) #print shape of data set

(4440, 14)


print(data.head(n=5)) #print first 5 rows

            type                                               text  \
0  single_family  Escape to tranquility with this off-grid, unfi...   
1  single_family  Beautiful home in the desirable Oak Hills and ...   
2  single_family  Welcome to your new home, nestled in the heart...   
3  single_family  Investment Opportunity. House needs some work ...   
4           land  Deer Springs Ranch is an 8000 Ac Ranch in an H...   

   year_built  beds  baths  baths_full  baths_half  garage  lot_sqft    sqft  \
0     2020.00  1.00   1.00        1.00        1.00    2.00  71438.00  696.00   
1     1968.00  4.00   3.00        2.00        1.00    2.00  56628.00 3700.00   
2     1985.00  4.00   3.00        3.00        1.00    1.00  10019.00 3528.00   
3     1936.00  4.00   2.00        2.00        1.00    2.00  12632.00 2097.00   
4     2003.00  4.00   0.00        2.00        1.00    2.00 872071.00 2400.00   

   stories  lastSoldOn  listPrice    status  
0     2.00  2018-05-31   90000.00  for_sale  
1     2.00  2018-05-31  799000.00  for_sale  
2     2.00  2018-05-31  389900.00  for_sale  
3     2.00  2018-04-16  300000.00  for_sale  
4     2.00  2018-05-31   70000.00  for_sale


print(data.dtypes) #print data types

type           object
text           object
year_built    float64
beds          float64
baths         float64
baths_full    float64
baths_half    float64
garage        float64
lot_sqft      float64
sqft          float64
stories       float64
lastSoldOn     object
listPrice     float64
status         object
dtype: object


print(data.isna().sum()) #print nmumber null values found in each column

type          0
text          0
year_built    0
beds          0
baths         0
baths_full    0
baths_half    0
garage        0
lot_sqft      0
sqft          0
stories       0
lastSoldOn    0
listPrice     0
status        0
dtype: int64


print(data.describe().T)

             count      mean         std     min       25%       50%  \
year_built 4440.00   1997.94       23.61 1860.00   1997.00   2003.00   
beds       4440.00      3.89        1.27    1.00      3.00      4.00   
baths      4440.00      2.45        1.79    0.00      2.00      3.00   
baths_full 4440.00      2.24        1.17    1.00      2.00      2.00   
baths_half 4440.00      1.02        0.19    1.00      1.00      1.00   
garage     4440.00      2.33        1.02    0.00      2.00      2.00   
lot_sqft   4440.00 552523.95 11344714.29  436.00   9583.00  13939.00   
sqft       4440.00   2712.32     1553.68    0.00   1842.00   2400.00   
stories    4440.00      2.00        0.63    1.00      2.00      2.00   
listPrice  4440.00 796604.38  1731703.12    0.00 353805.00 528995.00   

                 75%          max  
year_built   2007.00      2026.00  
beds            4.00        19.00  
baths           3.00        45.00  
baths_full      3.00        45.00  
baths_half      1.00         6.00  
garage          2.00        20.00  
lot_sqft    24394.00 600953760.00  
sqft         3132.00     20905.00  
stories         2.00         4.00  
listPrice  754900.00  48000000.00


data.drop(columns=['text'], axis=1, inplace=True) #text not needed for this analysis
print(data.columns) #confirm changes

Index(['type', 'year_built', 'beds', 'baths', 'baths_full', 'baths_half',
       'garage', 'lot_sqft', 'sqft', 'stories', 'lastSoldOn', 'listPrice',
       'status'],
      dtype='object')


#change 'lastSoldOn' column to date data type
data['lastSoldOn'] = pd.to_datetime(data['lastSoldOn'])

#change year_built to int
data['year_built'] = data['year_built'].astype('int')

print(data.dtypes) #confirm changes

type                  object
year_built             int64
beds                 float64
baths                float64
baths_full           float64
baths_half           float64
garage               float64
lot_sqft             float64
sqft                 float64
stories              float64
lastSoldOn    datetime64[ns]
listPrice            float64
status                object
dtype: object


data['age'] = pd.to_datetime('today').year - data['year_built'] #calculate age
data.drop(columns=['year_built'], axis=1, inplace=True) #drop 'year_built' column


#calculate lastSoldNumYears
data['lastSoldNumYears'] =  [(pd.to_datetime('today').date() - x.date()).days/365 for x in data['lastSoldOn']]
data.drop(columns=['lastSoldOn'], axis=1, inplace=True)


#confirm changes
print(data.columns)

Index(['type', 'beds', 'baths', 'baths_full', 'baths_half', 'garage',
       'lot_sqft', 'sqft', 'stories', 'listPrice', 'status', 'age',
       'lastSoldNumYears'],
      dtype='object')


data['pricePerSqft'] = data['listPrice'] / data['sqft'] #create feature
print(data.columns) #confirm changes

Index(['type', 'beds', 'baths', 'baths_full', 'baths_half', 'garage',
       'lot_sqft', 'sqft', 'stories', 'listPrice', 'status', 'age',
       'lastSoldNumYears', 'pricePerSqft'],
      dtype='object')


data['totalBaths'] = data['baths'] + data['baths_full'] + (data['baths_half']*0.5) #combine features
data.drop(columns=['baths','baths_full','baths_half'], axis=1, inplace=True) #drop columns

print(data.columns) #confirm changes'
print(data.head(n=1)) #print example

Index(['type', 'beds', 'garage', 'lot_sqft', 'sqft', 'stories', 'listPrice',
       'status', 'age', 'lastSoldNumYears', 'pricePerSqft', 'totalBaths'],
      dtype='object')
            type  beds  garage  lot_sqft   sqft  stories  listPrice    status  \
0  single_family  1.00    2.00  71438.00 696.00     2.00   90000.00  for_sale   

   age  lastSoldNumYears  pricePerSqft  totalBaths  
0    4              6.45        129.31        2.50


#type column analysis
print("Original Values:\n",data.type.value_counts())

#combine like values
townhome_values = ['townhomes', 'townhouse']
condo_values = ['condos', 'condo', 'condo_townhome_rowhome_coop', 'condo_townhome']
data.loc[data['type'].isin(townhome_values), "type"] = "townhome"
data.loc[data['type'].isin(condo_values), 'type'] = "condo"

print("\nConfirm Changes:\n",data.type.value_counts()) #confirm changes

Original Values:
 single_family                  2883
land                            801
townhomes                       344
mobile                          206
condos                          156
townhouse                        14
other                            12
farm                              9
condo_townhome_rowhome_coop       8
condo_townhome                    6
condo                             1
Name: type, dtype: int64

Confirm Changes:
 single_family    2883
land              801
townhome          358
mobile            206
condo             171
other              12
farm                9
Name: type, dtype: int64


print(data.status.value_counts())

for_sale          4185
ready_to_build     255
Name: status, dtype: int64


def Find_Outliers(df, column, houseType):
    #df : input dataframe
    #column: clolumn within df to be checked using IQR method
    #housetype: filter data based on house type
    #return: a list containing 2 dataframes encompassing outlier samples based on house type (if specified): 
            #1: samples below lower bound 
            #2: samples above per bound
            #3: df with outliers removed
            
    lowOutliers_df = pd.DataFrame()
    hgihOutliers_df = pd.DataFrame()
    no_outliers_df = pd.DataFrame()
    df_list = []
    
    if(houseType != "NULL"):
        df = df.loc[df['type'] == houseType].copy()
    
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1-(1.5*IQR)
    upper_bound = Q3+(1.5*IQR)
        
    lowOutliers_df = df[(df[column] < lower_bound)].copy()
    highOutliers_df = df[(df[column] > upper_bound)].copy()
    no_outliers_df = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)].copy()
    df_list = [lowOutliers_df, highOutliers_df, no_outliers_df]
    return df_list


data_ppsqft = data.loc[(data["sqft"] > 0.0) & (data['listPrice'] > 0.0)].copy()
print(data_ppsqft.shape)

(4421, 12)


type_list = data['type'].unique().tolist()
print(type_list)

['single_family', 'land', 'mobile', 'condo', 'townhome', 'other', 'farm']


desc = data_ppsqft['pricePerSqft'].describe()
desc = pd.DataFrame(desc).T
print(desc)

               count   mean    std  min    25%    50%    75%     max
pricePerSqft 4421.00 272.36 380.22 0.62 167.23 224.68 294.12 8312.50


plt.figure(figsize=(5,5))
plt.boxplot(data_ppsqft["pricePerSqft"], showmeans=True)
plt.title("Boxplot of Price Per Sqft")
plt.show()


ppsqft_skewness = data_ppsqft.pricePerSqft.skew()
print("Price Per Sqft skewness: ",ppsqft_skewness)
if(ppsqft_skewness > 0):
    print("Data is skewed to the right")
elif(ppsqft_skewness == 0):
    print("Data is not skewed")
else:
    print("Data is skewed to the left")

Price Per Sqft skewness:  12.471674400066984
Data is skewed to the right


#calculate bins for 'ppsqft' variable/feature using Scotts Rule
mean = data_ppsqft.pricePerSqft.mean()
median = data_ppsqft.pricePerSqft.median()
bins = np.histogram_bin_edges(data_ppsqft['pricePerSqft'], bins='scott')
plt.figure(figsize=(5,15))

sns.displot(bins=bins, x=data_ppsqft['pricePerSqft'], color='darkblue', kde=True)
plt.axvline(mean, color='r', label=f'mean: {mean:.2f}')
plt.axvline(median, color='g', label=f'median: {median:.2f}')
plt.title(f"Price Per Sqft Distribution: {len(bins)} bins")
plt.legend(loc='best')
plt.show()

<Figure size 500x1500 with 0 Axes>


from scipy.stats import shapiro

alpha = 0.05
result, p = shapiro(data_ppsqft['pricePerSqft'])
print("W: ",result)
print("P-Value: ",p)
if(p < alpha):
    print("Price Per Sqft is not a normal distribution.\nWe will continue to explore what may have an effect on price per sqft")
else:
    print("Price Per Sqft is a normal distribution.")

W:  0.30437177419662476
P-Value:  0.0
Price Per Sqft is not a normal distribution.
We will continue to explore what may have an effect on price per sqft


plt.figure(figsize=(20,20))

for i, houseType in enumerate(type_list, 1):
    plt.subplot(3,3,i)
    plt.boxplot(data_ppsqft["pricePerSqft"].loc[data_ppsqft['type'] == houseType], showmeans=True)
    plt.title(f"PPSqft of {houseType}")
plt.suptitle("Price Per Sqft Boxplots of Each House Type")
#plt.tight_layout()    
plt.show()


plt.figure(figsize=(8,8))
sns.boxplot(x="type", y="pricePerSqft", data=data_ppsqft)
plt.title("PPSqft by type Boxplots ")
plt.xticks(rotation=45)
plt.show()


desc_df = pd.DataFrame()
for houseType in type_list:
    desc = data_ppsqft['pricePerSqft'].loc[data_ppsqft['type'] == houseType].describe()
    desc.name = houseType
    desc = pd.DataFrame(desc).T
    #print(desc)
    desc_df = pd.concat([desc_df, desc], ignore_index=False)

desc_df['type'] = desc_df.index
desc_df.index = range(desc_df.shape[0])
desc_df = desc_df[['type', 'count','mean', 'std', 'min', '25%', '50%', '75%', 'max']]
print(desc_df)

            type   count   mean    std    min    25%    50%     75%     max
0  single_family 2864.00 282.30 249.03  24.99 196.13 241.22  307.99 6458.33
1           land  801.00 254.51 733.18   0.62  52.08 101.88  160.42 8312.50
2         mobile  206.00 159.32 138.43  10.42  74.22 118.37  200.75 1260.23
3          condo  171.00 332.63 193.46  29.09 240.27 272.96  352.26 1811.91
4       townhome  358.00 255.92 133.24  33.37 203.21 236.21  278.76 1463.19
5          other   12.00 223.35  33.82 192.50 200.90 211.30  239.48  301.87
6           farm    9.00 858.72 983.25  49.58 149.96 250.00 1645.83 2743.95


plt.figure(figsize=(20,20))

for i, houseType in enumerate(type_list, 1):
    plt.subplot(3,3,i) 
    sns.histplot(data_ppsqft["pricePerSqft"].loc[data_ppsqft['type'] == houseType], kde=True)
    plt.title(houseType)
plt.suptitle("Price Per Sqft distribution per House Type")    
plt.show()


alpha = 0.05

for houseType in type_list:
    print("House Type: " ,houseType)
    print("Skewness: ", data_ppsqft['pricePerSqft'].loc[data_ppsqft['type'] == houseType].skew())
    print("Mean: ",data_ppsqft['pricePerSqft'].loc[data_ppsqft['type'] == houseType].mean())
    print("Median: ",data_ppsqft['pricePerSqft'].loc[data_ppsqft['type'] == houseType].median())
    print("Standard Deviation: ",data_ppsqft['pricePerSqft'].loc[data_ppsqft['type'] == houseType].std())
    
    result, p = shapiro(data_ppsqft['pricePerSqft'].loc[data_ppsqft['type'] == houseType])
    print("W: ",result)
    print("P-Value: ",p)
    if(p < alpha):
        print("Price Per Sqft is not a normal distribution.\n")
    else:
        print("Price Per Sqft is a normal distribution.\n")

House Type:  single_family
Skewness:  13.72700763466026
Mean:  282.3002800121268
Median:  241.2213752174358
Standard Deviation:  249.0291842312236
W:  0.34130775928497314
P-Value:  0.0
Price Per Sqft is not a normal distribution.

House Type:  land
Skewness:  7.653097072801116
Mean:  254.51303058676655
Median:  101.875
Standard Deviation:  733.1844005075772
W:  0.27775460481643677
P-Value:  0.0
Price Per Sqft is not a normal distribution.

House Type:  mobile
Skewness:  3.7998513042956175
Mean:  159.3191862275339
Median:  118.37121212121212
Standard Deviation:  138.43446177888248
W:  0.6849346160888672
P-Value:  2.1042841473192538e-19
Price Per Sqft is not a normal distribution.

House Type:  condo
Skewness:  4.043862961040698
Mean:  332.63160051628023
Median:  272.96360485268633
Standard Deviation:  193.4618840451925
W:  0.6271494626998901
P-Value:  3.909373706656839e-19
Price Per Sqft is not a normal distribution.

House Type:  townhome
Skewness:  5.713055585803318
Mean:  255.91988972793735
Median:  236.21170320239574
Standard Deviation:  133.24036284914519
W:  0.5161730051040649
P-Value:  3.923861080671942e-30
Price Per Sqft is not a normal distribution.

House Type:  other
Skewness:  1.3439531402177216
Mean:  223.3542540415976
Median:  211.29519587961244
Standard Deviation:  33.817914525663184
W:  0.8448854088783264
P-Value:  0.03176218271255493
Price Per Sqft is not a normal distribution.

House Type:  farm
Skewness:  1.088982975813337
Mean:  858.7218212681898
Median:  250.0
Standard Deviation:  983.2467194494586
W:  0.806327760219574
P-Value:  0.024111604318022728
Price Per Sqft is not a normal distribution.


from scipy.stats import kruskal

singleFam_df = Find_Outliers(data_ppsqft, "pricePerSqft", 'single_family')[2]
land_df = Find_Outliers(data_ppsqft, "pricePerSqft", 'land')[2]
mobile_df = Find_Outliers(data_ppsqft, "pricePerSqft", 'mobile')[2]
condo_df = Find_Outliers(data_ppsqft, "pricePerSqft", 'condo')[2]
townhome_df = Find_Outliers(data_ppsqft, "pricePerSqft", 'townhome')[2]
other_df = Find_Outliers(data_ppsqft, "pricePerSqft", 'other')[2]
farm_df = Find_Outliers(data_ppsqft, "pricePerSqft", 'farm')[2]

result = kruskal(singleFam_df['pricePerSqft'], land_df['pricePerSqft'], mobile_df['pricePerSqft'],\
                condo_df['pricePerSqft'], townhome_df['pricePerSqft'], other_df['pricePerSqft'],\
                farm_df['pricePerSqft'])
print(result)

KruskalResult(statistic=1552.24788362764, pvalue=0.0)


plt.figure(figsize=(10,5))
plt.scatter(x=data_ppsqft['age'], y=data_ppsqft['pricePerSqft'])
plt.xlabel("Home Age")
plt.ylabel("Price Per Sqft")
plt.show()


df_list = Find_Outliers(data_ppsqft, "pricePerSqft", "NULL") #[low, high, no_outliers]
#print(df_list[2].info())

print("Data Skewness: ",df_list[2]['pricePerSqft'].skew())

plt.figure(figsize=(10,5))
plt.scatter(x=df_list[2]["age"], y=df_list[2]["pricePerSqft"])
plt.show()

Data Skewness:  0.00047584806276346154


df_list = Find_Outliers(data_ppsqft, "pricePerSqft", "single_family")

print("Data Skewness: ",df_list[2]['pricePerSqft'].skew())

plt.figure(figsize=(10,5))
plt.scatter(x=df_list[2]["age"], y=df_list[2]["pricePerSqft"])
plt.show()

Data Skewness:  0.5552148446021512


corr_matrix = data_ppsqft.corr(method='spearman')
plt.figure(figsize=(14,8))
sns.heatmap(corr_matrix, annot=True, cmap='Spectral', fmt='.2f')
plt.show()


data_ppsqft_FA = Find_Outliers(data_ppsqft, "pricePerSqft", "NULL")[2]
print(data_ppsqft_FA.shape)

(4157, 12)


data_ppsqft_FA.drop('status', axis=1, inplace=True)
print(data_ppsqft_FA.columns)

Index(['type', 'beds', 'garage', 'lot_sqft', 'sqft', 'stories', 'listPrice',
       'age', 'lastSoldNumYears', 'pricePerSqft', 'totalBaths'],
      dtype='object')


data_ppsqft_FA.drop(columns=['sqft','listPrice'], axis=1, inplace=True)
print(data_ppsqft_FA.columns)

Index(['type', 'beds', 'garage', 'lot_sqft', 'stories', 'age',
       'lastSoldNumYears', 'pricePerSqft', 'totalBaths'],
      dtype='object')


from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()

old_head = pd.DataFrame()
print(data_ppsqft_FA['type'].unique()) #print unique house type categories
old_head = data_ppsqft_FA.head()

data_ppsqft_FA['type'] = label_encoder.fit_transform(data_ppsqft_FA['type']) #encode house type lables
#data_ppsqft_FA = pd.get_dummies(data_ppsqft_FA, drop_first=True, dtype='int')


print(data_ppsqft_FA['type'].unique())

print(old_head)
print(data_ppsqft_FA.head())

['single_family' 'land' 'mobile' 'condo' 'townhome' 'other' 'farm']
[5 2 3 0 6 4 1]
            type  beds  garage  lot_sqft  stories  age  lastSoldNumYears  \
0  single_family  1.00    2.00  71438.00     2.00    4              6.45   
1  single_family  4.00    2.00  56628.00     2.00   56              6.45   
2  single_family  4.00    1.00  10019.00     2.00   39              6.45   
3  single_family  4.00    2.00  12632.00     2.00   88              6.57   
4           land  4.00    2.00 872071.00     2.00   21              6.45   

   pricePerSqft  totalBaths  
0        129.31        2.50  
1        215.95        5.50  
2        110.52        6.50  
3        143.06        4.50  
4         29.17        2.50  
   type  beds  garage  lot_sqft  stories  age  lastSoldNumYears  pricePerSqft  \
0     5  1.00    2.00  71438.00     2.00    4              6.45        129.31   
1     5  4.00    2.00  56628.00     2.00   56              6.45        215.95   
2     5  4.00    1.00  10019.00     2.00   39              6.45        110.52   
3     5  4.00    2.00  12632.00     2.00   88              6.57        143.06   
4     2  4.00    2.00 872071.00     2.00   21              6.45         29.17   

   totalBaths  
0        2.50  
1        5.50  
2        6.50  
3        4.50  
4        2.50


import statsmodels.api as sm

X = data_ppsqft_FA.drop(columns=['pricePerSqft'], axis=1)
y = data_ppsqft_FA['pricePerSqft']
print(X.shape)
print(y.shape)

model = sm.OLS(y, X).fit()
residuals = model.resid
fitted = model.fittedvalues
plt.scatter(fitted, residuals)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

(4157, 8)
(4157,)


from statsmodels.compat import lzip
import statsmodels.stats.api as sms

alpha = 0.05
names = ['Lagrange Multipler Statistic', 'p-value', 'f-value', 'f p-value']
test = sms.het_breuschpagan(residuals, X)

results = lzip(names,test)
for item in results:
    print(item)

if(results[1][1] < alpha):
    print("Null Hypothesis (H0) is rejected")
else:
    print("Failed to reject Null Hypothesis (H0)")

('Lagrange Multipler Statistic', 1509.3690996398877)
('p-value', 0.0)
('f-value', 295.659243588019)
('f p-value', 0.0)
Null Hypothesis (H0) is rejected


from statsmodels.stats.outliers_influence import variance_inflation_factor

#create dataframe to hold VIFs
vif_data = pd.DataFrame()
vif_data['feature'] = data_ppsqft_FA.columns

#calculate VIF of each feature
vif_data["VIF"] = [variance_inflation_factor(data_ppsqft_FA, i) for i in range(len(data_ppsqft_FA.columns))]

print(vif_data)

            feature   VIF
0              type 12.80
1              beds 15.39
2            garage  8.32
3          lot_sqft  1.04
4           stories 11.85
5               age  2.33
6  lastSoldNumYears  4.25
7      pricePerSqft  7.56
8        totalBaths  8.91


data_ppsqft_FA.drop('pricePerSqft', axis=1, inplace=True)
print(data_ppsqft_FA.columns)

Index(['type', 'beds', 'garage', 'lot_sqft', 'stories', 'age',
       'lastSoldNumYears', 'totalBaths'],
      dtype='object')


from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity

alpha = 0.05
chi_square, p_value = calculate_bartlett_sphericity(data_ppsqft_FA)

print(f"chi-square value: {chi_square}")
print(f"p-value: {p_value:.3f}")

if(p_value < alpha):
    print("Test has found statistical significance. Reject null hypothesis (H0)")
else:
    print("Test has found statistical insignificance. Fail to reject null hypothesis (H0)")

chi-square value: 4082.708012018086
p-value: 0.000
Test has found statistical significance. Reject null hypothesis (H0)


from factor_analyzer.factor_analyzer import calculate_kmo

kmo_all, kmo_model = calculate_kmo(data_ppsqft_FA)

print(f"KMO Value: {kmo_model}")

KMO Value: 0.6359704793254995


from factor_analyzer import FactorAnalyzer

#perform factor analysis
fa = FactorAnalyzer()
fa.fit(data_ppsqft_FA)

#check Eigen values
ev, v = fa.get_eigenvalues()

print("Eigen Values: ")
print(ev)

Eigen Values: 
[2.26749245 1.19225011 1.04566932 0.93762314 0.82004539 0.73414455
 0.63154082 0.37123422]


plt.scatter(range(1, data_ppsqft_FA.shape[1]+1), ev)
plt.plot(range(1, data_ppsqft_FA.shape[1]+1), ev)
plt.title("Scree Plot")
plt.xlabel("Factors")
plt.ylabel("Eigen Value")
plt.grid()
plt.show()


num_factors = len([x for x in ev if x >= 1.0])
print("The number of factors chosen using the Kaiser Criterion: ", num_factors)

The number of factors chosen using the Kaiser Criterion:  3


fa = FactorAnalyzer(n_factors=num_factors, rotation="varimax")
fa.fit(data_ppsqft_FA)
print(pd.DataFrame(fa.loadings_, index=data_ppsqft_FA.columns))

                     0     1     2
type              0.23  0.71  0.08
beds              0.99 -0.13  0.09
garage            0.35  0.18 -0.17
lot_sqft          0.00 -0.20 -0.03
stories           0.40  0.12  0.01
age              -0.20  0.01  0.80
lastSoldNumYears  0.09  0.08  0.14
totalBaths        0.56  0.46 -0.15


print(pd.DataFrame(fa.get_factor_variance(), index=['Variance', 'Proportional Var', 'Cumulative Var']))

                    0    1    2
Variance         1.67 0.82 0.72
Proportional Var 0.21 0.10 0.09
Cumulative Var   0.21 0.31 0.40


print(pd.DataFrame(fa.get_communalities()*100, index=data_ppsqft_FA.columns, columns=['Communalities']))

                  Communalities
type                      55.57
beds                      99.86
garage                    18.49
lot_sqft                   4.27
stories                   17.33
age                       67.96
lastSoldNumYears           3.30
totalBaths                55.08

Exploratory Data Analysis of Price Per Sqft¶

Table of Contents

Import initial libraries¶

Re-format scientific notation for simpler reading¶

Load data. Print columns, shape, and first 5 rows of dataset¶

Descriptive Analysis¶

Print Data types within dataset¶

Check for NA/NULL values¶

Print Descriptive statistics of the data set¶

Data Preprocessing¶

Remove unneeded columns¶

Change data types of columns¶

Feature Engineering¶

Create 'age' column and remove 'year_built' column¶

Create 'lastSoldNumYears' column and remove 'lastSoldOn' column¶

Create 'pricePerSqft' feature¶

Combine baths_full, baths_half, and baths into single feature named totalBaths. Delete the original bath columns¶

Find unique values within categorical features and combine like values if necessary¶

'type' column analysis¶

'status' column analysis¶

Exploratory Data Analysis¶

Function to calculate outliers using the IQR method¶

Create dataframe of samples with sqft > 0.0 for analysis of pricePerSqft¶

Create list of unique house types within data set¶

View Descriptive stats of pricePerSqft¶

Boxplot of Price Per Sqft¶

Distribution of pricePerSqft¶

Shapiro-Wilk Test is a test of normality.¶

View boxplot of pricePerSqft for each house type¶

Plot all house types in single grid for comparison¶

Descriptive stats of pricePerSqft per house Type¶

Distribution of Price Per Sqft per house type¶

House type within group conclusion¶

Kruskal-Wallis Test¶

House type between groups conclusion¶

Compare home age to PPSqft (with PPSqft outliers)¶

Compare home age to PPSqft (PPSqft outliers removed)¶

Compare home age to PPSqft based on house type 'single_family' (PPSqft outliers removed)¶

Spearman Correlation Matrix¶

Factor Analysis¶

Remove outliers based on the dependent variable 'pricePerSqft'¶

Remove unneeded columns¶

Remove 'sqft' and 'listPrice' columns¶

Label encode the 'type' column¶

Check homoscedasticity between variables¶

Plotting risiduals (y_pred - y) against the fitted values method¶

Breusch-Pagan Test¶

Check Multicollinearity using Variance Inflation Factor method¶

Remove 'pricePerSqft' column¶

Bartlett's test of sphericity¶

Kaiser-Meyer-Olkin (KMO) Test¶

Choosing number of Factors¶

Kaiser Criterion¶

Scree Plot¶

Perform Factor Analysis¶

Analyze factor loadings¶

High Loadings¶

Analyze amount of variance explained by each factor¶

Communalities¶

Factor Analysis Conclusion¶