The comparable sales approach is a key method in real
estate valuation, yet the adjustments made during this process are often
regarded as more of an art than a science. This subjectivity can pose a
significant challenge, particularly when justifying these adjustments to an
audience without a technical background. Therefore, a clear and straightforward
data-driven model is essential to promote fairness and understanding.
This
blog post introduces a two-pass regression methodology to develop a robust
linear regression model for valuing single-family homes in Master Planned Unit
Developments (MPUDs). Using a dataset of 1,929 sales from 2024 across four
towns, we demonstrate how this approach enhances model accuracy and
reliability.
In
the first pass, we build an initial model and calculate Sales Ratios (Predicted
Price / Sale Price) to identify and remove outliers—unusual sales that distort
results. In the second pass, we refine the model with the cleaned dataset,
producing precise, interpretable coefficients for adjustments like $144 per
square foot of living area or $1,545 per month for sale timing. By removing
just 2.75% of sales (53 outliers), we increased the model's explanatory power
from 66.1% to 84.9% and reduced prediction errors by 39%, ensuring trustworthy
valuations.
This
methodology is simple to implement, easy to explain, and empowers professionals
to deliver defensible adjustments with confidence.
(Click on the image to enlarge) |
The regression output is derived from an Ordinary Least Squares
(OLS) model, with Sale Price as the dependent variable. This analysis utilizes
2024 sales data from 1,929 sales of single-family homes across four Master
Planned Unit Developments (MPUDs) located in four adjacent towns. The valuation
date is January 1, 2025.
The independent variables include
MONTHS SINCE, which accounts for time adjustments (for example, January is
assigned a value of 12, while December is assigned a value of 1, and so on).
The towns are represented as dummy
variables: TOWN-1, TOWN-2, and TOWN-3, with TOWN-4 serving as the reference.
Additionally, standard quantitative variables include LAND SF, BLDG AGE, LIVING
SF, OTHER SF, BATHS, and STORIES. Below is an analysis of the model's efficiency
and key metrics.
Model Efficiency and Interpretation
Adjusted
R-squared:
The Adjusted R-squared is 0.659467, meaning the model explains
about 66% of the variation in sale price, which is an excellent start for
our purpose.
Significance: The F-statistic of 374.37 and
its corresponding p-value of 0.0000 show that the model as a whole is
highly statistically significant.
MONTHS
SINCE (time adjustment) has a coefficient of
$583.61 per month but is insignificant (p = 0.5380, t = 0.6159), suggesting
that the market was flat in 2024.
TOWN
Variables:
The dummy-coded TOWN-1, TOWN-2, and TOWN-3 variables are
all highly significant (p-values of 0.0000). This confirms that there are
statistically significant price differences between the MPUDs in the four
towns.
Coefficients: The OTHER SF (non-living area)
has a value of $206.90 per square foot, while LIVING SF has a
value of $140.32 per square foot. Without a specific variable for
premium features like "golf course lot," the regression model is
likely attributing the premium value of these properties to the most correlated
variable it has—the non-living area. Homes on a golf course often feature
larger and more elaborate lanais, patios, and outdoor living spaces, which are
all categorized under non-living areas. The model is effectively saying that a
larger non-living area is a strong indicator of a premium location or amenity,
and it assigns a higher value to that variable to account for the missing
information.
BATHS:
The coefficient
for BATHS is $45,768.66, which means that, on average, each additional
bathroom in a home is associated with an increase in sale price of
approximately $45,769, holding all other variables constant. This
coefficient reflects the significant value that buyers place on the number of
bathrooms in a home.
STORIES:
The coefficient
for STORIES is $-54,586.03, indicating that, on average, a two-story
home sells for approximately $54,586 less than a single-story home, all
else being equal. This is a common finding in many retirement housing markets
in the Sunbelt, as single-story homes are often preferred for their convenience
and accessibility. The negative coefficient reflects this market preference.
The Second Regression
Pass
Analysis of the Second Pass
The removal of outliers has had a dramatic and
positive impact on the model.
o Improved Efficiency: The Adjusted R-squared jumped
from 0.659 to 0.848, meaning the model now explains almost 85% of
the variation in sale prices. This is a substantial improvement and indicates a
firm fit. The Standard Error also decreased significantly from 132,803
to 80,403, showing that the average prediction error is much lower.
o Significance: The F-statistic is now
1,051.34, and the model as a whole remains highly significant (p-value of
0.0000). The coefficients for all variables—including "MONTHS
SINCE"—are now statistically significant with p-values far below the 0.05
threshold.
o Outlier Impact: Removing 53 sales (2.75%) eliminated
noise, revealing the MONTHS SINCE trend and refining coefficients.
o Coefficient Changes: Most coefficients are stable but
refined:
o TOWN
Dummy Variables:
The coefficients for the dummy variables directly show the price difference
relative to the reference category, TOWN-4. Here's how to interpret the
coefficients from the second-pass regression:
- TOWN-1
Coefficient:
$66,368.66, which means that, on average, a home in TOWN-1 sells for
approximately $66,369 more than an identical home in the reference
town, TOWN-4.
- TOWN-2
Coefficient:
$175,751.86. A home in TOWN-2 sells for about $175,752 more than an
identical home in TOWN-4.
- TOWN-3
Coefficient:
$73,435.09. A home in TOWN-3 sells for roughly $73,435 more than an
identical home in TOWN-4.
By simply looking at the coefficients, we can
see the premium or discount for each town compared to the chosen baseline,
TOWN-4. This dummy setup is a very clear and effective way to present the
location variable's impact on sale price.
o The "MONTHS SINCE" variable
has become significant (p = 0.00804) after removing the outliers. This
is a crucial finding. The coefficient of $1,544.56 indicates that the
market was appreciating by approximately $1,545 per month in 2024. The
presence of outliers in the first pass was likely masking this subtle but real
market trend. After removing the outliers, the model reveals the actual
underlying pattern of price appreciation.
o LAND SF increased ($6.82 to $10.56),
suggesting outliers masked land value.
o BLDG AGE became more negative
(-$2,654.81 to -$3,422.43), indicating more substantial depreciation.
o LIVING SF and OTHER SF are stable
($140.32 to $144.40, $206.90 to $197.33), with OTHER SF still higher.
o BATHS and STORIES slightly decreased
in magnitude, reflecting cleaner data.
This two-pass methodology—running an initial
regression, identifying and removing outliers, and then running a final
regression—is a robust, defensible, and statistically sound process. The final
model, built on the cleaned data, has a much higher R-squared, lower error, and
coefficients that are more reliable and easier to interpret. The model now
accurately reflects a market that was appreciating throughout the year.
Valuation Grid for Subjects Using Regression Coefficients
The table below estimates the value of four subject properties,
one in each town (TOWN-1, TOWN-2, TOWN-3, TOWN-4), using the second-pass
regression model’s coefficients. Each property has identical attributes: LAND
SF = 25,700, BLDG AGE = 21 years, LIVING SF = 1,972, OTHER SF = 1,478, BATHS =
2.00, STORIES = 1.00, valued as of January 1, 2025 (MONTHS SINCE = 0). The grid
shows how coefficients contribute to the predicted price, enabling valuation
professionals to explain and justify comparable sales adjustments to a non-technical
audience.
The estimated value for each subject
property was calculated by summing the Intercept and the product of each
variable's Coefficient and the corresponding subject Attribute.
The process is as follows:
1.
Starting
with the Intercept from the regression model.
2.
Adding
the value for each of the subject's attributes by multiplying its attribute
value by the coefficient for that variable.
3.
For
the TOWN variable, only the coefficient for the subject's specific town is
added. The reference town (TOWN-4) does not have a coefficient and is
represented by an additional value of 0.
4.
The
MONTHS SINCE variable is set to 0, as the valuation date is January 1, 2025.
Here's an example of how the
calculation was performed for the subject property in TOWN-1:
Calculation
for Town-1
Estimated Value=Intercept+Town Adj+Time Adj+Land SF+Bldg Age+Living SF+Other SF+Baths+Stories
Estimated Value=−116,742.96+66,368.66+(0)+(25,700×10.56)+(21×−3,422.43)+(1,972×144.40)+(1,478×197.33)+(2×39,626.86)+(1×−49,733.96)
Estimated Value=−116,742.96+66,368.66+271,432.00−71,871.03+284,724.80+291,617.74+79,253.72−49,733.96
Estimated Value=$755,049
The exact process was used for the other towns, with the only difference being the town-specific adjustment coefficient.
Sales
Ratio Analysis
The final sales ratios (SALES RATIO-2) are a
vast improvement and confirm that removing the outliers was the right move.
This analysis provides a solid, data-backed foundation for valuation
professionals.
The comparison of the sales ratio statistics
powerfully demonstrates the positive impact of removing the outliers. Every
metric shows a healthier, more reliable dataset and a superior model.
Mean
& Median:
The mean and median for the final model are both very close to 1,
which is the ideal outcome, as it indicates the model is accurately predicting
sale prices on average, without any systemic bias to over- or under-predict.
The initial median of 1.0151 was slightly skewed by the outliers.
Standard
Deviation & Variance:
The reduction in standard deviation from 0.1839 to 0.1377 and sample
variance from 0.0338 to 0.0190 is a key indicator of improved model
precision, meaning the predicted prices are much closer to the actual sale prices,
and the model's predictions are far more consistent.
Skewness
& Kurtosis:
This is where the most dramatic improvement is seen.
o Skewness: The initial skewness of 2.8471 shows
a significant rightward tail, driven by sales where the model heavily under-predicted
the price (e.g., the minimum ratio of 0.0840). The final skewness of 0.0864
is very close to zero, indicating the data is now almost perfectly symmetrical
and normally distributed.
o Kurtosis: The initial kurtosis of 29.1570
indicates a significantly "peaked" distribution with very heavy
tails—a classic sign of significant outliers. The final kurtosis of -0.2004
is near zero, confirming the distribution is now much flatter with fewer
extreme values, as expected from a normal distribution.
Range: The range of the sales ratios was
drastically reduced from 3.4499 to 0.8263, which shows that the most
egregious errors in the initial model have been successfully eliminated.
This two-pass methodology—running an initial
regression, identifying and removing outliers, and then running a final
regression—is a robust, defensible, and statistically sound process. The final
model, built on the cleaned data, has a much higher R-squared, lower error, and
coefficients that are more reliable and easier to interpret.
This final model is the result of a rigorous
and responsible data analysis process. This approach is perfect for valuation professionals
because it's transparent, easy to explain, and produces a highly credible model
for justifying valuation adjustments.
Why it's Wise to Keep All Variables
until Outliers are Removed in a Two-Pass Regression
In a two-pass regression model, it's unwise to
remove an independent variable after the first pass until outliers—unusual
sales that distort results—are removed. Outliers, such as non-arm's-length
transactions or data errors, can mask a variable's true significance by adding
noise. For example, in our 2024 dataset of 1,929 home sales, the MONTHS SINCE
variable, which adjusts for sale timing, appeared insignificant in the first
pass (p = 0.5380, coefficient = $583.61). However, after removing 53 outliers
(2.75%) using Sales Ratios, MONTHS SINCE became significant (p = 0.00804,
coefficient = $1,544.56) in the second pass, unearthing a meaningful price
trend of $1,545 per month, critical for accurate adjustments. By removing
MONTHS SINCE prematurely, we would have missed this trend, reducing the model's
reliability.
Here's a detailed explanation:
1. Masking True Relationships
Outliers are data points that don't fit the
overall pattern of the rest of the data. They can have a disproportionately
large influence on the regression line, pulling it in a direction that
minimizes their error, causing the model to incorrectly see a variable as
insignificant, even if it has a tangible impact on the dependent variable. In
our case, the "MONTHS SINCE" variable initially appeared
insignificant because the extreme values of the outliers were masking the
genuine, though subtle, trend of market appreciation.
2. Avoiding a Biased Model
If an insignificant variable is removed before
addressing outliers, we might be throwing away a variable that is actually
important. The variable's insignificance may be a symptom of the outlier
problem, not a reflection of its genuine relationship with the sale price. By
keeping all variables in the first pass, we create a more complete picture of
the initial data, which helps in correctly identifying which sales are the true
outliers.
3. The Outlier's Impact
Outliers can completely change the slope of
the regression line. For instance, a few sales with exceptionally high prices
could make the model "think" that prices were rising rapidly at the
end of the year, even if most sales were flat. Conversely, a few sales with
unusually low prices could make the model "think" the market was flat
or declining when it was actually appreciating. It's the presence of these
influential data points that distorts the initial results.
4. Correcting the Distortion
The purpose of the two-pass approach is to
correct for this distortion. The first pass is a diagnostic tool, not a final
model. Its primary job is to generate the residuals needed to find the
outliers. The second pass, run on a cleaned dataset, provides the accurate,
unbiased results. By keeping all variables in the first pass, we ensure the
outlier identification process is as accurate as possible.
By keeping all variables in the first pass, we
ensure outliers are identified and removed, allowing the second pass to reveal
genuine relationships and produce robust, defensible coefficients for
comparable sales analysis.
A Call to LLM Model Trainers
In traditional regression analysis, a common practice is to
immediately discard variables that appear statistically insignificant, such as
those with high p-values. This "first-pass" culling of variables is
often seen as a way to simplify the model and improve its efficiency. However,
as our analysis has shown, this conventional wisdom can be misleading and is
not advisable.
This finding is a direct challenge to a conventional practice that
could be baked into how Large Language Models (LLMs) are trained to perform
data analysis. We encourage LLM model trainers to adopt this two-pass
regression methodology, or at the very least, to embed a warning in their
models. When asked to perform a regression analysis on raw, real-world data,
the model should not simply discard variables based on a single pass. It should
be trained to recognize the potential for outliers to distort initial results.
LLMs should warn users: "Removing variables with high
p-values or low coefficients before outlier removal may overlook significant
predictors, as outliers can mask true effects. Consider a two-pass approach to
ensure robust results." This adjustment will enhance the reliability of
regression-based advice, particularly for applications like real estate
valuation, where interpretable models are critical. By adopting or flagging
this methodology, LLMs can empower users to build more accurate, defensible
models, avoiding the pitfalls of conventional practices.
Conclusion
The two-pass regression methodology is a powerful and practical
tool for any valuation professional. By running an initial regression and then
meticulously cleaning the data to remove outliers, we have demonstrated a
rigorous, defensible process. The resulting model—with its high R-squared, low
standard error, and, most importantly, highly significant coefficients—is not
just a better predictor of value; it's a testament to the integrity of the
analysis.
By first identifying and removing outliers—53 sales (2.75%) in our
2024 dataset of 1,929 homes—we eliminated noise that obscured key patterns,
such as a $1,545 monthly price increase. The second pass, using the cleaned 1,876
sales, produced a model explaining 84.9% of price variation, with prediction
errors reduced by 39% to $80,403. This process yielded significant, intuitive
coefficients, like $197 per square foot for outdoor areas (reflecting premium
golf course lots) and $175,752 for TOWN-2 homes compared to TOWN-4, enabling
precise adjustments. Sales Ratios, averaging 1.0098 with a standard deviation
of 0.1377, confirmed the model's accuracy.
This methodology ensures reliable, defensible valuations that valuation
professionals can confidently present to non-technical board members, balancing
precision with simplicity. This approach can transform raw sales data into a
practical tool for fair and transparent comparable sales analysis.
Disclaimer: The two-step regression model discussed in this blog post may yield different results based on the specific dataset and circumstances of each valuation task. Professionals are encouraged to consider the unique characteristics of each case and exercise discretion in applying this methodology. While the results demonstrated in this blog post showcase the potential benefits of the two-pass regression approach, it is essential to conduct thorough analyses and exercise caution before relying solely on this method for valuation purposes.
No comments:
Post a Comment