The comparable sales approach is a key method in real estate valuation, yet the adjustments made during this process are often regarded as more of an art than a science. This subjectivity can pose a significant challenge, particularly when justifying these adjustments to an audience without a technical background. Therefore, a clear and straightforward data-driven model is essential to promote fairness and understanding.

This blog post introduces a two-pass regression methodology to develop a robust linear regression model for valuing single-family homes in Master Planned Unit Developments (MPUDs). Using a dataset of 1,929 sales from 2024 across four towns, we demonstrate how this approach enhances model accuracy and reliability.

In the first pass, we build an initial model and calculate Sales Ratios (Predicted Price / Sale Price) to identify and remove outliers—unusual sales that distort results. In the second pass, we refine the model with the cleaned dataset, producing precise, interpretable coefficients for adjustments like $144 per square foot of living area or $1,545 per month for sale timing. By removing just 2.75% of sales (53 outliers), we increased the model's explanatory power from 66.1% to 84.9% and reduced prediction errors by 39%, ensuring trustworthy valuations.

This methodology is simple to implement, easy to explain, and empowers professionals to deliver defensible adjustments with confidence.

(Click on the image to enlarge)

The regression output is derived from an Ordinary Least Squares (OLS) model, with Sale Price as the dependent variable. This analysis utilizes 2024 sales data from 1,929 sales of single-family homes across four Master Planned Unit Developments (MPUDs) located in four adjacent towns. The valuation date is January 1, 2025.

The independent variables include MONTHS SINCE, which accounts for time adjustments (for example, January is assigned a value of 12, while December is assigned a value of 1, and so on).

The towns are represented as dummy variables: TOWN-1, TOWN-2, and TOWN-3, with TOWN-4 serving as the reference. Additionally, standard quantitative variables include LAND SF, BLDG AGE, LIVING SF, OTHER SF, BATHS, and STORIES. Below is an analysis of the model's efficiency and key metrics.

Model Efficiency and Interpretation

Adjusted R-squared: The Adjusted R-squared is 0.659467, meaning the model explains about 66% of the variation in sale price, which is an excellent start for our purpose.

Significance: The F-statistic of 374.37 and its corresponding p-value of 0.0000 show that the model as a whole is highly statistically significant.

MONTHS SINCE (time adjustment) has a coefficient of $583.61 per month but is insignificant (p = 0.5380, t = 0.6159), suggesting that the market was flat in 2024.

TOWN Variables: The dummy-coded TOWN-1, TOWN-2, and TOWN-3 variables are all highly significant (p-values of 0.0000). This confirms that there are statistically significant price differences between the MPUDs in the four towns.

Coefficients: The OTHER SF (non-living area) has a value of $206.90 per square foot, while LIVING SF has a value of $140.32 per square foot. Without a specific variable for premium features like "golf course lot," the regression model is likely attributing the premium value of these properties to the most correlated variable it has—the non-living area. Homes on a golf course often feature larger and more elaborate lanais, patios, and outdoor living spaces, which are all categorized under non-living areas. The model is effectively saying that a larger non-living area is a strong indicator of a premium location or amenity, and it assigns a higher value to that variable to account for the missing information.

BATHS: The coefficient for BATHS is $45,768.66, which means that, on average, each additional bathroom in a home is associated with an increase in sale price of approximately $45,769, holding all other variables constant. This coefficient reflects the significant value that buyers place on the number of bathrooms in a home.

STORIES: The coefficient for STORIES is $-54,586.03, indicating that, on average, a two-story home sells for approximately $54,586 less than a single-story home, all else being equal. This is a common finding in many retirement housing markets in the Sunbelt, as single-story homes are often preferred for their convenience and accessibility. The negative coefficient reflects this market preference.

The Second Regression Pass

Analysis of the Second Pass

The removal of outliers has had a dramatic and positive impact on the model.

o Improved Efficiency: The Adjusted R-squared jumped from 0.659 to 0.848, meaning the model now explains almost 85% of the variation in sale prices. This is a substantial improvement and indicates a firm fit. The Standard Error also decreased significantly from 132,803 to 80,403, showing that the average prediction error is much lower.

o Significance: The F-statistic is now 1,051.34, and the model as a whole remains highly significant (p-value of 0.0000). The coefficients for all variables—including "MONTHS SINCE"—are now statistically significant with p-values far below the 0.05 threshold.

o Outlier Impact: Removing 53 sales (2.75%) eliminated noise, revealing the MONTHS SINCE trend and refining coefficients.

o Coefficient Changes: Most coefficients are stable but refined:

o TOWN Dummy Variables: The coefficients for the dummy variables directly show the price difference relative to the reference category, TOWN-4. Here's how to interpret the coefficients from the second-pass regression:

TOWN-1 Coefficient: $66,368.66, which means that, on average, a home in TOWN-1 sells for approximately $66,369 more than an identical home in the reference town, TOWN-4.
TOWN-2 Coefficient: $175,751.86. A home in TOWN-2 sells for about $175,752 more than an identical home in TOWN-4.
TOWN-3 Coefficient: $73,435.09. A home in TOWN-3 sells for roughly $73,435 more than an identical home in TOWN-4.

By simply looking at the coefficients, we can see the premium or discount for each town compared to the chosen baseline, TOWN-4. This dummy setup is a very clear and effective way to present the location variable's impact on sale price.

o The "MONTHS SINCE" variable has become significant (p = 0.00804) after removing the outliers. This is a crucial finding. The coefficient of $1,544.56 indicates that the market was appreciating by approximately $1,545 per month in 2024. The presence of outliers in the first pass was likely masking this subtle but real market trend. After removing the outliers, the model reveals the actual underlying pattern of price appreciation.

o LAND SF increased ($6.82 to $10.56), suggesting outliers masked land value.

o BLDG AGE became more negative (-$2,654.81 to -$3,422.43), indicating more substantial depreciation.

o LIVING SF and OTHER SF are stable ($140.32 to $144.40, $206.90 to $197.33), with OTHER SF still higher.

o BATHS and STORIES slightly decreased in magnitude, reflecting cleaner data.

This two-pass methodology—running an initial regression, identifying and removing outliers, and then running a final regression—is a robust, defensible, and statistically sound process. The final model, built on the cleaned data, has a much higher R-squared, lower error, and coefficients that are more reliable and easier to interpret. The model now accurately reflects a market that was appreciating throughout the year.

Valuation Grid for Subjects Using Regression Coefficients

The table below estimates the value of four subject properties, one in each town (TOWN-1, TOWN-2, TOWN-3, TOWN-4), using the second-pass regression model’s coefficients. Each property has identical attributes: LAND SF = 25,700, BLDG AGE = 21 years, LIVING SF = 1,972, OTHER SF = 1,478, BATHS = 2.00, STORIES = 1.00, valued as of January 1, 2025 (MONTHS SINCE = 0). The grid shows how coefficients contribute to the predicted price, enabling valuation professionals to explain and justify comparable sales adjustments to a non-technical audience.

The estimated value for each subject property was calculated by summing the Intercept and the product of each variable's Coefficient and the corresponding subject Attribute. The process is as follows:

1. Starting with the Intercept from the regression model.

2. Adding the value for each of the subject's attributes by multiplying its attribute value by the coefficient for that variable.

3. For the TOWN variable, only the coefficient for the subject's specific town is added. The reference town (TOWN-4) does not have a coefficient and is represented by an additional value of 0.

4. The MONTHS SINCE variable is set to 0, as the valuation date is January 1, 2025.

Here's an example of how the calculation was performed for the subject property in TOWN-1:

Calculation for Town-1

Estimated Value=Intercept+Town Adj+Time Adj+Land SF+Bldg Age+Living SF+Other SF+Baths+Stories

Estimated Value=−116,742.96+66,368.66+(0)+(25,700×10.56)+(21×−3,422.43)+(1,972×144.40)+(1,478×197.33)+(2×39,626.86)+(1×−49,733.96)

Estimated Value=−116,742.96+66,368.66+271,432.00−71,871.03+284,724.80+291,617.74+79,253.72−49,733.96

Estimated Value=$755,049

The exact process was used for the other towns, with the only difference being the town-specific adjustment coefficient.

Sales Ratio Analysis

The final sales ratios (SALES RATIO-2) are a vast improvement and confirm that removing the outliers was the right move. This analysis provides a solid, data-backed foundation for valuation professionals.

The comparison of the sales ratio statistics powerfully demonstrates the positive impact of removing the outliers. Every metric shows a healthier, more reliable dataset and a superior model.

Mean & Median: The mean and median for the final model are both very close to 1, which is the ideal outcome, as it indicates the model is accurately predicting sale prices on average, without any systemic bias to over- or under-predict. The initial median of 1.0151 was slightly skewed by the outliers.

Standard Deviation & Variance: The reduction in standard deviation from 0.1839 to 0.1377 and sample variance from 0.0338 to 0.0190 is a key indicator of improved model precision, meaning the predicted prices are much closer to the actual sale prices, and the model's predictions are far more consistent.

Skewness & Kurtosis: This is where the most dramatic improvement is seen.

o Skewness: The initial skewness of 2.8471 shows a significant rightward tail, driven by sales where the model heavily under-predicted the price (e.g., the minimum ratio of 0.0840). The final skewness of 0.0864 is very close to zero, indicating the data is now almost perfectly symmetrical and normally distributed.

o Kurtosis: The initial kurtosis of 29.1570 indicates a significantly "peaked" distribution with very heavy tails—a classic sign of significant outliers. The final kurtosis of -0.2004 is near zero, confirming the distribution is now much flatter with fewer extreme values, as expected from a normal distribution.

Range: The range of the sales ratios was drastically reduced from 3.4499 to 0.8263, which shows that the most egregious errors in the initial model have been successfully eliminated.

This final model is the result of a rigorous and responsible data analysis process. This approach is perfect for valuation professionals because it's transparent, easy to explain, and produces a highly credible model for justifying valuation adjustments.

Why it's Wise to Keep All Variables until Outliers are Removed in a Two-Pass Regression

In a two-pass regression model, it's unwise to remove an independent variable after the first pass until outliers—unusual sales that distort results—are removed. Outliers, such as non-arm's-length transactions or data errors, can mask a variable's true significance by adding noise. For example, in our 2024 dataset of 1,929 home sales, the MONTHS SINCE variable, which adjusts for sale timing, appeared insignificant in the first pass (p = 0.5380, coefficient = $583.61). However, after removing 53 outliers (2.75%) using Sales Ratios, MONTHS SINCE became significant (p = 0.00804, coefficient = $1,544.56) in the second pass, unearthing a meaningful price trend of $1,545 per month, critical for accurate adjustments. By removing MONTHS SINCE prematurely, we would have missed this trend, reducing the model's reliability.

Here's a detailed explanation:

1. Masking True Relationships

Outliers are data points that don't fit the overall pattern of the rest of the data. They can have a disproportionately large influence on the regression line, pulling it in a direction that minimizes their error, causing the model to incorrectly see a variable as insignificant, even if it has a tangible impact on the dependent variable. In our case, the "MONTHS SINCE" variable initially appeared insignificant because the extreme values of the outliers were masking the genuine, though subtle, trend of market appreciation.

2. Avoiding a Biased Model

If an insignificant variable is removed before addressing outliers, we might be throwing away a variable that is actually important. The variable's insignificance may be a symptom of the outlier problem, not a reflection of its genuine relationship with the sale price. By keeping all variables in the first pass, we create a more complete picture of the initial data, which helps in correctly identifying which sales are the true outliers.

3. The Outlier's Impact

Outliers can completely change the slope of the regression line. For instance, a few sales with exceptionally high prices could make the model "think" that prices were rising rapidly at the end of the year, even if most sales were flat. Conversely, a few sales with unusually low prices could make the model "think" the market was flat or declining when it was actually appreciating. It's the presence of these influential data points that distorts the initial results.

4. Correcting the Distortion

The purpose of the two-pass approach is to correct for this distortion. The first pass is a diagnostic tool, not a final model. Its primary job is to generate the residuals needed to find the outliers. The second pass, run on a cleaned dataset, provides the accurate, unbiased results. By keeping all variables in the first pass, we ensure the outlier identification process is as accurate as possible.

By keeping all variables in the first pass, we ensure outliers are identified and removed, allowing the second pass to reveal genuine relationships and produce robust, defensible coefficients for comparable sales analysis.

A Call to LLM Model Trainers

In traditional regression analysis, a common practice is to immediately discard variables that appear statistically insignificant, such as those with high p-values. This "first-pass" culling of variables is often seen as a way to simplify the model and improve its efficiency. However, as our analysis has shown, this conventional wisdom can be misleading and is not advisable.

This finding is a direct challenge to a conventional practice that could be baked into how Large Language Models (LLMs) are trained to perform data analysis. We encourage LLM model trainers to adopt this two-pass regression methodology, or at the very least, to embed a warning in their models. When asked to perform a regression analysis on raw, real-world data, the model should not simply discard variables based on a single pass. It should be trained to recognize the potential for outliers to distort initial results.

LLMs should warn users: "Removing variables with high p-values or low coefficients before outlier removal may overlook significant predictors, as outliers can mask true effects. Consider a two-pass approach to ensure robust results." This adjustment will enhance the reliability of regression-based advice, particularly for applications like real estate valuation, where interpretable models are critical. By adopting or flagging this methodology, LLMs can empower users to build more accurate, defensible models, avoiding the pitfalls of conventional practices.

Conclusion

The two-pass regression methodology is a powerful and practical tool for any valuation professional. By running an initial regression and then meticulously cleaning the data to remove outliers, we have demonstrated a rigorous, defensible process. The resulting model—with its high R-squared, low standard error, and, most importantly, highly significant coefficients—is not just a better predictor of value; it's a testament to the integrity of the analysis.

By first identifying and removing outliers—53 sales (2.75%) in our 2024 dataset of 1,929 homes—we eliminated noise that obscured key patterns, such as a $1,545 monthly price increase. The second pass, using the cleaned 1,876 sales, produced a model explaining 84.9% of price variation, with prediction errors reduced by 39% to $80,403. This process yielded significant, intuitive coefficients, like $197 per square foot for outdoor areas (reflecting premium golf course lots) and $175,752 for TOWN-2 homes compared to TOWN-4, enabling precise adjustments. Sales Ratios, averaging 1.0098 with a standard deviation of 0.1377, confirmed the model's accuracy.

This methodology ensures reliable, defensible valuations that valuation professionals can confidently present to non-technical board members, balancing precision with simplicity. This approach can transform raw sales data into a practical tool for fair and transparent comparable sales analysis.

Disclaimer: The two-step regression model discussed in this blog post may yield different results based on the specific dataset and circumstances of each valuation task. Professionals are encouraged to consider the unique characteristics of each case and exercise discretion in applying this methodology. While the results demonstrated in this blog post showcase the potential benefits of the two-pass regression approach, it is essential to conduct thorough analyses and exercise caution before relying solely on this method for valuation purposes.

Connecting the AI dots

Saturday, August 9, 2025

Beyond Gut Feeling: Using Regression to Build a Defensible Comps Adjustments Matrix

Valuation Grid for Subjects Using Regression Coefficients

Beyond Gut Feeling: Using Regression to Build a Defensible Comps Adjustments Matrix

Report Abuse

Labels