Thursday, May 30, 2024

The Art of Encoding Categorical Variables in an Automated Valuation Model (AVM)

Target Audience: New Graduates/Analysts

Dummy Coding, Effects Coding, and One-Hot Coding

Dummy encoding is used to avoid perfect multicollinearity in a regression model, a phenomenon also known as the dummy variable trap. This occurs when one dummy variable can be perfectly predicted from the others. For example, if you have three dummy variables for Town A, Town B, and Town C, the variable for Town D can be inferred when all three are 0. This can lead to issues in estimating the model coefficients. Dummy encoding handles this by using one fewer dummy variable than the total number of categories, with one category serving as the baseline or reference.

Dummy coding and effects coding both use binary variables to represent categorical data in a regression model, but they differ in how they define the baseline or reference group. In dummy coding, the baseline group is represented by a set of zeros for all dummy variables. In effects coding, the baseline group is represented by a set of -1s. So you are correct.

How Dummy Coding and Effects Coding Differ

Dummy Coding

In dummy coding, if you have five towns (let's say A, B, C, D, and E), you'd create four dummy variables (Town B, Town C, Town D, and Town E) with Town A as the baseline.

·        A house in Town A would be coded as (0,0,0,0).

·        A house in Town B would be coded as (1,0,0,0).

·        A house in Town C would be coded as (0,1,0,0).

·        A house in Town D would be coded as (0,0,1,0).

·        A house in Town E would be coded as (0,0,0,1).

The coefficient for Town B would represent the difference in the dependent variable between Town B and Town A. The intercept would represent the mean of the dependent variable for the baseline group, Town A.

Effects Coding

In effects coding, with the same five towns, you would also create four dummy variables. However, the coding for the baseline group differs.

·        A house in Town A would be coded as (−1,−1,−1,−1).

·        A house in Town B would be coded as (1,0,0,0).

·        A house in Town C would be coded as (0,1,0,0).

·        A house in Town D would be coded as (0,0,1,0).

·        A house in Town E would be coded as (0,0,0,1).

In this case, the coefficient for Town B would represent the difference between the mean of Town B and the grand mean of all towns. The intercept would represent the grand mean of the dependent variable for all towns combined.

In both dummy coding and effects coding, the reference or baseline group is not directly included in the regression equation as a separate variable.

The reason is to avoid multicollinearity, a phenomenon known as the "dummy variable trap." If you were to include a variable for all five towns, the fifth variable would be perfectly predictable from the other four. For example, if a house is not in towns A, B, C, or D, it must be in town E. This perfect linear relationship between the variables makes it impossible for the regression model to uniquely estimate the coefficients for each variable.

How the Baseline is Accounted For

Even though the baseline group isn't explicitly in the equation, its information is implicitly included through the intercept and the interpretation of the other coefficients.

Note: "Effects Coding" and "Effect Coding" are often used interchangeably, and both are generally understood to refer to the same statistical method. However, "Effects Coding" is the more common and technically precise term used in academic literature and statistical software documentation.

The name "Effects Coding" is derived from the fact that it measures the "effect" of each group relative to the overall grand mean of all groups. In a balanced design (equal sample sizes in all groups), the intercept represents this grand mean. The coefficients for the dummy variables represent the deviation of each group's mean from the overall grand mean.

Key Takeaway

·        Effects Coding: This is the most widely accepted and formal name.

·        Effect Coding: While technically less common, it is also frequently used and understood in the same context.

One-hot encoding, on the other hand, creates a binary variable for each category.

To use one-hot encoding for the four towns in a regression model, you'll create four separate binary variables, one for each town. Each observation will have a 1 in the column corresponding to its town and a 0 in the other three town columns.

While this approach introduces multicollinearity if all one-hot encoded variables are included in the model along with an intercept, it can be handled in a few ways:

·        Dropping one of the one-hot encoded variables. This effectively makes the model identical to dummy coding, with the dropped variable serving as the reference category.

·        Omitting the intercept from the model. If you use all four one-hot encoded variables but remove the intercept, the model can still be estimated. In this case, each coefficient represents the average house price for that specific town. This approach is less common in standard regression analysis but is a valid alternative.

For a regression model, omitting the intercept would make the coefficients directly interpretable as the average price for each town, all else being equal. However, if you are also including other variables, such as square footage, the coefficients would represent the effect of being in that town, holding other variables constant, relative to a baseline of zero for the other towns. This interpretation can be less intuitive than the standard approach of using a reference category.

Therefore, most regression practitioners prefer to drop one dummy variable and include an intercept.

No comments:

Post a Comment

Book: Challenging Your Property Assessment: The Art of the Rebuttal: (A Comprehensive Guide to Winning Property Tax Appeals)

Link to the Kindle version Book Summary Your property tax bill arrives — and it’s higher than it should be. The assessor’s valuation feels w...