Misalkan kita mengamati data dan ingin mencocokkan model regresi untuk . Sayangnya, kadang-kadang diukur dengan kesalahan yang rata-rata bukan nol.
Biarkan menunjukkan apakah diukur dengan kesalahan rata-rata nol klasik atau kesalahan bukan-rata. Kami ingin memperkirakan . Sayangnya, Z umumnya tidak diamati, dan \ mathbf {E} [Y \, | \, X, Z = \ text {bias}] \ neq \ mathbf {E} [Y \, | \, X] . Jika kita cocok dengan regresi Y pada X , kita akan mendapatkan prediksi yang bias.
Misalkan kita secara umum tidak dapat mengamati , tetapi memiliki akses ke model untuk (karena kita secara manual mempelajari pada set pelatihan kecil dan menyesuaikan model klasifikasi dengan sebagai variabel target) . Apakah sesuai dengan regresi pada menggunakan karena bobot regresi menghasilkan estimasi yang tidak bias dari (atau, gagal itu, estimasi yang kurang bias dari yang akan kita dapatkan tanpa menggunakan bobot)? Apakah metode ini digunakan dalam praktik, dan apakah ada nama?
Klarifikasi: tujuannya adalah untuk mencocokkan model yang meminimalkan kesalahan kuadrat rata-rata pada data yang tidak terlihat (data uji) di mana . Prediktor optimal untuk tujuan itu adalah , jadi itulah fungsi yang kami coba perkirakan. Metode untuk memecahkan masalah ini harus diberi peringkat dalam hal seberapa baik mereka mencapai tujuan itu.
Contoh kecil dalam R dengan df$y_is_unbiased
memainkan peran dan df$y_observed
memainkan peran :
library(ggplot2)
library(randomForest)
set.seed(12345)
get_df <- function(n_obs, constant, beta, sd_epsilon, mismeasurement) {
df <- data.frame(x1=rnorm(n_obs), x2=rnorm(n_obs), epsilon=rnorm(n_obs, sd=sd_epsilon))
## Value of Y if measured correctly
df$y_unbiased <- constant + as.matrix(df[c("x1", "x2")]) %*% beta + df$epsilon
## Value of Y if measured incorrectly
df$y_biased <- df$y_unbiased + sample(mismeasurement, size=n_obs, replace=TRUE)
## Y is equally likely to be measured correctly or incorrectly
df$y_is_unbiased<- sample(c(TRUE, FALSE), size=n_obs, replace=TRUE)
df$y_observed <- ifelse(df$y_is_unbiased, df$y_unbiased, df$y_biased)
return(df)
}
## True coefficients
constant <- 5
beta <- c(1, 5)
df <- get_df(n_obs=2000, constant=constant, beta=beta, sd_epsilon=1.0, mismeasurement=c(-10.0, 5.0))
ggplot(df, aes(x=x1, y=y_observed, color=y_is_unbiased)) + geom_point() + scale_color_manual(values=c("#ff7f00", "#377eb8"))
## For facet_wrap title
df$string_y_is_unbiased <- paste0("y_is_unbiased: ", df$y_is_unbiased)
## Notice that Pr[Y | Z = biased] differs from Pr[Y | Z = unbiased]
ggplot(df, aes(x=y_observed)) + geom_histogram(color="black", fill="grey", binwidth=0.5) + facet_wrap(~ string_y_is_unbiased, ncol=1)
## Recover true constant and beta (plus noise) when using y_unbiased
summary(lm(y_unbiased ~ x1 + x2, data=df))
## Biased estimates when using y_biased (constant is biased downward)
summary(lm(y_biased ~ x1 + x2, data=df))
## Also get biased estimates when using y_observed (constant is biased downward)
summary(lm(y_observed ~ x1 + x2, data=df))
## Now image that we "rate" subset of the data (manually check/research whether y was measured with or without bias)
n_rated <- 1000
df_rated <- df[1:n_rated, ]
## Use a factor so that randomForest does classification instead of regression
df_rated$y_is_unbiased <- factor(df_rated$y_is_unbiased)
model_pr_unbiased <- randomForest(formula=y_is_unbiased ~ y_observed + x1 + x2, data=df_rated, mtry=2)
## Examine OOB confusion matrix (error rate < 5%)
print(model_pr_unbiased)
## Use the model to get Pr[Y is unbiased | X, observed Y] on unrated data
df_unrated <- df[(n_rated+1):nrow(df), ]
df_unrated$pr_unbiased <- as.vector(predict(model_pr_unbiased, newdata=df_unrated, type="prob")[, "TRUE"])
## Train a model on unrated data, using pr_unbiased as regression weights -- is this unbiased?
summary(lm(y_observed ~ x1 + x2, data=df_unrated, weights=df_unrated$pr_unbiased))
Dalam contoh ini, model adalah hutan acak dengan . Jika model ini sangat akurat, itu akan menghasilkan bobot 1,0 di mana tidak bias, 0,0 di mana bias, dan regresi tertimbang jelas akan tidak bias. Apa yang terjadi ketika model untuk memiliki presisi pengujian dan penarikan yang tidak sempurna (akurasi <100%)? Apakah regresi berbobot dijamin kurang bias dibandingkan dengan regresi tertimbang pada ?formula=y_is_unbiased ~ y_observed + x1 + x2
Contoh yang sedikit lebih rumit di mana bervariasi dengan (sebagai lawan dari contoh sederhana yang saya posting di atas, di mana ):
library(ggplot2)
library(randomForest)
set.seed(12345)
logistic <- function(x) {
return(1 / (1 + exp(-x)))
}
pr_y_is_unbiased <- function(x1, x2) {
## This function returns Pr[ Z = unbiased | X ]
return(logistic(x1 + 2*x2))
}
get_df <- function(n_obs, constant, beta, sd_epsilon, mismeasurement) {
df <- data.frame(x1=rnorm(n_obs), x2=rnorm(n_obs), epsilon=rnorm(n_obs, sd=sd_epsilon))
## Value of Y if measured correctly
df$y_unbiased <- constant + as.matrix(df[c("x1", "x2")]) %*% beta + df$epsilon
## Value of Y if measured incorrectly
df$y_biased <- df$y_unbiased + sample(mismeasurement, size=n_obs, replace=TRUE)
## Note: in this example, Pr[ Z = biased | X ] varies with X
## In the first (simpler) example I posted, Pr[ Z = biased | X ] = 1/2 was constant with respect to X
df$y_is_unbiased <- runif(n_obs) < pr_y_is_unbiased(df$x1, df$x2)
df$y_observed <- ifelse(df$y_is_unbiased, df$y_unbiased, df$y_biased)
return(df)
}
## True coefficients
constant <- 5
beta <- c(1, 5)
df <- get_df(n_obs=2000, constant=constant, beta=beta, sd_epsilon=1.0, mismeasurement=c(-10.0, 5.0))
ggplot(df, aes(x=x1, y=y_observed, color=y_is_unbiased)) + geom_point() + scale_color_manual(values=c("#ff7f00", "#377eb8"))
## For facet_wrap title
df$string_y_is_unbiased <- paste0("y_is_unbiased: ", df$y_is_unbiased)
## Notice that Pr[Y | Z = biased] differs from Pr[Y | Z = unbiased]
ggplot(df, aes(x=y_observed)) + geom_histogram(color="black", fill="grey", binwidth=0.5) + facet_wrap(~ string_y_is_unbiased, ncol=1)
## Recover true constant and beta (plus noise) when using y_unbiased
summary(lm(y_unbiased ~ x1 + x2, data=df))
## Biased estimates when using y_biased (constant is biased downward)
summary(lm(y_biased ~ x1 + x2, data=df))
## Also get biased estimates when using y_observed
## Note: the constant is biased downward _and_ the coefficient on x2 is biased upward!
summary(lm(y_observed ~ x1 + x2, data=df))
## Now image that we "rate" subset of the data (manually check/research whether y was measured with or without bias)
n_rated <- 1000
df_rated <- df[1:n_rated, ]
## Use a factor so that randomForest does classification instead of regression
df_rated$y_is_unbiased <- factor(df_rated$y_is_unbiased)
model_pr_unbiased <- randomForest(formula=y_is_unbiased ~ y_observed + x1 + x2, data=df_rated, mtry=2)
## Examine OOB confusion matrix (error rate < 5%)
print(model_pr_unbiased)
## Use the model to get Pr[Y is unbiased | X, observed Y] on unrated data
df_unrated <- df[(n_rated+1):nrow(df), ]
df_unrated$pr_unbiased <- as.vector(predict(model_pr_unbiased, newdata=df_unrated, type="prob")[, "TRUE"])
## Train a model on unrated data, using pr_unbiased as regression weights -- is this unbiased? If not, is it _less_ biased than the unweighted model?
summary(lm(y_observed ~ x1 + x2, data=df_unrated, weights=df_unrated$pr_unbiased))
## What happens if we use pr_unbiased as a feature (aka predictor) in the regression, rather than a weight?
## In this case the weighted regression seems to do better, but neither is perfect
## Note: copied from shabbychef's answer
summary(lm(formula = y_observed ~ x1 + x2 + I(1 - pr_unbiased), data = df_unrated))
Dalam contoh ini, regresi tertimbang pada terlihat kurang bias dibandingkan dengan regresi tanpa bobot. Apakah itu benar secara umum? Saya juga mencoba saran shabbychef (lihat jawaban di bawah) pada contoh ini, dan tampaknya lebih buruk daripada regresi berbobot.
Bagi mereka yang lebih suka Python ke R, inilah simulasi kedua dengan Python:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
def logistic(x):
return 1 / (1 + np.exp(-x))
def pr_y_is_unbiased(x1, x2):
# This function returns Pr[ Z = unbiased | X ]
return logistic(x1 + 2*x2)
def get_df(n_obs, constant, beta, sd_epsilon, mismeasurement):
df = pd.DataFrame({
'x1': np.random.normal(size=n_obs),
'x2': np.random.normal(size=n_obs),
'epsilon': np.random.normal(size=n_obs, scale=sd_epsilon),
})
df['y_unbiased'] = constant + np.dot(np.array(df[['x1', 'x2']]), beta) + df['epsilon']
# Note: df['y_biased'].mean() will differ from df['y_unbiased'].mean() if the mismeasurements have a nonzero mean
df['y_biased'] = df['y_unbiased'] + np.random.choice(mismeasurement, size=n_obs)
df['y_is_unbiased'] = np.random.uniform(size=n_obs) < pr_y_is_unbiased(df['x1'], df['x2'])
df['y_observed'] = df.apply(lambda row: row['y_unbiased'] if row['y_is_unbiased'] else row['y_biased'], axis=1)
return df
constant = 5
beta = np.array([1, 5])
print(f'true coefficients:\n constant = {constant}, beta = {beta}')
n_obs = 2000
# Note: the mean of the possible mismeasurements is nonzero (this is the source of the bias)
df = get_df(n_obs=n_obs, constant=constant, beta=beta, sd_epsilon=1.0, mismeasurement=[-10.0, 5.0])
lr = LinearRegression()
lr.fit(X=df[['x1', 'x2']], y=df['y_observed'])
print(f'estimates from unweighted regression of Y on X ({df.shape[0]} obs):\n constant = {lr.intercept_}, beta = {lr.coef_}')
# Note: pretend that we only observe y_is_unbiased on a "rated" subset of the data
n_rated = n_obs // 2
df_rated = df.iloc[:n_rated].copy()
df_unrated = df.iloc[n_rated:].copy()
rf = RandomForestClassifier(n_estimators=500, max_features=2, oob_score=True)
rf_predictors = ['y_observed', 'x1', 'x2']
rf.fit(X=df_rated[rf_predictors], y=df_rated['y_is_unbiased'])
print(f'random forest classifier OOB accuracy (for predicting whether Y is unbiased): {rf.oob_score_}')
df_unrated['pr_y_is_unbiased'] = rf.predict_proba(df_unrated[rf_predictors])[:, 1]
lr.fit(X=df_unrated[['x1', 'x2']], y=df_unrated['y_observed'], sample_weight=df_unrated['pr_y_is_unbiased'])
print(f'estimates from weighted regression of Y on X ({df_unrated.shape[0]} obs):\n constant = {lr.intercept_}, beta = {lr.coef_}')