|
Type of Document Dissertation Author Sharma, Dinesh R. URN etd-06292006-153249 Title Logistic Regression, Measures of Explained Variation and the Base Rate Problem Degree Doctor of Philosophy Department Statistics, Department of Advisory Committee
Advisor Name Title Daniel L. McGee, Sr. Committee Chair Eric Chicken Committee Member Myra Hurt Committee Member Xu-Feng Niu Committee Member Keywords
- Logistic Regression
- Explained Variation
- Base Rate
- Base Rate Problem
- Coefficient of Determinant
- R^2 Statistics
- Latent Variable
Date of Defense 2006-06-26 Availability unrestricted Abstract One of the desirable properties of the coefficient of determinant (R^2 measure) is that its values fordifferent models should be comparable whether the models differ in one or more predictors, or in the dependent
variable, or whether the models are specified as being different for different subsets of a dataset. This allows
researchers to compare adequacy of models across subgroups of the population or models with different but
related dependent variables. However, the various analogs of the R^2 measure used for logistic regression
analysis are highly sensitive to the base rate (proportion of successes in the sample) and thus do not possess
this property. An R^2 measure sensitive to the base rate is not suitable to comparison for the same or
different model on different datasets, different subsets of a dataset or different but related dependent
variables. We evaluated 14 R^2 measures that have been suggested or might be useful to measure the explained
variation in the logistic regression models based on three criteria 1) intuitively reasonable interpretability;
2) numerical consistency with the Rho^2 of underlying model, and 3) the base rate sensitivity. We carried out
a Monte Carlo Simulation study to examine the numerical consistency and the base rate dependency of the various
R^2 measures for logistic regression analysis. We found all of the parametric R^2 measures to be
substantially sensitive to the base rate. The magnitude of the base rate sensitivity of these measures tends to
be further influenced by the rho^2 of the underlying model.
None of the measures considered in our study are found to perform equally well in all of the three evaluation
criteria used. While R^2_L stands out for its intuitively reasonable interpretability as a measures of
explained variation as well as its independence from the base rate, it appears to severely underestimate the
underlying rho^2. We found R^2_CS to be numerically most consistent with the underlying Rho^2, with
R^2_N its nearest competitor. In addition, the base rate sensitivity of these two measures appears to be very
close to that of the R^2_L, the most base rate invariant parametric R^2 measure. Therefore, we suggest to
use R^2_CS and R^2_N for logistic regression modeling, specially when it is reasonable to believe that a
underlying latent variable exists. However, when the latent variable does not exit, comparability with the
underlying rho^2 is not an issue and R^2_L might be a better choice over all the R^2 measures.
Files
Filename Size Approximate Download Time (Hours:Minutes:Seconds)
28.8 Modem 56K Modem ISDN (64 Kb) ISDN (128 Kb) Higher-speed Access dissertation_drs.pdf 959.42 Kb 00:04:26 00:02:17 00:01:59 00:00:59 00:00:05