Recent developments in multilevel modeling made it possible to model the relationships between item properties and examinee properties within the multilevel and structural equation modeling framework. In this study, the performance of the multilevel two parameter logistic item response model (2PL IRM) was investigated for estimating item difficulty and discrimination parameters and equating a test among different grade levels under the presence of differential item functioning (DIF) by using real and simulation data. A statewide data, designed for vertical scaling, were used with three different adjacent grade levels. The data were collected for the Florida Comprehensive Achievement Test in 2001. In addition, simulated data comparable to large-scale assessment data from two grade levels were analyzed to control for conditions of different numbers of DIF items.
The performance of 2PL IRMs with modeling of the DIF and inclusion of an examinee-level variable was compared with traditional IRT for the development of a vertical scale. It was found that 2PL IRM without any DIF parameter produced the same item difficulty and discrimination parameters. Furthermore, 2PL IRM generated the same scale score as traditional IRT. The inclusion of an examinee variable (grade level) in 2PL IRM produced a better vertical scale in comparison to 2PL IRT. The modeling of nonuniform DIF for some of the anchor items, in addition to the examinee-level variable, resulted in the same scale as the previous model; however, the modeling of uniform DIF for some of the anchor items distorted the vertical scale.
A small simulation study was designed to investigate the effects of DIF items on vertical equating with respect to presence of uniform, nonuniform, and both nonuniform and uniform DIF exhibiting on some of anchor items. It was found that distortion of the scale increased as the number of nonuniform DIF items increased in the anchor set. The scale distortion got larger than the effects of having one type of DIF when items in the anchor set had both types of DIF at the same time. There was one conflicting result: Increasing the number of uniform DIF items in an anchor set decreased the scale distortion when only uniform DIF items were present. However, this could have been the result of random error due to the limited simulation size.
There was one drawback of multilevel IRM in using the large-scale assessment data. The computation time needed to complete the calibration process was far beyond practicality for a comprehensive state testing program. However, multilevel IRM potentially provides more flexibility for investigating the dimensions that affect the success. Directions for future research and limitations are also discussed.