In our spring 2025 administration, we used Bookmark on a 42-item CAT reading test, and the provisional B1 cut score drifted by 6 raw points vs. 2023 even though anchor item parameters looked stable; are others seeing cohort effects, and how are you handling small-sample equating in programs under 300 test-takers? I’m debating adding anchor density and reporting with conditional SEM bands before we lock the standard — curious what’s worked for you.
I’d run ‘SEA’ for population invariance with bootstrapped NEAT; do B1 TIFs align?
We had a similar 6-point drift on a ‘42-item CAT’; what helped was a constrained concurrent calibration where we fixed b1=1 and set b0 from the 2023 anchor-only mean/SD, then re-ran Bookmark on that locked scale — drift dropped to about 1 point. If you try this, add a few anchors clustered around B1 only if they’re operational across years; otherwise you’ll just re-center to noise. Are your panelists bookmarking the live CAT pool or a static pool snapshot?
Quick check: run a posterior predictive CAT sim on the 42-item pool using the 2023 scale but the 2025 theta mix, and see what raw cut you get at B1. If the expected cut is near the observed ‘6 raw points’, it’s cohort/pool; if not, revisit Bookmark placement or the link. For <300, we’ve had better stability with Bayesian Stocking–Lord (priors on a,b from 2023) and adding 2–3 high-a anchors right around B1, @jgarcia204.