Educational measurement practice (item bank development, form assembly, scoring of constructed response answers, etc.) involves the development and processing of an enormous amount of text. This requires large numbers of people to write, read through, evaluate, classify, edit, score, and analyze the text. Not only is this process time consuming and resource intensive, but it is also subjective and prone to error. Subject-matter experts must define the construct of the test through some formalized process. Based on the construct, items are written, reviewed, edited, and classified. Beyond the individual items, item banks must also be evaluated to identify content overlap, cuing, or other content features that will lead to dependence among items and reduce construct representation. Newly written items approved for pretesting must then be administered to a sample of representative test takers before their statistical quality can be determined. If the items involve constructed response answers, they must be scored by trained human raters. Finally item writing must be conducted on a continuous basis due to security issues, and construct definition must be reevaluated on a regular basis due to changes in practice or standards.
Natural language processing (NLP) can be used to reduce the above-mentioned costs in time, money, and labor. NLP is a collection of methods for indexing, classifying, summarizing, generating, and interpreting texts. Initially, educational measurement made use of these methods in the development of automated essay-scoring engines. Recently, however, NLP methods have been applied to nearly every aspect of test development and psychometrics: item difficulty modeling, using text analysis to improve scoring in computerized adaptive testing and multistage testing, searching for pairs of mutually excluded items (item enemies), item generating, and item bank referencing.
This report introduces a heuristic for computing semantic similarity between two single-topic texts. The heuristic was tested on 10 datasets prepared by a test developer. Each dataset consisted of 10 Logical Reasoning passages from the Law School Admission Test (LSAT), where passages P1 and P2 were judged by the test developer to be similar, and the other 8 passages were judged to be dissimilar from P1 and P2. Given a dataset, the heuristic was used to compute semantic similarity between P1 and other passages, and it demonstrated an agreement with the test developer on 8 datasets.
The heuristic has several potential applications for the LSAT: (1) semantic-based search for possible enemies in an item pool; (2) Internet search for illegally reproduced cloned items; (3) improvement of estimates of item difficulty through the addition of semantic features (e.g., semantic similarity between a passage and its key, or between a key and its distractors).
Request the Full Report
To request the full report, please email Linda Reustle at lreustle@LSAC.org.