How to Measure Test Validity and Reliability


Invalid or unreliable methods of assessment can reduce the chances of reaching predetermined academic or curricular goals. Poorly written assessments can even be detrimental to the overall success of a program. It is essential that exam designers use every available resource — specifically data analysis and psychometrics — to ensure the validity of their assessment outcomes.

What Makes a Reliable Assessment?

For an exam or an assessment to be considered reliable, it must exhibit consistent results. Deviations from data patterns and anomalous results or responses could be a sign that specific items on the exam are misleading or unreliable. Here are three types of reliability, according to The Graide Network, that can help determine if the results of an assessment are valid:

  • Test-Retest Reliability measures “the replicability of results.”
    • Example: A student who takes the same test twice, but at different times, should have similar results each time.
  • Alternate Form Reliability measures “how test scores compare across two similar assessments given in a short time frame.”
    • Example: A student who takes two different versions of the same test should produce similar results each time.
  • Internal Consistency Reliability measures “how the actual content of an assessment works together to evaluate understanding of a concept.”
    • Example: A student who is asked multiple questions that measure the same thing should give the same answer to each question.

Using these three types of reliability measures can help teachers and administrators ensure that their assessments are as consistent and accurate as possible. 

How Can You Improve Test Validity?

The validity of an assessment refers to how accurately or effectively it measures what it was designed to measure, notes the University of Northern Iowa Office of Academic Assessment. If test designers or instructors don’t consider all aspects of assessment creation — beyond the content — the validity of their exams may be compromised. For example, a political science test with exam items composed using complex wording or phrasing could unintentionally shift to an assessment of reading comprehension. Similarly, an art history exam that slips into a pattern of asking questions about the historical period in question without referencing art or artistic movements may not be accurately measuring course objectives. Inadvertent errors such as these can have a devastating effect on the validity of an examination. A case study from The Journal of Competency-Based Education suggests following these best-practice design principles to help preserve exam validity:

  1. Establish the test purpose.

This the first, and perhaps most important, step in designing an exam. When building an exam, it is important to consider the intended use for the assessment scores. Is the exam supposed to measure content mastery or predict success? “Taking time at the beginning to establish a clear purpose, helps to ensure that goals and priorities are more effectively met.”

  1. Perform a job/test analysis (JTA).

This essential step in exam creation is conducted to accurately determine what job-related attributes an individual should possess before entering a profession. “The JTA contributes to assessment validity by ensuring that the critical aspects of the field become the domains of content that the assessment measures.”

  1. Create the item pool.

Once the intended focus of the exam, as well as the specific knowledge and skills it should assess, has been determined, it’s time to start generating exam items or questions. “Typically, a panel of subject matter experts (SMEs) is assembled to write a set of assessment items. The panel is assigned to write items according to the content areas and cognitive levels specified in the test blueprint.”

  1. Review the exam items.

Once the exam questions have been created, they are reviewed by a team of experts to ensure there are no design flaws. Exam items are checked for grammatical errors, technical flaws, accuracy, and correct keying. “Additionally, items are reviewed for sensitivity and language in order to be appropriate for a diverse student population.”

  1. Conduct the item analysis.

This essential stage of exam-building involves using data and statistical methods, such as psychometrics, to check the validity of an assessment. “If an item is too easy, too difficult, failing to show a difference between skilled and unskilled examinees, or even scored incorrectly, an item analysis will reveal it.”

Improve Exams Overall by Using Data and Psychometrics

One of the most effective way to improve the quality of an assessment is through the use of psychometrics. ExamSoft defines psychometrics: “Literally meaning mental measurement or analysis, psychometrics are essential statistical measures that provide exam writers and administrators with an industry-standard set of data to validate exam reliability, consistency, and quality.” Here are the psychometrics endorsed by the assessment community for evaluating exam quality:

  • Item Difficulty Index (p-value): Determines the overall difficulty of an exam item.
  • Upper Difficulty Index (Upper 27%): Determines how difficult exam items were for the top scorers on a test.
  • Lower Difficulty Index (Lower 27%): Determines how difficult exam items were for the lowest scorers on a test.
  • Discrimination Index: Provides a comparative analysis of the upper and lower 27% of examinees.
  • Point Bi-serial Correlation Coefficient: Measures correlation between an examinee’s answer on a specific item and their performance on the overall exam.
  • Kuder-Richardson Formula 20 (KR-20): Rates the overall exam based on the consistency, performance, and difficulty of all exam items.

It is essential to note that psychometric data points are not intended to stand alone as indicators of exam validity. These statistics should be used together for context and in conjunction with the program’s goals for holistic insight into the exam and its questions. When used properly, psychometric data points can help administrators and test designers improve their assessments in the following ways:

  • Identify questions that may be too difficult.
  • Identify questions that may not be difficult enough.
  • Avoid instances of more than one correct answer choice.
  • Eliminate exam items that measure the wrong learning outcomes.
  • Increase reliability (Test-Pretest, Alternate Form, and Internal Consistency) across the board.

Using the Most Reliable Assessments Has Benefits for Everyone

Ensuring that exams are both valid and reliable is the most important job of test designers. Psychometric data can make the difference between a flawed examination that requires review and an assessment that provides an accurate picture of whether students have mastered course content and are ready to perform in their careers.


The Graide Network: Importance of Validity and Reliability in Classroom Assessments

The University of Northern Iowa: Exploring Reliability in Academic Assessment

The Journal of Competency-Based Education: Improving the Validity of Objective Assessment in Higher Education: Steps for Building a Best-in-Class Competency-Based Assessment Program

ExamSoft: Exam Quality Through the Use of Psychometric Analysis