Validity and Reliability in Educational Assessment: A Comprehensive Guide

Students’ assessment, a major component of the learning cycle and curriculum, serves many functions. Assessment suggests areas of improvement during training; the selection of students based on performance helps in the evaluation of the program and also has predictive utility. To make a precise decision about a student’s learning and competency, the assessment must have both measurable and nonmeasurable components. Two important attributes defining students’ assessment are reliability and validity. This article aims to provide a comprehensive understanding of these concepts, their interrelation, and their significance in educational assessment.

The Role of Assessment in Education

Assessment plays a crucial role in the learning process. It can be of learning - summative assessment, for learning - formative assessment, and can be without any external supervision - internal assessment. Van der Vleuten and Schuwirth defined assessment “as any formal or purported action to obtain information about the competence and performance of a student." Further, the assessment can be either criterion-referenced-comparing the competence of students against some fixed criteria or norm-referenced-comparing the performance of students with each other. Besides being an aid to learning by virtue of having a provision of feedback and thus improve learning, assessment has a reverse side too-improperly designed assessment can disfigure the learning.

Understanding Reliability

Conventionally, the reliability of an assessment tool has commonly been referred to as “reproducibility” or “getting the same scores/marks under the same conditions” or “precision of the measurement” or “consistency with which a test measures what it is supposed to assess.” Reliability is measurable. It is the extent to which a measurement tool gives consistent results. It does not have to be right, just consistent. With educational tests, we say that test scores are reliable when they are consistent from one test administration to the next.

Factors Affecting Reliability

The major factor affecting reliability is content or domain specificity. How can an assessment be reliable if it is based on a limited sampling of content, or if a large content has been included in a single sample, or if it is based on a single test? Moreover, a score that is derived from solving one problem cannot be interpolated for the second one. For example, assessment scores that are based on a single long case or viva for a single patient sample cannot produce reliable scores for another problem. If, at the end of any professional year, subject knowledge is assessed by a single multiple-choice questions (MCQs) based test of 10 items, can it measure students’ knowledge for the whole subject? Such assessments can be held valid but not reliable.

Improving Reliability

For any assessment test to be reliable, it is important to have a representation of the entire content as well as adequate sampling. Further, the reliability can also be increased by increasing the testing time, separating the whole content into multiple tests rather than a single test, and selecting a battery of tests to access the same competency. Additional strategies include:

Developing better rubrics with well-defined scoring categories with clear differences in advance.
Testing rubrics and calculating an interrater reliability coefficient (Interrater reliability = number of agreements/number of possible agreements).
Conducting norming sessions to help raters use rubrics more consistently.

Delving into Validity

Validity, another important characteristic of good assessment, is usually defined as measuring what it intends to measure. Validity is a unitary concept, and evidence can be drawn from many aspects such as content and construct-related and empirical evidence; therefore, the validity of assessment cannot be represented by a single coefficient like that of reliability. It is the extent to which a measurement tool measures what it is supposed to. More specifically, it refers to the extent to which inferences made from an assessment tool are appropriate, meaningful, and useful. Validity is often thought of as having different forms. Perhaps the most relevant to assessment is content validity, or the extent to which the content of the assessment instrument matches the SLOs.

Content Validity

For example, if the performance of urinary blood proteins is part of 1st-year undergraduate medical training and these are not assessed in skill examination, the content-related validity is at stake.

Contemporary Concept of Validity

Of late, we have moved on from the historical concepts of validity and reliability of the assessment tools. Now, for educational purposes, we are not interested in the reliability of an assessment tool. Now, the more important aspect is how we use the tool to make the results reliable. Similarly, the contemporary concept of validity focuses on the interpretation that we make out of assessment data and not on the validity of assessment tools. So it is often said - No assessment tool/method is inherently invalid, more important is - what inference we draw from the assessment made using that tool. For example, the MCQs used will measure factual knowledge if it is designed to check factual knowledge; however, if some case-based scenario or some management plan for any disease is in-built into such MCQs, it will assess the problem-solving abilities of the students. Similarly, results will not be valid if in theory examination, steps for elicitation of knee-jerk reflex are asked, and student is certified to have skills for performing knee-jerk reflex.

The Interplay Between Validity and Reliability

The measurement of reliability of any assessment by any statistical methods should always be deduced, keeping the validity of the assessment in mind. The contemporary concept of validity considers validity as a unitary concept deduced from various empirical evidence including content, criterion, construct, and reliability evidence. The reliability values have no meaning with poor validity. This model also states that perfect assessment is not possible and deficiency in one attribute can be compensated if the weightage of another attribute is high; depending on the context and purpose of assessment. For example, in high-stake examinations, the assessment with high reliability will be of more value, while for any in-class test of multiple occurrences, the educational impact will be a more considerable criterion. The multiplicative nature of this model also ensures that if one variable is 0, the overall utility of the assessment automatically becomes 0.

As stated above, for years, we have always treated the validity and reliability of assessments as separate measures. With contemporary views and utility models, an interrelationship has been established; but still, we consider validity and reliability as independent existing entities. However, do consider yourself, if any measure is consistently measuring a wrong construct, it can be said to be reliable while validity is at stake; but in practice, if any measure is not measuring what it is intended to measure, can we rely on its measurements? Not at all! Although apparently, reliability is there because of validity issues, we cannot rely on the assessment results to make valid and reliable inferences.

Read also: Blue Sea Consulting Services

Validity and reliability go hand-in-hand, as demonstrated by a quantifiable link. It has been documented that the square root of reliability is almost equivalent to maximum attainable validity. For example, if the reliability coefficient for any test is 0.79, the validity coefficient cannot be larger than 0.88, which is itself square root of 0.79. As stated above, validity is now considered a unitary concept, and reliability evidence is an important part of validity; and as such, validity and reliability are interrelated. It has also been documented in the literature that validity and reliability experience trade-off - the stronger the bases for reliability, the weaker the bases for validity (vice-versa). However, still, these two are considered separate concepts - reliability evidence is considered necessary for the validity of assessment, but what about validity contribution for the reliability of assessment?

A Unified Concept

Conceptual representation of validity and reliability for students’ assessmentLet us discuss it more with a simple example. If during a preexamination meeting, all the examiners have decided that they will not give more than 80% marks and <50% marks to any student in the practical examination, and the assessment has been carried out with all the valid methods, results are consistent too, can we still rely on such assessment results? As stated in another example quoted above - If after asking a student to write different steps of knee-jerk reflection during an article test, we are certifying him to have an ability to elicit knee-jerk - Are we making the right inference? Not at all! To broaden the notion - For students’ assessment, any assessment which is not valid cannot be reliable; and any assessment which is not reliable cannot be valid. This also implies that validity and reliability for students’ assessment should be considered a unified phenomenon, a unified concept, instead of discrete units. To make “accurate inference from any assessment with full confidence,” our assessment should be both reliable and valid.

The Bathroom Scale Analogy

Take the widely used example of the bathroom scale. If you repeatedly step on the scale, you get the same reading. You could say the weight measurements are consistent-or reliable-because the scale shows the same weight each time you step on it.

Now, suppose this same bathroom scale is off by 5 pounds. Because the scale is reliable, you still get consistent weight measurements every time you weigh yourself, but the measurements are not accurate because they are off by 5 pounds! In this case, although the recorded weights are reliable, they are not valid measures of how much you weigh.

Conversely, if the scale were calibrated just right, you’d get a weight measurement that is both reliable and valid, each time. In order to be valid, a score must be reliable. However, just because a score is reliable does not mean it’s valid. If you think of the innermost circle as your true weight measurement, you’ll notice that the first wheel has weight recordings that vary wildly each time we step on our bathroom scale; it’s clear the weight measurements are not reliable-and thus not valid. The second wheel shows reliable but not valid weight measurements that might come from that sneaky scale that is off by 5 pounds. Lastly, only the properly calibrated scale will give us both precise and accurate weight measurements.

Read also: Shaping the Future of Translation

Practical Considerations in Assessment Design

Reliability and validity are important concepts in assessment, however, the demands for reliability and validity in SLO assessment are not usually as rigorous as in research. While you should try to take steps to improve the reliability and validity of your assessment, you should not become paralyzed in your ability to draw conclusions from your assessment results and continuously focus your efforts on redeveloping your assessment instruments rather than using the results to try and improve student learning. Student learning throughout the program should be relatively stable and not depend on who conducts the assessment. Be parallel in form.

tags: #validity #and #reliability #in #educational #assessment