VALIDITY AND RELIABILITY

SPECIFICATION:

  • Reliability across all methods of investigation. Ways of assessing reliability: test-retest and inter-observer; improving reliability.

  • Types of validity across all methods of investigation: face validity, concurrent validity, ecological validity and temporal validity. Assessment of validity. Improving validity.

Types of validity across all methods of investigation: face validity, concurrent validity, ecological validity and temporal validity. Assessment of validity. Improving validity.

VALIDITY AND RELIABILITY

  • Reliability and validity underpin everything that we do as psychologists– without them research would be worthless

  • The principles of validity and reliability are fundamental cornerstones of the scientific method.

  • Together, they are at the core of what is accepted as scientific proof, by scientist and philosopher alike. By following a few basic principles, any experimental design will stand up to rigorous questioning and scepticism.

WHAT IS VALIDITY

Valid (real, true, genuine, honest).

Validity is one of those concepts that can really tie you up in knots.  The more you think about it the more difficult it can become.  If you know that it simply refers to whether a study measures what it claims to measure you are just about there.

If you read on you will find definitions of some of the other types of validity that are often discussed, but don’t worry if you find it a bit confusing.  Most psychologists and text books don’t always agree about this concept and they often use different terms and definitions.

In a nutshell, validity is the extent to which a piece of research actually investigates what the researcher says it does. Whether they test and mean what they are supposed to.

 Validity encompasses the entire experimental concept and establishes whether the results obtained meet all of the requirements of the scientific research method.

WHAT IS EXTERNAL VALIDITY?

Can the results be applied outside of the experiment? Are they valid?

 External validity is one the most difficult of the validity types to achieve, and is at the foundation of every good experimental design.Many scientific disciplines, especially the social sciences, face a long battle to prove that their findings represent the wider population in real world situations. The main criteria of external validity is the process of generalisation, and whether results obtained from a small sample group, often in laboratory surroundings, can be extended to make predictions about the entire population.In 1966, Campbell and Stanley proposed the commonly accepted definition of external validity.

“External validity asks the question of generalizability: To what populations, settings, treatment variables and measurement variables can this effect be generalized?”

External validity is usually split into two distinct types, population validity and ecological validity, and they are both essential elements in judging the strength of an experimental design.

In a nutshell……Can the findings can be generalised beyond the context of the research situation

  • Population: Can we generalise from the sample to the population as a whole or to other population groups? Think does the sample reflect different: ages, sexual persuasions, disabilities, classes, types of family, religions, cultures, ethnicities, genders, intelligences etc?

  • Mundane Realism (Coolican only said it once): Can the results that we’ve obtained in a laboratory setting really tell us how people will behave in real life.  Think back to memory experiments most of which were carried out in laboratories, or to’ Milgram’s experiment in the labs of Yale University.  Would people really behave this way in real life?

  • Test/ IQ/personality/Measures: If we use the Eysenck Personality Questionnaire (EPQ) and measure a person as very extrovert and slightly neurotic, can we be sure that they are really like this in real life or in social situations?  Similarly when we measure IQ, is the test we are using telling us anything real about the person?

  • Historical Bias/zeitgeist: Can experiments carried out 40 0r 50 years ago such as Asch, Milgram etc. still tell us anything about people today.  It has been mentioned how for example conformity changes over time.  Wars, for example tend to bring populations together and make us more conformist as was measured following the Falklands Conflict of 1982. Does Mary Ainsworth’s research have relevance for families today? It was done over forty years ago. What is different about families in the seventies?

  • Ecological Validity: The extent to which results can be generalised outside the research setting.

  • In most cases, research psychology has very high population validity, because researchers take meticulously randomly select groups and use large sample sizes, allowing meaningful statistical analysis.However, the artificial nature of research psychology means that ecological validity is usually low.

Assessing and improving external Validity Assessing and improving external Validity

  • Clearly it is useful for a psychologist to have some idea of whether or not tests and or research are valid.  There are a number of ways this can be done:

  • Replication: Replicate research on another population; are the results similar, unlike the strange situation results from Germany and Japan.

  • Meta-Analysis. Data can be collected form lots of different studies in different parts of the World and see if results are similar.  For example the results of Mary Ainsworth’s strange situation in different countries showed a similar And Bouchard & McGee compared findings for IQ tests between MZ twins and found similar levels of correlation between them all. With meta-analysis you should aanalyse all the research in one area and compare it, the results should be similar, e.g., meta-analysis of family studies and Schizophrenia.

  • Concurrent validity: if we are measuring IQ we could compare the scores obtained to school tests in maths and English, or we could compare the results of personality tests with assessments by a person’s friends and family.

  • Predictive: a test should be able to predict later performance, behaviour or personality.  So again, a high score on an IQ test should be able to predict later success at school etc.  In school you sit YELLIS and ALIS tests which are used by teachers as predictors of your future performance.

INTERNAL VALIDITY

Are the procedures valid inside of the experiment?

Psychologists design experiments and research. They must ensure that what they are testing is valid and true. So many things can go wrong with experimental design and make the results invalid. Are techniques used to collect data in tests, questionnaires, interviews and observations measuring what is claimed? We need to find out if our research is sound. Do our tests measure what they claim to measure? Can we trust any effect that has been found to be the result of manipulating our independent variable and not from another unwanted variable?

  • Questionnaires that don’t test what they say they do. For example does an IQ test measure intelligence or education? If the latter than IQ tests are invalid.

  • Social desirability bias, experimenter bias, and demand characteristics make participants results invalid.

  • Testing participants in different conditions at different times of day. Time of day affects performance.

  • Testing participants in different conditions in different settings (hotter room, more attractive or comfortable room. Again this can affect performance making results invalid.

  • Not randomly allocating participants to conditions

  • Not counterbalancing

Just to leave you with an example of how difficult measuring internal validity can be:

In the experiment where researchers compared a computer program for teaching Greek against traditional methods, there are a number of threats to internal validity.

  • The group with computers feels special, so they try harder, the Hawthorne Effect.

  • The group without computers becomes jealous, and tries harder to prove that they should have been given the chance to use the shiny new technology.

  • Alternatively, the group without computers is demoralized and their performance suffers.

  • Parents of the children in the computerless group feel that their children are missing out, and complain that all children should be given the opportunity.

  • The children talk outside school and compare notes, muddying the water.

  • The teachers feel sorry for the children without the program and attempt to compensate, helping the children more than normal.

I am are not trying to depress you with these complications, only illustrate how complex internal validity can be. In fact, perfect internal validity is an unattainable ideal, but any research design must strive towards that perfection.For those of you wondering whether you picked the right course don’t worry. Designing experiments with good internal validity is a matter of experience, and becomes much easier over time.

For the scientists who think that social sciences are soft – think again!

Assessing and Improving Internal Validity

Content validity

Does the content of a test cover everything in the area of interest? More rigorous –   experts in the field systematically examine the tool’s components and compare them with set standards. They have to agree the content is appropriate

Concurrent Validity

New measure test scores are correlated with those from an established valid test.  Like comparing IQ with a new intelligence test. As you can see, we have a high positive correlation between scores on the new and old tests. I declare this test valid! What would we do if the correlation were low?

Predictive Validity

  • Can an intelligence test at age 3 predict academic performance at 21?

  • Can a diagnosis of a certain mental illness predict recovery?

Temporal Validity (Do findings change with the Zeitgeist) .

Do our findings endure over time or are they era-dependent?

Face validity

Refers to the extent to which a measure appears on the surface to measure what it is supposed to measure. If measuring depression then a question on whether the person likes cheese would be irrelevant.   Face validity (sometimes called surface validity) is probably the most commonly discussed type of validity.

Criterion validity

A way of assessing validity by comparing the results with another measure.  For example, we could compare the results of an IQ test with school results.   If the other measure is roughly compared at the same time we call this concurrent validity.  If the other measure is compared at a much later time we call this predictive validity.

Construct validity

A way of assessing validity by investigating if the measure really is measuring the theoretical construct it is supposed to be.  For example, many theories of intelligence see intelligence as comprising a number of different skills and therefore to have construct validity an IQ test would have to test these different skills

 Other methods for improving internal validity.

  • Standardised instructions and standardised procedures to ensure conditions are the same for all participants in the study.

  • Control extraneous variables.

  • Reduce Demand characteristics: Guessing the aim of the experiment and acting in non-genuine way. Solution: placebo and single blind (if applicable) and do not reveal aim or hypothesis by using presumptive or prior general consent.

  • Try and reduce the Hawthorn effect: A term referring to the tendency of some people to work harder and perform better when they are participants in an experiment. Individuals may change their behavior due to the attention they are receiving from researchers rather than because of any manipulation of independent variables. Participants are therefore not acting in non-genuine way

  • Try and reduce the Social desirability bias: Participants wishing themselves to be seen a positive desirable light and therefore not acting in non-genuine/ valid way. Slot box or anonymity.

  • Investigator effects: Investigator effects result from the effects of a researcher’s behaviour and characteristics on an investigation. E.g., they may be attractive, a different class or intimidating.  Participants may therefore not act in genuine/ valid way. Try to match investigator if possible on class, ethnic, origin, age. Train to not act intimidating.

  • Investigator bias: Investigators unintentionally influencing the participant’s behaviour by suggesting which way they want the results to turn out. Solution: double and single blind.

  • Observer bias: Observer using own subjective view. Solution: Operationalise definition

  • Individual differences – solution: Matched pairs

  • Constant errors. Errors that affect all participants. For instance, an experiment to determine which Spaghetti bolognaise is best, mine or my Mum’s. In the first condition all participants have been out drinking the night before and are sick. They rate my Spaghetti Bolognaise badly. It looks like my Mother’s is better when in fact the result is due to a constant error.

  • Random Errors: A random error would be if only one student had been out drinking the night before. Usually if there are enough participants the random errors do not drastically affect the results]

  • Participant bias. Researcher chooses which participants go in which condition, Solution: Randomly allocate to conditions

  • Order effects for participants in repeated measures experimental designs: They either get bored or become expert in the second condition. Solution: counterbalancing

WHAT IS RELIABILITY?

Does a driving test measure your competence to drive on the road or is it a measure of your ability to pass the driving test? Would you be able to pass it again in six months’ time? Would you do better? Is it a reliable and valid test?

 The idea behind reliability is that any significant results must be more than a one-off finding and be repeatable.  Other researchers must be able to perform exactly the same experiment, under the same conditions and generate the same results. This will reinforce the findings and ensure that the wider scientific community will accept the hypothesis. Without this replication of statistically significant results, the experiment and research have not fulfilled all of the requirements of testability.

Reliability essential for research to be accepted as scientific.For example, if you are performing an experiment where time is a factor, you will be using some type of stopwatch. Generally, it is reasonable to assume that the instruments are reliable and will keep true and accurate time. Psychologists take measurements many times, to minimize the chances of error and maintain validity and reliability.

Any experiment that uses human judgment is always going to come under question.

In a nutshell ……

  • If a measurement is not reliable, then research cannot be valid or ‘true.

  • We need to be able to measure or observe something time after time and produce the same or similar results

  • For example, you want to measure intelligence in a 16 year boy so you give him an IQ test. If that same boy sits the test on several occasions and the results don’t change each time, then that test has reliability

  • If you test the same boy several months later and his score remains consistent, you can say the test is reliable. Incidentally, it might still lack validity. The second score might just be measuring what a person has learned since taking the first test.

There are two types of reliability:

Internal reliability

Internal Reliability is to with the reliability inside the experiment or research. In other words how reliable/consistent is the data collection, analysis and interpretation.  Would an independent researcher, on reanalysing the data, come to the same conclusion? A researcher should make similar observations or carry out interviews in the same way on more than one occasion. Researchers should also collect data in the same way (for example, the same time frame given to complete a task

  •  If marking an interview or questionnaire then make sure researchers have the same mark scheme.

  • If analysing or interpreting behaviour make sure behaviours are operationalised.

 Improving Internal Reliability

Split-half

Compares a participant’s performance on two halves of a test or questionnaire – there should be a close correlation between scores on both halves of the test. Questions in both halves should be of equal quality for good internal reliability.  

Improving Researcher Reliability/ Inter-rater reliability

·      This refers to the consistency of a researcher’s behaviour. If observing behaviour ensure you have inter-rater reliability – observers have to agree on what they see and carry out the same procedure.

There should be a high positive correlation between the scores of different observers. Consistency between different researchers working on the some study is very important for reliability.

 External reliability

Extent to which independent researchers can reproduce a study and obtain results similar to those obtained in the original study. Would an independent researcher, on replicating the study, come to the same conclusions?”

This measures consistency from one occasion to another – the same result should be found on different days, in different labs, observations or interviews, by different researchers 

Improving external reliability

  • Test retest

  • Participants take the same test on different occasions – a high correlation between test scores indicates the test has good external reliability.

  • Timing is crucial. Why?

  • Replication

  • Repeat experiment or research to see if you get the same results

  • This measures the extent to which a test or procedure is consistent within itself, i.e., questionnaire items or questions in an interview should all be measuring the same thing

 Overall: Good research should:

  •  Increase reliability by standardising instructions

  • Operationalise variables

  • Standardise data collection by well-trained observers

  • More than one measurement taken per participant so an average can be calculated

  • Pilot studies should be used to check everything works and improve procedures and materials

  • Checking data recording and interpretation

  • Thoroughly train researchers in the use of materials and procedures prior to our study taking place

Previous
Previous

IMPLICATIONS FOR THE ECONOMY

Next
Next

THE JOURNEY FROM MAGIC, TO RELIGION AND TO SCIENCE