Validity and reliability

Validity

To answer your problem you use a method, collect (and possibly analyze) data, and draw conclusions from the data. It is important that you only draw conclusions that are valid, i.e. that is supported by the way you have done your work and the data you have collected. The critical reader can argue that there are validity problems in your work and that your results therefore are wrong.

There are many types of validity problems that can occur, but the most important ones are:

Construct Validity

This is about the interpretation of theoretical constructs. Consider that you have used the word “efficient” when describing an algorithm. Is your interpretation of efficient the same as other people’s interpretation of it? Most likely not, since it is a very broad and vague term.

To reduce problems with construct validity you shall as much as possible use well-known and widely used terms from the field of study, or explain in detail what you mean by a term. Instead of “efficient” you can say “efficient in terms of execution time and memory usage”.

Internal Validity

This is about if the results and conclusions follow the collected data.

One example is if there are some unknown variables that affect the results: the outdoor lighting conditions affect the accuracy of an app for automatic license plate detection on cars or the garbage collector in Java is not controlled and therefore possibly can affect the execution time of an algorithm.

Another example is bias. This means that the person conducting a study in some way affects the result. Consider a study on employees at a company. The person responsible for the study can be “colored” by his/her opinions of the company, which can affect how the data is interpreted and what conclusions are drawn. It can sometimes be very difficult to be completely objective about something.

External Validity

This is about if the generality of the results is justified. This is a rather common problem in studies. If we only include students in a study, can we then say that the results and findings apply for professional developers as well? It will be hard to convince the critical reader if we try to make that claim.

Another example is if we evaluate the performance of a framework for mobile development, but only test it in the Android operating system. Then we cannot claim that the results are general and also applies for the iOS and Windows Phone operating systems.

Reliability

To answer your problem you use a method, collect (and possibly analyze) data, and draw conclusions from the data. Reliability means if others will get the same result as you if they replicate your work.

Reliability problems can occur if you use the wrong method for data collection. One example is if you only take notes instead of recording an interview. The result then depends on your memory, which probably is not very reliable when it comes to remembering all details.

Another example is if you want to measure execution time for a new algorithm. If you start your timer too early you might include the time it takes to read a data file or print logging text to the screen, which introduces some error in your measurements. You shall only measure the execution time for the actual algorithm.

If you manually measure something, an error can be introduced if you are unsure about how to handle the measurement instrumentation or the reaction time when stopping a clock.

If you are doing a demanding and exhausting study on participants, the results will most likely differ depending on how hungry or tired the participants are. If you don’t count for this, you introduce error in your data.

To reduce problems with reliability you shall use conventional methods for data collection for your type of problem. What tools, methods and techniques have others working on similar problems used?