Theory

Theory

Recording 2025

Repetition Relevance

Trivial problems lack scientific value. If an answer is already well established, further study adds little.

Key criteria for relevance:

  • Active research: The topic should be part of an ongoing scholarly conversation.
  • Target audience: There must be a clearly defined Computer Science community for whom the results matter.
  • Knowledge gap: Prior work should reveal unanswered questions or limitations that the current study addresses.

Describe with precision

A question like “Which database is the best?” is ill-posed because it is underspecified and context-free. Its breadth guarantees multiple, incompatible answers.

To make it scientifically or technically meaningful, the problem must be constrained along explicit dimensions:

  • Database class: relational, document, key–value, graph, time-series, etc.
  • Evaluation criteria: performance, consistency guarantees, scalability, cost, ease of use, operational complexity, ecosystem maturity.
  • Use case / scenario: read/write ratios, latency sensitivity, data volume, fault tolerance requirements.
  • Target audience: application developers, data engineers, researchers, system administrators, enterprises vs individuals.
  • Programming interface: supported languages, API quality, ORM/tooling compatibility.
  • Deployment context: on-premises, cloud-managed, edge, single-node vs distributed.
  • Operational concerns: backup/restore, replication, upgrades, monitoring, disaster recovery.

Exclusion principles:

  • Prefer problem limitation (narrowing the domain) over methodological limitation (restricting how the problem is studied).
  • Avoid exclusions that silently bias the outcome.

From a risk management perspective, the goal is to reduce unknowns by making assumptions explicit, narrowing ambiguity, and clearly defining boundaries. Clarity transforms a vague opinion question into an answerable, comparable, and defensible problem statement.

Problematic example

"React is the best framework for all web applications."

Why it lacks scientific credibility:

  • It is overly broad and absolute – “all web applications” covers every possible scenario, user type, and performance requirement.
  • It lacks specified criteria – what does “best” mean? Performance, developer productivity, security, user experience?
  • It ignores context or constraints, such as application type, server environment, or target audience.
  • It is based on opinion or anecdote rather than empirical observation or comparative experimentation.

A credible formulation would narrow the framework, context, metrics for “best,” and the methodology for comparison.

A better example

"Which type of graph-database offers security guarantees (encryption, access control, auditing) and performance for a cloud-based application with high concurrency and sensitive user data?"

Credibility

Science advances by incrementally approximating the truth rather than claiming direct access to it. Human perception and cognition act as filters: we observe selectively, interpret subjectively, and reason under bias. As a result, our understanding of reality is necessarily incomplete.

Two common credibility threats follow:

  • Unobserved factors: What is not measured or visible may be as influential as what is observed, yet it is often ignored.
  • Overinterpretation: Observed effects are frequently given more explanatory weight than warranted, while alternative explanations remain underexplored.

Credibility in science therefore depends on acknowledging uncertainty, making assumptions explicit, and designing methods that actively counter bias rather than pretending it does not exist.

Selective Attention

Selective attention illustrates a core limitation of human perception. In the Selective Attention Test (the “Invisible Gorilla”), observers focusing on a counting task often fail to notice an unexpected but salient event. The point is not inattentiveness as a flaw, but that attention is necessarily selective. What we attend to determines what we believe exists.

The same insight underlies The Invisible Gorilla: we systematically miss information that falls outside our expectations or task framing. This has direct implications for scientific credibility.

Selective attention test ( First 40 seconds)

The invisible Gorilla: https://www.theinvisiblegorilla.com/

Bias

There are a huge amount of biases

  • Survivorship bias: conclusions drawn only from cases that remain visible, ignoring failures or missing data. https://www.deanyeong.com/article/survivorship-bias
  • Publication bias: results with statistically significant outcomes are more likely to be published, skewing the literature. https://www.youtube.com/watch?v=42QuXLucH3Q&t=4s&ab_channel=Veritasium (12 min)
  • Confirmation bias: a tendency to seek, interpret, and emphasize evidence that supports existing beliefs.
  • Recall bias: memories are selective and reconstructive, not reliable records of past events.
  • Sample selection bias: systematic under- or over-representation of parts of the population.
  • Observation bias: behavior changes because subjects know they are being observed.

These biases are not exceptions; they are default human tendencies.

This is why methodology is central to scientific credibility. Methodology provides structured, transparent, and repeatable procedures designed to counteract bias:

  • predefined hypotheses and analysis plans,
  • controlled data collection,
  • explicit inclusion and exclusion criteria,
  • replication and robustness checks.

Crucially, this is a collective process. Credibility is built through peer critique, iteration, and convergence over time. Individual studies are fallible; the scientific method is designed so that, across iterations, errors are exposed and understanding improves.

Empirical Evidence

Theories are simplified models of reality. They do not derive their legitimacy from internal coherence alone, but from how well they are grounded in empirical evidence. Empirical work connects abstract ideas to observable phenomena.

Common sources of empirical evidence include:

  • Observations: systematic recording of phenomena as they occur, often used to identify patterns or generate hypotheses.
  • Experiments: controlled interventions designed to test causal relationships by isolating variables.
  • Simulations: computational or mathematical models used to explore system behavior when direct experimentation is impractical or impossible.
  • Interviews and surveys: structured or semi-structured instruments for capturing human experience, perception, and behavior.

Empirical approaches can be:

  • Quantitative: focusing on measurement, numerical data, and statistical inference.
  • Qualitative: focusing on meaning, context, processes, and interpretation.

Credible research aligns its theoretical claims with appropriate empirical methods, recognizing that different questions require different forms of evidence and that no single method provides a complete view of reality.

Reliability

Reliability refers to the extent to which a measurement or assessment yields the same result when repeated under identical conditions. A reliable instrument minimizes random error; without reliability, validity cannot be established.

Intra-rater reliability - The consistency of measurements made by a single evaluator across multiple occasions.

Key aspects:

  • Same agent, same criteria, same conditions
  • Sensitive to fatigue, learning effects, and subjective judgment
  • Improved through clear guidelines, standardized procedures, and calibration

Inter-rater reliability - The consistency of measurements made by multiple evaluators assessing the same phenomenon.

Key aspects:

  • Requires more than two raters for robust estimation
  • Disagreement reveals ambiguity in definitions or procedures
  • Improved through shared guidelines, training, and well-defined scoring rubrics

High inter-rater reliability indicates that the measurement protocol is sufficiently precise to be applied independently.

Example

Scenario: Evaluating code readability or maintainability across student projects.

Inter-rater: Three reviewers independently assess the same code using the same rubric their similarity of judgement is measured.

Validity

Validity concerns whether a study actually measures what it claims to measure and whether the conclusions drawn are justified. While reliability is about consistency, validity is about correctness of interpretation.

Also see: https://youtu.be/lYX-QsBm0nw (20 min)

Construct validity

The degree to which an instrument truly measures the theoretical construct it is intended to represent.

Key questions:

  • Are we measuring the right concept, or a proxy that only partially overlaps?
  • Is the construct clearly defined and theoretically grounded?
  • Can the construct be meaningfully operationalized?

Low construct validity occurs when measurements capture convenience rather than the intended phenomenon.

Internal validity

The extent to which causal claims are justified within the study.

Key concerns:

  • Are observed effects attributable to the manipulated variables?
  • Have confounding variables been controlled or randomized?
  • Is the experimental setup logically and temporally sound?

Threats include confounders, selection effects, maturation, and instrumentation changes.

External validity

The degree to which results can be generalized beyond the study context.

Closely tied to:

  • Sampling: how representative the sample is of the target population
  • Context: whether results hold across environments, time, and conditions
  • High internal validity does not guarantee high external validity. Strong causal control may reduce realism, while broad sampling may reduce experimental precision.

Validity requires explicit alignment between research questions, theoretical constructs, methods, and claims. Without this alignment, even precise and repeatable results risk being irrelevant or misleading.

Power

Statistical power is the probability that a study will detect a true effect if one actually exists. In other words, it measures the ability to avoid false negatives (Type II errors).

Power is primarily determined by:

  • Effect size: larger effects are easier to detect.
  • Sample size: larger samples reduce random error.
  • Variance/noise: lower variability increases sensitivity.

Low-powered studies risk concluding that “there is no effect” when the study is simply incapable of detecting it. This undermines credibility and contributes to non-replicable findings.

Adequate power aligns experimental design with the expected magnitude of effects and the strength of conclusions being drawn.

Sampling

Sampling refers to the process of selecting a subset of units from a population to make inferences about that population. Proper sampling ensures that the sample accurately reflects the target population, which underpins the credibility of empirical findings.

  • Population: the entire set of units of interest (e.g., all users of a platform, all packets traversing a network).
  • Sample: the subset actually observed or measured.

A good sample should preserve key characteristics of the population so that inferences are valid.

Sampling Methods

https://www.scribbr.com/methodology/sampling-methods/

Method... see those subchapters