(c) Petri Lankoski, v0.5 (an almost complete draft).

This section intends to give a brief overview of quantitative approaches for design research. As statistics and quantitative research is a huge topic, this section can only scratch the surface. Field, Miles and Field (2012) *Discovering statistics using R* is good comprehensive and easy to approach introductions to statistics.

Quantitative approaches, as the name tells, require operationalizing research question in the forms that the question can be answered via measuring or counting some aspects of a studied phenomenon and making statements about it by building statistical models. A very simple model, for example, is mean.

In quantitative research data is used to explain and understand something. That, naturally, means that there is something that needs explaining.

In game design, research with quantitative data is used to improve design: for example, one can have multiple different designs and one can test which performs the best by letting people play different versions, collecting data about how different versions perform and selecting the best. Here, the designers’ goal typically is not to explain why some versions are better than others, but to pick the most suitable version using quantitative data. In design research, the focus is in explaining why the versions perform differently.

In quantitative research, researchers start by generates theories from existing knowledge (e.g., from previous data or research results). Based on theory, a testable hypothesis is formulated. As an example of this, we can take Laramées theory that the first person point of view is optimal for a game because player became their characters and test it. Based on the theory, we can form a hypothesis, that players identify more with their player character or have a higher sense of presence if players have newer “shown or told anything that the character has not experienced directly” compared to the cases where players information is not tied to that one point-of-view. We need to collect data to test the hypothesis, and based on the analysis of the collected data, we can change the theory and hypothesis. If data support the theory, we can conclude that the evidence supports the hypothesis and, hence, theory.

# What to measure

In order to conduct the quantitative analysis, we need to figure out what to measure in order to answer our research question and how to measure that. For example, if we are interested in what level design choices contribute to the difficulty of game levels, we can measure the difficulty in multiple different ways:

*counting*how many times the player fails while completing a level*timing*how long it took to complete the level (we get back to this later)- asking players evaluate the difficulty using
*a scale*(e.g., 1: very easy, 2, 3, 4, 5, 6, 7: very hard).

The difficulty here is an *outcome variable* (or a *dependent variable)* as we are interested in be able to explain the difficulty in term of other factors. These other factors are called *predictors* (or independent variables). The predictors are assumed (in many cases) not to depend on the value of other measured variables. Predictors in level difficulty cases can be, for example, the number of enemies or the number of jumps in the level.

As seen above variables can be different. *Categorical variables* are categories. *Binomial* *variables* can have two values: yes and no. *Nominal variables* are used in when different things can put in the same category based on the sameness of two more things. A gender, for example, gender is a nominal variable (with possible categories ‘woman’, ‘man’, ‘other’). A feature of a nominal variable is that only the equality can be defined. *Ordinal variables* can be used with categories that can be ordered (the winner, the second place, the third place). In the example above, difficulty with “1: very easy, 2, 3, 4, 5, 6, 7: very hard” is an ordinal scale. We know that ‘1’ is smaller than ‘2’ or ‘5’ and so on.

*Continuous variables* can take any value in the scale. *Interval variables* are continuous variables where differences between points are the same (that is the distance between x_{n} and x_{n+1} is the equal to the distance between x_{n+1} and x_{n+2}). *Discrete variables* can only specific values. Counts are an example of discrete variables as counts can take only zero or positive integer values. Many statistical tests assume that the variables are interval variables and have can issues with count data if all measurements have near zero values. On the other hand, scales such as above difficulty (1: very easy, 2, 3, 4, 5, 6, 7: very hard) can be handled as an interval variable in many cases.

The accuracy of measurements is important in research. However, measuring something accurately is not straightforward. For example, a meter can give little different results depending on a temperature (the instrument expands and contracts based on temperature) or self-reported measures are influenced by the external conditions (e.g.,self-evaluation can be very different when the respondent is stressed vs. relaxed from the reasons that are unrelated to the study).

When measuring something the measurements can measure what is intended or not in various degrees.

*Validity*is about whether the instrument actually measures what it is intended to measure. In above example on measuring difficulty, using completion time to measure difficulty has issues: the levels can be a different length and when people get better to play a game they are likely to complete levels faster. For these reasons, we have a grounds to conclude that completion time is not measuring difficulty except in the very limited set of cases.*Reliability*refers to how consistently an instrument measures something in different conditions and contexts. Reliability can be estimated by measuring the things multiple times. The results should be similar if the measured thing is not expected to vary over time (test-retest reliability). Using multiple measurements to measure something is a typically used to increase reliability and to enable estimating measuring error. Self-reporting scales are typically using at least three questions to measure the same thing.

Measurement error: random noise…

# Data collection

All statistical methods assume random sampling from the target population. Random sampling means that from the selected group (population) a smaller group of participants are selected randomly. Also, if the sampled people do not participate, the reason should be random. That means that the actual study does not influence the choice whether to participate; for example, a study looking at problematic gaming might see people having the gaming problem. In terms of games, especially when one is studying experimental games, it might not be possible to know what is the population to be sampled. The advantage of random sampling is that the results generalize to the sampled population.

When it is not, for some reason, to do random sampling, different some form of nonprobability samplings is used. This is the case of the most game design research studies. Nonprobability samplings are simple, fast and cheap (compared to random sampling) as a researcher gather a group that is available for the study; however, the study is more venerable to bias in the sample and one cannot claim that the results generalize.

Data can be collected using different means and from different sources:

*Observations:*When gathering data using observations researchers count, marks event’s the location or time or measure completion times. What events will be in focus depends on the research question. Events can be things such as deaths, kills, level completion times, and errors using controls. See Observations for more details.*Surveys:**Data archives:*Researchers collect data and some cases they make their data publicly available. These data archives, however, are collected for some specific use and it might be challenging to use them to something else.*Simulations:*When the goal is to understand how a game system behaves, simulating parts of the system using random input can collect data. Simulations can be handy to understand probabilities and strategies of card games were probabilities and strategies are path-depended (what events have already happened influence the probabilities of coming events). For example, Black Jack and Poker has been analyzed using simulations. (see, Epstein, 2013)*Old studies:*old studies in some topic can be a source of the researchers’ data. In those cases, they are systematically gathering results of studies of a certain topic.*Questionnaires:*In game design research, a study often needs data about play experience and questionnaires can be used to gather data about that. In order to get reliable data, it is the best to use existing scales that are shown to work when possible. Different kinds of scales are available to assess play experience: for example, there are PENS ( Player Experience of Need Satisfaction scale) that has subscale for measuring the sense of presence (REF) or NASA TLX scale for measuring workload (REF).

# Research designs

## Correlational designs, experimental and quasi-experimental designs

*Correlational designs* can answer questions about how different variables relate each other but the correlational designs are not adequate for making causal claims. With the correlational designs, researchers are gathering data about naturally occurring variables without directly inferring the variables.

*Experimental designs* are concerned to look at the causal connections. Experimental designs are based on hypothesis testing and random assignment of subjects. In addition, manipulation of a condition is essential. A paradigmatic example from medicine is evaluating if a drug has an effect. One group, a treatment group, gets a drug to be tested and another group, a control group, get a placebo. People are randomly assigned to the groups. Administering a drug is called an intervention. After the intervention, the groups are compared statistically to see how treatment group compares to the placebo group.

*Quasi-experimental designs* are similar to experimental design, except that there is no random assignment to the groups. Quasi-experiments should typically be used when experiments are not possible because the lack of random assignment raises concerns about the validity of results.

## Within subjects and between subjects designs

In between subjects designs, part of participants encounter one experimental condition and part other (and so on). For example, half play low challenge level and half play half challenge level for the study.

In* within subject designs, *all participants encounter all experimental conditions. *Withing subject designs* are *repeated measures designs*.* *To get back above example, all participants play high challenge level and low challenge level. However, the order of in which participants encounter the experimental conditions is prone to influence the results (e.g. via fatigue or outside factors). To minimize carry on effect from the previous condition, the order is manipulated so that in the example case, half would play high challenge level first and half play low challenge level first. This is called *counterbalancing*.

Counterbalancing two conditions is easy, but if there are more experimental conditions counterbalancing them becomes harder. With three conditions we already need six different orders (ABC, ACB, BAC, BCA, CAB and CBA) where experimental conditions are met to minimize the effect of the order. With four and more conditions, complete counterbalancing is usually not practically possible. Instead, a *balanced latin square* ordering can be used. With six conditions, balanced latin square ordering of the conditions A, B, C, D, E, andF is as follows:

Subject Condition p1 A B F C E D p2 B C A D F E p3 C D B E A F p4 D E C F B A p5 E F D A C B p6 F A E B D C ...

With five conditions there are more cases to cover and two latin squares are needed to order conditions A, B, C, D, E:

Subject Condition p1 A B E C D p2 B C A D E p3 C D B E A p4 D E C A B p5 E A D B C p6 D C E B A p7 E D A C B p8 A E B D C p9 B A C E D p10 C B D A E ...

## Hypothesis testing and explorative studies

In hypothesis testing, we have a set of hypotheses that is evaluated. The aim is to gather evidence that supports the hypotheses. Hypotheses are statements about variables and those statements are derived from existing theories.

In explorative studies, we do not have hypotheses but instead of looking data in order to find patterns in the data. The explorative studies are used to build a preliminary understanding of a topic. Based on the results it is possible to build a working theory and hypotheses for a hypothesis testing study. Statistical evidence of relations and differences established in explorative studies are weak; when multiple statistical models are fitted to the data, the change of some results are due to random occurrences increases because of that.

# Data analysis

The basis of analysis is forming a statistical model based on the data. The model can be simple descriptions of the data such as mean and standard deviation. These models describe tendencies of the sample. To go beyond sample, an assumption of distribution of the data is needed. For example, if we measure the mean value of a sample and calculate 95% confidence interval for that, the Confidence interval states that if the same study is repeated with a different sample, the mean of that study would fall in within the confidence interval in 95% of cases.

The formula for the confidence interval for a mean is:

mean ± 1.96 * standard error of the mean.

The value of 1.96 comes from the fact that 95% area of distribution is within 1.96 standard deviations of the mean. The calculation of standard error of error of mean depends on the distribution. Notably, with large samples, different distributions are often approximately the same as normal distribution.

More often researchers are interested whether two (or more) variables are different or related. The t*-test* is an example of a method that can be used to test whether two means are different, the *equivalence test* to evaluate whether two means are the same and the *Pearson correlation* is an example of a method used to estimate how two variables are related (see below).

The game data collected, like many real data in other domains, are often far from the ideal data fitting nicely to the expectations of analysis methods.

Examples below illustrate how analyses are conducted in R. Refer to Chi Yau (2017) tutorials for R basics.

## Statistical significance

Statistical tests are based on the idea that by rejecting the null* hypothesis*, the alternative hypothesis is supported. This would mean that if we want to test if two means are different, first we assume that they are the same. This would be the null hypothesis. Then with data, we see if we can reject the null hypothesis and have a support to an alternative hypothesis, the hypothesis that is we want to investigate.

“The P value is defined as the probability under the assumption of no effect or no difference (null hypothesis), of obtaining a result equal to or more extreme than what was actually observed.” (Dahiru, 2008.)

The p-value is used as a criterion to reject the null hypothesis. Note that the criteria, for example, p < 0.05 is arbitrary and many cases is it enough just to look if the result is statistically significant but also look at the effect sizes (more about effect sizes below).

Very often researchers are interested in if there are a difference between the groups and the null hypothesis then is that there are no difference. For example, T-test and ANOVA and meant to test the difference between means of groups and the null hypothesis is no difference. However, it is possible to test a hypothesis of an equivalence. Then the null hypothesis is that there is no difference between groups. Then the difference needs to be defined (more about this below in Equivalence testing)

## Sample size, statistical power, and effect size

Statistical power is defined as a probability of rejecting a false null hypothesis. Power mainly depends on the sample size and the effect size. The effect size is a measure of the strength of an effect, such as d-family effect sizes used to compare differences between groups (e.g., in t-test) or r family used to measure association (r or R² as a measure of the strength of correlation where r=1 means that the variable A perfectly explains the variable B and r=1 means there is no relation between the variables).

Required sample size to detected differences can be calculated apriori using power analysis.

For example, the required sample size for detecting a small difference(d=0.2) between two means can be calculated in R as follows:

library(pwr) pwr.t.test(d=.2,power=.8,sig.level=0.05,type="two.sample", alternative="two.sided")

The sample size requiring to research 80% power with 5% alpha and d of 0.2 is 394 in each group. It is conventional in many areas to use p=.80, but other values are also used depending on context.

Let say we have groups A and B and A has higher mean. An effect of d=0.2 means that 92% of two groups overlap and 58% of the group A’s score are above the mean of the group B. An effect of d=0.8 the overlap of two groups is 69% and 79% of group A’s scores are above the mean of the group B. If d=0, the overlap would be 100%.

Cohen (1988) suggested effect sizes for power analysis.

Effect size measure Small Medium Large Measure used with d 0.20 0.50 0.80 t test f 0.10 0.25 0.40 ANOVA w 0.10 0.30 0.50 Chi-squared test r 0.10 0.25 0.40 Pearson's correlation

## Comparing two means: Welch’s t-test

*Welch’s t-tes*t is used to compare if two means are different. The null hypothesis is that the two samples are not different.

Assumptions: Normally distributed continuous interval variables that are independent. Welch’s t-test does not require equal variances (like student t-test). The t-test is rather robust to normality violations.

Note that, statistically non-significant difference does not mean that one cannot argue that means are the equal. For that one needs to test equivalence. The equivalence testing is explained below.

T-test results are often given in a form, t(degrees of freedom) = t-value; p = p-value. However, different publications may ask the results to be given in a different format. T-test provides t-value that is a measure of a difference in relation to variations in the sample. T-value near zero means that the means are not different or that variation of the sample is big.

Degrees of freedom is defined as “the number of independent observations in a sample minus the number of population parameters that must be estimated from sample data.” (Anon, n.d.). Degrees of freedom is basically sample size minus the amount used to calculate the model. Degrees of freedom tells what available to estimate variability. For example, t-test (for comparing the means of two groups) fixed to observations in means. The rest observations (n_{1} + n_{2} – 2) are available to calculate variability: hence, df = n_{1} + n_{2} -2.

Assumptions:

- Variable is continuous interval variable
- The two populations sampled has normal distribution
- Groups are independent

### Example

In this example, we are interested how verbose HUD and minimal HUD influence the players’ sense of presence and performance in a death-match game. The sense of presence (or immersion) is be evaluated using PENS instrument(ref) (with 7-point Likert scale questions) that the players fill after they completed a game scenario. The performance is tracked by the kills. One group of participants played with HUD and other without HUD.

The presence scores were computed using the mean of nine questions meant to measure the sense of presence. Likert scales using 5 or more levels are shown to perform similarly to interval variables (ref).

Let us assume that the effect size we are looking for is medium (see above) which is d=0.5.

> library(pwr) > pwr.t.test(d=0.5,power=0.8,sig.level=0.05,type="two.sample", alternative="two.sided")

Power analysis reveals that we need 64 participants in each group to reach 80% power and 0.05 alpha level.

We have 60 participants in each group, however, the data has two missing values one of the groups so we have 60 participants in one group and 48 in the other. This set-up is little underpowered.(*)

The data looks as follows

> head(df.hud) HUD presence kills ... 1 No 0.3333333 36 ... 2 Yes -0.3333333 34 ... 3 No 0.8888889 42 ... 4 Yes -1.0000000 59 ... 5 No 0.7777778 51 ... 6 Yes -0.2222222 50 ... ...

The data is in the so-called long format where HUD variable is telling which groups each presence measurements belongs.

Boxplots below gives an overview of the data.

The above figure can be produced as follows:

library(ggplot2) library(gridExtra) grid.arrange( ggplot(df.hud, aes(HUD, presence)) + geom_boxplot(), ggplot(df.hud, aes(HUD, Kills)) + geom_boxplot() )

Based on the boxplots, we assume that the data is sufficiently normal. The following R code performs t-tests.

> t.test(subset(df.hud, HUD=="No")$presence, subset(df.hud, HUD=="Yes")$presence) > t.test(subset(df.hud, HUD=="No")$kills, subset(df.hud, HUD=="Yes")$kills)

T-test, t(112.42) = 7.874; p < 0.001, indicates that there is a significant difference in presence scores between the verbose HUD and minimal HUD groups. This means that those who played with the minimal HUD reported overall higher presence scores that those who played with verbose group When compared kills, there are no significant differences, t(117.75) = 0.119; p = 0.905, between groups.

(P values lower than 0.001 should be reported as p < 0.001, as the p values cannot be zero. Higher than that p values are reported as p = 0.xxx.)

## Comparing two ordinal variables: Wilcoxon rank-sum test

Null hypothesis of Wilcoxon rank-sum test is that distributions of both populations are the same.

Assumptions:

- Ordinal variable
- Independent observations

### Example

> wilcox.test(subset(df.hud, HUD=="No")$Kills, subset(df.hud, HUD=="Yes")$Kills)

We cannot reject the null hypothesis (W = 451.5, p = 0.9882) and conclude that there are no statistically significant differences between groups.

## Two sample equivalence test

With a two sample *equivalence test,* it is possible to compare if two means are equivalent. For that, one needs to define what equivalent means.

### Example

Let return to the above HUD comparison example. T-test results indicated that there are no significant differences between the scores of the verbose HUD and minimal HUD groups, but are those groups equal?

We can determine the required sample using power analysis. We define the similarity as d=[-0.4,0.4].

library(TOSTER) powerTOSTtwo(0.05, 0.8, -0.4, 0.4)

This gives required sample as 216 in total or 108 in each group.

In R, equivalence tests can be conducted using, for example, *TOSTtwo* function from TOSTER package. As *TOSTtwo* requires means and sds as parameters, we use a wrapper that that takes in variables in vector and calculates means, sds, n.

tost.two = function(v1, v2, epsilon1, epsilon2 = -epsilon1) { m1 = mean(v1, na.rm = T) sd1 = sd(v1, na.rm = T) m2 = mean(v2, na.rm = T) sd2 = sd(v2, na.rm = T) n1 = length(na.omit(v1)) n2 = length(na.omit(v2)) res = TOSTtwo(m1=m1, m2=m2, sd1=sd1, sd2=sd1, n1=n1,n2=n2, low_eqbound_d=epsilon2, high_eqbound_d=epsilon1) return(res) }

After this, we can just pass vectors from data frames as parameters to tost.two to conduct analyses with the following calls.

> summary(tost.two(subset(df.hud, HUD=="No")$Kills, subset(df.hud, HUD=="Yes")$Kills, 0.4))

Here we provide two groups to be compared just like in t-test example above. We then define that groups are similar if the difference of means is between d=-0.4 and d=0.4.

The test (p1=0.0538; p2=0.074) is not significant so we cannot conclude that the means are similar.

## Correlation

Correlation is used to determine the strength of a relationship between two variables. The figure below shows four different strengths of correlations. A positive correlation means that when the value of the variable V1 increases, increases also the value of the variable V2. On the other hand, a negative correlation means that when the value of the variable V1 increases the value of the variable V2 decreases.

All correlations below assumes two probabilistically independent variables.

**Person correlation:** Assumptions: Normally distributed continuous interval variables that are independent. Pearson correlation is rather robust to normality violations.

**Kendal tau rank correlation:** ordinal variables

**Spearman’s rho** is used to calculate a correlation between ranks of two variables.

### Example

We can assume that players experience of playing shooter games influence their performance. One way to estimate if this is the case is test correlation between the kills and players’ prior experience of playing shooter games. We use Kendal correlation here as experience is an ordinal variable (with levels “Less than once a month”, “A few times every month”, “A few times every week”) and kills (which is a count) can be treated as ordinal.

However, determining sample size is not that straightforward for Kenda tau than it is for Person r (we could get it with a command pwr.r.test(r=0.3, power=0.80, sig.level=0.05, alternative=”two.sided”), which give the minimum sample size of 85).

cor.test(formula = ~ as.numeric(Experience) + Kills, data=df.hud, method="kendal")

There is a statistically significant correlation (tau=0.321; p=0.002) between Experience and Kills. The positive correlation indicates when experience increases, increases the number of kills that player makes.

## Comparing more than two means: One-way ANOVA

One-way ANOVA is used to compare three or more means. The null hypothesis is that all groups are sampled from the populations having the same mean value. A significant result is mean that the null hypothesis can be rejected and at least one of the groups compared has significantly different mean. One can test the group-by-group differences with a posthoc comparison.

Assumptions of one-way ANOVA:

- Response variable is normally distributed (although ANOVA is rather robust to non-normality; cf., e.g., Schmider et al.. 2010)
- Equal variance across the groups
- The groups compared are independent

### Example

In this example, we compare different targeting to manual targeting in a fictive game. We use the number of kills for a proxy to compare targeting systems. Data looks like as follows:

kills targeting 1 16 A 2 19 A 3 18 A 4 13 A 5 16 A 6 13 A ...

Boxplots below illustrates the data by targeting.

Required sample size to research 80% power assuming medium sized effects with four groups can be calculated with power analysis

pwr.anova.test(k = 4, f = 0.25, sig.level = 0.05, power = 0.8)

Required n in each group is 45.

As ANOVA assumes equal variance and normal distribution in variables in each group, we visually inspect from the boxplot above that variances are roughly equal and each group is normally distributed. In addition, we can check if the variances are unequal using Levene Test (in car package) and shapiro.test to test normality as follows:

leveneTest(kills~targeting, data = df.targeting)

The Levene Test result comes as non-significant (p=.2838) so there is no reason to assume non-equal variances between groups. We could test normality using Shapiro test, but instead of we just inspect that from the box plot or histogram, because ANOVA is shown to be rather robust to normality violations (e.g., Schmider et al.. 2010).

We can get ANOVA results

> fit = aov(kills~targeting, data = df.targeting) > summary(fit) Df Sum Sq Mean Sq F value Pr(>F) targeting 3 4005 1334.9 83.8

Results tell that at least one group mean of kills is significantly different than some other. After this, we could do a pair-wise comparison using t-test. Instead, we use Multcomp package to calculate group-wise comparisons.

> library(multcomp) > summary(glht(fit, linfct = mcp(targeting = "Tukey"))) Linear Hypotheses: Estimate Std. Error t value Pr(>|t|) B - A == 0 1.0000 0.4609 2.170 0.133 C - A == 0 -5.7133 0.4609 -12.397

We see that differences between A and B or A and C are not statistically significant. Other differences are (p<0.001).

> library(multcomp) > summary(glht(fit, linfct = mcp(targeting = "Tukey"))) Linear Hypotheses: Estimate Std. Error t value Pr(>|t|) B - A == 0 1.0000 0.4609 2.170 0.133 C - A == 0 -5.7133 0.4609 -12.397

From the results above, we see that differences between A and B or A and D are not significant. Other differences are (p<0.001). Calculated p-values are adjusted for multiple comparisons.

## Comparing more than two ranks: Kruskal–Wallis one-way analysis of variance

xxx

> kruskal.test(kills~targeting, data = df.targeting) Kruskal-Wallis rank sum test data: kills by targeting Kruskal-Wallis chi-squared = 168.13, df = 3, p-value < 2.2e-16

As p value is less than 0.05, we can conclude that the targeting populations are not identical. To get more details, we can do pair-wise testing to see which populations are different. Parameter dist=”Chisquare” will correct for ties in rank comparison.

> library(PMCMR) > posthoc.kruskal.nemenyi.test(x=df.targeting$kills, g=df.targeting$targeting, dist="Chisquare") A B C B 0.3169 - - C < 2e-16 < 2e-16 - D 0.4730 0.0074 1.2e-15

The pair-wise comparison leads to the same results than with the ANOVA, the differences between A and B or A and D are not statistically significant. The Nemenyi test corrects for multiple comparisons.

## Longitudinal data

XXX

## Within subjects designs

XXX

# References

Anon, n.d., Degrees of Freedom. *Statistics and Probability Dictionary*. Available at: http://stattrek.com/statistics/dictionary.aspx?definition=Degrees%20of%20freedom.

Dahiru, T. 2008. P–valu, a true test of statistical significance? A cautionary note. *Ann Ib Postgrad Med,* 6(1), pp.21–26.

Epstein, R.A. 2013. *Gambling and Statistical Logic,* 2nd ed. Academic Press.

Field, A., Miles, J. and Field, Z. 2012. *Discovering statistics using R*. Sage.

Schmider et al. 2010. Is It Really Robust? Reinvestigating the Robustness of ANOVA Against Violations of the Normal Distribution Assumption. *Methodology*, 6(4). DOI=10.1027/1614-2241/a000016.

Yau, C. 2017. *R Introduction*. Available at: http://www.r-tutor.com/r-introduction.