Testing and Evaluation
F27ID Introduction to Interactive Design
2021-2022
## Overview * Goals and benefits of **Testing and Evaluating** Designs * Different types of **user-based** testing * **Qualitative and Quantitative** Data * Details on experimental research in HCI * Questions and Discussion
## Revision Question Usability and User Experience (UX) are the same: * a) True * b) False
## Answer Answer: b) False
## Revision Question User Experience (UX) focuses on: * a) Satisfaction, Enjoyment, Pleasure, Fun, Value * b) Effectiveness, Efficiency, Learnability, Error Prevention, Memorability
## Answer Answer: a) Satisfaction, Enjoyment, Pleasure, Fun, Value
## **Why** is Testing and Evaluation **Important**?
?
> Take a moment to think about what testing and evalution is!
### Why is Testing and Evaluation **Important**? * Testing a prototype / developed design is **very important** * Testing and evaluation, simply **confirms** that the product will **work** as it is **supposed to**, or if it needs refinement * Testing assesses the viability of a **design or idea** * Discovers **defects/bugs/problems** before the delivery to the client * Guarantees the **quality** of the design/product
## The Goals of **Testing** #### Discover errors and areas of improvement in: * **Performance** -- How much time, and how many steps, are required for people to complete basic tasks? (For example, find something to buy, create a new account, and order the item.) * **Accuracy** -- How many mistakes did people make? (And were they fatal or recoverable with the right information?) * **Recall** -- How much does the person remember afterwards or after periods of non-use? * **Emotional response** -- How does the person feel about the tasks completed? Is the person confident, stressed? Would the user recommend this system to a friend?
## Goals of **Evaluation** * Assess **extent** of system functionality * Assess **effect** of interface on user * Identify specific **problems** * Designers need to **check** whether their ideas are really what users **need/want**; or whether the final product works as expected. * To do that, we need some form of methods, or more specifically, **empirical methods** for HCI
#### Remember ## What is Usability?
## Usability * Usability refers to the **extent** to which a product **can be used** by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of user (ISO 9241-11) * Usability **measures** the quality of a **user's experience** when interacting with a product or system - Ease of learning - Efficiency of use - Memorability - Error frequency and severity - Subjective satisfaction
## User-based **evaluation** * Considered to yield the most **reliable** and valid estimate of an **application's usability** * In a typical user-based evaluation, test subjects are asked to perform a set of tasks with the technology. * Depending on the primary focus of the evaluator, the users' success at completing the tasks and their speed of performance may be recorded. * Large sample of users would be good, but 3 (**Lewis**) or 5 (**Nielsen**) are often enough to uncover the majority of problems. * **Nielsen**: "once it is found that two or three people are totally confused by the home page, little is gained by watching more people suffer through the same flawed design"
## Certain **Categories** of evaluation #### **User-based** evaluations: where a sample of the intended users try to use the application * **Lab studies** (experimental research, Usability testing), field studies * **Observation Methods** (think aloud, video analysis) * **Questionnaires**, Interviews, Focus Groups * Some are Quantitative others are Qualitative
### Method **Selection** The evaluation **approach influences** the methods used, and in turn, how data is collected, analyzed and presented * E.g., **Field studies typically**: - Involve observation and interviews. - Do not involve controlled tests in a laboratory. - Produce qualitative data. * E.g., **Usability testing typically**: - Involves users, interviews, design, questionnaires. - Conducted in the laboratory or in a natural setting. - Primary goal is to test how usable the interface is with intended populations.
## Revision Question For user-based evaluations, what is a good enough number of test users to uncover the majority of problems? * a) 1-2 * b) 3-5 * c) 50-99 * d) 100+
## Answer Answer: b) 3-5 --- * Large sample of users would be good, but 3 (**Lewis**) or 5 (**Nielsen**) are often **enough** to uncover the majority of problems. * **Nielsen**: "once it is found that two or three people are totally confused by the home page, little is gained by watching more people suffer through the same flawed design"
### Should you use **Quantitative** or **Qualitative** approaches?
### **Quantitative** or **Qualitative**? * Data is either Qualitative or Quantitative * e.g. Interview-based data is Qualitative * Questionnaires can be Quantitative (ratings-based), Qualitative (short-questions) * Other forms of Quantitative data includes * time taken to do a task, performance * Selection **depends** on: * Overall research or evaluation goal * Ethical and practical issues * Resources, cost, logistics * Availability of participants and sample size * Skills of team, background, philosophy * One is more rigid than the other * Hence an evaluation proposal is imperative
## Reliability, Validity and Scope * Can the study be replicated: **reliability** * Are you measuring what you expect: **validity** * Are there any unexpected effects: **bias** * Can the findings be generalised: **scope**
## Questionnaire Design * Interview vs Questionnaire? * Should the session be video taped?
## New, adapt or reuse? * **Established questionnaires** will give more reliable and repeatable results than ad-hoc questionnaires. * Some questionnaires for assessing the perceived usability of an interactive system: * Questionnaire for **User Interface Satisfaction** (QUIS) (1988) * **Computer System Usability** Questionnaire (CSUQ) (1995) * **System Usability Scale** (SUS) (1996)
## Observations / Video Coding * E.g. Child Robot Interaction Evaluation Setup * Video Analysis * E.g., measuring engagement through calculating duration of * Child eye-gaze facing robot, * Facial and Verbal Expressions, * Gestures
## Interviews * analyst questions user on one-to-one basis usually based on **prepared questions** * informal, subjective and relatively cheap --- * Advantages * can be varied to suit context * issues can be explored more fully * can elicit user views and identify unanticiapted problems * Disadvantages * very subjective * time consuming
### Interview **Example** * **Goal:** To understand behaviours and roles for social robots in Education * **Users:** School Teachers * **Method:** Interview * E.g. Interview questions include: * How do you think a robot can contribute towards efficient language learning? * How do you want a robot to show different gestures during a one to one interaction? * How do you want a robot to display a personality according to a child? * How do you want a robot to react to children emotions? * What kind of role a robot should play to improve learning? * How do you want a robot to store child's memory?
## Field Trials Useful for **interacting with users** or **getting ideas for later versions** --- * **Advantages** - natural environments (and evaluator can view system as part of the total environment ) - context retained (though observation may alter it) - longitudinal studies possible * **Disadvantages** - distractions/noise * Appropriate - where **context is crucial** for longitudinal studies
### Field Trail **Example** * To check the applicability of **Interactive Robots** as Social Partners and Peer tutors for **Children** in a school * **Steps** * Deploy a robot in one of the rooms insides the school, say a science laboratory * A robot can gesture and speak English using a vocabulary of 300 sentences and recognize 50 words * Idea is to analyse the applicability of robots through observing overall child-robot interaction
## Focus Groups * **Discussion based group interview** - Originated during the 2nd world war - used to test effectiveness of propaganda - Often used today for market research * Features - Comprises people with a particular set of characteristics - Moderated (often by the researcher) - Tend to be relatively informal - Centres **around open questions designed to generate dynamic discussion** - (participants may also be required to complete a questionnaire)
## Focus Group **Example**
* Learning about **children views'** on social robots in education * Children discuss themes in groups of three or four on the use of social robots in education * Themes include: * Robots as peers * Robots as tutors
### Another Focus Group Example
## Analyzing **Qualitative Data** from the **Focus Group** * Content Analysis * In content analysis, **qualitative remarks** from the participants are **coded** into predefined **categories** via inferences made by independent human coders * Setup coding scheme * Operationalize the variables * Clearly define all categories * Identify unit for coding * Normally: 2 independent coders and a check is made for inter coder agreement (i.e. check for bias)
## Think Aloud * user observed performing task * user asked to describe what he is doing and why, what he thinks is happening etc ---- * **Advantages** - simplicity - requires little expertise - can provide useful insights - can show how system is actually used * **Disadvantages** - subjective - selective - act of describing may alter task performance
### Experimental Research * A test of the effect of a **single variable** by changing it while keeping all other variables the same * A controlled experiment generally **compares the results** obtained from an experimental sample against a control sample * General terms * Participant (subject) * Independent variables (test conditions) * Dependent variables (what you measure)
## IV's and DV's * **Independent Variables (IV)**: What you as the researcher vary or manipulate * Type of interface/app * Type of feedback * Type of Menu * **Dependent Variables (DV)**: What you measure * Time * Performance (Accuracy, Errors) * Subjective Ratings
## **Example** between subjects and within subjects * A **sports drink company** wants to test two types of drinks (D1 and D2) and their effect on racers * If all participants try both drinks (within) * If half of the participants try one drink only (between) * We can also have a control group (D0 - water or no drink) * We can also have a mixed design (the length of the race could be a second condition and be between subjects, type of drink is within)
## Experimental design * **within** group design - each subject performs experiment under each condition - transfer of learning possible - less costly and less likely to suffer from user variation * **between** groups design - each subject performs under only one condition - no transfer of learning - more users required - variation can bias results
## **Confound** Variables * A variable that provides an alternative explanation to the results that we see * Can cloud the effect of extra conditions * So in the drinks example, a confound variable could be: * Temperature at the time of the race * Previous experience * Age of racers
#### Evaluation Method (what is the method used and summarize the **setup**) * **Procedure:** the sequence of steps from welcoming participant to the participant leaving the experiment room, in other words: you provide enough detail on the process of data collection to allow another person to repeat your research * **Materials:** Here you describe the equipment/instruments used for data collection (computers, microphones, screens, tangible objects, etc). * **Setup:** where was the participant seated, how far from the screen, etc. * **Measurements:** details on data collection instruments (what is being measured and how) * **Participants:** Here you need to identify the units you studied (what types of users). When the research units are humans, they are most often referred to as "participants."
## Summary * Understand the Core Concepts for **Testing** and **Evaluation** * **Different** Evaluation and Test Methods * Examples
## Recommended Reading
Interaction Design - Beyond Human- Computer Interaction
Chapter 14-16
## To do this week ... * Read over the lectures * **Review** the revision questions * Work through labs/tutorial practicals * Experiment (get into good habits)