Usability Testing Best Practices: An Interview with Rolf Molich

Rolf Molich and the Foundations of Modern Usability

When most people think of usability pioneers, Rolf Molich’s name rarely comes to mind, yet his influence is woven through nearly every standard practice in interface design. Beginning his career in 1983, Molich set out to answer a simple question: how can designers discover and correct human–computer interaction problems before users encounter them? His early research pushed beyond the prevailing focus on engineering precision and introduced a human‑centric lens that remains central to usability today.

In 1990, Molich teamed with Jakob Nielsen to publish a landmark paper that outlined the “heuristic inspection” process. The idea was straightforward yet revolutionary: bring a handful of experienced designers together to walk through an interface, flagging violations of a concise set of usability principles. The result was a cost‑effective, rapid assessment that could uncover problems early in the design cycle. The method quickly gained traction, becoming a staple in both academic curricula and industry training programs.

For years, heuristic inspection was heralded as a silver bullet. Teams could gather a few experts, review wireframes or prototypes, and generate a list of actionable recommendations - all without the need for costly user testing. The simplicity of the approach - few participants, short timeframes, and a clear set of guidelines - made it attractive to product managers eager to save time and money.

However, Molich has not been content to let his early success define his legacy. In recent interviews, he has expressed growing skepticism about the universal value of heuristic inspections. He points to studies where untrained or inexperienced reviewers produced a flood of false positives, undermining the credibility of the method. Molich argues that while expert insight can be powerful, the assumptions underlying a “one‑size‑fits‑all” inspection model are increasingly out of step with today’s complex, data‑driven design workflows.

In the same vein, Molich turned his attention to a different research avenue: Comparative Usability Evaluation, or CUE. The CUE series was conceived as the first systematic effort to gather the same interface across multiple independent evaluation teams, each applying their preferred methods. By comparing results, the project sought to illuminate which practices consistently surface high‑impact issues and which ones are more idiosyncratic.

Two notable iterations of the study are CUE‑2, which examined Microsoft Hotmail with nine separate teams, and CUE‑4, which focused on the Flash‑based iHotelier reservation system with eighteen evaluators. These projects produced a wealth of data that challenged many long‑held assumptions about consistency and reliability in usability testing. The full reports, including detailed findings and methodological notes, are publicly available at http://www.UIE.com) offers workshops that emphasize evidence‑based evaluation, while Molich’s own writings include practical checklists and decision trees. These tools help teams weigh the cost and benefit of expert reviews against other testing methods, ensuring that each evaluation step adds measurable value.

Comparative Usability Evaluation (CUE): A Global Experiment

The CUE series was conceived as a scientific experiment in methodological pluralism. Instead of asking a single team to evaluate an interface, CUE invited multiple independent groups to conduct the same evaluation using their own established practices. The goal was to see how much overlap there was in the problems identified and which methods surfaced the most significant issues.

CUE‑2, the second iteration of the project, focused on Microsoft Hotmail. Nine teams - each comprising usability experts, designers, and researchers - were given a three‑week window to recruit participants, design tasks, and carry out the evaluation. The teams were left largely free to use whatever methods they deemed most appropriate, whether that meant think‑aloud protocols, eye tracking, or purely expert inspections.

The results were striking. Across the nine teams, 310 distinct usability problems were documented. Yet only six of those problems were mentioned by more than half the teams, and the most frequently reported issue was noted by seven teams. The remainder of the findings were largely unique to each group, reflecting the subjective nature of usability assessment and the variety of lenses through which teams view a design.

One of the most illuminating aspects of CUE‑2 was the variation in task design. Each team created its own set of realistic tasks for the Hotmail interface, often tailored to their perception of user intent. This freedom led to divergent focus areas, with some teams uncovering navigation issues while others highlighted error handling or search functionality. The outcome suggested that without a shared task framework, it is difficult to achieve a consistent set of findings.

CUE‑4 extended the experiment to a more complex, Flash‑based hotel reservation system from iHotelier. Eighteen evaluators applied both expert inspections and usability testing across a range of scenarios. Although the data set for CUE‑4 was still being analyzed at the time of the interview, preliminary insights echoed the CUE‑2 pattern: a high volume of unique findings and a lack of consensus on the most critical issues.

The broader implications of the CUE studies point to a need for clearer methodological standards in usability evaluation. While the diversity of techniques is valuable, the field must also address the reproducibility of findings and the trade‑offs between depth and breadth. The full CUE reports, including methodological appendices and raw data, are available at