Summary: The book The Effortless Experience posits that the Customer Effort Score is a good predictor of customer loyalty. This part of the review addresses the shortcomings of the research execution. The description of the survey execution leaves many unanswered questions, and any of these issues would seriously compromise the validity of the research data. By their own admission, the researchers do not know how to write good survey questions, but the issues go far beyond that.
~ ~ ~
2. Survey Questionnaire Design and Execution Issues
Let’s move beyond the weak research model. The actual survey instrument used has weaknesses and the survey process has holes — as much as we can tell from the Swiss-cheese presentation of the methodology. (I love Swiss cheese, but I like holes in my cheese, not in the presentation of research streams.)
The authors never present us the survey questionnaire or the exact wording of most of the questions in the survey. This “holding their cards close to their vest” should give you cause to pause, but we see some glimpses of the methodology that cast doubt upon the validity of the data captured and thus any findings derived from those data.
Survey Administration Issues. In the very brief methodology section shown in the earlierscreen shot, the authors tell us they surveyed “over 97,000 customers — all of whom had a recent service interaction over the web or through calling a contact center and were able to remember the details clearly…” (emphasis added). This short description raises a whole host of problematic questions.
- How recent is “recent”? Were the service interactions in the past day, past week, past month or past year? Clearly, recall bias is in play here, perhaps to a large extent. For some respondents the time lag between service event and the data captured may have been great, and the time lag certainly varied across respondents, which make the data less reliable as measures of customer perception. The longer the time gap between the service event and the survey, the more measurement error is introduced into the data set.
- How did they direct the respondent to think about a specific service interaction? This is what survey designers call “setting the respondent’s mental frame.” Did the instructions say to report on the most recent interaction or a recent one that stood out? I’m guessing the latter given what they wrote. Think for a minute (please!). What service interaction would you remember most clearly? That’s right — an extreme interaction and probably a negative extreme since that one sticks in the mind most. If so, then the service interactions measured were not a random sample but a biased sample. This would have serious implications for the findings.
- What incentive got 97,000 people to respond? That number is meant to impress — and it does! Oddly, it’s so big, it raises questions. By their own statement, respondents had to “endur[e] the battery of questions” raising questions yet further about getting that volume of responses. I suspect some incentive, either a raffle or payment as part of a panel, was offered. Surveyors know that incentives will lead some people to just click on anything just to get the prize. Of course, this would mean the validity of the data is weakened.
- How does this study relate to the previous one? In the 2010 HBR article, they say they surveyed 75,000 people over a three-year period. The book reports on a study with 97,176 respondents. Was the latter a new study or did they just survey 22,000 more people? If the latter, then how did the survey questionnaire change over the roughly six-year study period? Professional surveyors know that significant changes to a questionnaire — wording, sequencing, length, etc. — requires starting over with the data sets.
- What was the response rate? The response rate is important to evaluate the extent of non-response bias — people choosing not to respond holding different views from those who did — which could result in the data sample being a biased subset of the invitation sample.
- How do they know the respondents “remembered the details clearly”? Did they include so-called “test questions” to help insure the respondents weren’t just making up their responses? The authors make no mention of using these techniques to ensure validity, or that some number of responses were rejected during data cleansing — if they did data cleansing. The authors proudly trot out all their research they conducted. Data cleansing is part of being a good researcher and should be disclosed as part of the research methodology. My assumption is this. If the respondent clicked on the radio buttons on the survey screen, the data were used; no questions asked. Did they “clearly remember” the details or were they just checking radio buttons? Because a respondent enters a response on a survey doesn’t mean the data accurately reflect the respondent’s real views.
All of these questions are left unaddressed. We aren’t privy to the administration procedures, email invitation, the introductory screen, the instructions, or the survey questionnaire.
I know that most people reading this are not trained researchers, but I hope you can see how these shortcomings should make the critical thinker question whether findings from these data are legitimate.
Customer Effort Score (CES) as a Loyalty Measure on Transactional Surveys. The authors claim that CES complements Net Promoter Score® as summary customer metrics. They say NPS is better for a high level view of the relationship on annual surveys; whereas, CES is better at assessing loyalty effects at a transactional level. Yet, the administration approach for their research was not a transactional survey.
Survey Questionnaire Validity, Particularly the CES Question. A “valid” question is one that actually measures what it was supposed to measure. Copious opportunities exist to create invalidity. We should all have serious doubts about the ability of this research organization to write a good, valid questionnaire. That’s not my assessment. That’s the authors’ assessment. If you read no other chapter in the book, read Chapter 6 — and really read it, thinking about what they say.
Consider the wording of the central question in their research.
The exact wording of CES [Customer Effort Score] has evolved since we first released in a 2010 Harvard Business Review article. (page 157; emphasis added)
For that research stream, CES 1.0 — as they now call it — was worded:
How much effort did you personally have to put forth to handle your request?
The question was posed on a 1-to-5 scale where 1 was low effort and 5 was High Effort. They changed the wording for at least some of the research described in the book — back to that later — to:
The company made it easy for me to handle my issue.
The CES 2.0 (sic) question is now posed on a 1-to-7 Agreement scale. That’s more than an “evolution;” that’s a wholesale change. Why the change? (pages 157-158)
First we found that CES could be prone to false negatives and positives simply because of the inverted scale it uses.
Some customers misinterpreted the CES question as asking not how difficult the resolution experience was, but how hard they personally tried to resolve the issue on their own.
The word “effort” can also prove hard to translate for companies that serve non-English-speaking customers.
Finally, there was the challenge presented by a lack of “priming” before they CES question. In other words, after enduring the battery of questions that all ask, in some way, shape, or form how the customer likes or dislikes certain elements of the service experience…, it can throw some customers off to suddenly ask them about effort when everything that been asked up to that point pertained more to satisfaction. (emphasis added)
Let me give them some well-deserved kudos here. They recognized that the original question was invalid and they made changes. A mature researcher learns from — and admits — his mistakes. A round of applause, please.
But… how could they not have known that CES 1.0 was bogus? My jaw dropped when I first read the contorted syntax of the question and saw the inverted scale. When I present it to people in my survey workshops, the typical reaction is, “What does that mean?” I guarantee that if they had done any pilot testing — presenting the questionnaire to respondents prior to release — the problems with that question would have become known. Pilot testing is standard practice for professional survey designers.
Since the questionnaire is not shared with the reader, the huge mistake in the wording of their most important question gives me grave reservations about the valid wording of all the questions.
CES 2.0 – Where Does It Fit Into the Research Stream? What changes did they make in this later effort to help ensure questionnaire validity? They tell us:
This new question, CES v.2.0, is a variant of the original question, but one we found produces much more reliable results, is less prone to misinterpretation, is less arduous to translate into multiple languages, and provides and requires less priming to get customers to accurately respond. (page 158)
But what were the reliability measures? How do they know customers “accurately responded”? What testing did they do? Given that they screwed up in v1.0 so badly, it’s a very legitimate question to ask, especially given that this is the central question in their research.
Or is it?
Running this question with a panel of thousands of customers produced some fascinating and powerful results.(page 159; emphasis added)
Whoa! I thought the research survey had 97,000 respondents. Here they talk about “thousands of customers.”
- So was this new question part of the survey with 97,000 respondents? It doesn’t appear it is, given the above comment. Are all the previous statements in the book about the predictive qualities of CES based on the old fatally flawed wording? I have no choice but to assume that’s the case.
The authors present a scramble of research activities — and this relates more to my previous section of this extended review — but they never really lay out their research program. Is this lack of precise description intentional or just from sloppy writing? When reading this book, I felt like I was watching a magician’s act, looking for the sleight of hand. It was on my 2nd reading of this chapter that I recognized the above distinction. I doubt many other readers have noticed it. Yet, it’s a pretty darn important distinction to the validity of all their findings.
CES 1.0 – Trying to Have it Both Ways. I also have to retract partially my previous compliment about their research maturity, recognizing the wording problems in CES 1.0. On page 157 in talking about the need to reword CES, they state:
In our cross-industry studies we found that the answer to this question [CES 1.0] provided a strong measure of the impact of an individual customer service interaction on loyalty.
In their blog they use similar language:
[CES] proves to be an extremely strong predictor of future customer loyalty
This shows their serious immaturity as researchers. They say on the one hand that the question wording was seriously flawed, but on the other hand say CES 1.0 “provided a strong measure … on loyalty.” If the question is flawed, the data from it are invalid and all findings using that question are meaningless! This isn’t some weird quirk of researchers; it’s common sense!
They want kudos for “improving” the question, but they still want the findings to be accepted. You might wonder how such bad research got published in the Harvard Business Review. That’s because their research has never gone through a peer review process which HBR does not do.
Question Wording for the Loyalty Questions. In addition to the customer effort question, the other critical questions were the loyalty questions, but how were those questions phrased? We don’t know because they don’t tell us.
Did they say, “Based upon the service interaction you’ve been describing, how likely…” or was that qualifying opening clause omitted? If it was omitted, then the loyalty questions are not measuring the impact of loyalty from “merely one service interaction,” (page 159) but instead, the composite view of the experience of the customer with that company. The question wording makes a world of difference here. Ambiguous wording leads to data invalidity.
Even with that qualifying phrase, past product and service experiences will create a halo effect for the loyalty questions. How does the importance of the level of effort in a service interaction stack up against other factors, such as product experiences, that drive intended loyalty? We don’t know. Those measures were excluded from their model.
Various Questionnaire Design Concerns. Notice the earlier description of the “priming” issue. They’re describing what we call “routine” coupled with “fatigue.” If their respondents are “enduring a battery of questions” such that a change in scale anchors confuses them, then the questionnaire needs serious rework. If it was so arduous, how did they get 97,000 people to complete it?
We do know that the loyalty questions were posed at the end of the survey when respondent fatigue had likely set in from “the battery of questions.” The respondents had also been “primed” to think about all the bad things that happened in the service interaction — a sequencing effect that’s the equivalent of leading questions. How valid are those loyalty data? How different would the loyalty responses been if they had been posed at the beginning of the survey?