A confluence of survey biases – response, interviewer & instrumentation – likely overwhelmed what the NY Times’ surveyors think they measured about people’s feelings about having a female presidential candidate.
Each of the five survey question types has potential value in a survey research project. This article presents brief definitions of each question type, the analysis you can do, and some key concerns with each question type.
“What’s the objective of your survey program?” is the first issue I suggest people should consider when they create a survey program. In fact, developing a Statement of Research Objectives is the first exercise in my Survey Design Workshops. The project step seems innocuous; the goal would be to capture information from customers, employees or some other stakeholder to create or improve products or services. Or is it? Is there some other goal or agenda — stated or unstated — that trumps that logical goal?
That real agenda may manifest itself in the survey questionnaire design. Disraeli said, “There are three types of lies: lies, damn lies, and statistics” and survey questionnaire design affords us the opportunity to lie through survey statistics it generates — or unwittingly be mislead as a result of decisions made during the survey questionnaire design process.
Let’s look at an example. Below is an image of the Ritz Carlton customer satisfaction survey — an event or transactional feedback survey — that you may have received if you stay with them in the early part of this century. The survey was professionally developed for them. (The survey shown has been abridged. That is, some questions have been omitted. Also, the formatting is very true to but not exactly as on the original survey. I took some minor liberties in translating from a paper form to a web page for readability’s sake.)
Ritz Carlton is the gold standard of customer service, and they are well known for their efforts to identify and correct any customer problems, though I had been personally disappointed in their service recovery efforts. One key purpose of transaction-driven surveys — surveys conducted after the conclusion of some event — is to identify specific customers in need of a service recovery act. A second purpose is to find aspects of the operation in need of improvement. Consider how well this survey serves those purposes. In a follow-up article, we’ll consider other flaws in the wording of the survey questions.
First, let’s look at the survey as a complaint identifier. Putting aside the issues of the scale design , the questions at the end of the survey capture how happy you were with the service recovery attempt. But what if you had a problem and you did NOT report it? Sure, you could use the Comments field, but no explicit interest is shown in your unreported problem. No suggestion is made that you should state the nature of the unreported problem so they could make amends in some way.
Next, let’s examine how well the survey instrument design captures those service attributes that had room for improvement. Notice the scale design that Ritz uses. The anchor for the highest point is Very Satisfied and the next highest point is Somewhat Satisfied, with mirror image on the lower end of the scale.
Consider your last stay at any hotel — or the Ritz if your budget has made you so inclined. Were your minimal expectations met for front-desk check-in, room cleanliness, etc.? My guess is that your expectations were probably met or close to met, unless you had one of those disastrous experiences. If your expectations were just met, then you would probably consider yourself “just satisfied.”
So, what point on the scale would you check? You were more than Somewhat Satisfied — after all, they did nothing wrong — therefore it’s very possible you’d choose the highest response option, Very Satisfied, despite the fact that you were really only satisfied, an option not on the scale. To summarize, if your expectations were essentially just met, the choices described by the anchors may well lead you to use the highest response option.
In conference keynotes I draw a two-sided arrow with a midpoint mark and ask people where on that spectrum they would place their feelings if their expectations were just met for some product or service. Almost universally, they place themselves in the center at the midpoint marker or just barely above it. In other words, most people view satisfaction — the point where expectations were just met — as a midpoint or neutral position. This is a positive position, but not an extremely positive position.
What if your expectations were greatly exceeded from an absolutely wonderful experience along one or more of these service attributes? What response option would you choose? Very Satisfied is the only real option. So consider this: customers who were just satisfied would likely choose the same option as those who were ecstatic. (By the way, the arguments described here for the high end of the scale apply equally to the low end of the scale.)
Here’s the issue: this scale lacks good dispersal properties. Put in Six Sigma and Pareto Analysis terminology, it does NOT “separate the critical few from the trivial many.” (For you engineering types, the noise-to-signal ratio is very low.) The scale design — either intentionally or unwittingly — drives respondents to the highest response option. Further, it’s a truncated scale since those with extreme feelings are lumped in with those with only moderately strong feelings. We really learn very little about the range of feelings that customers have.
Is Ritz Carlton the only company who uses this practice? Of course not. After a recent purchase at Staples, I took their transactional survey. Its error was even more extreme. The anchors were Extremely Satisfied, then Somewhat Satisfied. I was satisfied with the help I got from the store associates; they met my expectations. Which option should I choose? I was not ecstatic, but I was pleased.
A well-designed scale differentiates respondents along the spectrum of feelings described in the anchors. Learning how to choose good anchors is vitally important to get actionable data. How actionable do you expect the data from these questions would be? I’ve never seen their data, but I would bet they get 90% to 95% of scores in the top two response options, so called “Top Box scores.” I would not be surprised if they got 98% top box scores. If I were the Ritz, I would interpret any score other than a Very Satisfied to be a call to action. (Maybe they do that.) I would also bet that any hotel — a Motel 6 or Red Roof Inn — would also get 90%+ top box scores using this scale. The dispersal properties of this scale are just that poor.
A simple change would improve it. Make the highest option Extremely Satisfied and the next highest Satisfied. Or use a different scale. Have the midpoint be Expectations Just Met, which is still a positive statement, and the highest point be Greatly Exceeds Expectations. I have used that scale and found a dispersion of results that lends itself to actionable interpretation.
If you’re a cynic, then you might be asking what “damn lie” is really behind this questionnaire scale design. Here’s a theory: Public Relations or Inside Relations. Perhaps the “other goal” of the survey was to develop a set of statistics to show how much customers love Ritz Carlton. Or perhaps the goal is for one level of management to get kudos from senior management.
This questionnaire scale design issue is one reason why comparative benchmarking efforts within an industry are so fundamentally flawed. You may be familiar with these benchmarking data bases that collect data from companies and then share the average results with all who participate. Self-reported customer satisfaction survey scores are typically one of the data points, that is, the data submitted are not audited. Yet, if some companies use scale designs as shown here, how valid is the benchmark if you use a scale that truly differentiates? Your 50% top box score may reflect higher levels of customer satisfaction than the other company’s 90% top box score. Self-reported data for comparative benchmarking databases where there’s no standard practice for the data collected is suspect — to say the least.
The Disraeli quote at the opening is also attributed to Mark Twain. (Isn’t every good quote from Twain, Disraeli, or Churchill?) If Twain had lived in the days of customer satisfaction surveys, he would have augmented Disraeli’s quote thusly: “There are four types of lies: lies, damn lies, statistics, and survey statistics.”
Choice of the survey question types used in a questionnaire is a critical design decision. Survey question type determines the type of data generated, which in turn determines the type of analysis you can do with the survey data collected. No one best survey question type exist. The appropriate question type will be one that best generates valid, reliable data to answer your research question.
Design of interval scales for surveys is a vital part of survey questionnaire design. How many points on the scale, odd number or even number, presenting the scale from high to low versus low to high, endpoint anchoring or fully anchoring each scale point are all design issues. Most important is the choice of anchors, which are those terms that describe the dimension of measurement. Importantly, a scale designed for American English audiences must be localized for other variations of the mother tongue.
We practice scale design in my survey workshops, and in a recent workshop, one attendee decided to create a localize scale for measuring relevancy, in this case for Texas: (my apologies in advance for offending anyone’s sensitivities.)
I leave it to you to add in the proper Texan accent!
A Minnesotan colleague has submitted these for the quality of service dimension:
A Massachusetts friend added these:
For non-Bostonians, “Stahted” translates to “Started”.
Have one to submit? Contact us!
The choice of a survey scale impacts setting performance goals. Scale choice and question wording will affect the way people respond. The article also discusses why (artificially) high scores are not necessarily good — if your goal is to use the survey results for continuous improvement projects, requiring Pareto Analysis.
Recently, I heard a colleague present the results of a research survey. One area of the survey addressed the importance of various factors in determining marketplace actions. At this juncture he said, “Everything looks important.” It was apparently true. For each of the three factors questioned, over 50% of the respondents used the top box score of 5 (on a 1 to 5 scale), indicating Extremely Important. More than 80% of the respondents gave a 4 or 5 rating.
So what did we learn from this? Not much about the criticality of different factors, but we can learn an important lesson about survey instrument design practices.
Most any customer research is likely to touch on the topic of measuring the importance or value of some set of factors. Perhaps the focus is the importance of aspects of services or products to the overall business relationship. Perhaps it’s the importance of alternative future actions the company could take. Our real interest lies in the importance of each factor relative to the other factors. This is fundamental Pareto Analysis: separate the critical few from the trivial many. If everything is important, then where do you focus your energies?
The challenge lies in how the question is asked. Most survey designers will follow the practice shown above — asking respondents to evaluate each factor individually on an interval rating scale. Few factors will get low- or mid-level ratings. For example, what if you were surveyed about the importance of various aspects of a recent flight you took. What wouldn’t get a very high rating? Flight schedules? Price? Security processes? Speed of check-in? Handling of checked baggage? Seat comfort? On-time arrival? Flight attendant actions? Meal quality? (Well, okay… airlines seldom serve meals…) Perhaps the movie selection would get low importance scores. Yet, one or two of those items is truly more important to you if you were forced to perform an explicit trade-off analysis
So what’s the solution? Six research approaches can be used to garner importance from respondents. Here’s a quick review of each approach.
Interval Rating Scales. This is the more typical approach just described. Respondents are presented with a rating scale and asked to rate the importance of items. (An interval rating scale requires a consistent unit of measurement — the cognitive distance between adjacent pairs of points must be equal intervals.) Other questions on the survey likely will use an interval scale, so it’s natural to apply the scale to importance questions as well. We’ve seen what this leads to: unactionable data. Everything is important — so nothing is important.
Forced Rankings. An alternative is to ask the respondent to rank order a set of items. For example, six factors might be presented and the respondent is asked to place a “1” next to the most important factor, a “2” next to the second most important factor, and so on. Sounds great. We force the respondent to think of the factors as a set and rank them. But there are shortcomings to this approach.
First, we’d be tempted to take an average of the ranking scores respondents assigned to each factor. However, the question did not generate interval data; it generated ordinal data. That is, even spacing does not exist between each pair of ranked items. (For example, some respondents may consider the items ranked 1 and 2 as almost equal, but the item ranked 3 is a distant third.) You can take an average, but it would be statistically improper and could be misleading. Your analysis is really limited to cumulative frequency distributions for each factor. For example, “60% of respondents ranked Item A as first or second in importance.”
On a web-form survey, you could provide error feedback and not let the respondent move through the survey without completing the question correctly. You’ll see various designs for forced rank questions, and all are likely to annoy some respondents as they complete the question. Many respondents will quit the survey. You could also “correct” the errors — that’s all you can do on a hardcopy survey. But what are the correct answers? You will be introducing error into the survey data because they’ve become your responses, not the respondents’.
Third, you can only ask the respondent to rank order a limited set of items. Ranking even six items is asking a lot of the respondent.
Multiple-Response, Multiple-Choice Questions. One viable solution is to seemingly step back in the sophistication of the question structure by using a multiple-response, multiple-choice question format. Ask the respondent to select from a list of items the 2 (or maybe 3) most important items to them. Your analysis will then be the percent of respondents who selected each item in their top 2. The respondent burden — the amount of work you’re asking the respondent to do – is much less, and the results are still meaningful. Many of the web survey tools will allow you to limit — or force — the number of items checked to a specific number.
The number of items you should ask people to check is driven in good part by the number of items from which they have to choose. Two or three is a reasonable number of items, but if you have one item that you know everyone is likely to select, then you might want to ask for an additional choice. For example, price would likely be a top-three choice for everyone regarding factors that affect a purchase decision. True, you could not ask price, but people would write it in or think you don’t know what you’re doing.
To enhance your data, you can also pose a follow-up question asking the respondent which of the three choices they just checked would be their number one choice. Then, you could pose a second follow-up question about the remaining two choices. Some web-form survey software will perform a threaded branch in which the selections from the first question are carried forward to the subsequent questions. In essence, you’re asking respondents to rank order their top three choices — without presenting a set of instructions that would likely confuse many people.
Fixed-Sum Questions. A question format that combines the interval scale with the forced ranking is the fixed-sum or fixed-allocation question format. Here, you present the respondent with the same set of items and ask them to allocate 100 points across the items based on their relative importance. The respondent has to make a trade-off. More points for one selection means fewer points for another. The depth of thinking required means this question format is high in respondent burden, but there’s real value in that thinking.
A key decision is how many items to present. Four, five or ten are optimal since they divide evenly into 100. Web survey tools will allow you to set the total to a different number. For example, if you decide on 7 items, then have the items total to 70. The tools should provide a running total to the respondent and can force the question to be completed correctly. Otherwise some data cleansing will be necessary, but that is likely worth the effort for the very rich data the format can generate.
Correlation and Regression Analysis. One way to measure importance is to not ask it at all! Instead, importance can be derived statistically from the data set. Consider the scenario where you have questions measuring the satisfaction with various aspects of a product or service and you want to know how important each is to overall satisfaction. Include a summary question measuring overall satisfaction, which you probably would anyway, and skip any questions about importance. Using correlation or regression analysis, you can determine which items align most closely to overall satisfaction. But be sure your manager can understand and will accept the findings.
Conjoint analysis. A final method involves a more complex statistical technique, conjoint analysis. This technique is particularly useful in constructing product offerings where various features are combined into bundles, and the researcher wants to know how important each factor is in driving purchase decisions. Conjoint analysis requires a special research survey where the respondent is presented with pairs of factors asking the relative importance of one over the other. Subsequently, the respondent is presented with two versions of the product, each with different sets of features based upon the previous responses and asked which product version they prefer. The findings show the relative importance of each feature, and the tool allows posing many “what if” scenarios.
In conclusion, measuring importance doesn’t have to lead to frustration with useless data. Think about these alternatives when constructing the research program and the survey instrument, and you can generate useful, meaningful data for business decisions. My fall back is the multiple-choice checklist approach. Simple for the respondent, yet it provides meaningful information for management.