MaxDiff Analyzer

MaxDiff Analyzer Help

This tool is for analyzing results of MaxDiff (Maximum Difference Scaling) questionnaires (also known as best/worst scaling) as well as other general data (like Likert Scale data). It can:

  • Display average values for the sample, or by segment
  • Conduct simulations, projecting "market choices"
  • Optimize portfolios of items to "reach" respondents, via TURF analysis

There are three different types of projects available within the MaxDiff Analyzer:

  • MaxDiff
  • Anchored MaxDiff
  • General

Use the "MaxDiff" project type when you are analyzing standard MaxDiff scores. This project type will rescale the results and enable other special features that only work with MaxDiff scores.

Use the "Anchored MaxDiff" project type when you are analyzing Anchored MaxDiff scores. With Anchored scaling, the Anchor utility (must be the last utility in the file) usually represents the utility threshold (boundary) between important/not important, preferred over status quo/not preferred over status quo, or buy/no buy items.

The "General" project type will work with most other data that are to be analyzed in the simulator or TURF analyzer. Likert scale data is one example of such data that could be analyzed with a "General" project.

Sawtooth Software has two main ways to compute individual-level (HB) scores for MaxDiff experiments:

  • MaxDiff/Web software (Analysis + MaxDiff Scores)
  • Using CBC/HB software (analyzing a .CHO file produced by the MaxDiff Experiment Designer software)

In either case, when the scores are computed, a .csv file is created containing case ids, scores for items in your experiment, and the item labels.

Browse to the .csv file containing the scores so that the MaxDiff Analyzer can import them for analysis.

You can find more information about the file formats here. (This will be helpful if you have computed scores using a different method or if you would like to customize the contents of the file before you upload it.)

If you used the CBC/HB method to generate scores, the "reference item" (the last item in your study) is not explicitly represented in the .csv file (though indeed it was a part of your experiment). You are asked to supply its label (since that label isn't available in the .csv file). The MaxDiff Analyzer recognizes that the final item was the reference item, with a raw score of zero, during import.

So that the Analyzer can convert the raw scores to probabilities true to respondents' choices within the context of the questionnaire, you need to indicate how many items were shown per MaxDiff set. For example, it is typical to show 5 items at a time in each MaxDiff question. If you showed 5 items per set, indicate a "5."

The "Scores" tab reports the average scores for the items across respondents. Also, the 95% confidence interval for each is displayed.

Rescaled Scores: These are "Probabilities of Choice" (described below) that have been rescaled to sum to 100 for each respondent. These data reflect a ratio-quality scale, allowing one to conclude (for example) that an item with a score of 10 is twice as important/preferred as an item with a score of 5.

Probability of Choice: These are probabilities (ranging from 0 to 100) that reflect the likelihood that an item would be selected as "best" among a representative set of items in the MaxDiff questionnaire. For example, if you showed 5 items per set, the Probability of Choice for an item is the average likelihood that respondents would select this item as "best" when compared to 4 other items of average importance/preference (among those included in the questionnaire). These data reflect a ratio-quality scale.

Raw Scores: These are the scores directly resulting from the HB estimation and are logit-scaled (an interval-quality scale). These scores are zero-centered within each respondent, so their average is zero. Interval-quality scales do not allow us to make ratio-quality judgments, such as saying that an item with a raw score of 2 is twice as important/preferred as an item with a score of 1.

The 95% confidence interval provides an indication of how much certainty we have regarding our estimate of the item's score. The interpretation is this: if we were to repeat the experiment many, many times (drawing new random sample in each case), the population's true mean would fall within the computed confidence interval in 95% of the experiments. In other words, we are 95% confident that the true mean for the population falls within the 95% confidence interval (again, assuming unbiased, random samples). The 95% confidence interval is computed by taking the item's mean, plus or minus 1.96 times its standard error. The standard error for each score is computed by dividing its standard deviation by the square root of the sample size.

The "Scores" tab reports the average scores for the items across respondents. Also, the 95% confidence interval for each is displayed.

With Anchored scaling, the Anchor utility usually represents the utility threshold (boundary) between important/not important, preferred over status quo/not preferred over status quo, or buy/no buy items.

Zero-Anchored Interval Scale: These scores set the Anchor item equal to zero and the range of scores equal to 100 for each respondent. Negative scores are below the anchor (i.e. the important/not important threshold) and positive scores are above the anchor. This scaling method has the advantage that each respondent has equal weighting toward population means, making it more proper to compare individuals or groups of respondents on the item scores than for the other scales. But, interval-quality scales do not allow us to make ratio-quality judgments, such as saying that an item with a raw score of 2 is twice as important/preferred as an item with a score of 1.

Probability Scale (Anchor=100): These are ratio scaled scores for items where the Anchor item is equal to 100 for each respondent. All scores are positive, where those below 100 are below the anchor threshold of utility (i.e. the important/not important threshold). These data reflect a ratio-quality scale, allowing one to conclude (for example) that an item with a score of 10 is twice as important/preferred as an item with a score of 5. The ratio differences are directly tied to the probabilities of choice as reflected in the context of the MaxDiff questionnaire. One disadvantage of this scale is that each respondent does not receive equal weighting when computing population means. Respondents who believe all items fall below the threshold have a maximum score of 100, whereas respondents who believe some items exceed the Anchor utility have a maximum score of the number of items shown in each set * 100.

Probability of Choice vs. Anchor (Anchor=50): These are probabilities (ranging from 0 to 100) that reflect the likelihood that an item would be selected as "best" compared to the Anchor item. These data reflect a ratio-quality scale. A score of 50 indicates the item is equal in utility to the Anchor item. A score of 90 means the item has a 90% likelihood of being selected instead of the Anchor item. A score of 10 means the item has a 10% likelihood of being selected instead of the Anchor item. If the anchor is a buy/no buy threshold, then the score represents the likelihood of purchase. The main disadvantage of this scale is that the ratio scaling is only consistent with the probabilities of choice expressed by respondents in the questionnaire in the case of 2 items shown per set (method of paired comparisons). When using this scale for MaxDiff questionnaires that have shown more than 2 items per set, the ratio differences between items will be somewhat more accentuated than justified by the original choice data. A secondary disadvantage is that each respondent does not receive equal weighting when computing respondent means. Some respondents have a wider range of scores than others.

Raw Scores: These are the scores directly resulting from the HB estimation and are logit-scaled (an interval-quality scale). The Anchor item receives a score of zero, and the other items are scaled with respect to the Anchor item. Items preferred to the Anchor threshold are positive. Items not preferred to the Anchor threshold are negative. The Raw interval-quality scales do not allow us to make ratio-quality judgments, such as saying that an item with a raw score of 2 is twice as important/preferred as an item with a score of 1. The Raw Scores also have the disadvantage that some respondents may be weighted significantly more than others in calculating the population means (their scores have much larger range than other respondents).

The 95% confidence interval provides an indication of how much certainty we have regarding our estimate of the item's score. The interpretation is this: if we were to repeat the experiment many, many times (drawing new random sample in each case), the population's true mean would fall within the computed confidence interval in 95% of the experiments. In other words, we are 95% confident that the true mean for the population falls within the 95% confidence interval (again, assuming unbiased, random samples). The 95% confidence interval is computed by taking the item's mean, plus or minus 1.96 times its standard error. The standard error for each score is computed by dividing its standard deviation by the square root of the sample size.

Testing for Differences between Scores

The confidence interval is sometimes used as a way to test whether an item is "significantly different" from another item. The easy (but not technically correct) "eyeball" method is to observe whether the 95% confidence intervals overlap for the two items. If two items reflect a statistically significant difference using this "eyeball" test, then they will also pass the more rigorous tests described below.

A second, more technically correct way, to test whether two item mean scores are "significantly different" is to divide the difference in the scores by the pooled standard error of the two items, where the pooled standard error is equal to: sqrt (SEa^2 + SEb^2) where SEa is the standard error of the first item and SEb is the standard error of the second item. If this resulting T-value is greater than 1.96 in absolute magnitude (assuming large sample), we are 95% confident that the mean for one item is different from the other. However, this test assumes independent samples, when really we're dealing with matched samples, and could use an even more sensitive test.

An even more sensitive and technically correct way is to use the matched samples T-test. The scores may be exported and opened in Excel. A new column is defined by taking the difference between the two scores (for each respondent). Next, the standard deviation of the values in that new column is taken, and that standard deviation is divided by the square root of the sample size, resulting in the standard error of the difference between scores. The matched samples T-value is the mean difference in scores divided by this standard error.

You may upload a file that contains variables for use in segmenting respondents and weighting. All segmentation and weighting variables must be in the same file, formatted as .csv. Case id should be the first column, followed by other variables. The leading row must contain variable labels which are read into your project.

After you have uploaded segmentation and weighting variables, you can edit their labels, including labels for categories, under the Variables tab. Click the Edit icon on the appropriate row within the Variables grid to edit any labels.

The Simulator function conducts simulations similar to conjoint-style market simulations. You select which items are to be made available to respondents (as if they were in competition with one another within a marketplace). The percent of respondents projected to "choose" each item as "best in market" is computed, according to either the first choice or the Share of Preference (logit) rules.

First Choice Rule: Each respondent "casts a vote" for the item that has the highest score within the items included in the simulation set. In the case of a tie (a rare occurrence), tied items share (divide) the vote. The interpretation is simple: "Considering only these x items, what percent of the respondents think each is best?"

Share of Preference (Logit) Rule: Respondents are allowed to split their votes across the items included in the simulation set. The probability that an item is selected is equal to the antilog of the item's raw score divided by the summation of the antilogs of the raw scores for all items in the set.

You will generally find the projected "shares of choice" a bit more extreme for the First Choice than the Share of Preference (logit) rule. You will also find that the logit rule's estimates are more precise (smaller standard errors), because more information is gleaned from each respondent. But, the First Choice rule is easier to describe to others.

The standard errors can be used to compute a 95% confidence interval for the estimated share of choice, by taking the estimated share of choice +/- 1.96 times the standard error.

TURF stands for "total unduplicated reach and frequency." It is an optimization approach for finding a subset of items that "reach" the maximum number of respondents possible. For example, the classic problem is one of choosing which flavors of ice cream to stock in the freezer at a grocery store. The grocer may decide that he/she has limited space and can only include up to 8 flavors of ice cream (out of 30 possible flavors). The grocer wants to maximize the chance that shoppers will find a flavor that they like enough to buy in the freezer (the "Threshold" criterion). For example, if a flavor achieves a score of either "4" or "5" on a 5-point Likert scale ("top two box"), one might decide to count the respondent as "reached." The problem isn't as simple as including the eight most preferred flavors on average across the sample. Niche flavors that appeal to segments of the population (and that can increase total reach) would be overlooked.

For the ice cream example outlined above, the TURF procedure examines all possible subsets of 8 flavors of ice cream (out of 30 total flavors), and for each set counts how many respondents are "reached." The top sets of 8 flavors that maximize "reach" are listed in the output with the percent of respondents reached shown next to each.

One challenge with TURF is that many solutions typically yield essentially equal reach. However, this could be viewed as an opportunity rather than a problem. You can bring other information to bear on the decision (such as expert opinion) to help decide which set is best to solve the business problem. For example, if the grocer knew that one particular flavor (that appears in many of the top sets) tends to spoil more quickly than others, such solutions would be avoided in favor of other similar-reach solutions.

The MaxDiff Analyzer provides three different options for assessing "reach" in TURF:

First Choice: A respondent is counted as "reached" if the subset of items contains his/her top item (the item with the highest raw score or value). This option also reports a "Frequency." If a respondent has multiple top items (multiple items with the same highest raw score or value) then each top item will count as a partial "reach" (the "reach" value will be 1 / n where n is the number of top items). The "Frequency" is the number of top items in the set.

Threshold: The analyst supplies a value, indicating a threshold above which a respondent is counted as "reached." If any of the items' Probability of Choice in the set exceed the supplied threshold, the respondent is considered "reached." This option also reports a "Frequency." The "Frequency" is the number of items in the set that exceed the supplied threshold. If two sets have equal reach, the set with higher frequency should be preferred.

Weighted by Probability (Standard MaxDiff Scores):

We compute reach as the probability that the respondent will choose at least one of the items in the portfolio instead of other items of average utility.

To illustrate our approach, we begin with the formula for computing the likelihood of selecting an item from a set of items (of average utility) equal in size to that shown in the MaxDiff questionnaire. We zero-center the raw HB scores such that the average item has a score of 0. Since e0 is equal to 1, the likelihood of selecting item i from a set involving a - 1 other items of average desirability is:

Pi = eUi / (eUi + a - 1)

It is easy to expand this equation to include more items. For example, the likelihood that a portfolio containing items i, j, and k reaches the respondent is:

Pijk = (eUi + eUj + eUk) / (eUi + eUj + eUk + a - 1)

Weighted by Probability (Anchored MaxDiff Scores):

We compute reach as the probability that the respondent will choose at least one of the items in the portfolio when given the option to pick among those items versus the anchor.

To illustrate, let's consider a portfolio with a single item i. The likelihood that item i will be chosen instead of the anchor (where the anchor has a score of 0 and e0 is equal to 1) is:

Pi = eUi / (eUi + 1)

It is easy to expand this equation to include more items. For example, the likelihood that a portfolio containing items i, j, and k would be chosen instead of the anchor is:

Pijk = (eUi + eUj + eUk) / (eUi + eUj + eUk + 1)

With an exhaustive search, the number of portfolios that the TURF procedure must evaluate can become extremely large. Currently, we have placed a limitation of 5,000,000 portfolios to evaluate exhaustively in any one TURF run. If you specify a run that exceeds this limit, the software will automatically switch to a stepwise algorithm (explained below). The formula to determine how many portfolios the exhaustive TURF procedure must process is as follows.

m!
n!(m-n)!

where:

m is the total number of items in your study
n is the size of the portfolio to optimize

For example, searching for an optimal portfolio of 5 items from 50 leads to 2,118,760 possible portfolios to evaluate, which the Analyzer will do exhaustively. But, searching for 6 items out of 50 leads to 15,890,700 possible portfolios, which exceeds the limit and will therefore be solved using the stepwise algorithm.

One of the challenges with TURF is dealing with truly large spaces of potential portfolio combinations. For example, choosing the optimal 12 flavors out of 70 involves examining over 10 trillion possible 12-flavor combinations (if using an exhaustive algorithm which examines each possible one). Such problems are hard to solve within reasonable time limits if using exhaustive search. Researchers often rely on heuristic search algorithms that can find nearly-optimal solutions in a fraction of the time required to exhaustively search for the globally optimal solution.

For large TURF problems (where there are more than 5,000,000 possible sets to search if solving exhaustively) the analyzer will use a Stepwise algorithm. Stepwise TURF searches for the optimal portfolio in incremental steps, taking the best result from the previous step forward to the next step. For example, searching for 12 flavors out of 70 could be done in three steps:

  • Step 1: Find the best 4 flavors out of 70 using exhaustive search (916,895 portfolios to exhaustively search)
  • Step 2: Force the best 4 flavors from step 1 into the search; use exhaustive search to find the next best 4 flavors to add to those previous 4 (720,720 portfolios to examine)
  • Step 3: Force the 8 flavors from steps 1 and 2 into the search; use exhaustive search to find the next best 4 flavors to add to those previous 8 (557,845 portfolios to examine)

Breaking the problem into three steps of 4 flavors each leads to 2,195,460 portfolios to search (rather than 10 trillion), which takes less than 1 minute. However, we are not assured of finding the optimal solution. And, if we are interested in reviewing the top dozen portfolios, there isn't a guarantee that we've done a very good job at finding any of global top dozen solutions. Early decisions in the stepwise procedure corner us into a fraction of the search space, and prohibit us from looking elsewhere for better solutions. So, we don't stop searching just yet.

To further improve upon the stepwise procedure, we take a few more seconds to examine hundreds of potential item swaps that could lead to better reach. We take the top several dozen portfolios found in the stepwise routine, and we try swapping non-included flavors for included flavors one at a time, looking for new portfolio definitions that increase the reach. It turns out that these last few seconds spent in swapping can improve the results quite dramatically.

If you want to prohibit certain items from occurring within a portfolio, add any prohibited combinations within the text box. Specify one prohibited combination per line, with items separated by spaces, tabs, commas, or semicolons.

For example, if items 5 and 6 cannot occur together, add the following line to the dialog:

5 6

If items 9 and 10 also are prohibited, then specify these two rows:

5 6
9 10

More than two items may be entered per row. For example if items 11, 12, and 13 are not to occur within the same portfolio, specify:

11 12 13

Only portfolios containing ALL three of these items would be prohibited. A portfolio containing two of these items (e.g. 11 and 12) is allowed.

To exclude single items from occurring within a portfolio, specify one item per line. For example, if items 11, 12, and 13 cannot occur in any way in the portfolio, specify:

11
12
13

There is virtually no limit to the number of prohibitions you can supply.