Idea clustering: the wisdom of crowds, or reversion to mediocrity?

August 15, 2016

3 min read

Share on

Simon Elliston Ball

Senior Product Manager at AWS

Simon is a data scientist, with a background in product management, and has worked for a range of data technology companies on both sides of the fence including vendors like Hortonworks and Red Gate Software and numerous heavy data users in retail, hedge funds, and web. He focuses on big data analytics, machine learning, and how these can be used to make better products.

Read more by Simon »

When you’re looking for new ideas, look beyond yourself. Collaboration and teamwork is a great way to generate new ideas. A popular way to do this is to brainstorm ideas, then aggregate them. Write a thought, whatever comes into your head, put it on a Post-It note. Now, stop. Let’s put them all together on the whiteboard. If you’ve got the same idea as someone else, cluster them together. Bingo, the complete aggregated set of ideas from the whole team, with a range of inputs, experience and multiple perspectives.

People like to agree, they like to fit into groups. As soon as you switch from individual idea generation to group discussion, social dynamics kick in. People want to agree that Idea A really is the same as idea A+1/2. Clustering ideas has the effect of yielding the lowest common denominator, not the simplest form that expresses them and exposes them in their beautiful complexity.

Clustering ideas has the effect of yielding the lowest common denominator, not the simplest form that expresses them and exposes them in their beautiful complexity.

A lot of data science use cases come down to a similar form of clustering. This is especially true in unsupervised learning, where you don’t have a nice set of known outcomes. A common algorithm is k-means, which works on the basis of identifying a preset number of clusters of data points based on calculating their distance from a number of cluster centres. The goal of such a process is to reduce huge numbers of data points and summarise them. You throw away the huge amount of detail in the original data to try and create a more workable summary set of points.

This is exactly what we’re trying to do with our idea clustering. However, as we peel back the complex ideas that have been compressed into a Post-It and expressed in a few quick words, we’ve lost information. When we cluster that idea which is close enough to that other idea, just as we do with clustering, we lose information. This can be a good way of getting a good summary, but it always runs up against another problem familiar to the data scientist: reversion to the mean.

Losing key insights

Every time we simplify the idea, cluster the simplifications, and throw away that oh-so-very-important difference that might well have been the subtle but key insight that makes the difference in a new product, we increase our bias. We risk simplifying not to a core differentiating feature that makes a killer product, but the mediocrity of me-too. Reversion to the mediocre mean.

As a product manager I found these idea clustering sessions almost invariably underwhelming. The output would just be a whole bunch of the obvious. Maybe a list of categories to care about, but not much else. The value came in the subtlety, it came in the outliers, the genuinely different, but also in the rarely captured subtle differences between ideas that got piled together. It’s inside the clusters that the key insights that take a product from good to great were buried. So as any data scientist would tell you, the cluster model is great, but always make sure you look at the residuals. There might just be insight hiding there.

Explore all topic content