In our last few blog posts, we’ve been describing how marketers are turning to social media research to uncover deep insights about their consumers. As with any research,  the quality of the insights depends on the quality – not necessarily  quantity – of the underlying data.
Let’s say you want to benchmark your brand vs. competitors and vs.  itself so you can track brand health over time. Your initial findings  show dramatic spikes in buzz for your brand, suggesting certain  promotions are working. Months later you want to invest more in this  research and analyze this data for qualitative insights, but wait, where  are the messages? You realize that fifty-percent of the data you were  basing decisions on was SPAM and generated by bots, 3% of the messages  were duplicates of each other and 10% had absolutely nothing to do with  your brand. Lots of social media data for the analysis? Yes. Relevant,  trust-worthy, reliable findings? No.
All the data in the world is meaningless unless you know it is clean,  reliable and accurate. You need the reassurance that what you’re  collecting won’t skew your findings and result in poor decision-making.  To help, we put together a short guide of what to look for when  evaluating your social data options. Social media’s big advantage is  that it represents authentic consumer expressions. However, there are a  number of data issues to watch out for. Let me take you through some  pitfalls and how to work around them.
How do I know what true, quality social media data looks like?
To be a serious marketing player in social media research, you need to ensure you’re getting the best combination of reach and accuracy. Reach means pulling in every single piece of content possibly available. Accuracy means balancing the breadth with relevancy. Here are the critical questions you should ask a company:
• How is the data collected? Is it all machine-based or is there a human element?
• How flexible is the data-mining technology?
• What type of SPAM detection filters are employed?
• Does the data come from third-party sources?
• How are new sites detected? How quickly are they brought online?
• How flexible are keyword searches and search filters?
• How flexible is the data-mining technology?
• What type of SPAM detection filters are employed?
• Does the data come from third-party sources?
• How are new sites detected? How quickly are they brought online?
• How flexible are keyword searches and search filters?
Data Collection
Most data-mining applications work the same – they put out spiders and crawl the online world by pulling back everything they can get, but they often can’t get to everything. That’s why you need a human component to identify the sites that don’t get captured by web crawlers. Many popular sites also frequently change the way companies can collect their data, which can impact data integrity. To avoid this, you need a team on hand for real-time manual adjustments so there’s no interruption in data collection.
The flexibility of the data-mining technology also comes into play.  Many platforms employ a one-size-fits-all approach, meaning that when it  starts crawling a particular board, it won’t collect all the messages,  just the ones it sees first until it hits its message threshold. For  these instances, there should be a specialized team that can reconfigure  the tool to capture all the messages.
Be watchful for and wary of companies that rely on third-party  sources for data; research indicates that 80% of players in the space  use third party commodity data providers. Third-parties aren’t always  reliable; they can drop sources without notice, which will skew your  findings. They also apply a ‘lazy man’s’ approach to data collection.  They pull in as much volume as they can using collection services, not  taking into consideration the relevance or reach of a source, and also  possibly missing highly relevant, industry-specific sources not captured  by collection services. The signal to noise ratio doesn’t look good  here.
Data Hygiene
Effective SPAM detection is critical. All data collection should be using a machine-based learning algorithm. Machine-learning technologies get fed thousands of messages that qualify as SPAM. It then self-learns what different types of SPAM messages look like and weeds them out of the dataset. This detection is effective for blogs, boards and groups, but Twitter is a whole other beast. Twitter SPAM detection should be rule-based. Rule-based detection effectively weeds out messages from bots by checking a user’s profile for certain characteristics, such as no followers or a handle with no associated name.
Companies that offer flexible keyword (classifier) tools and  filtering options provide an extra layer of accuracy and relevancy  protection. For example, research indicates that in healthcare, 92% of  cleaned messages are still irrelevant – so you need this flexibility.  Not all SPAM can always be removed and believe it or not, not all  messages that have your product or brand name in them are relevant,  especially if it includes a common term (e.g., analysis on Snickers  candy bar). Tools that use Boolean Logic allow you to get extremely  specific with what you want to pull in; an added bonus is the  availability to use proximity operators (you can tell the tool that you  want “x” to be with a certain number of words from “z;” for example, if  you want information on vanilla lattes, you tell the tool that “vanilla”  has to appear with 3 words of “latte” to account for phrases such as “I  had a latte; it was vanilla,” or “that vanilla spice latte was so  delicious!”) . And tools that allow you to apply segment filters give  you even more relevance (segment filters allow you to search only on  sources that are known to focus on specific topics or are comprised of  certain target demographics). See figure below:

Finally, another bonus is analyst input. If there are lots of people  working with the social media data on a daily basis, it’s easier to  identify if some sites are being crawled appropriately or if the data  isn’t entirely clean. With this constant feedback, the crawlers and  machine-learning SPAM detection can be updated quickly.
How has the quality of data your company worked with affected your results?
