In our last few blog posts, we’ve been describing how marketers are turning to social media research to uncover deep insights about their consumers. As with any research, the quality of the insights depends on the quality – not necessarily quantity – of the underlying data.
Let’s say you want to benchmark your brand vs. competitors and vs. itself so you can track brand health over time. Your initial findings show dramatic spikes in buzz for your brand, suggesting certain promotions are working. Months later you want to invest more in this research and analyze this data for qualitative insights, but wait, where are the messages? You realize that fifty-percent of the data you were basing decisions on was SPAM and generated by bots, 3% of the messages were duplicates of each other and 10% had absolutely nothing to do with your brand. Lots of social media data for the analysis? Yes. Relevant, trust-worthy, reliable findings? No.
All the data in the world is meaningless unless you know it is clean, reliable and accurate. You need the reassurance that what you’re collecting won’t skew your findings and result in poor decision-making. To help, we put together a short guide of what to look for when evaluating your social data options. Social media’s big advantage is that it represents authentic consumer expressions. However, there are a number of data issues to watch out for. Let me take you through some pitfalls and how to work around them.
How do I know what true, quality social media data looks like?
To be a serious marketing player in social media research, you need to ensure you’re getting the best combination of reach and accuracy. Reach means pulling in every single piece of content possibly available. Accuracy means balancing the breadth with relevancy. Here are the critical questions you should ask a company:
• How is the data collected? Is it all machine-based or is there a human element?
• How flexible is the data-mining technology?
• What type of SPAM detection filters are employed?
• Does the data come from third-party sources?
• How are new sites detected? How quickly are they brought online?
• How flexible are keyword searches and search filters?
• How flexible is the data-mining technology?
• What type of SPAM detection filters are employed?
• Does the data come from third-party sources?
• How are new sites detected? How quickly are they brought online?
• How flexible are keyword searches and search filters?
Data Collection
Most data-mining applications work the same – they put out spiders and crawl the online world by pulling back everything they can get, but they often can’t get to everything. That’s why you need a human component to identify the sites that don’t get captured by web crawlers. Many popular sites also frequently change the way companies can collect their data, which can impact data integrity. To avoid this, you need a team on hand for real-time manual adjustments so there’s no interruption in data collection.
The flexibility of the data-mining technology also comes into play. Many platforms employ a one-size-fits-all approach, meaning that when it starts crawling a particular board, it won’t collect all the messages, just the ones it sees first until it hits its message threshold. For these instances, there should be a specialized team that can reconfigure the tool to capture all the messages.
Be watchful for and wary of companies that rely on third-party sources for data; research indicates that 80% of players in the space use third party commodity data providers. Third-parties aren’t always reliable; they can drop sources without notice, which will skew your findings. They also apply a ‘lazy man’s’ approach to data collection. They pull in as much volume as they can using collection services, not taking into consideration the relevance or reach of a source, and also possibly missing highly relevant, industry-specific sources not captured by collection services. The signal to noise ratio doesn’t look good here.
Data Hygiene
Effective SPAM detection is critical. All data collection should be using a machine-based learning algorithm. Machine-learning technologies get fed thousands of messages that qualify as SPAM. It then self-learns what different types of SPAM messages look like and weeds them out of the dataset. This detection is effective for blogs, boards and groups, but Twitter is a whole other beast. Twitter SPAM detection should be rule-based. Rule-based detection effectively weeds out messages from bots by checking a user’s profile for certain characteristics, such as no followers or a handle with no associated name.
Companies that offer flexible keyword (classifier) tools and filtering options provide an extra layer of accuracy and relevancy protection. For example, research indicates that in healthcare, 92% of cleaned messages are still irrelevant – so you need this flexibility. Not all SPAM can always be removed and believe it or not, not all messages that have your product or brand name in them are relevant, especially if it includes a common term (e.g., analysis on Snickers candy bar). Tools that use Boolean Logic allow you to get extremely specific with what you want to pull in; an added bonus is the availability to use proximity operators (you can tell the tool that you want “x” to be with a certain number of words from “z;” for example, if you want information on vanilla lattes, you tell the tool that “vanilla” has to appear with 3 words of “latte” to account for phrases such as “I had a latte; it was vanilla,” or “that vanilla spice latte was so delicious!”) . And tools that allow you to apply segment filters give you even more relevance (segment filters allow you to search only on sources that are known to focus on specific topics or are comprised of certain target demographics). See figure below:
Finally, another bonus is analyst input. If there are lots of people working with the social media data on a daily basis, it’s easier to identify if some sites are being crawled appropriately or if the data isn’t entirely clean. With this constant feedback, the crawlers and machine-learning SPAM detection can be updated quickly.
How has the quality of data your company worked with affected your results?