Small Data or Big Data – Which Matters Most for AI?

Jeff McDowell

September 5, 2024

Primal Labs

In the past year, we have seen countless headlines about how artificial intelligence (AI) will transform business. AI promises to provide insight into data and customers at a level of individualization never seen before. In response, many companies are scrambling to capture and store as much data as possible – but in doing so they might be increasing their exposure to data breaches, privacy violations, and hacks.

Unfortunately, by taking a standard “machine learning only” approach to AI, we may not get far out of the starting blocks to achieve the vision of an AI solution that can understand data at a high level of fidelity. Many people assume that storing and analyzing large amounts of information (“big data”) through machine learning is the only way to take advantage of AI. But machine learning approaches can be actually be ineffective in understanding the meaning of text or the interests of individuals with any sort of specificity. Any company serious about AI needs to develop a solution that is both more targeted and more secure. I believe the way forward lies in integrating small data analysis into a big data approach.

Here are a few reasons to consider small data:

Big data techniques can be expensive and ineffective at high levels of specificity: Just like satellite imagery provides a broad picture of geospatial data of a physical lake, today’s big data approaches do the same with data lakes. When statistical methods of AI are applied to a big data environment, the output is usually very generalized and lacks fidelity. For example, if a statistical model is looking at data about sports fans, it may see a pattern that groups people into categories such as “baseball enthusiast”, “football enthusiast”, etc. These broad categories lose sight of the fact that some users are actually a pitching enthusiast, or a statistics junkie, or a part-time umpire. Knowledge of these narrower topics would be extremely useful to advertisers of niche products, yet big data platforms today are very limited in identifying and exposing these higher fidelity interest categories. This is because processing and storage becomes increasingly more expensive and complex when analyzing large amounts of data to achieve higher levels of specificity.

Integrating small data analysis is the key to making AI meaningful: Small data simply refers to the quantity of data available to train models. It’s often defined as the amount of information that can be processed by one computer, but it could be even smaller than that – a spreadsheet, a document, an article, or even as small as a social media post. “Small data” can even be found within large data sets. Instead of applying statistical collaborative filtering techniques to a group of people to infer broad interests which are hit and miss, taking an approach that applies semantic or symbolic techniques to small data can look at an individual to understand exactly what they are interested in, no matter the level of specificity. For our baseball example, a small data approach would analyze the meaning and context of a person’s blog or social media post, and pick up the nuance between someone who likes statistics vs someone who is interested in pitching techniques.

Small data approaches increase explainability and reduce potential for bias: One of the criticisms of AI is that it operates in a “black box”, where it can be difficult to determine the reasoning behind a specific output. Numerous organizations – including the National Institute of Standards and Technology (NIST) – have called for a more balanced and thoughtful approach to developing AI solutions, to ensure they are trustworthy and explainable. AI outputs based on small data are inherently easier to interpret by humans. AI systems which analyze and categorize users based on large data sets also have the risk of introducing biases over time – a problem that can be mitigated by integrating analysis of small data, which can serve as a self-correction against bias.

AI has huge potential to augment our human intelligence and make us more productive. The importance and power of small data for AI is still on the fringes of being understood, but will gain momentum as businesses and consumers increasingly expect a greater level of relevance and security from AI. Even Eric Schmidt, former CEO of Google, recently tweeted, “AI may usher in the era of ‘small data’ – smarter systems can learn with less to train on.”

Tweet by Eric Schmidt, former Google CEO

‍
‍

‍

The current model of statistical analysis of big data is ‘good enough’ for now, but not sustainable. For AI to really be relevant, efficient, and safe, big data must be balanced by a robust small data processing activity.

Editor's Note: This blog was originally published on August 27, 2018. It was updated on September 4, 2024.