Data
Whether an AI system is in development or in use, the quality of its data is paramount to its success. Beneath the information stored in digital data lies an unexpected wealth of technical decisions and design. A deeper understanding of the choices behind data design can reveal the lens through which AI “sees” the world. These choices matter for policy, not only because many data standards are mandated by policy, but because they can influence and even dramatically change outcomes.
Updated September 2024
This part of the AI Policy Guide covers how data are used to both train AI systems and form their inferences in real time. Learn what digital data are, what data qualities matter, and the challenges data create.
Data serve two high-level purposes in artificial intelligence (AI) systems. First is the input. Data are the digital raw material used to train models during the machine learning process, as well as the input on which trained models make inferences. Second is the output of inference that serves the practical purposes of users and output that can be recycled as input for further refining model performance.
Several design choices of the dataset—such as volume, data selection, and the removal of outliers—shape the nature of AI systems. The technical form of digital data files also matters. The resolution of a photo, the compression of digital music, and unseen metadata all shape what information an AI system can process during learning or inference. To understand how microchips and algorithms shape AI, policymakers must first grasp the fundamental importance of data.
Data have many important aspects:
Through the training process, machine learning models use data to refine their inferences.
When deployed, trained models use input data to make inferences, which can be translated into predictions and decisions.
Many machine learning approaches require large volumes of data to train AI models.
Machine learning approaches with small data are emerging to enable success without big data.
The variety of data can be just as important as the volume. With diverse and representative data, systems can better account for real-world diversity and complexity.
A diversity of data storage, warehousing, and collection systems is an important consideration in understanding AI governance.
The data used to train and operate systems are often the result of human curation, labeling, and cleaning. Human curation of data can lead systems to reflect biases. Some systems may perpetuate negative biases, whereas some might be more objective.
Data Basics
Whether an AI system is in development or in use, the quality of its data is paramount to success. Selecting high-quality AI data is challenging, a function of multiple competing factors including volume, variety, and velocity. These qualities together are sometimes referred to as the “three V’s.”
“We don’t have better algorithms. We have more data.”
Data Volume
Determining the ideal data volume, or the quantity of data relative to the model’s needs, has become a central question of machine learning. To train AI systems, there are two emerging approaches: big data and small data.
The big data approach is likely the most familiar. To train an AI system, vast stores of data are funneled into the model, which learns from that data and refines itself over time. Although this process does not always work in practice, the hope is that with enough data, the model eventually arrives at an optimal form with powerful predictive capabilities.
The famed ImageNet database illustrates the power that large and diverse datasets can provide. Introduced in 2009, ImageNet included more than 14 million images and was conceived on the premise that progress in AI image recognition was a matter of more data, not improved algorithmic design. This approach proved successful. Massive data accelerated the improvements in computer image recognition; the accuracy of models using ImageNet jumped from a modest 72 percent success rate in 2010 to 96 percent in 2015, an accuracy rate exceeding average human success achieved in just five years. Such results are rooted in the volume of this database.
Although the ImageNet approach to image recognition benefited from millions of data points, the exact volume required for machine learning training is not standardized. Note that image recognition is a narrow, single-purpose application of this technology, yet it still required vast troves of data. For more complex systems, such as driverless vehicles or chatbots, the volume of data is likely orders of magnitude larger. Estimating how much data are enough is a moving target and heavily depends on the application complexity, model size, accuracy requirements, and other goals. Progress has been made toward defining the relationship between algorithms and data requirements; however, current models are still speculative. In practice, engineers often depend on soft rules of thumb rather than empirically tested processes.
Trending against big data approaches are the increasingly common small data strategies. These can be used in scenarios where data are limited, spotty, or even unavailable. Small data strategies use a variety of techniques to overcome data limitations, including transfer learning, where a model “inherits” learned information from previously trained models; synthetic data, where representative yet generated data are synthetically created; and Bayesian methods, where models are coded with “prior information” that provides problem context before learning begins, thereby shrinking the overall learning challenge. Given predictions that high-quality data may soon be exhausted, such strategies could augment or maximize the value of existing human-generated data. collection, digitization, and aggregation of bulky tranches of data. These data include physician documentation, patient inputs, external medical facilities transmissions, and direct transmissions of medical data from hospital instruments. As in social media, aggregated healthcare data can be truly massive.
As AI models are embedded into physical systems such as cars and drones, an AI system has broadened to include the visual, audio, and signal arrays that capture real-time information to function adequately. Some refer to this as the broader AI constellation. AI increasingly takes advantage of the internet of things (IoT), a network that connects uniquely identifiable “things” to the internet, where the “things” are devices that can sense and interact according to their hardware and software capabilities. IoT devices can prove rich data sources and give AI additional eyes and ears into a problem. Recall that one of the primary benefits of AI is its sensory scale and scope. IoT is a relatively new phenomenon, and these devices may grow in importance to AI as they are able to collect a wide variety of previously inaccessible data.
Success can be contingent on distribution, a function of a web of storage and networking technologies, which are important components of many AI systems. These devices include not only data warehouses—large, centralized warehouses holding hundreds of servers on which vast lakes of data are stored—but also smaller caches of data physically closer to where the program is running to allow for quick data access. For an AI to learn quickly, and function efficiently during inference, data must be easy to collect, store, and access.
In 2018, DeepMind’s AlphaZero demonstrated how an AI system could master chess, Shogi, and Go through self-play—learning without any input data apart from the game rules. The system bested all existing big data–trained systems, challenging the assumption that more data are always better. Although AlphaZero’s design is not universally applicable, it demonstrates the potential of small data AI to transform future AI development. This ability to make accurate inferences without having seen explicit examples is called zero-shot learning, while few-shot learning is the ability to make inferences based on a handful of examples. Both are considered rare yet highly desirable qualities essential for model flexibility and robustness.
Variety
Variety is just as important as volume. The problems that AI systems face are often complex and, in theory, a great variety of data can help models account for the unique wrinkles and corner cases that complexity brings. Flexibility is essential to AI quality and ensures that systems are robust in the face of the unexpected. A classic example illustrating the importance of variety is the facial images used to train facial recognition algorithms. Human faces come in many varieties, and to perform accurately, an algorithm should be trained on data containing a full variety of races, genders, hair colors, and so forth. Without full variety, these systems have been shown to misidentify nonwhite faces at significantly higher rates. Insufficient variety can create performance-degrading bias.
The variety of data must match the task at hand. The maps, visual images, and proximity sensor data needed to train a driverless car will be vastly different from the data required to train a stock-trading AI. Data must also be timely. Adding stale data—that is, old data that are not quite pertinent to the current problem—just for the sake of greater volume can reduce the overall quality of an AI system. As an illustration, inflation data taken before 1971, when the US government promised a fixed rate for gold coins (and gold bullion), may contain more noise than signal for inflation data since 1971. Perhaps such data should be excluded when training economic modeling systems.
Velocity
Velocity refers to the speed “in which data is generated, distributed, and collected.” In general, this speaks to an AI system’s ability to manage and access the data it needs for optimal performance.
Data generation and collection depend on the design of a system and the way it interfaces with the world. Web applications are well known for their ability to amass diverse and incisive data from their users. Meta and Google use digital platforms, social media, and adware to track users and collect personal data. Mass data collection is also widespread outside of the internet. In healthcare, electronic health records have enabled the collection, digitization, and aggregation of bulky tranches of data. These data include physician documentation, patient inputs, external medical facilities transmissions, and direct transmissions of medical data from hospital instruments. As in social media, aggregated healthcare data can be truly massive.
As AI models are embedded into physical systems such as cars and drones, an “AI system” has broadened to include the visual, audio, and signal arrays that capture real-time information to function adequately. Some refer to this as the broader “AI constellation.” AI increasingly takes advantage of the internet of things (IoT), a network that connects uniquely identifiable “things” to the internet, where the “things” are devices that can sense and interact according to their hardware and software capabilities. IoT devices can prove rich data sources and give AI additional eyes and ears into a problem. Recall that one of the primary benefits of AI is its sensory scale and scope. The IoT is a relatively new phenomenon, and these devices may grow in importance to AI because they are able to collect a wide variety of previously inaccessible data.
Web storage and networking technologies are essential components of many AI systems, physically distributing the data storage and processing burden. These devices include not only data warehouses—large, centralized warehouses holding hundreds of servers on which vast lakes of data are stored—but also smaller caches of data physically closer to where the program is running to allow for quick data access. For an AI to learn quickly, and function efficiently during inference, data must be easy to collect, store, and access. Distributed data resources allow systems to take advantage of storage and processing power otherwise unavailable, while resource consolidation in warehouses can lead to economies of scale and lower cost burdens.
Data Management
Data management often dominates AI design. In fact, engineers frequently cite that data preprocessing accounts for 80 percent of engineering time. Data are often disjointed, messy, and incomplete. Before a model can be trained, data cleaning, often by hand, is required to ensure usability.
To prepare data, engineers must decide whether to remove outliers, and they must weed out irrelevant information and ensure that the data are well organized and machine readable. Various methods and rules of thumb have also been developed to help fill in data gaps as needed. To reiterate, AI engineering is often more art than science. Further, data must often be labeled. AI cannot naturally know the labels and symbols that humans apply to objects. An image of a red, shiny fruit can be labeled “apple” only if AI knows that term. All these labels are often affixed by hand. Automatic label generation, however, is increasingly common. OpenAI’s DALL·E 3 used 95 percent AI-generated captions in its training dataset, often enabling longer, more descriptive captions than those written by humans. While traditionally time consuming and human driven, data management overall is increasingly automated.
Bias
A common concern in AI is bias, defined generally as the difference between desired outcomes and measured outcomes. Data are a major source of AI bias. When a model learns from human-curated data, the model takes on a lens that reflects the viewpoint of the humans who selected and shaped those data. For instance, natural language–processing AI trained on news articles may take on and perpetuate the societal stereotypes embedded in the language and viewpoint of those articles. Even after training, the data input used during model inference can bias its output. Input data that ask an AI chatbot to write a “positive poem,” versus just a poem, will bias results in a positive direction.
The National Institute for Standards and Technology (NIST) notes that AI bias has many roots. In many cases, bias simply stems from the natural blind spots in human cognition and judgment and the consequent choices that engineers make about what data are more important or less important. Because humans collect data, all data will be biased in some way. In other cases, bias is rooted in structural constraints. Perhaps the dataset an engineer uses is selected not because of its superior quality but merely because it was easy or cheap to access. The resulting AI system will then take on the qualities of that set, whatever they may be. Historical data trends can also bias present-data AI systems. For instance, if an algorithm used to judge recidivism was trained on data marred by historical racism, its decisions could incorporate those historical prejudices moving forward. Beyond these examples, there are many additional sources of bias, all of which must be balanced when selecting data.
Bias, although unavoidable, is not necessarily harmful. Often, the intensity of a given bias may be negligible or irrelevant to the goals of a system. A chatbot that is biased toward using an overly academic tone might be useful as a research reference tool despite occasionally sounding pompous. In other cases, biases may exist yet have little effect on system performance because they are exceedingly rare. Identity-based biases may be considered negligible in a system if they occurred only once every trillion queries. In all cases, engineers decide, consciously or not, what constitutes an acceptable level of bias before they deploy these systems. Today, these decisions are increasingly shaped by various AI bias correctives. Developing these fixes is inherently challenging, however, and has become a prominent focus of recent AI research and policy discussion.
Data in Detail
The following section provides a more advanced, Level 2 understanding of data and introduces several concepts pertinent to the development of data.
Beneath the stored information lies a wealth of technical decisions that decide what information is contained in the dataset, how it is to be used, and how it interacts with the AI model. A deeper understanding of the choices behind data design can reveal the lens through which AI “sees” the world. These choices can matter for policy, not only because many data standards are mandated by law, but also because they can influence or even dramatically change outcomes. The following sections introduce several concepts pertinent to the governance of data.
Adversarial Machine Learning
Data affect not only AI system design but also system security. Adversarial machine learning refers generally to the study and design of machine-learning cyberattacks and defenses. Of central importance to many attacks are data. The design of these systems is often driven by training data, and training data alterations made by malicious actors have been demonstrated to both degrade model performance and purposefully misdirect it. So-called data poisoning attacks can be implemented in some cases with only minor alterations to data. One study found that a single altered image in a training dataset caused a classification system to misclassify thousands of images. As a result, poisoned data can be difficult to spot, lowering the bar for attacks.
Data can also be used to attack systems after training is complete. For instance, adversarial examples, which are data inputs designed to trick AI systems during inference, can cause models to make incorrect predictions. In a classic example, the addition of only a few stickers to a stop sign caused a visual classification system to classify it as a 45-mile-per-hour sign. Similar attacks have been developed for a range of other AI applications.
Adversarial techniques can be used defensively. Glaze is a system that imperceptibly alters digital art, causing image-generation models training on that data to misinterpret the data’s contents and style. This system is used by artists to block image generators from learning and copying their style. DeFake is a system that alters human voice recordings to disrupt bad actors trying to clone someone’s voice to carry out synthetic voice fraud.
Beyond these prominent examples, there are many emerging adversarial attacks and defensive-use cases, and this new field of study is constantly changing. Mitigating and preventing these vulnerabilities will prove a major challenge as AI capabilities improve and become even more widespread.
Data Standards and Data Capture
Much of the data that are collected and used are constrained or guided by data standards set by industry or government. For instance, accounting data standards in the United States are set by the Financial Accounting Standards Board, which dictates how financial statements are structured and recorded. Standards can deeply shape what data are available for any particular AI application. Under the board’s rules, companies can pick one of three methods to account for inventory, whereas entities regulated under International Financial Reporting Standards have only two permitted methods. As a result of these policy choices, the inventory data that are recorded can vary substantially. If applied to AI, these data differences can ultimately alter analysis and results. As with all concepts in AI, application matters. The effect of some standards may be minor in certain cases and dramatic in others.
Standards dictate not only the content of data but also the structure of their digital representation. MP3, PDF, and other file formats are familiar to most people. Each of these file formats is a standard that dictates how to arrange 1s and 0s to properly represent a given piece of data—in the case of PDF, a document, or in the case of MP3, an audio file. These formats can affect the quality of data and, by extension, AI. For instance, some formats, such as JPEG, allow for image compression, a technique that seeks to reduce file size by removing data from an image. This approach can have significant implications. In mammography image analysis, results have been found to vary significantly when AI systems are trained on images of differing compression levels. In certain cases, compression even caused complete misinterpretation of mammograms. It is worth repeating: data standards are design choices that are critical to AI applications.
Furthermore, note that, increasingly, captured data are not necessarily free from AI influence. Many cameras, including the cameras in the Apple iPhone, employ AI techniques to subtly alter images during capture. Although the effect of these alterations remains to be seen, what is captured in data does not necessarily represent the unaltered ground truth of reality.