AI Policy Guide

Data

Whether an AI system is in development or in use, the quality of its data is paramount to its success. Beneath the information stored in digital data lies an unexpected wealth of technical decisions and design. A deeper understanding of the choices behind data design can reveal the lens through which AI “sees” the world. These choices matter for policy, not only because many data standards are mandated by policy, but because they can influence and even dramatically change outcomes.

This part of the AI Policy Guide covers how data are used to both train AI systems and form their inferences in real time. Learn what digital data are, what data qualities matter, and the challenges data create.

Data serve two high-level purposes in artificial intelligence systems. First is the input. Data are the digital raw material used to train models during the machine learning process as well as the input on which trained models make inferences. Second is the output of inference that serves the practical purposes of users and output that can be recycled as input for further refining model performance

Several design choices of the dataset—such as volume, data selection, and the removal of outliers—shape the nature of AI systems. The technical form of digital data files also matters. The resolution of a photo, the compression of digital music, and unseen metadata all shape what information an AI system can process during learning or inference. To understand how microchips and algorithms shape AI, policymakers must first grasp the fundamental importance of data. 

Data have many important aspects: 

  • Through the training process, machine learning models use data to refine their inferences. 
  • When deployed, trained models use input data to make inferences, which can be translated into predictions and decisions. 
  • Many machine learning approaches require large volumes of data to train AI models. 
  • Machine learning approaches with small data are emerging to enable success without big data. 
  • The variety of data can be just as important as the volume. With diverse and representative data, systems can better account for real-world diversity and complexity. 
  • A diversity of data storage, warehousing, and collection systems is an important consideration in understanding AI systems and their governance. 
  • The data used to train and operate systems are the result of human curation, labeling, and cleaning. 
  • Human curation of data can lead systems to reflect biases. Some systems may perpetuate negative biases, whereas some might be more objective. 

Data Basics

Whether an AI system is in development or in use, the quality of its data is paramount to success. Selecting high-quality AI data is challenging, a function of multiple competing factors including volume, variety, and velocity. These qualities together are sometimes referred to as the 3-Vs

“We don’t have better algorithms. We have more data.” 

Peter Norvig, Director of Research at Google

 
Data Volume

Determining the ideal data volume, or the quantity of data relative to the model’s needs, has become a central question of machine learning. To train AI systems, there are two emerging approaches: big data and small data

The big-data approach is likely the most familiar. To train an AI system, vast stores of data are funneled into the model, which learns from that data and refines itself over time. Although this process does not always work in practice, the hope is that with enough data the model eventually arrives at an optimal form with powerful predictive capabilities. 

The famed ImageNet database illustrates the power that large and diverse datasets can provide. Introduced in 2009, ImageNet included more than 14 million images, conceived on the premise that progress in AI image recognition was a matter of more data, not improved algorithmic design. This approach proved successful. Massive data accelerated the improvements in computer image recognition; the accuracy of models using ImageNet jumped from a modest 72 percent success rate in 2010 to 96 percent in 2015, an accuracy rate exceeding average human success, in just five years. These results are rooted in the volume of this database. 

Although the ImageNet approach to image recognition benefited from millions of data points, the exact volume required for machine learning training is not standardized. Note that image recognition is a narrow, single-purpose application of this technology, yet it still required vast troves of data. For more complex systems, such as driverless vehicles, the volume of data is likely orders of magnitude larger. Estimating how much data is enough is a moving target and heavily depends on the application complexity, model size, accuracy requirements, and other goals. Progress has been made toward defining the relationship between algorithms and data requirements; however, current models are still speculative. In practice, engineers often depend on soft rules of thumb rather than empirically tested processes. Today’s AI engineering is more an art than a science. 

Trending against big-data approaches are the increasingly common small-data strategies. These can be used in scenarios where data are limited, spotty, or even unavailable. Small-data strategies use a variety of techniques to overcome data limitations, including transfer learning, where a model “inherits” learned information from previously trained models; artificial data, where representative yet fake data are synthetically created; and Bayesian methods, where models are coded with prior information that provides problem context before learning begins, thereby shrinking the overall learning challenge

Privacy in the AI Age

A classic debate in modern AI is the tradeoff between big-data AI innovation and personal privacy. Some have argued that small-data strategies offer a potential future for AI that preserves both innovation and privacy. Others contend that small data might allow democracies to compete against unrestricted authoritarian data collection practices without sacrificing democratic principles. The ultimate impact and viability of these strategies, however, remain to be seen.

In 2018, DeepMind’s AlphaZero demonstrated how an AI system could master Chess, Shogi, and Go through self-play—learning without any input data apart from the game rules. The system bested all existing big-data-trained systems, challenging the assumption that more data is always better. Although AlphaZero’s design is not universally applicable, it demonstrates the potential of small-data AI to transform future AI development. 

 

Variety 

Variety is just as important as volume. The problems that AI systems face are often complex, and in theory, a great variety of data can help models account for the unique wrinkles and corner cases that complexity brings. Flexibility is essential to AI quality and ensures that systems are robust in the face of the unexpected. A classic example illustrating the importance of variety are the facial images used to train facial recognition algorithms. Human faces come in many varieties, and to perform accurately, an algorithm should be trained on data containing a full variety of races, genders, hair colors, and so forth. Without full variety, these systems have been shown to misidentify nonwhite faces at significantly higher rates.

The maps, visual images, and proximity sensor data needed to train a driverless car will be vastly different from the data required to train a stock-trading AI. Data must also be timely. Adding stale data—that is, old data that are not quite pertinent to the current problem—just for the sake of greater volume can reduce the overall quality of an AI system. As an illustration, inflation data taken before 1971, when the US government promised a fixed rate for gold coins (and gold bullion), may contain more noise than signal for inflation data since 1971. Perhaps such data should be excluded when training economic modeling systems. 

 

Velocity 

Velocity refers to the speed “in which data is generated, distributed, and collected.” In general, this speaks to an AI system’s ability to manage the data it needs for optimal performance.

Data generation and collection depends on the design of a system and the way it interfaces with the world. Web applications are well known for their ability to amass diverse and incisive data from their users. Companies such as Facebook and Google use digital platforms, social media, and adware to track users and collect personal data. Mass data collection is also widespread outside of the internet. In healthcare, electronic health records have enabled the collection, digitization, and aggregation of bulky tranches of data. These data include physician documentation, patient inputs, external medical facilities transmissions, and direct transmissions of medical data from hospital instruments. As in social media, aggregated healthcare data can be truly massive. 

As AI models are embedded into physical systems such as cars and drones, an AI system has broadened to include the visual, audio, and signal arrays that capture real-time information to function adequately. Some refer to this as the broader AI constellation. AI increasingly takes advantage of the internet of things (IoT), a network that connects uniquely identifiable “things” to the internet, where the “things” are devices that can sense and interact according to their hardware and software capabilities. IoT devices can prove rich data sources and give AI additional eyes and ears into a problem. Recall that one of the primary benefits of AI is its sensory scale and scope. IoT is a relatively new phenomenon, and these devices may grow in importance to AI as they are able to collect a wide variety of previously inaccessible data.

Success can be contingent on distribution, a function of a web of storage and networking technologies, which are important components of many AI systems. These devices include not only data warehouses—large, centralized warehouses holding hundreds of servers on which vast lakes of data are stored—but also smaller caches of data physically closer to where the program is running to allow for quick data access. For an AI to learn quickly, and function efficiently during inference, data must be easy to collect, store, and access. 

AI Infrastructure Drives Policy

The infrastructure that makes AI possible is not immune to unintended social effects. Data warehouses have been noted for their constant hum, which disrupts both wildlife and local residents. Warehousing also creates electronic waste and, when using fissile fuels, yields a high carbon footprint. With large data and computation demands, AI systems may create many externalities in the communities that support their infrastructural operation.

Data Management 

Data management dominates AI design. In fact, engineers frequently cite that data preprocessing accounts for 80 percent of engineering time. The reason for this is that data are often disjointed, messy, and incomplete. Before a model can be trained, data cleaning, often by hand, is required to ensure that it will be usable. To prepare data, engineers must often decide whether to remove outliers, weed out irrelevant information, add labels, and ensure that the data are well organized. Various methods and rules of thumb have also been developed to help fill in data gaps as needed. Further, data must often be labeled. AI cannot naturally know the labels and symbols that humans apply to objects. An image of a red, shiny fruit can be labeled “apple” only if an AI knows that term. All these labels must be affixed by hand.

AI Labor beyond Engineering Talent

Note that data management isn’t always performed by an engineer. Often companies will outsource this task to lower-skilled workers. This is often done through crowdsourcing services like Amazon’s Mechanical Turk.

Bias 

A common concern in AI is bias, defined generally as the difference between desired outcomes and measured outcomes. Data are a major source of AI bias. When a model learns from human-curated data, the model takes on a lens that reflects the viewpoint of the humans who selected and shaped those data. For instance, language-processing AI trained on news articles may take on and perpetuate the societal stereotypes embedded in the language and viewpoint of those articles. Even after training, the data input used during model inference can bias its output. Input data that ask an AI chatbot to write a “positive poem,” versus just a poem, will bias results in a positive direction. 

The National Institute for Standards and Technology notes that AI bias has many roots. In many cases, bias simply stems from the natural blind spots in human cognition and judgment and the consequent choices that engineers make about what data are more or less important. As humans collect data, all data will be biased in some way. In other cases, bias is rooted in structural constraints. Perhaps the dataset an engineer uses is selected merely because it was easy or cheap to access, not because of its superior quality. The resulting AI system will then take on the qualities of that set, whatever they may be. Historical data trends can also bias present data AI systems. For instance, if an algorithm used to judge recidivism was trained on data marred by historical racism, its decisions could incorporate those historical prejudices moving forward. Beyond these examples, there are many additional sources of bias, all of which must be balanced when selecting data. 

Bias, although unavoidable, is not necessarily harmful. Often, the intensity of a given bias may be negligible or irrelevant to the goals of a system. A chatbot that is biased toward using an overly academic tone might be useful as a research reference tool despite occasionally sounding pompous. In other cases, biases may exist yet have little effect on system performance because they are exceedingly rare. Identity-based biases may be considered negligible in a system if they occurred only once every trillion queries. In all cases, engineers decide, consciously or not, what constitutes an acceptable level of bias before they deploy these systems. Today, these decisions are increasingly shaped by various AI bias correctives. Developing these fixes is inherently challenging, however, and has become a prominent focus of recent AI research and policy discussion. 

Just want the basics? Stop here.

That covers the basics of data!

You can continue reading for a more advanced understanding, or skip to the next topic in the guide.

Data in Detail

The following section provides a more advanced, Level 2 understanding of data, and introduces several concepts pertinent to the governance of data. 

Beneath the stored information lies a wealth of technical decisions that decide what information is contained in the dataset, how it is to be used, and how it interacts with the AI model. A deeper understanding of the choices behind data design can reveal the lens through which AI “sees” the world. These choices can matter for policy, not only because many data standards are mandated by law, but also because they can influence or even dramatically change outcomes. 

 
Adversarial Machine Learning 

Data affect not only AI system design, but also system security. Adversarial machine learning refers generally to the study and design of machine learning cyberattacks and defenses. Of central importance to many attacks are data. The design of these systems is often driven by training data, and training data alterations made by malicious actors have been demonstrated to both degrade model performance and purposefully misdirect it. So-called data poisoning attacks can be implemented in some cases with only minor alterations to data. One study found that a single altered image in a training dataset caused a classification system to misclassify thousands of images. As a result, poisoned data can be difficult to spot, lowering the bar for attacks. 

Traditional Vulnerabilities Persist

All algorithms are vulnerable to traditional cyberattacks. AI systems are no different. While data poisoning attacks are often pointed to as the marquee AI vulnerability, policymakers must not forget that more traditional exploits remain. Data-based attacks represent an added layer of insecurity unique to AI systems on top of the range of older but still relevant cyber vulnerabilities. Also, recall that AI algorithms depend on other systems. An attacker that cannot gain access to an AI or its training data can still attack the system by compromising connected devices.

Data Standards and Data Capture 

Much of the data that are collected and used are constrained or guided by data standards set by industry or government. For instance, accounting data standards in the United States are set by the Financial Accounting Standards Board, which dictates how financial statements are structured and recorded. Standards can deeply shape what data are available for any particular AI application. Under the board’s rules, companies can pick one of three methods to account for inventory, whereas entities regulated under International Financial Reporting Standards have only two permitted methods. As a result of these policy choices, the inventory data that are recorded can vary substantially. If applied to AI, these data differences can ultimately alter analysis and results. As with all concepts in AI, application matters. The effect of some standards may be minor in certain cases and dramatic in others. 

Data Standards Are Destiny

In many cases, and across industries, governments and government-commissioned industry regulators control standards. Standardization is deeply entwined with policy. Most standards were decided without considering the needs of artificial intelligence, yet their effect on AI could be great. Each field and potential AI application may benefit from regulators looking at how their designs and choices might affect these systems.

Standards dictate not only the content of data, but also the structure of their digital representation. MP3, PDF, and others will be familiar. Each of these file formats is a standard that dictates how to arrange 1s and 0s to properly represent a given piece of data—in the case of PDF, a document, or in the case of MP3, an audio file. These formats can affect the quality of data and, by extension, AI. For instance, some formats such as JPEG allow for image compression, a technique that seeks to reduce the file size by removing data from an image. This approach can have significant implications. In mammography image analysis, results have been found to vary significantly when AI systems are trained on images of differing compression levels. In certain cases, compression even caused complete misinterpretation of mammograms. It is worth repeating: data standards are design choices that are critical to AI applications. 

Furthermore, note that, increasingly, captured data are not necessarily free from AI influence. Many cameras, including the cameras in the Apple iPhone, employ AI techniques to subtly alter images during capture. Although the effect of these alterations remains to be seen, what is captured in data does not necessarily represent the unaltered ground truth of reality.

Next up: Microchips

This hardware powers AI systems, and policymakers must understand both how they are made and used in order to craft effective rules.

About the Author

Matthew Mittelsteadt is a technologist and research fellow at the Mercatus Center whose work focuses on artificial intelligence policy. Prior to joining Mercatus, Matthew was a fellow at the Institute of Security, Policy, and Law where he researched AI judicial policy and AI arms control verification mechanisms. Matthew holds an MS in Cybersecurity from New York University, an MPA from Syracuse University, and a BA in both Economics and Russian Studies from St. Olaf College.

Read Matt's Substack on AI policy