Image source: The Washington Post
Deepesh Chaudhari is an advisor at Omidyar Network, a former Senior Fellow at the Foundation for American Innovation and the founder of the ML Model Attribution Challenge. This year, the Mercatus Center awarded Deepesh a grant to support his work on AI model attribution. Below he discusses the significance and successes of this work, and the importance of continued investment in AI model attribution and forensics efforts.
For anyone interested in this work, you can connect with or donate to the AI Threats Working Group here!
Matt Mittelsteadt
Active measures, particularly in their covert form, threaten our trust in democratic systems. These clandestine operations, often orchestrated by state actors, involve subtle and deceptive tactics such as disinformation campaigns, cyber-attacks, and manipulating social media narratives. Such measures are hazardous in times of emergency or overwhelming crisis, such as a health pandemic, when pervasive fear and uncertainty make us increasingly susceptible to manipulation.
Current research tells us that campaigns of misinformation significantly escalated the death toll and economic loss during the COVID-19 outbreak. Alarmingly, bots are now believed to account for about a quarter of misinformation on digital platforms, and distinguishing genuine human communications from bot-generated content is becoming increasingly challenging. This blurring between human-generated and bot-generated misinformation underscores the challenge of discerning credible information sources during nuclear-related emergencies. The ability to trace AI-generated content back to its parent model is crucial to thwarting AI-propelled misinformation. Identifying the source model can help predict and counteract subsequent outputs, paving the way for a more informed and secure digital environment. A collective global initiative is needed that mirrors the approach used by the Nuclear Forensics International Technical Working Group—an organization comprising laboratory scientists, law enforcement agents, and regulatory experts—in tracing unauthorized nuclear materials to their origins.
President Biden's recent executive orders have spurred commitments from government agencies and top AI labs to implement safety measures, including watermarking AI-generated content. Nonetheless, these measures fall short of explicitly covering the watermarking of text content, creating potential gaps in robust attribution methods. This weakness highlights the pressing need to reinforce independent ML attribution research and validate the attribution capabilities of these companies. MITRE and the AI Threats Working Group inaugurated the ML Model Attribution Challenge last year to address these socio-technical challenges. Participants in the Challenge aimed to decode the unique fingerprints of fine-tuned models from their text outputs and identify the base models employed in influence operations—an approach akin to nuclear forensics.
The ML Model Attribution challenge underscored the necessity for increased international collaboration between technical researchers and the broader society. In the battle against AI-driven influence campaigns, model attribution might be our most formidable weapon, sending a stern message to those orchestrating bot attacks: “We have traced this attack back to you. Stop immediately!”
However, the journey to success is fraught with obstacles. Determining the intent behind a model's deployment and the complications arising from the widespread availability and accessibility of AI technology—as compared to nuclear technology—make the task exceedingly arduous.
Following the success of the ML Model Attribution Challenge, we facilitated researchers' participation in the International Atomic Energy Agency’s Coordinated Research Program titled, "Effective Public Emergency Communication in a Misinformation Environment." This program focuses on developing strategies to counteract potential misinformation in the event of nuclear emergencies, which could be propagated via widely accessible digital platforms. This forward-thinking initiative has united leading academic institutions, AI labs, and governments to proactively address emerging AI challenges in crisis communication.
The researchers involved in the CRP are developing technologies to ensure accountability for harmful AI systems, similar to how nuclear forensic investigations trace materials back to their source following an incident. These capabilities would enable legal actions against entities causing significant harm through their use AI and deter similar misuses in jurisdictions evading penalties.
The journey to refine ML Model attribution is fraught with significant challenges. First, AI firms must have a unified effort to deploy these measures. Second, we must ensure that attack models against watermarks are formally defined. Third, developing robust watermarks, which leverage unique characteristics like model weights, is crucial to shield firms from unjustified allegations of deceptive activities.
The pending National Defense Authorization Act (NDAA) has recognized the importance of this endeavor by launching a prize competition that zeroes in on detecting and watermarking generative AI. While a praiseworthy step, the initiative has gaps. Specifically, the current competition format lacks emphasis on source model attribution. Furthermore, the emphasis is more on textual recognition than understanding models' inherent semantics.
To enhance this framework, we recommend the following actions be taken:
Broaden the scope of DARPA's semaphore project to encompass LLM text data.
Encourage AI research institutions to develop and roll out watermarking strategies that leverage model weights.
Amplify the dedication of AI firms to the White House by integrating the watermarking of text, building on prior commitments.
Set up an international tabletop exercise, mirroring the ML Model Attribution Challenge, where research teams:
harness provided text data to build distinct ML model attribution classifiers, and
evaluate whether the results align with companies' internal watermarking schemes.
Embracing this format will invigorate the enthusiasm of independent researchers and AI labs, driving them to hone their attribution skills and promote shared responsibility.
Building an ML Model attribution infrastructure is a monumental task that demands a collaborative civilizational approach, ideally championed by an international entity like the IAEA. By emphasizing model attribution in the NDAA's proposed watermarking competition and aligning it with the IAEA’s Coordinated Research Project, we pave the way for the foundational steps of this transformative infrastructure.