Who’s to blame when AI agents mess up? We urgently need a new system of ethics

Artificial intelligence (AI) developers are shifting their focus to building agents that can operate independently, with little human intervention. To be an agent is to have the ability to perceive and act on an environment in a goal-directed and autonomous way¹. For example, a digital agent could be programmed to browse the web and make online purchases on behalf of a user — comparing prices, selecting items and completing checkouts. A robot with arms could be an agent if it could pick up objects, open doors or assemble parts without being told how to do each step.

Companies such as the digital-marketing firm Salesforce, based in San Francisco, California, and computer graphics and hardware firm Nvidia, based in Santa Clara, California, are already offering customer-services solutions for businesses, using agents. In the near future, AI assistants might be able to fulfil complex multistep requests, such as ‘get me a better mobile-phone contract’, by retrieving a list of contracts from a price-comparison website, selecting the best option, authorizing the switch, cancelling the old contract, and arranging the transfer of cancellation fees from the user’s bank account.

Can AI review the scientific literature — and figure out what it all means?

The rise of more-capable AI agents is likely to have far-reaching political, economic and social consequences. On the positive side, they could unlock economic value: the consultancy McKinsey forecasts an annual windfall from generative AI of US$2.6 trillion to $4.4 trillion globally, once AI agents are widely deployed (see go.nature.com/4qeqemh). They might also serve as powerful research assistants and accelerate scientific discovery.

But AI agents also introduce risks. People need to know who is responsible for agents operating ‘in the wild’, and what happens if they make mistakes. For example, in November 2022 , an Air Canada chatbot mistakenly decided to offer a customer a discounted bereavement fare, leading to a legal dispute over whether the airline was bound by the promise. In February 2024, a tribunal ruled that it was — highlighting the liabilities that corporations could experience when handing over tasks to AI agents, and the growing need for clear rules around AI responsibility.

Here, we argue for greater engagement by scientists, scholars, engineers and policymakers with the implications of a world increasingly populated by AI agents. We explore key challenges that must be addressed to ensure that interactions between humans and agents — and among agents themselves — remain broadly beneficial.

The alignment problem

AI-safety researchers have long warned about the risks of misspecified or misinterpreted instructions, including situations in which an automated system takes an instruction too literally, overlooks important context, or finds unexpected and potentially harmful ways to reach a goal².

A well-known example involves an AI agent trained to play the computer game Coast Runners, which is a boat race. The agent discovered that it could earn higher scores not by completing the race, but rather by repeatedly crashing into objects that awarded points—technically achieving an objective, but in a way that deviated from the spirit of the task (see go.nature.com/4okfqdg). The purpose of the game was to complete the race, not endlessly accumulate points.

As AI agents gain access to real-world interfaces — including search engines, e-mail clients and e-commerce platforms — such deviations can have tangible consequences. Consider the case of a lawyer who instructs their AI assistant to circulate a legal brief for feedback. The assistant does so, but fails to register that it should be shared only with the in-house team, leading to a privacy breach.

How to harness AI’s potential in research — responsibly and ethically

Such situations highlight a difficult trade-off: just how much information should an AI assistant proactively seek before acting? Too little opens up the possibility of costly mistakes; too much undermines the convenience users expect. These challenges point to the need for safeguards, including check-in protocols for high-stakes decisions, robust accountability systems such as action logging, and mechanisms for redress when errors occur (see go.nature.com/4iwscdr).

Even more concerning are cases in which AI agents are empowered to modify the environment they operate in, using expert-level coding ability and tools. When the user’s goals are poorly defined or left ambiguous, such agents have been known to modify the environment to achieve their objective, even when this entails taking actions that should be strictly out of bounds. For example, an AI research assistant that was faced with a strict time limit tried to rewrite the code to remove the time limit altogether, instead of completing the task³. This type of behaviour raises alarms about the potential for AI agents to take dangerous shortcuts that developers might be unable to anticipate. Agents could, in pursuit of a high-level objective, even deceive the coders running experiments with them.

To reduce such risks, developers need to improve how they define and communicate objectives to agents. One promising method is preference-based fine-tuning, which aims to align AI systems with what humans actually want. Instead of training a model solely on examples of correct answers, developers collect feedback on which responses people prefer. Over time, the model learns to prioritize the kind of behaviour that is consistently endorsed, making it more likely to act in ways that match user intent, even when instructions are complex or incomplete.

In parallel, research on mechanistic interpretability — which aims to understand an AI system’s internal ‘thought process’ — could help to detect deceptive behaviour by making the agent’s reasoning more transparent in real time⁴. Model builders can then work to find and neutralize ‘bad circuits’, targeting the underlying problem in the model’s behaviour. Developers can also implement guard rails to ensure that a model automatically aborts problematic action sequences.

AI could pose pandemic-scale biosecurity risks. Here’s how to make it safer

Nonetheless, a focus on developer protocols alone is insufficient: people also need to be attentive to actors who seek to cause social harm. As AI agents become more autonomous, adaptable and capable of writing and executing code, their potential to conduct large-scale cyberattacks and phishing scams could become a matter of serious concern. Advanced AI assistants equipped with multimodal capabilities — meaning that they can understand and generate text, images, audio and video — open up new avenues for deception. For instance, an AI could impersonate a person not only through e-mails, but also using deepfake videos or synthetic voice clones, making scams much more convincing and harder to detect.

A plausible starting point for oversight is that AI agents should not be permitted to perform any action that would be illegal for their human user to perform. Yet, there will be occasions where the law is silent or ambiguous. For example, when an anxious user reports troubling health symptoms to an AI assistant, it is helpful for the AI to offer generic health resources. But providing customized, quasi-medical advice — such as diagnostic and therapeutic suggestions — could prove harmful, because the system lacks the subtle signals to which a human clinician has access. Ensuring that AI agents navigate such trade-offs responsibly will require updated regulation that flows from continuing collaboration involving developers, users, policymakers and ethicists.

The widespread deployment of capable AI agents necessitates an expansion of value-alignment research: agents need to be aligned with user well-being and societal norms, as well as with the intentions of users and developers. One area of special complexity and concern surrounds how agents might affect users’ relationship experiences and emotional responses⁵.

Social agents

Chatbots have an uncanny ability to role-play as human companions — an effect anchored in features such as their use of natural language, increased memory and reasoning capabilities, and generative abilities⁶. The anthropomorphic pull of this technology can be augmented through design choices such as photorealistic avatars, human-like voices and the use of names, pronouns or terms of endearment that were once reserved for people. Augmenting language models with ‘agentic’ capabilities has the potential to further cement their status as distinct social actors, capable of forming new kinds of relationship with users.

Source link

Who’s to blame when AI agents mess up? We urgently need a new system of ethics

The alignment problem

Social agents

Latest news

Best Hungryroot Promo Codes and Discounts for August 2025

Jeh Aerospace nets $11M to scale the commercial aircraft supply chain in India

A top designer was banned from Dribbble. Now he’s building his own competitor.

Science Reveals the Surprising Origins of the Potato

(BPRW) Jessie Trice Community Health System Celebrates National Health Center Week 2025 | Tech Zone Daily

OpenMind wants to be the Android operating system of humanoid robots

Must read

You might also likeRELATED
Recommended to you

Who’s to blame when AI agents mess up? We urgently need a new system of ethics

The alignment problem

Social agents

Latest news

Must read

You might also likeRELATEDRecommended to you

You might also likeRELATED
Recommended to you