DeepMind AI safety report explores the perils of “misaligned” AI

-



DeepMind also addresses something of a meta-concern about AI. The researchers say that a powerful AI in the wrong hands could be dangerous if it is used to accelerate machine learning research, resulting in the creation of more capable and unrestricted AI models. DeepMind says this could “have a significant effect on society’s ability to adapt to and govern powerful AI models.” DeepMind ranks this as a more severe threat than most other CCLs.

The misaligned AI

Most AI security mitigations follow from the assumption that the model is at least trying to follow instructions. Despite years of hallucination, researchers have not managed to make these models completely trustworthy or accurate, but it’s possible that a model’s incentives could be warped, either accidentally or on purpose. If a misaligned AI begins to actively work against humans or ignore instructions, that’s a new kind of problem that goes beyond simple hallucination.

Version 3 of the Frontier Safety Framework introduces an “exploratory approach” to understanding the risks of a misaligned AI. There have already been documented instances of generative AI models engaging in deception and defiant behavior, and DeepMind researchers express concern that it may be difficult to monitor for this kind of behavior in the future.

A misaligned AI might ignore human instructions, produce fraudulent outputs, or refuse to stop operating when requested. For the time being, there’s a fairly straightforward way to combat this outcome. Today’s most advanced simulated reasoning models produce “scratchpad” outputs during the thinking process. Devs are advised to use an automated monitor to double-check the model’s chain-of-thought output for evidence misalignment or deception.

Google says this CCL could become more severe in the future. The team believes models in the coming years may evolve to have effective simulated reasoning without producing a verifiable chain of thought. So your overseer guardrail wouldn’t be able to peer into the reasoning process of such a model. For this theoretical advanced AI, it may be impossible to completely rule out that the model is working against the interests of its human operator.

The framework doesn’t have a good solution to this problem just yet. DeepMind says it is researching possible mitigations for a misaligned AI, but it’s hard to know when or if this problem will become a reality. These “thinking” models have only been common for about a year, and there’s still a lot we don’t know about how they arrive at a given output.



Source link

Latest news

Lenovo’s Legion Go 2 Is a Good Handheld for Power Users

The detachable controllers go a long way towards making the device more portable and usable. The screen has...

Why Tehran Is Running Out of Water

This story originally appeared on Bulletin of the Atomic Scientists and is part of the Climate Desk collaboration.During...

Move Over, MIPS—There’s a New Bike Helmet Safety Tech in Town

Over the course of several hours and a few dozen trail miles, I had little to say about...

Security News This Week: Oh Crap, Kohler’s Toilet Cameras Aren’t Really End-to-End Encrypted

An AI image creator startup left its database unsecured, exposing more than a million images and videos its...

Gevi’s Espresso Machine Works Fine, but There Are Better Options at This Price Point

The coffee gadget market has caused a massive proliferation of devices for all tastes, preferences, and budgets, but...

Gear News of the Week: Google Drops Another Android Update, and the Sony A7 V Is Here

It was only back in June that Android 16 delivered a raft of new features for Google's operating...

Must read

You might also likeRELATED
Recommended to you