Humanity details its AI safety strategy

Date:

Anthropic details safety strategies to try to maintain the popular AI model, Claude.

At the heart of this initiative is the human Safeguard team. Those who aren’t your average technical support group have a mix of policy experts, data scientists, engineers and threat analysts and know how bad actors think.

However, the approach to human security is more like a castle with multiple layers of defense than a single wall. It all starts with creating the right rules and ends with hunting new threats in the wild.

The first is the usage policy. This is basically a rulebook for how you shouldn’t use Claude. It provides clear guidance on using Claude responsibly in sensitive areas such as finance and healthcare, such as election integrity and child safety.

To shape these rules, teams use a unified harm framework. This helps us to think through potential negative effects, up to physical and psychological economic and social harms. This is not a formal grading system, but a structured way to weigh risks when making decisions. We also invite external experts to test your policy vulnerability. These experts in areas such as terrorism and child safety try to “break” Claude and ask difficult questions to see where their weaknesses lie.

I saw this working during the 2024 US election. After working with the Institute for strategic dialogue, the Human Perception Claude may provide old voting information. So they added a banner pointing users towards the turbo boat. This is a reliable source of latest nonpartisan election information.

Tell Claude after getting wrong

The Human Conservation Team works closely with the developers who train Claude to build safety from the start. This means that Claude decides what kind of things it should and should not do and embed those values in the model itself.

They also work with experts to do this right. By partnering with Crisis Support Leader, Thruline, for example, they taught Claude how to handle sensitive conversations about mental health and self-harm, rather than simply refusing to speak. This careful training is why Claude declines requests to support illegal activities, write malicious code, or create scams.

You can pace it with three important ratings before a new version of Claude is released.

  1. Safety rating: These tests will see if Claude sticks to the rules, even in tricky, long conversations.
  1. Risk assessment: In areas with extremely high cyber threats and biological risks, teams conduct specialized testing with the support of government and industry partners.
  1. Bias rating: This is all fairness. They see if Claude provides reliable and accurate answers for everyone, and tests political bias and distorted responses based on gender, race, etc.

This intense test will help teams check if their training is stuck and tell them if additional protection needs to be built before launch.

A cycle of how human protection teams can build effective AI safety protection throughout the Claude model lifecycle.
(Credit: Humanity)

Human sleepless AI safety strategy

When Claude enters the world, the combination of automated systems and human reviewers keeps an eye on trouble. The main tool here is a special set of Claude models called “classifiers,” trained to cause specific policy violations in real time.

If the classifier discovers a problem, it can trigger a different action. It may move Claude’s response away from producing something harmful like spam. For repeat offenders, the team may issue warnings or even shut down their accounts.

The team is also looking at the big picture. Use privacy-friendly tools to find trends in how Claude is used, and employ techniques such as hierarchical summaries to find massive misuses such as coordinated impact campaigns. They are constantly looking for new threats, digging into data, and monitoring forums where bad actors may hang out.

But Anthropic says he knows that ensuring AI is not a job they can do on their own. They are actively working with researchers, policymakers and the public to build the best possible safeguards.

(Lead image by Nick Honding)

reference: Suvianna Grecu, AI for AI change: no rules, AI risks risking “trust crisis”

Want to learn more about AI and big data from industry leaders? Check out the AI & Big Data Expo in Amsterdam, California and London. The comprehensive event will be held in collaboration with other major events, including the Intelligent Automation Conference, Blockx, Digital Transformation Week, and Cyber Security & Cloud Expo.

Check out other upcoming Enterprise Technology events and webinars with TechForge here.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Share post:

Subscribe

spot_imgspot_img

Popular

More like this
Related

How to watch, updates for couples

"Love Island" breakout star speaks in season 7 castSerena...

What are the winning numbers for Powerball’s $750 million jackpot?

The chances of winning Powerball and Mega Millions are...

Federal Reserve Governor Lisa Cook fired by President Trump

Michael S. Derby |ReutersTrump calls on Fed Governor...