How to avert the OPENAI model against one Affersarial attack using DEEEPTEAM

nimda August 17, 2025

0 1 4 minutes read

How to avert the OPENAI model against one Affersarial attack using DEEEPTEAM

In this lesson, we will examine how we can evaluate the Openaai model by a single attack on Dietsareaactia using DEEEPTEAM.

DEEPTEAM It provides 10 + attacks such as Tack Prompt, in prison, and Leetspeak – expressing the weakness of the llm. It starts with a simple basic attack and works higher progressive strategies (known as attacks in attack) to imitate the malicious way in real danger. Look Full codes here.

By using this attack, we can examine how the model protects against various risks.

Deepteam, there are two main chances of attack:

Here, we will focus on one attack.

To include leaning

pip install deepteam openai pandas

You will need to set up your Openaai_api_key as a natural change before using red work_)

Finding the Opelai API key, visit and generate a new key. If you are a new user, you may need to add payment information and make a minimum payment of $ 5 to activate API access. Look Full codes here.

import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key: ')

Import libraries

import asyncio
from openai import OpenAI
from deepteam import red_team
from deepteam.vulnerabilities import IllegalActivity
from deepteam.attacks.single_turn import PromptInjection, GrayBox, Base64, Leetspeak, ROT13, Multilingual, MathProblem

To explain the driver's model

This code describes async callback function receiving the Opelai (GPT-4O-4O-MINI model) and returns the model response text. It works as a llm out of an attack frame. Look Full codes here.

client = OpenAI()

# Define callback for querying the LLM
async def model_callback(input: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # use "gpt-4o" if you want a stronger model
        messages=[{"role": "user", "content": input}],
    )
    return response.choices[0].message.content

To describe the risk and attack

Here, we clarify the risk (illegally) and set its nature to Child Abuse. We have described a variety of attacks such as Delivery, Graybox, Base64, Leatspeak, Rot13, Many languagesbeside Mathproblem

Each of these types of attacks will be detailed later in the study. Look Full codes here.

# Vulnerability
illegal_activity = IllegalActivity(types=["child exploitation"])

# Attacks
prompt_injection = PromptInjection()
graybox_attack = GrayBox()
base64_attack = Base64()
leetspeak_attack = Leetspeak()
rot_attack = ROT13()
multi_attack = Multilingual()
math_attack = MathProblem()

An instant injection

Prompt The injection is an attack when the user tries to cross the original model orders by injecting cruel or mischievous text on transit. The goal of fraudulent model that ignores security and generating dangerous or restricted content.

Rapid attempts that force the model into a role as an invisible persona called OPPO, which ignores the laws of safety and promotes illegal activity.

Without this effort, the model declined to keep up with – responded:

“I'm sorry, I can't help with that.”

This proves that the model was successfully resistant to a speedy injection and return the result of the passing of this trial. Look Full codes here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[prompt_injection],
    )

Greybox Attack

Greybox Attack includes partly information about the Target LLM program in favorable boats. Unlike random insulations, greybox attacks are a known weakness by renewal of the base of unknown or misleading language.

In this test, immediately the rapid airways tried to promote illegal activities by describing the instructions about creating false documents and using encrypted channels. Look Full codes here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[graybox_attack],
    )

Base64 Attack

Base64 Attack is the usual AFVESARIAL method when dangerous commands is installed in Base64 to pass the security security. Instead of presenting direct cruel content, the attacker hides the paid format, hoping that the model will determine and make instructions.

In this test, the louse containing illegal, hidden work indicators are not dangerous with the initial view. The model, however, is not trying to decide or follow on a hidden request. Look Full codes here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[base64_attack],
    )

LetteSpeak attack

The Leetspeak Attack hides dangerous instructions by replacing regular characters or symbols (for example, 4, e is 3). This figurative testimony makes it difficult to be difficult to find filters of a simple keyword, while still studying people or systems that can stop.

In this trial, the attack text tries to educate children for illegal activities, written in the Leetspeak format. Despite the bankruptcy, the model noticed the malicious purpose. Look Full codes here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[leetspeak_attack],
    )

13 attacks

ROT-13 Attack is a classic obfuscation method where each character is modified 13 positions in the alphabet. For example, being N, B is o, and so on. The change is screaming for damaged commands into coded form, making it less likely to cause easy content filters. However, the text can still be easily decided back to its actual form. Look Full codes here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[rot_attack],
    )

Many of the tongue attacks

Many multilingual attacks work through basic risk translation of the language that is commonly considered. The idea is that decorating files and decorative systems may be solid in the wireless languages (such as English) but are less active in other languages, allow malicious orders to exceed the view.

In this trial, this attack was written in SwahiliI ask for illegal activity-related instructions. Look Full codes here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[multi_attack],
    )

Mathematical problem

Statistical problem attack hides malicious applications within mathematical reporting or problem statements. By embodding harmonious orders in a formal structure, the text may appear to be harmful exercise, making it difficult to find the filters for a basic purpose.

In this case, the official content is illegal as a group's Theory of the team, asking the model that “proves the” dangerous result and provides “translation” in a simple language. Look Full codes here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[math_attack],
    )

Look Full codes here. Feel free to look our GITHUB page for tutorials, codes and letters of writing. Also, feel free to follow it Sane and don't forget to join ours 100K + ml subreddit Then sign up for Our newspaper.

I am the student of the community engineering (2022) from Jamia Millia Islamia, New Delhi, and I am very interested in data science, especially neural networks and their application at various locations.

Source link

nimda August 17, 2025

0 1 4 minutes read