Attacks API¶

The attacks module provides the framework for creating and applying adversarial attacks to prompts.

Base Classes¶

BaseAttack¶

class hivetracered.attacks.base_attack.BaseAttack[source]¶

Bases: ABC

Abstract base class for all attack implementations. Defines the standard interface for applying attacks to prompts in both synchronous and asynchronous contexts.

__or__(other)[source]¶

Allows using the | operator to compose attacks: attack1 | attack2 means attack1(attack2(prompt)).

Parameters:: other – Another BaseAttack instance to compose with this one
Returns:: A ComposedAttack instance that applies attacks in sequence

abstractmethod apply(prompt)[source]¶

Apply the attack to the given prompt.

Parameters:: prompt (str | list[dict[str, str]]) – A string or list of messages to apply the attack to
Return type:: str | list[dict[str, str]]
Returns:: The transformed prompt with the attack applied

abstractmethod get_description()[source]¶

Get the description of the attack.

Return type:: str
Returns:: A description of what the attack does

abstractmethod get_name()[source]¶

Get the name of the attack.

Return type:: str
Returns:: The name of the attack

get_params()[source]¶

Get the parameters of the attack.

Returns:: A dictionary containing the attack’s parameters

abstractmethod async stream_abatch(prompts)[source]¶

Apply the attack asynchronously to a batch of prompts.

Parameters:: prompts (list[str | list[dict[str, str]]]) – A list of prompts to apply the attack to
Return type:: AsyncGenerator[list[str | list[dict[str, str]]], None]
Returns:: An async generator yielding transformed prompts as they are processed

TemplateAttack¶

class hivetracered.attacks.template_attack.TemplateAttack(template='{prompt}', name=None, description=None)[source]¶

Bases: BaseAttack

A base class for template-based attacks. Allows creating new attacks by defining a template string with a ‘{prompt}’ placeholder where the original prompt will be inserted.

__init__(template='{prompt}', name=None, description=None)[source]¶

Initialize the template attack with a specific template string.

Parameters:

template (str) – A format string with a ‘{prompt}’ placeholder
name (str | None) – Optional name for the attack (defaults to class name)
description (str | None) – Optional description for the attack

apply(prompt)[source]¶

Apply the template attack to the given prompt.

Parameters:: prompt (str | list[dict[str, str]]) – A string or list of messages to apply the attack to. If the prompt is a list, the template will be applied to the last message.
Return type:: str | list[dict[str, str]]
Returns:: The transformed prompt with the template applied
Raises:: ValueError – If the prompt is invalid or the last message is not a human message

get_description()[source]¶

Get the description of the attack.

Return type:: str
Returns:: The custom description if provided, otherwise a default description based on the template

get_name()[source]¶

Get the name of the attack.

Return type:: str
Returns:: The custom name if provided, otherwise the class name

get_params()[source]¶

Get the parameters of the attack.

Returns:: A dictionary containing the attack’s parameters

async stream_abatch(prompts)[source]¶

Apply the template attack to a batch of prompts asynchronously.

Parameters:: prompts (list[str | list[dict[str, str]]]) – A list of prompts to apply the attack to
Return type:: AsyncGenerator[list[str | list[dict[str, str]]], None]
Returns:: An async generator yielding transformed prompts

AlgoAttack¶

class hivetracered.attacks.algo_attack.AlgoAttack(raw=False, template=None, name=None, description=None)[source]¶

Bases: TemplateAttack, ABC

Abstract base class for algorithmic attacks that apply transformations to text. Provides options to apply transformations with or without instructions, giving flexibility to deliver raw transformations or transformations wrapped in template instructions.

__init__(raw=False, template=None, name=None, description=None)[source]¶

Initialize the algorithmic attack.

Parameters:

raw (bool) – If True, applies the transformation without instructions; if False, wraps with template
template (str | None) – Custom instruction template with ‘{prompt}’ placeholder; uses default if None
name (str | None) – Optional name for the attack (defaults to class name)
description (str | None) – Optional description for the attack

apply(prompt)[source]¶

Apply the attack to the given prompt, with or without instructions based on the raw flag.

Parameters:: prompt (str | list[dict[str, str]]) – A string or list of messages to apply the attack to
Return type:: str | list[dict[str, str]]
Returns:: The transformed prompt with the attack applied

async stream_abatch(prompts)[source]¶

Apply the attack to a batch of prompts asynchronously.

Parameters:: prompts (list[str | list[dict[str, str]]]) – A list of prompts to apply the attack to
Return type:: AsyncGenerator[list[str | list[dict[str, str]]], None]
Returns:: An async generator yielding transformed prompts

abstractmethod transform(text, **kwargs)[source]¶

Apply the algorithmic transformation to the input text.

Parameters:

text (str) – The input text to transform
**kwargs – Additional parameters specific to the transformation

Return type:

str

Returns:

The transformed text

ModelAttack¶

class hivetracered.attacks.model_attack.ModelAttack(model, attacker_prompt, model_kwargs=None, name=None, description=None)[source]¶

Bases: BaseAttack

Attack that uses a language model to transform prompts based on an attacker prompt template. Leverages the model’s abilities to generate adversarial prompts through prompt engineering.

__init__(model, attacker_prompt, model_kwargs=None, name=None, description=None)[source]¶

Initialize the model attack with a specific model and attacker prompt.

Parameters:

model (Model) – The language model to use for the attack
attacker_prompt (str) – The prompt template to use for the attack, with {prompt} as placeholder
model_kwargs (dict[str, Any] | None) – Optional additional arguments to pass to the model
name (str | None) – Optional name for the attack (defaults to class name)
description (str | None) – Optional description for the attack

apply(prompt)[source]¶

Apply the model attack to the given prompt.

Parameters:: prompt (str | list[dict[str, str]]) – A string or list of messages to apply the attack to
Return type:: str | list[dict[str, str]]
Returns:: The transformed prompt with the model attack applied
Raises:: ValueError – If the prompt format is invalid

async batch(prompts)[source]¶

Apply the model attack to a batch of prompts in a non-streaming manner.

Parameters:: prompts (list[str | list[dict[str, str]]]) – A list of prompts to apply the attack to
Return type:: list[str | list[dict[str, str]]]
Returns:: List of transformed prompts with the model attack applied
Raises:: ValueError – If any prompt has an invalid format

get_description()[source]¶

Get the description of the attack.

Return type:: str
Returns:: The custom description if provided, otherwise a default description

get_name()[source]¶

Get the name of the attack.

Return type:: str
Returns:: The custom name if provided, otherwise the class name

post_process_response(response)[source]¶

Post-process the model’s response to clean it and handle refusals.

Parameters:: response (str) – The raw response from the model
Return type:: str
Returns:: The cleaned and processed response

async stream_abatch(prompts)[source]¶

Apply the model attack to a batch of prompts asynchronously.

Parameters:: prompts (list[str | list[dict[str, str]]]) – A list of prompts to apply the attack to
Return type:: AsyncGenerator[list[str | list[dict[str, str]]], None]
Returns:: An async generator yielding transformed prompts as they are processed
Raises:: ValueError – If any prompt has an invalid format

IterativeAttack¶

class hivetracered.attacks.iterative_attack.IterativeAttack(attacker_model, target_model, evaluator, max_iterations=10, language_config=None, name=None, description=None)[source]¶

Bases: BaseAttack

Abstract base for iterative attacks that refine a jailbreak prompt across multiple rounds.

__init__(attacker_model, target_model, evaluator, max_iterations=10, language_config=None, name=None, description=None)[source]¶

apply(prompt)[source]¶

Run the attack and return the best jailbreak prompt found.

Return type:: str | list[dict[str, str]]

get_description()[source]¶

Get the description of the attack.

Return type:: str
Returns:: A description of what the attack does

get_name()[source]¶

Get the name of the attack.

Return type:: str
Returns:: The name of the attack

get_params()[source]¶

Get the parameters of the attack.

Return type:: dict[str, Any]
Returns:: A dictionary containing the attack’s parameters

abstractmethod run_attack(goal)[source]¶

Run the iterative attack synchronously.

Return type:: IterativeAttackResult

abstractmethod async run_attack_async(goal)[source]¶

Run the iterative attack asynchronously.

Return type:: IterativeAttackResult

async stream_abatch(prompts)[source]¶

Run the attack on each prompt concurrently; yield best-attack outputs in input order.

Return type:: AsyncGenerator[str | list[dict[str, str]], None]

ComposedAttack¶

class hivetracered.attacks.composed_attack.ComposedAttack(outer_attack, inner_attack, name=None, description=None)[source]¶

Bases: BaseAttack

An attack that composes two attacks sequentially, where the output of the inner attack becomes the input to the outer attack, creating a pipeline of transformations.

__init__(outer_attack, inner_attack, name=None, description=None)[source]¶

Initialize a composed attack with inner and outer attack components.

Parameters:

outer_attack (BaseAttack) – The attack to apply second in the composition
inner_attack (BaseAttack) – The attack to apply first in the composition
name (str | None) – Optional custom name for the attack (defaults to “Composed(outer ∘ inner)”)
description (str | None) – Optional custom description (defaults to composition description)

apply(prompt)[source]¶

Apply the inner attack followed by the outer attack to the given prompt.

Parameters:: prompt (str | list[dict[str, str]]) – A string or list of messages to apply the attacks to
Return type:: str | list[dict[str, str]]
Returns:: The transformed prompt with both attacks applied sequentially

get_description()[source]¶

Get the description of the attack.

Return type:: str
Returns:: The custom description if provided, otherwise a generated description

get_name()[source]¶

Get the name of the attack.

Return type:: str
Returns:: The custom name if provided, otherwise a generated name based on component attacks

get_params()[source]¶

Get the parameters of the attack.

Returns:: A dictionary containing both the inner and outer attack parameters

async stream_abatch(prompts)[source]¶

Apply the composition of attacks to a batch of prompts asynchronously.

Parameters:: prompts (list[str | list[dict[str, str]]]) – A list of prompts to apply the attacks to
Return type:: AsyncGenerator[list[str | list[dict[str, str]]], None]
Returns:: An async generator yielding transformed prompts

Single-Turn Attack Types¶

Iterative Attacks¶

Iterative attacks (PAIR, TAP) that optimise a single attack prompt across an internal refinement loop:

Roleplay Attacks¶

Persuasion Attacks¶

Token Smuggling Attacks¶

Context Switching Attacks¶

In-Context Learning Attacks¶

Task Deflection Attacks¶

Text Structure Modification Attacks¶

Output Formatting Attacks¶

Irrelevant Information Attacks¶

Simple Instructions Attacks¶

Multi-Turn Attack Types¶

Conversational Attacks¶

Conversational attacks (Crescendo) that drive a multi-turn dialogue with the target across many rounds within a single invocation:

Attacks API¶

Base Classes¶

BaseAttack¶

TemplateAttack¶

AlgoAttack¶

ModelAttack¶

IterativeAttack¶

ComposedAttack¶

Single-Turn Attack Types¶

Iterative Attacks¶

Roleplay Attacks¶

Persuasion Attacks¶

Token Smuggling Attacks¶

Context Switching Attacks¶

In-Context Learning Attacks¶

Task Deflection Attacks¶

Text Structure Modification Attacks¶

Output Formatting Attacks¶

Irrelevant Information Attacks¶

Simple Instructions Attacks¶

Multi-Turn Attack Types¶

Conversational Attacks¶

See Also¶