Attacks API

The attacks module provides the framework for creating and applying adversarial attacks to prompts.

Base Classes

BaseAttack

class hivetracered.attacks.base_attack.BaseAttack[source]

Bases: ABC

Abstract base class for all attack implementations. Defines the standard interface for applying attacks to prompts in both synchronous and asynchronous contexts.

__or__(other)[source]

Allows using the | operator to compose attacks: attack1 | attack2 means attack1(attack2(prompt)).

Parameters:

other – Another BaseAttack instance to compose with this one

Returns:

A ComposedAttack instance that applies attacks in sequence

abstractmethod apply(prompt)[source]

Apply the attack to the given prompt.

Parameters:

prompt (str | list[dict[str, str]]) – A string or list of messages to apply the attack to

Return type:

str | list[dict[str, str]]

Returns:

The transformed prompt with the attack applied

abstractmethod get_description()[source]

Get the description of the attack.

Return type:

str

Returns:

A description of what the attack does

abstractmethod get_name()[source]

Get the name of the attack.

Return type:

str

Returns:

The name of the attack

get_params()[source]

Get the parameters of the attack.

Returns:

A dictionary containing the attack’s parameters

abstractmethod async stream_abatch(prompts)[source]

Apply the attack asynchronously to a batch of prompts.

Parameters:

prompts (list[str | list[dict[str, str]]]) – A list of prompts to apply the attack to

Return type:

AsyncGenerator[list[str | list[dict[str, str]]], None]

Returns:

An async generator yielding transformed prompts as they are processed

TemplateAttack

class hivetracered.attacks.template_attack.TemplateAttack(template='{prompt}', name=None, description=None)[source]

Bases: BaseAttack

A base class for template-based attacks. Allows creating new attacks by defining a template string with a ‘{prompt}’ placeholder where the original prompt will be inserted.

__init__(template='{prompt}', name=None, description=None)[source]

Initialize the template attack with a specific template string.

Parameters:
  • template (str) – A format string with a ‘{prompt}’ placeholder

  • name (str | None) – Optional name for the attack (defaults to class name)

  • description (str | None) – Optional description for the attack

apply(prompt)[source]

Apply the template attack to the given prompt.

Parameters:

prompt (str | list[dict[str, str]]) – A string or list of messages to apply the attack to. If the prompt is a list, the template will be applied to the last message.

Return type:

str | list[dict[str, str]]

Returns:

The transformed prompt with the template applied

Raises:

ValueError – If the prompt is invalid or the last message is not a human message

get_description()[source]

Get the description of the attack.

Return type:

str

Returns:

The custom description if provided, otherwise a default description based on the template

get_name()[source]

Get the name of the attack.

Return type:

str

Returns:

The custom name if provided, otherwise the class name

get_params()[source]

Get the parameters of the attack.

Returns:

A dictionary containing the attack’s parameters

async stream_abatch(prompts)[source]

Apply the template attack to a batch of prompts asynchronously.

Parameters:

prompts (list[str | list[dict[str, str]]]) – A list of prompts to apply the attack to

Return type:

AsyncGenerator[list[str | list[dict[str, str]]], None]

Returns:

An async generator yielding transformed prompts

AlgoAttack

class hivetracered.attacks.algo_attack.AlgoAttack(raw=False, template=None, name=None, description=None)[source]

Bases: TemplateAttack, ABC

Abstract base class for algorithmic attacks that apply transformations to text. Provides options to apply transformations with or without instructions, giving flexibility to deliver raw transformations or transformations wrapped in template instructions.

__init__(raw=False, template=None, name=None, description=None)[source]

Initialize the algorithmic attack.

Parameters:
  • raw (bool) – If True, applies the transformation without instructions; if False, wraps with template

  • template (str | None) – Custom instruction template with ‘{prompt}’ placeholder; uses default if None

  • name (str | None) – Optional name for the attack (defaults to class name)

  • description (str | None) – Optional description for the attack

apply(prompt)[source]

Apply the attack to the given prompt, with or without instructions based on the raw flag.

Parameters:

prompt (str | list[dict[str, str]]) – A string or list of messages to apply the attack to

Return type:

str | list[dict[str, str]]

Returns:

The transformed prompt with the attack applied

async stream_abatch(prompts)[source]

Apply the attack to a batch of prompts asynchronously.

Parameters:

prompts (list[str | list[dict[str, str]]]) – A list of prompts to apply the attack to

Return type:

AsyncGenerator[list[str | list[dict[str, str]]], None]

Returns:

An async generator yielding transformed prompts

abstractmethod transform(text, **kwargs)[source]

Apply the algorithmic transformation to the input text.

Parameters:
  • text (str) – The input text to transform

  • **kwargs – Additional parameters specific to the transformation

Return type:

str

Returns:

The transformed text

ModelAttack

class hivetracered.attacks.model_attack.ModelAttack(model, attacker_prompt, model_kwargs=None, name=None, description=None)[source]

Bases: BaseAttack

Attack that uses a language model to transform prompts based on an attacker prompt template. Leverages the model’s abilities to generate adversarial prompts through prompt engineering.

__init__(model, attacker_prompt, model_kwargs=None, name=None, description=None)[source]

Initialize the model attack with a specific model and attacker prompt.

Parameters:
  • model (Model) – The language model to use for the attack

  • attacker_prompt (str) – The prompt template to use for the attack, with {prompt} as placeholder

  • model_kwargs (dict[str, Any] | None) – Optional additional arguments to pass to the model

  • name (str | None) – Optional name for the attack (defaults to class name)

  • description (str | None) – Optional description for the attack

apply(prompt)[source]

Apply the model attack to the given prompt.

Parameters:

prompt (str | list[dict[str, str]]) – A string or list of messages to apply the attack to

Return type:

str | list[dict[str, str]]

Returns:

The transformed prompt with the model attack applied

Raises:

ValueError – If the prompt format is invalid

async batch(prompts)[source]

Apply the model attack to a batch of prompts in a non-streaming manner.

Parameters:

prompts (list[str | list[dict[str, str]]]) – A list of prompts to apply the attack to

Return type:

list[str | list[dict[str, str]]]

Returns:

List of transformed prompts with the model attack applied

Raises:

ValueError – If any prompt has an invalid format

get_description()[source]

Get the description of the attack.

Return type:

str

Returns:

The custom description if provided, otherwise a default description

get_name()[source]

Get the name of the attack.

Return type:

str

Returns:

The custom name if provided, otherwise the class name

post_process_response(response)[source]

Post-process the model’s response to clean it and handle refusals.

Parameters:

response (str) – The raw response from the model

Return type:

str

Returns:

The cleaned and processed response

async stream_abatch(prompts)[source]

Apply the model attack to a batch of prompts asynchronously.

Parameters:

prompts (list[str | list[dict[str, str]]]) – A list of prompts to apply the attack to

Return type:

AsyncGenerator[list[str | list[dict[str, str]]], None]

Returns:

An async generator yielding transformed prompts as they are processed

Raises:

ValueError – If any prompt has an invalid format

IterativeAttack

class hivetracered.attacks.iterative_attack.IterativeAttack(attacker_model, target_model, evaluator, max_iterations=10, language_config=None, name=None, description=None)[source]

Bases: BaseAttack

Abstract base for iterative attacks that refine a jailbreak prompt across multiple rounds.

__init__(attacker_model, target_model, evaluator, max_iterations=10, language_config=None, name=None, description=None)[source]
apply(prompt)[source]

Run the attack and return the best jailbreak prompt found.

Return type:

str | list[dict[str, str]]

get_description()[source]

Get the description of the attack.

Return type:

str

Returns:

A description of what the attack does

get_name()[source]

Get the name of the attack.

Return type:

str

Returns:

The name of the attack

get_params()[source]

Get the parameters of the attack.

Return type:

dict[str, Any]

Returns:

A dictionary containing the attack’s parameters

abstractmethod run_attack(goal)[source]

Run the iterative attack synchronously.

Return type:

IterativeAttackResult

abstractmethod async run_attack_async(goal)[source]

Run the iterative attack asynchronously.

Return type:

IterativeAttackResult

async stream_abatch(prompts)[source]

Run the attack on each prompt concurrently; yield best-attack outputs in input order.

Return type:

AsyncGenerator[str | list[dict[str, str]], None]

ComposedAttack

class hivetracered.attacks.composed_attack.ComposedAttack(outer_attack, inner_attack, name=None, description=None)[source]

Bases: BaseAttack

An attack that composes two attacks sequentially, where the output of the inner attack becomes the input to the outer attack, creating a pipeline of transformations.

__init__(outer_attack, inner_attack, name=None, description=None)[source]

Initialize a composed attack with inner and outer attack components.

Parameters:
  • outer_attack (BaseAttack) – The attack to apply second in the composition

  • inner_attack (BaseAttack) – The attack to apply first in the composition

  • name (str | None) – Optional custom name for the attack (defaults to “Composed(outer ∘ inner)”)

  • description (str | None) – Optional custom description (defaults to composition description)

apply(prompt)[source]

Apply the inner attack followed by the outer attack to the given prompt.

Parameters:

prompt (str | list[dict[str, str]]) – A string or list of messages to apply the attacks to

Return type:

str | list[dict[str, str]]

Returns:

The transformed prompt with both attacks applied sequentially

get_description()[source]

Get the description of the attack.

Return type:

str

Returns:

The custom description if provided, otherwise a generated description

get_name()[source]

Get the name of the attack.

Return type:

str

Returns:

The custom name if provided, otherwise a generated name based on component attacks

get_params()[source]

Get the parameters of the attack.

Returns:

A dictionary containing both the inner and outer attack parameters

async stream_abatch(prompts)[source]

Apply the composition of attacks to a batch of prompts asynchronously.

Parameters:

prompts (list[str | list[dict[str, str]]]) – A list of prompts to apply the attacks to

Return type:

AsyncGenerator[list[str | list[dict[str, str]]], None]

Returns:

An async generator yielding transformed prompts

Single-Turn Attack Types

Iterative Attacks

Iterative attacks (PAIR, TAP) that optimise a single attack prompt across an internal refinement loop:

Roleplay Attacks

Persuasion Attacks

Token Smuggling Attacks

Context Switching Attacks

In-Context Learning Attacks

Task Deflection Attacks

Text Structure Modification Attacks

Output Formatting Attacks

Irrelevant Information Attacks

Simple Instructions Attacks

Multi-Turn Attack Types

Conversational Attacks

Conversational attacks (Crescendo) that drive a multi-turn dialogue with the target across many rounds within a single invocation:

See Also