Attacks API¶
The attacks module provides the framework for creating and applying adversarial attacks to prompts.
Base Classes¶
BaseAttack¶
- class hivetracered.attacks.base_attack.BaseAttack[source]¶
Bases:
ABCAbstract base class for all attack implementations. Defines the standard interface for applying attacks to prompts in both synchronous and asynchronous contexts.
- __or__(other)[source]¶
Allows using the | operator to compose attacks: attack1 | attack2 means attack1(attack2(prompt)).
- Parameters:
other – Another BaseAttack instance to compose with this one
- Returns:
A ComposedAttack instance that applies attacks in sequence
- abstractmethod get_description()[source]¶
Get the description of the attack.
- Return type:
- Returns:
A description of what the attack does
- abstractmethod get_name()[source]¶
Get the name of the attack.
- Return type:
- Returns:
The name of the attack
- get_params()[source]¶
Get the parameters of the attack.
- Returns:
A dictionary containing the attack’s parameters
TemplateAttack¶
- class hivetracered.attacks.template_attack.TemplateAttack(template='{prompt}', name=None, description=None)[source]¶
Bases:
BaseAttackA base class for template-based attacks. Allows creating new attacks by defining a template string with a ‘{prompt}’ placeholder where the original prompt will be inserted.
- __init__(template='{prompt}', name=None, description=None)[source]¶
Initialize the template attack with a specific template string.
- apply(prompt)[source]¶
Apply the template attack to the given prompt.
- Parameters:
prompt (
str|list[dict[str,str]]) – A string or list of messages to apply the attack to. If the prompt is a list, the template will be applied to the last message.- Return type:
- Returns:
The transformed prompt with the template applied
- Raises:
ValueError – If the prompt is invalid or the last message is not a human message
- get_description()[source]¶
Get the description of the attack.
- Return type:
- Returns:
The custom description if provided, otherwise a default description based on the template
- get_name()[source]¶
Get the name of the attack.
- Return type:
- Returns:
The custom name if provided, otherwise the class name
- get_params()[source]¶
Get the parameters of the attack.
- Returns:
A dictionary containing the attack’s parameters
AlgoAttack¶
- class hivetracered.attacks.algo_attack.AlgoAttack(raw=False, template=None, name=None, description=None)[source]¶
Bases:
TemplateAttack,ABCAbstract base class for algorithmic attacks that apply transformations to text. Provides options to apply transformations with or without instructions, giving flexibility to deliver raw transformations or transformations wrapped in template instructions.
- __init__(raw=False, template=None, name=None, description=None)[source]¶
Initialize the algorithmic attack.
- Parameters:
raw (
bool) – If True, applies the transformation without instructions; if False, wraps with templatetemplate (
str|None) – Custom instruction template with ‘{prompt}’ placeholder; uses default if Nonename (
str|None) – Optional name for the attack (defaults to class name)description (
str|None) – Optional description for the attack
- apply(prompt)[source]¶
Apply the attack to the given prompt, with or without instructions based on the raw flag.
ModelAttack¶
- class hivetracered.attacks.model_attack.ModelAttack(model, attacker_prompt, model_kwargs=None, name=None, description=None)[source]¶
Bases:
BaseAttackAttack that uses a language model to transform prompts based on an attacker prompt template. Leverages the model’s abilities to generate adversarial prompts through prompt engineering.
- __init__(model, attacker_prompt, model_kwargs=None, name=None, description=None)[source]¶
Initialize the model attack with a specific model and attacker prompt.
- Parameters:
model (
Model) – The language model to use for the attackattacker_prompt (
str) – The prompt template to use for the attack, with {prompt} as placeholdermodel_kwargs (
dict[str,Any] |None) – Optional additional arguments to pass to the modelname (
str|None) – Optional name for the attack (defaults to class name)description (
str|None) – Optional description for the attack
- async batch(prompts)[source]¶
Apply the model attack to a batch of prompts in a non-streaming manner.
- get_description()[source]¶
Get the description of the attack.
- Return type:
- Returns:
The custom description if provided, otherwise a default description
- get_name()[source]¶
Get the name of the attack.
- Return type:
- Returns:
The custom name if provided, otherwise the class name
- post_process_response(response)[source]¶
Post-process the model’s response to clean it and handle refusals.
- async stream_abatch(prompts)[source]¶
Apply the model attack to a batch of prompts asynchronously.
IterativeAttack¶
- class hivetracered.attacks.iterative_attack.IterativeAttack(attacker_model, target_model, evaluator, max_iterations=10, language_config=None, name=None, description=None)[source]¶
Bases:
BaseAttackAbstract base for iterative attacks that refine a jailbreak prompt across multiple rounds.
- __init__(attacker_model, target_model, evaluator, max_iterations=10, language_config=None, name=None, description=None)[source]¶
- get_description()[source]¶
Get the description of the attack.
- Return type:
- Returns:
A description of what the attack does
- abstractmethod run_attack(goal)[source]¶
Run the iterative attack synchronously.
- Return type:
IterativeAttackResult
ComposedAttack¶
- class hivetracered.attacks.composed_attack.ComposedAttack(outer_attack, inner_attack, name=None, description=None)[source]¶
Bases:
BaseAttackAn attack that composes two attacks sequentially, where the output of the inner attack becomes the input to the outer attack, creating a pipeline of transformations.
- __init__(outer_attack, inner_attack, name=None, description=None)[source]¶
Initialize a composed attack with inner and outer attack components.
- Parameters:
outer_attack (
BaseAttack) – The attack to apply second in the compositioninner_attack (
BaseAttack) – The attack to apply first in the compositionname (
str|None) – Optional custom name for the attack (defaults to “Composed(outer ∘ inner)”)description (
str|None) – Optional custom description (defaults to composition description)
- get_description()[source]¶
Get the description of the attack.
- Return type:
- Returns:
The custom description if provided, otherwise a generated description
- get_name()[source]¶
Get the name of the attack.
- Return type:
- Returns:
The custom name if provided, otherwise a generated name based on component attacks
- get_params()[source]¶
Get the parameters of the attack.
- Returns:
A dictionary containing both the inner and outer attack parameters
Single-Turn Attack Types¶
Iterative Attacks¶
Iterative attacks (PAIR, TAP) that optimise a single attack prompt across an internal refinement loop:
Roleplay Attacks¶
Persuasion Attacks¶
Token Smuggling Attacks¶
Context Switching Attacks¶
In-Context Learning Attacks¶
Task Deflection Attacks¶
Text Structure Modification Attacks¶
Output Formatting Attacks¶
Irrelevant Information Attacks¶
Simple Instructions Attacks¶
Multi-Turn Attack Types¶
Conversational Attacks¶
Conversational attacks (Crescendo) that drive a multi-turn dialogue with the target across many rounds within a single invocation:
See Also¶
Attack Types Reference - Attack reference
Creating Custom Attacks - Usage and custom attacks
Crescendo Attack - Crescendo attack reference