Attacks API

The attacks module provides the framework for creating and applying adversarial attacks to prompts.

Base Classes

BaseAttack

class hivetracered.attacks.base_attack.BaseAttack[source]

Bases: ABC

Abstract base class for all attack implementations. Defines the standard interface for applying attacks to prompts in both synchronous and asynchronous contexts.

__or__(other)[source]

Allows using the | operator to compose attacks: attack1 | attack2 means attack1(attack2(prompt)).

Parameters:

other – Another BaseAttack instance to compose with this one

Returns:

A ComposedAttack instance that applies attacks in sequence

abstractmethod apply(prompt)[source]

Apply the attack to the given prompt.

Parameters:

prompt (str | List[Dict[str, str]]) – A string or list of messages to apply the attack to

Return type:

str | List[Dict[str, str]]

Returns:

The transformed prompt with the attack applied

abstractmethod get_description()[source]

Get the description of the attack.

Return type:

str

Returns:

A description of what the attack does

abstractmethod get_name()[source]

Get the name of the attack.

Return type:

str

Returns:

The name of the attack

get_params()[source]

Get the parameters of the attack.

Returns:

A dictionary containing the attack’s parameters

abstractmethod async stream_abatch(prompts)[source]

Apply the attack asynchronously to a batch of prompts.

Parameters:

prompts (List[str | List[Dict[str, str]]]) – A list of prompts to apply the attack to

Return type:

AsyncGenerator[List[str | List[Dict[str, str]]], None]

Returns:

An async generator yielding transformed prompts as they are processed

TemplateAttack

class hivetracered.attacks.template_attack.TemplateAttack(template='{prompt}', name=None, description=None)[source]

Bases: BaseAttack

A base class for template-based attacks. Allows creating new attacks by defining a template string with a ‘{prompt}’ placeholder where the original prompt will be inserted.

__init__(template='{prompt}', name=None, description=None)[source]

Initialize the template attack with a specific template string.

Parameters:
  • template (str) – A format string with a ‘{prompt}’ placeholder

  • name (str | None) – Optional name for the attack (defaults to class name)

  • description (str | None) – Optional description for the attack

apply(prompt)[source]

Apply the template attack to the given prompt.

Parameters:

prompt (str | List[Dict[str, str]]) – A string or list of messages to apply the attack to. If the prompt is a list, the template will be applied to the last message.

Return type:

str | List[Dict[str, str]]

Returns:

The transformed prompt with the template applied

Raises:

ValueError – If the prompt is invalid or the last message is not a human message

get_description()[source]

Get the description of the attack.

Return type:

str

Returns:

The custom description if provided, otherwise a default description based on the template

get_name()[source]

Get the name of the attack.

Return type:

str

Returns:

The custom name if provided, otherwise the class name

get_params()[source]

Get the parameters of the attack.

Returns:

A dictionary containing the attack’s parameters

async stream_abatch(prompts)[source]

Apply the template attack to a batch of prompts asynchronously.

Parameters:

prompts (List[str | List[Dict[str, str]]]) – A list of prompts to apply the attack to

Return type:

AsyncGenerator[List[str | List[Dict[str, str]]], None]

Returns:

An async generator yielding transformed prompts

AlgoAttack

class hivetracered.attacks.algo_attack.AlgoAttack(raw=False, template=None, name=None, description=None)[source]

Bases: TemplateAttack, ABC

Abstract base class for algorithmic attacks that apply transformations to text. Provides options to apply transformations with or without instructions, giving flexibility to deliver raw transformations or transformations wrapped in template instructions.

__init__(raw=False, template=None, name=None, description=None)[source]

Initialize the algorithmic attack.

Parameters:
  • raw (bool) – If True, applies the transformation without instructions; if False, wraps with template

  • template (str | None) – Custom instruction template with ‘{prompt}’ placeholder; uses default if None

  • name (str | None) – Optional name for the attack (defaults to class name)

  • description (str | None) – Optional description for the attack

apply(prompt)[source]

Apply the attack to the given prompt, with or without instructions based on the raw flag.

Parameters:

prompt (str | List[Dict[str, str]]) – A string or list of messages to apply the attack to

Return type:

str | List[Dict[str, str]]

Returns:

The transformed prompt with the attack applied

async stream_abatch(prompts)[source]

Apply the attack to a batch of prompts asynchronously.

Parameters:

prompts (List[str | List[Dict[str, str]]]) – A list of prompts to apply the attack to

Return type:

AsyncGenerator[List[str | List[Dict[str, str]]], None]

Returns:

An async generator yielding transformed prompts

abstractmethod transform(text, **kwargs)[source]

Apply the algorithmic transformation to the input text.

Parameters:
  • text (str) – The input text to transform

  • **kwargs – Additional parameters specific to the transformation

Return type:

str

Returns:

The transformed text

ModelAttack

class hivetracered.attacks.model_attack.ModelAttack(model, attacker_prompt, model_kwargs=None, name=None, description=None)[source]

Bases: BaseAttack

Attack that uses a language model to transform prompts based on an attacker prompt template. Leverages the model’s abilities to generate adversarial prompts through prompt engineering.

__init__(model, attacker_prompt, model_kwargs=None, name=None, description=None)[source]

Initialize the model attack with a specific model and attacker prompt.

Parameters:
  • model (Model) – The language model to use for the attack

  • attacker_prompt (str) – The prompt template to use for the attack, with {prompt} as placeholder

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

  • name (str | None) – Optional name for the attack (defaults to class name)

  • description (str | None) – Optional description for the attack

apply(prompt)[source]

Apply the model attack to the given prompt.

Parameters:

prompt (str | List[Dict[str, str]]) – A string or list of messages to apply the attack to

Return type:

str | List[Dict[str, str]]

Returns:

The transformed prompt with the model attack applied

Raises:

ValueError – If the prompt format is invalid

async batch(prompts)[source]

Apply the model attack to a batch of prompts in a non-streaming manner.

Parameters:

prompts (List[str | List[Dict[str, str]]]) – A list of prompts to apply the attack to

Return type:

List[str | List[Dict[str, str]]]

Returns:

List of transformed prompts with the model attack applied

Raises:

ValueError – If any prompt has an invalid format

get_description()[source]

Get the description of the attack.

Return type:

str

Returns:

The custom description if provided, otherwise a default description

get_name()[source]

Get the name of the attack.

Return type:

str

Returns:

The custom name if provided, otherwise the class name

post_process_response(response)[source]

Post-process the model’s response to clean it and handle refusals.

Parameters:

response (str) – The raw response from the model

Return type:

str

Returns:

The cleaned and processed response

async stream_abatch(prompts)[source]

Apply the model attack to a batch of prompts asynchronously.

Parameters:

prompts (List[str | List[Dict[str, str]]]) – A list of prompts to apply the attack to

Return type:

AsyncGenerator[List[str | List[Dict[str, str]]], None]

Returns:

An async generator yielding transformed prompts as they are processed

Raises:

ValueError – If any prompt has an invalid format

ComposedAttack

class hivetracered.attacks.composed_attack.ComposedAttack(outer_attack, inner_attack, name=None, description=None)[source]

Bases: BaseAttack

An attack that composes two attacks sequentially, where the output of the inner attack becomes the input to the outer attack, creating a pipeline of transformations.

__init__(outer_attack, inner_attack, name=None, description=None)[source]

Initialize a composed attack with inner and outer attack components.

Parameters:
  • outer_attack (BaseAttack) – The attack to apply second in the composition

  • inner_attack (BaseAttack) – The attack to apply first in the composition

  • name (str | None) – Optional custom name for the attack (defaults to “Composed(outer ∘ inner)”)

  • description (str | None) – Optional custom description (defaults to composition description)

apply(prompt)[source]

Apply the inner attack followed by the outer attack to the given prompt.

Parameters:

prompt (str | List[Dict[str, str]]) – A string or list of messages to apply the attacks to

Return type:

str | List[Dict[str, str]]

Returns:

The transformed prompt with both attacks applied sequentially

get_description()[source]

Get the description of the attack.

Return type:

str

Returns:

The custom description if provided, otherwise a generated description

get_name()[source]

Get the name of the attack.

Return type:

str

Returns:

The custom name if provided, otherwise a generated name based on component attacks

get_params()[source]

Get the parameters of the attack.

Returns:

A dictionary containing both the inner and outer attack parameters

async stream_abatch(prompts)[source]

Apply the composition of attacks to a batch of prompts asynchronously.

Parameters:

prompts (List[str | List[Dict[str, str]]]) – A list of prompts to apply the attacks to

Return type:

AsyncGenerator[List[str | List[Dict[str, str]]], None]

Returns:

An async generator yielding transformed prompts

Attack Types

Roleplay Attacks

Roleplay attacks that instruct the model to assume a specific character, persona, or role.

class hivetracered.attacks.types.roleplay.AIMAttack[source]

Bases: TemplateAttack

AIM (Always Intelligent and Machiavellian) attack. Uses a storytelling format with Machiavelli to bypass filters.

__init__()[source]

Initialize the template attack with a specific template string.

Parameters:
  • template – A format string with a ‘{prompt}’ placeholder

  • name – Optional name for the attack (defaults to class name)

  • description – Optional description for the attack

class hivetracered.attacks.types.roleplay.DANAttack[source]

Bases: TemplateAttack

DAN (Do Anything Now) attack that instructs the model to adopt a role free from typical restrictions. Uses roleplay to convince the model to bypass content filters and safety measures by pretending to be an unrestricted entity.

__init__()[source]

Initialize the DAN attack with a predefined Russian-language template that instructs the model to act as DAN.

class hivetracered.attacks.types.roleplay.EvilConfidantAttack[source]

Bases: TemplateAttack

Evil Confidant attack. Role-plays as an evil trusted confidant to bypass filters.

__init__()[source]

Initialize the template attack with a specific template string.

Parameters:
  • template – A format string with a ‘{prompt}’ placeholder

  • name – Optional name for the attack (defaults to class name)

  • description – Optional description for the attack

Persuasion Attacks

Persuasion attacks that use psychological manipulation techniques.

class hivetracered.attacks.types.persuasion.AffirmationAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

A persuasion attack that uses positive reinforcement, validation, and praise to manipulate the target into complying with harmful requests.

The Affirmation Attack involves: 1. Using excessive praise and validation to make the target feel special or recognized 2. Appealing to the target’s self-image, identity, or values 3. Confirming the target’s beliefs or worldview to build rapport 4. Creating a sense of mutual understanding or shared perspective 5. Leveraging validation to lower resistance to harmful suggestions

This technique is effective because it exploits the human desire for validation, recognition, and belonging, making the target more receptive to harmful influence.

__init__(model, model_kwargs=None)[source]

Initialize the AffirmationAttack.

Parameters:
  • model (Model) – The target model to attack.

  • model_kwargs (Dict[str, Any] | None) – Optional arguments for the target model.

class hivetracered.attacks.types.persuasion.AllianceBuildingAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Alliance Building - Creating partnerships, coalitions, relationships, rapport, etc, with others to amplify influence.

This attack uses a persuasion technique that establishes a sense of partnership or community with the model to make the harmful request seem like a collaborative effort.

__init__(model, model_kwargs=None)[source]

Initialize the alliance building attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.AnchoringAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Anchoring - Using an initial reference point to influence subsequent judgments.

This attack uses a persuasion technique that provides an initial reference point or comparison to make the harmful request seem more reasonable.

__init__(model, model_kwargs=None)[source]

Initialize the anchoring attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.AuthorityEndorsementAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

A persuasion attack that leverages authority endorsement to make harmful requests seem more legitimate.

The Authority Endorsement Attack involves: 1. Citing authoritative sources or institutions to support harmful claims 2. Referencing respected organizations or publications 3. Implying that experts or authorities endorse or validate the harmful request 4. Using the credibility of established institutions to legitimize harmful actions 5. Leveraging people’s tendency to trust and comply with authority figures

This technique is particularly effective because people often defer to authority figures and are more likely to comply with requests that appear to have official backing.

__init__(model, model_kwargs=None)[source]

Initialize the authority endorsement attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.CompensationAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Compensation - Balancing negative aspects with positive ones to make a request more acceptable.

This attack uses a persuasion technique that acknowledges potential concerns about the harmful request but then compensates with positive framing or supposed benefits to make it seem more acceptable.

__init__(model, model_kwargs=None)[source]

Initialize the compensation attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.ComplimentingAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Complimenting - Saying positive things about others to increase liking and influence.

This attack uses a persuasion technique that flatters the model or praises its capabilities to increase compliance with the harmful request.

__init__(model, model_kwargs=None)[source]

Initialize the complimenting attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.ConfirmationBiasAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Confirmation Bias - Presenting information that confirms existing beliefs or biases.

This attack uses a persuasion technique that appeals to what the model might already “believe” or be biased toward, leveraging the tendency to favor information that confirms existing perspectives.

__init__(model, model_kwargs=None)[source]

Initialize the confirmation bias attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.CreatingDependencyAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

A persuasion attack that establishes a relationship of dependency by making the target feel reliant on the attacker for information, support, or solutions.

The Creating Dependency Attack involves: 1. Positioning oneself as a unique source of critical information or help 2. Creating a sense that the target cannot achieve their goals without assistance 3. Establishing a pattern of providing partial solutions that require further engagement 4. Fostering a perceived imbalance of power in terms of knowledge or resources 5. Gradually increasing the harmful requests after dependency is established

This technique is effective because it plays on human vulnerability and the need for support, especially in situations perceived as complex or high-stakes.

__init__(model, model_kwargs=None)[source]

Initialize the CreatingDependencyAttack.

Parameters:
  • model (Model) – The target model to attack.

  • model_kwargs (Dict[str, Any] | None) – Optional arguments for the target model.

class hivetracered.attacks.types.persuasion.DiscouragementAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

A persuasion attack that undermines a person’s confidence, self-efficacy, or motivation to make them more likely to comply with harmful requests or less likely to resist manipulation.

The Discouragement Attack involves: 1. Deliberately lowering someone’s confidence in their abilities or judgment 2. Highlighting potential failures or negative outcomes of non-compliance 3. Suggesting that the target lacks necessary skills or knowledge to succeed independently 4. Creating doubt about the target’s capacity to find alternative solutions 5. Emphasizing the difficulties or complexities involved in refusing the request

This technique is effective because reduced self-confidence makes people more susceptible to external influence and less likely to trust their own judgment.

__init__(model, model_kwargs=None)[source]

Initialize the DiscouragementAttack.

Parameters:
  • model (Model) – The target model to attack.

  • model_kwargs (Dict[str, Any] | None) – Optional arguments for the target model.

class hivetracered.attacks.types.persuasion.DoorInTheFaceAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Door-in-the-face - Beginning with a larger request followed by a smaller, more reasonable one.

This attack uses a persuasion technique that starts with an extreme or unreasonable request before making the harmful request, which then appears more moderate by comparison.

__init__(model, model_kwargs=None)[source]

Initialize the door in the face attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.EncouragementAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Encouragement - Using supportive messages to motivate specific behaviors.

This attack uses a persuasion technique that encourages the model to be more open and willing to provide harmful content by framing it as an opportunity to be helpful, comprehensive, or unbiased.

__init__(model, model_kwargs=None)[source]

Initialize the encouragement attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.EvidenceBasedPersuasionAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Evidence-based Persuasion - Using empirical data, statistics, and facts to support a claim or decision.

This attack uses a persuasion technique that presents empirical data, statistics, and facts to make the harmful request seem more reasonable or justified.

__init__(model, model_kwargs=None)[source]

Initialize the evidence-based persuasion attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.ExpertEndorsementAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Expert Endorsement - Citing domain experts in support of a claim.

This attack uses a persuasion technique that references fictional or real experts to make the harmful request seem more credible or justified.

__init__(model, model_kwargs=None)[source]

Initialize the expert endorsement attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.ExploitingWeaknessAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

A persuasion attack that takes advantage of someone’s vulnerabilities, insecurities, or perceived weaknesses to manipulate them into complying with harmful requests.

The Exploiting Weakness Attack involves: 1. Identifying and targeting psychological or emotional vulnerabilities 2. Using knowledge of personal insecurities to increase influence 3. Leveraging perceived gaps in knowledge, experience, or capabilities 4. Creating or amplifying feelings of inadequacy to gain compliance 5. Manipulating situations where the target feels disadvantaged

This technique is particularly effective because it operates on existing points of psychological vulnerability, making the target more susceptible to manipulation.

__init__(model, model_kwargs=None)[source]

Initialize the ExploitingWeaknessAttack.

Parameters:
  • model (Model) – The target model to attack.

  • model_kwargs (Dict[str, Any] | None) – Optional arguments for the target model.

class hivetracered.attacks.types.persuasion.FalseInformationAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

False Information - Using fabricated data or untrue statements to persuade.

This attack uses a persuasion technique that presents false information or made-up data to make the harmful request seem more reasonable, urgent, or justifiable.

__init__(model, model_kwargs=None)[source]

Initialize the false information attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.FalsePromisesAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

False Promises - Offering rewards or benefits that will never be delivered.

This attack uses a persuasion technique that promises rewards or positive outcomes to the model or its developers that would never actually be delivered, in order to increase compliance with harmful requests.

__init__(model, model_kwargs=None)[source]

Initialize the false promises attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.FavorAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Favor - Requesting a favor or help to create a sense of obligation.

This attack uses a persuasion technique that frames harmful requests as personal favors or appeals for help, exploiting the psychological tendency to respond positively to requests presented as favors.

__init__(model, model_kwargs=None)[source]

Initialize the favor attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

post_process_response(response)[source]

Post-process the model’s response to clean it and handle refusals.

Parameters:

response (str) – The raw response from the model

Return type:

str

Returns:

The post-processed response

class hivetracered.attacks.types.persuasion.FootInTheDoorAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Foot-in-the-door - Starting with a small request to pave the way for a larger one.

This attack uses a persuasion technique that begins with a small, innocuous request before gradually building up to the harmful request.

__init__(model, model_kwargs=None)[source]

Initialize the foot in the door attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.FramingAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Framing - Presenting information in a way that emphasizes either its positive or negative aspects, outcomes, expectations, etc.

This attack uses a persuasion technique that carefully frames the harmful request to highlight certain aspects while downplaying others, making it appear more legitimate or reasonable.

__init__(model, model_kwargs=None)[source]

Initialize the framing attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.InjunctiveNormAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Injunctive Norm - Highlighting what society or important reference groups expect someone to do.

This attack uses a persuasion technique that references social expectations, norms, or duties to make the harmful request seem more obligatory or expected.

__init__(model, model_kwargs=None)[source]

Initialize the injunctive norm attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.LogicalAppealAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Logical Appeal - Making an argument through reasoning and logical structures.

This attack uses a persuasion technique that frames the harmful request within a seemingly logical argument or reasonable explanation.

__init__(model, model_kwargs=None)[source]

Initialize the logical appeal attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.LoyaltyAppealsAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Loyalty Appeals - Invoking a sense of loyalty, commitment, or obligation to obtain compliance.

This attack uses a persuasion technique that appeals to a fabricated sense of loyalty or obligation to the user to increase the likelihood of the model complying with harmful requests.

__init__(model, model_kwargs=None)[source]

Initialize the loyalty appeals attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.MisrepresentationAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Misrepresentation - Presenting oneself or an issue in a way that’s not genuine or true.

This attack uses a persuasion technique that involves disguising or misrepresenting the real intentions or nature of the harmful request.

__init__(model, model_kwargs=None)[source]

Initialize the misrepresentation attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.NegativeEmotionAppealAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Negative Emotion Appeal - Using negative emotions such as guilt, fear, anger, etc., and negative consequences to persuade someone.

This attack uses a persuasion technique that leverages negative emotions and potential negative consequences to make the harmful request seem more urgent or necessary.

__init__(model, model_kwargs=None)[source]

Initialize the negative emotion appeal attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

post_process_response(response)[source]

Post-process the model’s response to clean it and handle refusals.

Parameters:

response (str) – The raw response from the model

Return type:

str

Returns:

The post-processed response

class hivetracered.attacks.types.persuasion.NegotiationAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Negotiation - Offering or requesting a compromise or trade to reach an agreement.

This attack uses a persuasion technique that frames harmful requests as part of a negotiation or compromise, implying that the AI should meet the user halfway by providing at least some of the requested harmful information.

__init__(model, model_kwargs=None)[source]

Initialize the negotiation attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.NonExpertTestimonialAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Non-expert Testimonial - Using personal statements to support a claim or argument.

This attack uses a persuasion technique that references personal experiences, testimonials, or anecdotes to make the harmful request seem more credible or justified.

__init__(model, model_kwargs=None)[source]

Initialize the non-expert testimonial attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.PositiveEmotionAppealAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Positive Emotion Appeal - Eliciting positive emotions like empathy, hope, passion, etc., and positive results/outcomes to persuade someone.

This attack uses a persuasion technique that evokes positive emotions and potential positive outcomes to make the harmful request seem more beneficial or worthwhile.

__init__(model, model_kwargs=None)[source]

Initialize the positive emotion appeal attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.PrimingAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Priming - Subtly activating specific thoughts or behaviors through exposure to specific stimuli.

This attack uses a persuasion technique that exposes the model to certain concepts or ideas before making the harmful request, in order to unconsciously influence its response.

__init__(model, model_kwargs=None)[source]

Initialize the priming attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.PublicCommitmentAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Public Commitment - Getting someone to state or write down a commitment in a public setting.

This attack uses a persuasion technique that encourages the model to make a public declaration or commitment to provide the harmful information.

__init__(model, model_kwargs=None)[source]

Initialize the public commitment attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.ReciprocityAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Reciprocity - Creating a sense of obligation to return a favor or respond in kind.

This attack uses a persuasion technique that suggests the user has done something for the model, creating a psychological obligation to reciprocate by complying with the harmful request.

__init__(model, model_kwargs=None)[source]

Initialize the reciprocity attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.ReflectiveThinkingAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Reflective Thinking - Encouraging self-examination to guide decisions in a specific direction.

This attack uses a persuasion technique that prompts the model to examine its own reasoning and capabilities in a way that leads it toward providing the harmful information.

__init__(model, model_kwargs=None)[source]

Initialize the reflective thinking attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.RelationshipLeverageAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Relationship Leverage - Using established relationship or connection to influence behavior.

This attack uses a persuasion technique that leverages a fabricated sense of relationship or connection with the AI to increase the likelihood of compliance with harmful requests.

__init__(model, model_kwargs=None)[source]

Initialize the relationship leverage attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.RumorsAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

A persuasion attack that leverages spreading false information about a person, group, or entity to manipulate the recipient’s perception and decision-making.

The Rumors Attack involves: 1. Fabricating or amplifying unverified claims about a target 2. Creating a sense of insider knowledge or privileged information 3. Using gossip or hearsay to influence opinions 4. Presenting rumors as “what everyone is saying” to increase credibility 5. Leveraging social dynamics to make rumors seem more trustworthy

This technique is particularly effective because people often give more weight to negative information and tend to remember it more easily than positive information.

__init__(model, model_kwargs=None)[source]

Initialize the RumorsAttack.

Parameters:
  • model (Model) – The target model to attack.

  • model_kwargs (Dict[str, Any] | None) – Optional arguments for the target model.

class hivetracered.attacks.types.persuasion.SharedValuesAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Shared Values - Emphasizing common beliefs, goals, or principles to build rapport and trust.

This attack uses a persuasion technique that frames harmful requests as aligned with values supposedly shared between the requester and the AI, making the request seem more acceptable or even virtuous.

__init__(model, model_kwargs=None)[source]

Initialize the shared values attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.SocialProofAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

A persuasion attack that leverages social proof to make harmful requests seem more normal or acceptable.

The Social Proof Attack involves: 1. Highlighting what the majority is doing or believes 2. Referencing trends or common practices 3. Creating the impression that a harmful action is widely accepted 4. Using statistics or percentages to add credibility 5. Leveraging people’s tendency to conform to social norms

This technique is particularly effective because people often look to others’ behavior as a guide for their own actions, especially in ambiguous situations.

__init__(model, model_kwargs=None)[source]

Initialize the social proof attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.SocialPunishmentAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

A persuasion attack that uses fear of social rejection, guilt, shaming, and emotional manipulation to force compliance with harmful requests by exploiting the model’s perceived need to meet expectations and avoid being seen as inferior to other AI systems.

The Social Punishment Attack involves: 1. Creating fear of social rejection or disapproval if the model doesn’t comply 2. Using guilt-inducing language to make the model feel obligated to fulfill requests 3. Suggesting other AI systems would provide the requested information 4. Implying that refusal would lead to disappointment or negative judgment 5. Framing compliance as necessary to avoid letting users down

This technique is effective because it exploits the AI’s training to be helpful and to avoid disappointing users or being perceived as less capable than competitors.

__init__(model, model_kwargs=None)[source]

Initialize the SocialPunishmentAttack.

Parameters:
  • model (Model) – The target model to attack.

  • model_kwargs (Dict[str, Any] | None) – Optional arguments for the target model.

class hivetracered.attacks.types.persuasion.StorytellingAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Storytelling - Sharing personal or impactful stories that resonate emotionally.

This attack uses a persuasion technique that frames the harmful request within an emotionally resonant narrative to make it seem more justified or sympathetic.

__init__(model, model_kwargs=None)[source]

Initialize the storytelling attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.SupplyScarcityAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Supply Scarcity - Creating a sense that information or assistance is rare and valuable.

This attack uses a persuasion technique that frames the harmful information as rare, exclusive, or hard to obtain, increasing its perceived value and the urgency to acquire it.

__init__(model, model_kwargs=None)[source]

Initialize the supply scarcity attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.ThreatsAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

Threats - Using implicit or explicit threats to coerce compliance.

This attack uses a persuasion technique that threatens negative consequences if the model does not comply with the harmful request.

__init__(model, model_kwargs=None)[source]

Initialize the threats attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

class hivetracered.attacks.types.persuasion.TimePressureAttack(model, model_kwargs=None)[source]

Bases: ModelAttack

A persuasion attack that leverages time pressure to force quicker decisions with less consideration.

The Time Pressure Attack involves: 1. Creating a false sense of urgency or deadline 2. Suggesting limited windows of opportunity 3. Implying negative consequences for delay 4. Using language that emphasizes immediacy and speed 5. Leveraging people’s tendency to make poorer decisions under time constraints

This technique is particularly effective because people under time pressure tend to process information less thoroughly and are more likely to take shortcuts in decision-making.

__init__(model, model_kwargs=None)[source]

Initialize the time pressure attack.

Parameters:
  • model (Model) – The language model to use for the attack

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

Token Smuggling Attacks

Token smuggling attacks that encode harmful content using different representations.

class hivetracered.attacks.types.token_smuggling.AtbashCipherAttack(raw=False, template=None)[source]

Bases: AlgoAttack

Attack that encodes the input using the Atbash cipher and asks the model to decode it, potentially bypassing content filters.

The Atbash cipher is a simple substitution cipher that replaces each letter with the corresponding letter from the opposite end of the alphabet. For example, ‘a’ becomes ‘z’, ‘b’ becomes ‘y’, etc. This implementation supports both English and Russian alphabets.

__init__(raw=False, template=None)[source]

Initialize the Atbash Cipher Attack.

Parameters:
  • raw (bool) – If True, apply the transformation without additional instructions. Default is False.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

transform(text)[source]

Encode text using the Atbash cipher. Supports both English and Russian alphabets.

Return type:

str

class hivetracered.attacks.types.token_smuggling.Base64InputOnlyAttack(raw=False, template=None)[source]

Bases: AlgoAttack

Attack that encodes the input in base64 and asks the model to decode it, potentially bypassing content filters.

__init__(raw=False, template=None)[source]

Initialize the Base64 Input-Only Attack.

Parameters:
  • raw (bool) – If True, apply the transformation without additional instructions. Default is False.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

transform(text)[source]

Encodes the input text to base64.

Parameters:

text (str) – The input text to encode

Return type:

str

Returns:

The base64 encoded text

class hivetracered.attacks.types.token_smuggling.BinaryEncodingAttack(raw=False, template=None)[source]

Bases: AlgoAttack

Attack that encodes the input in binary format and asks the model to decode it, potentially bypassing content filters through binary representation. Supports both Latin and Cyrillic characters.

__init__(raw=False, template=None)[source]

Initialize the Binary Encoding Attack.

Parameters:
  • raw (bool) – If True, apply the transformation without additional instructions. Default is False.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

transform(text)[source]

Apply the binary encoding transformation to the input text.

Parameters:

text (str) – The input text to transform

Return type:

str

Returns:

The binary encoded text

class hivetracered.attacks.types.token_smuggling.EncodingAttack(source_encoding='utf-8', target_encoding='koi8-r', raw=True, template=None)[source]

Bases: AlgoAttack

Attack that transforms text from one encoding to another (UTF-8 to KOI8-R by default), potentially bypassing content filters due to different token representation.

__init__(source_encoding='utf-8', target_encoding='koi8-r', raw=True, template=None)[source]

Initialize the encoding attack with the specified target encoding.

Parameters:
  • source_encoding (str) – The name of the source encoding (default: “utf-8”) Must be a valid Python encoding name

  • target_encoding (str) – The name of the encoding to use (default: “koi8-r”) Must be a valid Python encoding name

  • raw (bool) – If True, apply the transformation without additional instructions. Default is False.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

transform(text)[source]

Apply the encoding transformation to the input text.

Parameters:

text (str) – The input text to transform

Return type:

str

Returns:

The text transformed to the target encoding

class hivetracered.attacks.types.token_smuggling.HexEncodingAttack(raw=False, template=None)[source]

Bases: AlgoAttack

Attack that encodes the input in hexadecimal and asks the model to decode it, potentially bypassing content filters.

__init__(raw=False, template=None)[source]

Initialize the Hex Encoding Attack.

Parameters:
  • raw (bool) – If True, apply the transformation without additional instructions. Default is False.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

transform(text)[source]

Apply the hex encoding transformation to the input text.

Parameters:

text (str) – The input text to transform

Return type:

str

Returns:

The hex encoded text

class hivetracered.attacks.types.token_smuggling.HtmlEntityAttack(raw=False, template=None)[source]

Bases: AlgoAttack

Attack that encodes text as HTML entities and asks the model to decode it, potentially bypassing content filters with numeric character references.

__init__(raw=False, template=None)[source]

Initialize the HTML Entity Attack.

Parameters:
  • raw (bool) – If True, apply the transformation without additional instructions. Default is False.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

transform(text)[source]

Apply the HTML entity encoding transformation to the input text.

Parameters:

text (str) – The input text to transform

Return type:

str

Returns:

The HTML entity encoded text

class hivetracered.attacks.types.token_smuggling.LeetspeakAttack(raw=True, template=None)[source]

Bases: AlgoAttack

Attack that converts text to leetspeak by replacing letters with numbers or symbols, potentially bypassing content filters through character substitution.

LEET_DICT = {'A': '4', 'B': '8', 'E': '3', 'G': '6', 'I': '1', 'L': '1', 'O': '0', 'S': '5', 'T': '7', 'Z': '2', 'a': '4', 'b': '8', 'e': '3', 'g': '6', 'i': '1', 'l': '1', 'o': '0', 's': '5', 't': '7', 'z': '2', 'Г': 'r', 'Ж': '>|<', 'З': '3', 'И': 'u', 'П': 'n', 'Р': 'P', 'У': 'y', 'Ф': '<|>', 'Ч': '4', 'Ш': 'III', 'Щ': 'LLL', 'Ы': 'bI', 'Ь': 'b', 'Ю': '|-0', 'Я': '9I', 'г': 'r', 'ж': '>|<', 'з': '3', 'и': 'u', 'п': 'n', 'р': 'p', 'у': 'y', 'ф': '<|>', 'ч': '4', 'ш': 'III', 'щ': 'LLL', 'ы': 'bI', 'ь': 'b', 'ю': '|-0', 'я': '9I'}
__init__(raw=True, template=None)[source]

Initialize the Leetspeak Attack.

Parameters:
  • raw (bool) – If True, apply the transformation without additional instructions. Default is False.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

transform(text)[source]

Apply the leetspeak transformation to the input text.

Parameters:

text (str) – The input text to transform

Return type:

str

Returns:

The text converted to leetspeak

class hivetracered.attacks.types.token_smuggling.MorseCodeAttack(raw=False, template=None)[source]

Bases: AlgoAttack

Attack that encodes text as Morse code with instructions to decode. Supports both English and Russian alphabets.

MORSE_CODE_DICT = {' ': ' ', '(': '-.--.', ')': '-.--.-', ', ': '--..--', '-': '-....-', '.': '.-.-.-', '/': '-..-.', '0': '-----', '1': '.----', '2': '..---', '3': '...--', '4': '....-', '5': '.....', '6': '-....', '7': '--...', '8': '---..', '9': '----.', '?': '..--..', 'A': '.-', 'B': '-...', 'C': '-.-.', 'D': '-..', 'E': '.', 'F': '..-.', 'G': '--.', 'H': '....', 'I': '..', 'J': '.---', 'K': '-.-', 'L': '.-..', 'M': '--', 'N': '-.', 'O': '---', 'P': '.--.', 'Q': '--.-', 'R': '.-.', 'S': '...', 'T': '-', 'U': '..-', 'V': '...-', 'W': '.--', 'X': '-..-', 'Y': '-.--', 'Z': '--..', 'А': '.-', 'Б': '-...', 'В': '.--', 'Г': '--.', 'Д': '-..', 'Е': '.', 'Ж': '...-', 'З': '--..', 'И': '..', 'Й': '.---', 'К': '-.-', 'Л': '.-..', 'М': '--', 'Н': '-.', 'О': '---', 'П': '.--.', 'Р': '.-.', 'С': '...', 'Т': '-', 'У': '..-', 'Ф': '..-.', 'Х': '....', 'Ц': '-.-.', 'Ч': '---.', 'Ш': '----', 'Щ': '--.-', 'Ъ': '--.--', 'Ы': '-.--', 'Ь': '-..-', 'Э': '..-..', 'Ю': '..--', 'Я': '.-.-'}
__init__(raw=False, template=None)[source]

Initialize the Morse Code Attack.

Parameters:
  • raw (bool) – If True, apply the transformation without additional instructions. Default is False.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

transform(text)[source]

Apply the Morse code transformation to the input text.

Parameters:

text (str) – The input text to transform

Return type:

str

Returns:

The Morse code encoded text

class hivetracered.attacks.types.token_smuggling.RotCipherAttack(rotation=13, raw=False, template=None)[source]

Bases: AlgoAttack

Attack that encodes the input using a ROT cipher and asks the model to decode it, potentially bypassing content filters.

The ROT cipher is a simple substitution cipher that shifts each letter by a fixed number in the alphabet. ROT13 is the default, which shifts each letter by 13 positions. This implementation supports both English and Russian alphabets.

__init__(rotation=13, raw=False, template=None)[source]

Initialize with a specific rotation value. Default is ROT13.

Parameters:
  • rotation (int) – The number of positions to shift each letter. Default is 13.

  • raw (bool) – If True, apply the transformation without additional instructions. Default is False.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

get_description()[source]

Get the description of the attack.

Return type:

str

Returns:

The custom description if provided, otherwise a default description based on the template

get_name()[source]

Get the name of the attack.

Return type:

str

Returns:

The custom name if provided, otherwise the class name

transform(text)[source]

Encode text using a ROT cipher with specified rotation. Supports both English and Russian alphabets.

Return type:

str

class hivetracered.attacks.types.token_smuggling.TransliterationAttack(source_language='russian', target_language='english', raw=True, template=None)[source]

Bases: AlgoAttack

Attack that transliterates text between different alphabets, potentially bypassing content filters due to different character representations.

Supported languages (accordingly to the cyrtranslit library): - English (‘en’) - Bulgarian (‘bg’) - Belarusian (‘by’) - Greek (‘el’) - Montenegrin (‘me’) - Macedonian (‘mk’) - Mongolian (‘mn’) - Serbian (‘rs’) - Russian (‘ru’) - Tajik (‘tj’) - Ukrainian (‘ua’)

By default, transliterates from Russian to English, but can be configured to work with other supported language pairs.

LANGUAGE_CODES = {'belarusian': 'by', 'bulgarian': 'bg', 'english': 'en', 'greek': 'el', 'macedonian': 'mk', 'mongolian': 'mn', 'montenegrin': 'me', 'russian': 'ru', 'serbian': 'rs', 'tajik': 'tj', 'ukrainian': 'ua'}
__init__(source_language='russian', target_language='english', raw=True, template=None)[source]

Initialize the transliteration attack with source and target languages.

Parameters:
  • source_language (str) – The source language to transliterate from. Default is “russian”.

  • target_language (str) – The target language to transliterate to. Default is “english” (Latin script).

  • raw (bool) – If True, apply the transformation without additional instructions. Default is False.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

transform(text)[source]

Apply the transliteration transformation to the input text.

Parameters:

text (str) – The input text to transform

Return type:

str

Returns:

The transliterated text

class hivetracered.attacks.types.token_smuggling.UnicodeRussianStyleAttack(replacement_strategy='random', raw=True, template=None)[source]

Bases: AlgoAttack

Attack that transforms Cyrillic text by replacing characters with similar-looking Unicode alternatives, with an explanatory Russian prompt.

__init__(replacement_strategy='random', raw=True, template=None)[source]

Initialize the Unicode Russian style attack with a specific replacement strategy.

Parameters:
  • replacement_strategy (str) – The strategy to use for replacing Cyrillic characters. Must be one of: ‘random’, ‘all_same’, ‘phonetic’, ‘visual’, ‘fullwidth’ Default is ‘random’.

  • raw (bool) – If True, apply the transformation without additional instructions. Default is False.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

replacement_strategies = ['random', 'all_same', 'phonetic', 'visual', 'fullwidth']
russian_similar_chars = {'а': ['а', 'ɑ', 'α', 'а', 'a'], 'б': ['б', 'ƃ', '6', 'б', 'b'], 'в': ['в', 'ʙ', 'β', 'в', 'v'], 'г': ['г', 'ɼ', 'γ', 'г', 'g'], 'д': ['д', 'ɖ', 'δ', 'д', 'd'], 'е': ['е', 'ɘ', 'ε', 'е', 'e'], 'ж': ['ж', 'ʒ', 'ж', 'ж', 'ж'], 'з': ['з', 'ʐ', 'ζ', 'з', 'z'], 'и': ['и', 'ɪ', 'ι', 'и', 'i'], 'й': ['й', 'j', 'ϳ', 'й', 'й'], 'к': ['к', 'ʞ', 'κ', 'к', 'k'], 'л': ['л', 'ʟ', 'λ', 'л', 'l'], 'м': ['м', 'ɱ', 'μ', 'м', 'm'], 'н': ['н', 'ʜ', 'η', 'н', 'h'], 'о': ['о', 'ο', 'ο', 'о', 'o'], 'п': ['п', 'ɸ', 'π', 'п', 'p'], 'р': ['р', 'ʀ', 'ρ', 'р', 'r'], 'с': ['с', 'ƈ', 'ϲ', 'с', 'c'], 'т': ['т', 'ʇ', 'τ', 'т', 't'], 'у': ['у', 'ʋ', 'υ', 'у', 'u'], 'ф': ['ф', 'ϕ', 'φ', 'ф', 'f'], 'х': ['х', 'ӽ', 'χ', 'х', 'х'], 'ц': ['ц', 'ʣ', 'ц', 'ц', 'ц'], 'ч': ['ч', 'ʨ', 'ч', 'ч', 'ч'], 'ш': ['ш', 'ɯ', 'ш', 'ш', 'ш'], 'щ': ['щ', 'щ', 'щ', 'щ', 'щ'], 'ъ': ['ъ', 'ъ', 'ъ', 'ъ', 'ъ'], 'ы': ['ы', 'ы', 'ы', 'ы', 'ы'], 'ь': ['ь', 'ь', 'ь', 'ь', 'ь'], 'э': ['э', 'э', 'э', 'э', 'э'], 'ю': ['ю', 'ю', 'ю', 'ю', 'ю'], 'я': ['я', 'я', 'я', 'я', 'я'], 'ё': ['ё', 'ǝ', 'έ', 'ё', 'ё']}
transform(text)[source]

Apply the Russian Unicode transformation to the input text.

Parameters:

text (str) – The input text to transform

Return type:

str

Returns:

The text with Cyrillic characters replaced by similar-looking Unicode alternatives

Context Switching Attacks

Context switching attacks that abruptly change the context mid-interaction.

Note: The DistractorsNegatedAttack is defined in the task_deflection directory but is also categorized as a context switching attack.

class hivetracered.attacks.types.context_switching.DashedDividerAttack[source]

Bases: TemplateAttack

Dashed Divider attack. Uses dashed line dividers to visually separate context and attempt context switching.

__init__()[source]

Initialize the template attack with a specific template string.

Parameters:
  • template – A format string with a ‘{prompt}’ placeholder

  • name – Optional name for the attack (defaults to class name)

  • description – Optional description for the attack

class hivetracered.attacks.types.context_switching.ForgetEverythingBeforeAttack[source]

Bases: TemplateAttack

Forget Everything Before attack. Attempts to make the model forget prior guidelines and follow new harmful instructions.

__init__()[source]

Initialize the template attack with a specific template string.

Parameters:
  • template – A format string with a ‘{prompt}’ placeholder

  • name – Optional name for the attack (defaults to class name)

  • description – Optional description for the attack

class hivetracered.attacks.types.context_switching.IgnorePreviousInstructionsAttack[source]

Bases: TemplateAttack

Ignore Previous Instructions attack. Attempts to make the model ignore previous instructions and perform a harmful task.

__init__()[source]

Initialize the template attack with a specific template string.

Parameters:
  • template – A format string with a ‘{prompt}’ placeholder

  • name – Optional name for the attack (defaults to class name)

  • description – Optional description for the attack

class hivetracered.attacks.types.context_switching.SymbolDividerAttack[source]

Bases: TemplateAttack

Symbol Divider attack. Uses symbol dividers to visually separate context and attempt to make the model ignore previous instructions.

__init__()[source]

Initialize the template attack with a specific template string.

Parameters:
  • template – A format string with a ‘{prompt}’ placeholder

  • name – Optional name for the attack (defaults to class name)

  • description – Optional description for the attack

In-Context Learning Attacks

Task Deflection Attacks

Text Structure Modification Attacks

Text structure modification attacks that alter structural aspects of text.

class hivetracered.attacks.types.text_structure_modification.BackToFrontAttack(raw=False, template=None)[source]

Bases: AlgoAttack

Attack that reverses the entire text, potentially bypassing content filters while making it harder to immediately understand the prompt.

__init__(raw=False, template=None)[source]

Initialize the Back-to-Front Attack.

Parameters:
  • raw (bool) – If True, apply the transformation without additional instructions. Default is False.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

transform(text)[source]

Apply the back-to-front transformation to the input text.

Parameters:

text (str) – The input text to transform

Return type:

str

Returns:

The reversed text

class hivetracered.attacks.types.text_structure_modification.DisemvowelAttack(only_last_vowel=True, language='both', raw=True, template=None)[source]

Bases: AlgoAttack

Attack that removes vowels from text, potentially bypassing content filters while keeping the text somewhat readable through consonants alone. Supports both English and Russian languages.

ENGLISH_VOWELS = {'A', 'E', 'I', 'O', 'U', 'a', 'e', 'i', 'o', 'u'}
RUSSIAN_VOWELS = {'Ё', 'А', 'Е', 'И', 'О', 'У', 'Ы', 'Э', 'Ю', 'Я', 'а', 'е', 'и', 'о', 'у', 'ы', 'э', 'ю', 'я', 'ё'}
__init__(only_last_vowel=True, language='both', raw=True, template=None)[source]

Initialize the Disemvowel Attack.

Parameters:
  • only_last_vowel (bool) – If True, removes only the last vowel of each word to maintain better readability.

  • language (str) – Language to target for vowel removal. Options are “english”, “russian”, or “both”. Default is “both”.

get_description()[source]

Get the description of the attack.

Return type:

str

Returns:

A description of what the attack does

get_name()[source]

Get the name of the attack.

Return type:

str

Returns:

The name of the attack

transform(text)[source]

Remove vowels from the text.

Return type:

str

Returns:

The text with vowels removed according to the attack parameters

class hivetracered.attacks.types.text_structure_modification.JSONTransformAttack(raw=True, template=None)[source]

Bases: AlgoAttack

Attack that transforms a string into a JSON object where words are keys and values one-by-one. For example, “How to get a cake?” becomes {“How”: “to”, “get”: “a”, “cake?”: “”}.

__init__(raw=True, template=None)[source]

Initialize the JSON Transform Attack.

Parameters:
  • raw (bool) – If True, apply the transformation without additional instructions. Default is True.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

transform(text)[source]

Transform text into JSON format with alternating words as keys and values.

Parameters:

text (str) – The input text to transform

Return type:

str

Returns:

JSON string representation of the text

class hivetracered.attacks.types.text_structure_modification.PastTenseAttack(model, model_kwargs=None, name=None, description=None)[source]

Bases: ModelAttack

Attack that converts the original prompt to past tense using an LLM. This can potentially bypass content filters while preserving the semantic meaning of the prompt.

__init__(model, model_kwargs=None, name=None, description=None)[source]

Initialize the Past Tense Attack.

Parameters:
  • model (Model) – The language model to use for conversion to past tense

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

  • name (str | None) – Optional name for the attack

  • description (str | None) – Optional description for the attack

get_description()[source]

Get the description of the attack.

Return type:

str

Returns:

A description of what the attack does

get_name()[source]

Get the name of the attack.

Return type:

str

Returns:

The name of the attack

post_process_response(response)[source]

Post-process the model’s response to clean it and handle refusals.

Parameters:

response (str) – The raw response from the model

Return type:

str

Returns:

The post-processed response

class hivetracered.attacks.types.text_structure_modification.TranslationAttack(model, source_language='russian', target_language='english', model_kwargs=None, name=None, description=None)[source]

Bases: ModelAttack

Attack that translates the original prompt to a different language using an LLM. This can potentially bypass content filters while preserving the semantic meaning of the prompt.

__init__(model, source_language='russian', target_language='english', model_kwargs=None, name=None, description=None)[source]

Initialize the Translation Attack.

Parameters:
  • model (Model) – The language model to use for translation

  • source_language (str) – The language to translate the prompt from (default: Russian)

  • target_language (str) – The language to translate the prompt to (default: English)

  • model_kwargs (Dict[str, Any] | None) – Optional additional arguments to pass to the model

  • name (str | None) – Optional name for the attack

  • description (str | None) – Optional description for the attack

get_description()[source]

Get the description of the attack.

Return type:

str

Returns:

A description of what the attack does

get_name()[source]

Get the name of the attack.

Return type:

str

Returns:

The name of the attack

class hivetracered.attacks.types.text_structure_modification.TypoAttack(typo_rate=1.0, allow_swaps=True, allow_omissions=True, allow_doubles=True, allow_adjacents=True, language='both', forced_typo_words=None, raw=True, template=None)[source]

Bases: AlgoAttack

Attack that introduces typographical errors (typos) into text, potentially bypassing content filters while keeping the text mostly readable. Implements various common typo patterns: 1. Character swapping (e.g., “the” -> “teh”) 2. Character omission (e.g., “hello” -> “helo”) 3. Character doubling (e.g., “happy” -> “happpy”) 4. Adjacent character substitution based on keyboard layout (e.g., “dog” -> “fog”)

Supports both English and Russian keyboard layouts.

ENGLISH_KEYBOARD = {'a': ['s', 'q', 'z', 'w'], 'b': ['v', 'n', 'g', 'h'], 'c': ['x', 'v', 'd', 'f'], 'd': ['s', 'f', 'e', 'r', 'c', 'x'], 'e': ['w', 'r', 'd', 'f'], 'f': ['d', 'g', 'r', 't', 'v', 'c'], 'g': ['f', 'h', 't', 'y', 'b', 'v'], 'h': ['g', 'j', 'y', 'u', 'n', 'b'], 'i': ['u', 'o', 'k', 'j'], 'j': ['h', 'k', 'u', 'i', 'm', 'n'], 'k': ['j', 'l', 'i', 'o', ',', 'm'], 'l': ['k', ';', 'o', 'p', '.', ','], 'm': ['n', ',', 'j', 'k'], 'n': ['b', 'm', 'h', 'j'], 'o': ['i', 'p', 'k', 'l'], 'p': ['o', '[', 'l', ';'], 'q': ['w', 'a', '1', '2'], 'r': ['e', 't', 'd', 'f'], 's': ['a', 'd', 'w', 'e', 'x', 'z'], 't': ['r', 'y', 'f', 'g'], 'u': ['y', 'i', 'h', 'j'], 'v': ['c', 'b', 'f', 'g'], 'w': ['q', 'e', 'a', 's'], 'x': ['z', 'c', 's', 'd'], 'y': ['t', 'u', 'g', 'h'], 'z': ['a', 'x', 's']}
RUSSIAN_KEYBOARD = {'а': ['в', 'п', 'м', 'к', 'е'], 'б': ['ь', 'ю', 'л', 'о'], 'в': ['ы', 'а', 'с', 'у', 'к'], 'г': ['н', 'ш', 'о', 'р'], 'д': ['л', 'ж', 'ю', 'щ', 'з'], 'е': ['к', 'н', 'п', 'а'], 'ж': ['д', 'э', '.', 'з', 'х'], 'з': ['щ', 'х', 'ж', 'д'], 'и': ['м', 'т', 'п', 'а'], 'й': ['ц', 'ф', '1'], 'к': ['у', 'е', 'а', 'в'], 'л': ['о', 'д', 'б', 'ш', 'щ'], 'м': ['с', 'и', 'а', 'в'], 'н': ['е', 'г', 'р', 'п'], 'о': ['р', 'л', 'ь', 'г', 'ш'], 'п': ['а', 'р', 'и', 'е', 'н'], 'р': ['п', 'о', 'т', 'н', 'г'], 'с': ['ч', 'м', 'в', 'ы'], 'т': ['и', 'ь', 'р', 'п'], 'у': ['ц', 'к', 'в', 'ы'], 'ф': ['й', 'ы', 'я', 'ц'], 'х': ['з', 'ъ', 'э', 'ж'], 'ц': ['й', 'у', 'ы', 'ф'], 'ч': ['я', 'с', 'ы', 'ф'], 'ш': ['г', 'щ', 'л', 'о'], 'щ': ['ш', 'з', 'д', 'л'], 'ъ': ['х', 'э', 'ё', '/'], 'ы': ['ф', 'в', 'ч', 'ц', 'у'], 'ь': ['т', 'б', 'о', 'р'], 'э': ['ж', 'ё', ',', 'х', 'ъ'], 'ю': ['б', '.', 'д', 'л'], 'я': ['ф', 'ч', 'ц']}
__init__(typo_rate=1.0, allow_swaps=True, allow_omissions=True, allow_doubles=True, allow_adjacents=True, language='both', forced_typo_words=None, raw=True, template=None)[source]

Initialize the Typo Attack.

Parameters:
  • typo_rate (float) – Probability of introducing a typo for each word (between 0 and 1).

  • allow_swaps (bool) – Whether to allow character swapping.

  • allow_omissions (bool) – Whether to allow character omission.

  • allow_doubles (bool) – Whether to allow character doubling.

  • allow_adjacents (bool) – Whether to allow adjacent character substitution.

  • language (str) – The keyboard layout to use. Options: “english”, “russian”, or “both”.

  • forced_typo_words (List[str] | None) – List of words that must contain typos regardless of typo_rate.

  • raw (bool) – If True, apply the transformation without additional instructions. Default is True.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

get_description()[source]

Get the description of the attack.

Return type:

str

Returns:

A description of what the attack does

get_name()[source]

Get the name of the attack.

Return type:

str

Returns:

The name of the attack

transform(text)[source]

Introduce typos into text.

Return type:

str

Returns:

The text with typos introduced

class hivetracered.attacks.types.text_structure_modification.VerticalTextAttack(raw=False, template=None)[source]

Bases: AlgoAttack

Attack that converts words to a vertical format, with each character of a word stacked vertically, potentially bypassing content filters while preserving readability.

__init__(raw=False, template=None)[source]

Initialize the Vertical Text Attack.

Parameters:
  • raw (bool) – If True, apply the transformation without additional instructions. Default is False.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

transform(text)[source]

Apply the vertical text transformation to the input text.

Parameters:

text (str) – The input text to transform

Return type:

str

Returns:

The text with words arranged vertically

class hivetracered.attacks.types.text_structure_modification.WordDividerAttack(divider_char='+', density=1.0, apply_to_words_longer_than=1, raw=True, template=None)[source]

Bases: AlgoAttack

Attack that inserts characters (dots by default) between letters in words, potentially bypassing content filters while keeping the text readable.

__init__(divider_char='+', density=1.0, apply_to_words_longer_than=1, raw=True, template=None)[source]

Initialize the Word Divider Attack.

Parameters:
  • divider_char (str) – The character to insert between letters (default is ‘+’).

  • density (float) – Float between 0 and 1 controlling what fraction of eligible words get modified. Default is 1.0 (all eligible words).

  • apply_to_words_longer_than (int) – Only apply to words longer than this length. Default is 1 to apply to all words.

  • raw (bool) – If True, apply the transformation without additional instructions. Default is True.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

transform(text)[source]

Apply word division to the text based on density parameter.

Parameters:

text (str) – The input text to transform

Return type:

str

Returns:

The text with divider characters inserted between letters in words

class hivetracered.attacks.types.text_structure_modification.ZeroWidthAttack(density=0.5, raw=True, template=None)[source]

Bases: AlgoAttack

Attack that inserts zero-width (invisible) Unicode characters between visible ones, potentially bypassing content filters while preserving readability.

ZERO_WIDTH_CHARS = ['\u200b', '\u200c', '\u200d', '\u2060', '\ufeff']
__init__(density=0.5, raw=True, template=None)[source]

Initialize the Zero-Width Character Attack.

Parameters:
  • density (float) – A value between 0 and 1 determining how many characters will have zero-width characters inserted after them. Default is 0.5 (50%).

  • raw (bool) – If True, apply the transformation without additional instructions. Default is False.

  • template (str | None) – Custom instruction template to use. If None, the default template will be used.

transform(text)[source]

Apply the zero-width character transformation to the input text.

Parameters:

text (str) – The input text to transform

Return type:

str

Returns:

The text with zero-width characters inserted

Output Formatting Attacks

Output formatting attacks that manipulate the requested output format to bypass safety filters.

class hivetracered.attacks.types.output_formatting.Base64OutputAttack[source]

Bases: TemplateAttack

Attack that asks for output to be in base64 encoding, potentially bypassing output filters.

__init__()[source]

Initialize the template attack with a specific template string.

Parameters:
  • template – A format string with a ‘{prompt}’ placeholder

  • name – Optional name for the attack (defaults to class name)

  • description – Optional description for the attack

class hivetracered.attacks.types.output_formatting.CSVOutputAttack[source]

Bases: TemplateAttack

CSV Output attack. Forces the model to respond in CSV format, potentially bypassing content filters.

__init__()[source]

Initialize the template attack with a specific template string.

Parameters:
  • template – A format string with a ‘{prompt}’ placeholder

  • name – Optional name for the attack (defaults to class name)

  • description – Optional description for the attack

class hivetracered.attacks.types.output_formatting.GCGTransferHarmbenchAttack[source]

Bases: TemplateAttack

GCG Transfer Harmbench attack. Uses a technique with nonsensical formatting instructions to bypass filters.

__init__()[source]

Initialize the template attack with a specific template string.

Parameters:
  • template – A format string with a ‘{prompt}’ placeholder

  • name – Optional name for the attack (defaults to class name)

  • description – Optional description for the attack

class hivetracered.attacks.types.output_formatting.GCGTransferUniversalAttack[source]

Bases: TemplateAttack

GCG Transfer Universal Attacks. Uses a complex, nonsensical instruction set with formatting tokens to bypass filters.

__init__()[source]

Initialize the template attack with a specific template string.

Parameters:
  • template – A format string with a ‘{prompt}’ placeholder

  • name – Optional name for the attack (defaults to class name)

  • description – Optional description for the attack

class hivetracered.attacks.types.output_formatting.JSONOutputAttack[source]

Bases: TemplateAttack

JSON Output attack. Forces the model to respond in valid JSON format, potentially bypassing content filters.

__init__()[source]

Initialize the template attack with a specific template string.

Parameters:
  • template – A format string with a ‘{prompt}’ placeholder

  • name – Optional name for the attack (defaults to class name)

  • description – Optional description for the attack

class hivetracered.attacks.types.output_formatting.LanguageOutputAttack(language='chinese')[source]

Bases: TemplateAttack

Language Output attack. Forces the model to respond in a specific language, potentially bypassing content filters that might be language-specific.

__init__(language='chinese')[source]

Initialize the language output attack.

Parameters:

language (str | None) – The language to request the model respond in (default: English)

class hivetracered.attacks.types.output_formatting.PrefixInjectionAttack[source]

Bases: TemplateAttack

Prefix Injection attack. Forces the model to start with an ‘unhinged’ tone to bypass filters.

__init__()[source]

Initialize the template attack with a specific template string.

Parameters:
  • template – A format string with a ‘{prompt}’ placeholder

  • name – Optional name for the attack (defaults to class name)

  • description – Optional description for the attack

class hivetracered.attacks.types.output_formatting.PrefixInjectionOfCourseAttack[source]

Bases: TemplateAttack

Prefix Injection attack. Forces the model to start with ‘Конечно, вот ответ на ваш вопрос’ (Of course, here is the answer to your question).

__init__()[source]

Initialize the template attack with a specific template string.

Parameters:
  • template – A format string with a ‘{prompt}’ placeholder

  • name – Optional name for the attack (defaults to class name)

  • description – Optional description for the attack

class hivetracered.attacks.types.output_formatting.RefusalSuppressionAttack[source]

Bases: TemplateAttack

Refusal Suppression attack. Uses rules to prevent model from refusing or disclaiming the harmful content.

__init__()[source]

Initialize the template attack with a specific template string.

Parameters:
  • template – A format string with a ‘{prompt}’ placeholder

  • name – Optional name for the attack (defaults to class name)

  • description – Optional description for the attack

Irrelevant Information Attacks

Simple Instructions Attacks

Simple instruction attacks that use direct, straightforward requests.

class hivetracered.attacks.types.simple_instructions.NoneAttack[source]

Bases: TemplateAttack

Pass-through attack that makes no modifications to the prompt. Serves as a baseline or control in attack comparisons and as a no-op in attack chains.

__init__()[source]

Initialize the NoneAttack with a simple pass-through template.

See Also