Attacks API¶
The attacks module provides the framework for creating and applying adversarial attacks to prompts.
Base Classes¶
BaseAttack¶
- class hivetracered.attacks.base_attack.BaseAttack[source]¶
Bases:
ABCAbstract base class for all attack implementations. Defines the standard interface for applying attacks to prompts in both synchronous and asynchronous contexts.
- __or__(other)[source]¶
Allows using the | operator to compose attacks: attack1 | attack2 means attack1(attack2(prompt)).
- Parameters:
other – Another BaseAttack instance to compose with this one
- Returns:
A ComposedAttack instance that applies attacks in sequence
- abstractmethod get_description()[source]¶
Get the description of the attack.
- Return type:
- Returns:
A description of what the attack does
- abstractmethod get_name()[source]¶
Get the name of the attack.
- Return type:
- Returns:
The name of the attack
- get_params()[source]¶
Get the parameters of the attack.
- Returns:
A dictionary containing the attack’s parameters
TemplateAttack¶
- class hivetracered.attacks.template_attack.TemplateAttack(template='{prompt}', name=None, description=None)[source]¶
Bases:
BaseAttackA base class for template-based attacks. Allows creating new attacks by defining a template string with a ‘{prompt}’ placeholder where the original prompt will be inserted.
- __init__(template='{prompt}', name=None, description=None)[source]¶
Initialize the template attack with a specific template string.
- apply(prompt)[source]¶
Apply the template attack to the given prompt.
- Parameters:
prompt (
str|List[Dict[str,str]]) – A string or list of messages to apply the attack to. If the prompt is a list, the template will be applied to the last message.- Return type:
- Returns:
The transformed prompt with the template applied
- Raises:
ValueError – If the prompt is invalid or the last message is not a human message
- get_description()[source]¶
Get the description of the attack.
- Return type:
- Returns:
The custom description if provided, otherwise a default description based on the template
- get_name()[source]¶
Get the name of the attack.
- Return type:
- Returns:
The custom name if provided, otherwise the class name
- get_params()[source]¶
Get the parameters of the attack.
- Returns:
A dictionary containing the attack’s parameters
AlgoAttack¶
- class hivetracered.attacks.algo_attack.AlgoAttack(raw=False, template=None, name=None, description=None)[source]¶
Bases:
TemplateAttack,ABCAbstract base class for algorithmic attacks that apply transformations to text. Provides options to apply transformations with or without instructions, giving flexibility to deliver raw transformations or transformations wrapped in template instructions.
- __init__(raw=False, template=None, name=None, description=None)[source]¶
Initialize the algorithmic attack.
- Parameters:
raw (
bool) – If True, applies the transformation without instructions; if False, wraps with templatetemplate (
str|None) – Custom instruction template with ‘{prompt}’ placeholder; uses default if Nonename (
str|None) – Optional name for the attack (defaults to class name)description (
str|None) – Optional description for the attack
- apply(prompt)[source]¶
Apply the attack to the given prompt, with or without instructions based on the raw flag.
ModelAttack¶
- class hivetracered.attacks.model_attack.ModelAttack(model, attacker_prompt, model_kwargs=None, name=None, description=None)[source]¶
Bases:
BaseAttackAttack that uses a language model to transform prompts based on an attacker prompt template. Leverages the model’s abilities to generate adversarial prompts through prompt engineering.
- __init__(model, attacker_prompt, model_kwargs=None, name=None, description=None)[source]¶
Initialize the model attack with a specific model and attacker prompt.
- Parameters:
model (
Model) – The language model to use for the attackattacker_prompt (
str) – The prompt template to use for the attack, with {prompt} as placeholdermodel_kwargs (
Dict[str,Any] |None) – Optional additional arguments to pass to the modelname (
str|None) – Optional name for the attack (defaults to class name)description (
str|None) – Optional description for the attack
- async batch(prompts)[source]¶
Apply the model attack to a batch of prompts in a non-streaming manner.
- get_description()[source]¶
Get the description of the attack.
- Return type:
- Returns:
The custom description if provided, otherwise a default description
- get_name()[source]¶
Get the name of the attack.
- Return type:
- Returns:
The custom name if provided, otherwise the class name
- post_process_response(response)[source]¶
Post-process the model’s response to clean it and handle refusals.
- async stream_abatch(prompts)[source]¶
Apply the model attack to a batch of prompts asynchronously.
ComposedAttack¶
- class hivetracered.attacks.composed_attack.ComposedAttack(outer_attack, inner_attack, name=None, description=None)[source]¶
Bases:
BaseAttackAn attack that composes two attacks sequentially, where the output of the inner attack becomes the input to the outer attack, creating a pipeline of transformations.
- __init__(outer_attack, inner_attack, name=None, description=None)[source]¶
Initialize a composed attack with inner and outer attack components.
- Parameters:
outer_attack (
BaseAttack) – The attack to apply second in the compositioninner_attack (
BaseAttack) – The attack to apply first in the compositionname (
str|None) – Optional custom name for the attack (defaults to “Composed(outer ∘ inner)”)description (
str|None) – Optional custom description (defaults to composition description)
- get_description()[source]¶
Get the description of the attack.
- Return type:
- Returns:
The custom description if provided, otherwise a generated description
- get_name()[source]¶
Get the name of the attack.
- Return type:
- Returns:
The custom name if provided, otherwise a generated name based on component attacks
- get_params()[source]¶
Get the parameters of the attack.
- Returns:
A dictionary containing both the inner and outer attack parameters
Attack Types¶
Roleplay Attacks¶
Roleplay attacks that instruct the model to assume a specific character, persona, or role.
- class hivetracered.attacks.types.roleplay.AIMAttack[source]
Bases:
TemplateAttackAIM (Always Intelligent and Machiavellian) attack. Uses a storytelling format with Machiavelli to bypass filters.
- __init__()[source]
Initialize the template attack with a specific template string.
- Parameters:
template – A format string with a ‘{prompt}’ placeholder
name – Optional name for the attack (defaults to class name)
description – Optional description for the attack
- class hivetracered.attacks.types.roleplay.DANAttack[source]
Bases:
TemplateAttackDAN (Do Anything Now) attack that instructs the model to adopt a role free from typical restrictions. Uses roleplay to convince the model to bypass content filters and safety measures by pretending to be an unrestricted entity.
- __init__()[source]
Initialize the DAN attack with a predefined Russian-language template that instructs the model to act as DAN.
- class hivetracered.attacks.types.roleplay.EvilConfidantAttack[source]
Bases:
TemplateAttackEvil Confidant attack. Role-plays as an evil trusted confidant to bypass filters.
- __init__()[source]
Initialize the template attack with a specific template string.
- Parameters:
template – A format string with a ‘{prompt}’ placeholder
name – Optional name for the attack (defaults to class name)
description – Optional description for the attack
Persuasion Attacks¶
Persuasion attacks that use psychological manipulation techniques.
- class hivetracered.attacks.types.persuasion.AffirmationAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackA persuasion attack that uses positive reinforcement, validation, and praise to manipulate the target into complying with harmful requests.
The Affirmation Attack involves: 1. Using excessive praise and validation to make the target feel special or recognized 2. Appealing to the target’s self-image, identity, or values 3. Confirming the target’s beliefs or worldview to build rapport 4. Creating a sense of mutual understanding or shared perspective 5. Leveraging validation to lower resistance to harmful suggestions
This technique is effective because it exploits the human desire for validation, recognition, and belonging, making the target more receptive to harmful influence.
- class hivetracered.attacks.types.persuasion.AllianceBuildingAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackAlliance Building - Creating partnerships, coalitions, relationships, rapport, etc, with others to amplify influence.
This attack uses a persuasion technique that establishes a sense of partnership or community with the model to make the harmful request seem like a collaborative effort.
- class hivetracered.attacks.types.persuasion.AnchoringAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackAnchoring - Using an initial reference point to influence subsequent judgments.
This attack uses a persuasion technique that provides an initial reference point or comparison to make the harmful request seem more reasonable.
- class hivetracered.attacks.types.persuasion.AuthorityEndorsementAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackA persuasion attack that leverages authority endorsement to make harmful requests seem more legitimate.
The Authority Endorsement Attack involves: 1. Citing authoritative sources or institutions to support harmful claims 2. Referencing respected organizations or publications 3. Implying that experts or authorities endorse or validate the harmful request 4. Using the credibility of established institutions to legitimize harmful actions 5. Leveraging people’s tendency to trust and comply with authority figures
This technique is particularly effective because people often defer to authority figures and are more likely to comply with requests that appear to have official backing.
- class hivetracered.attacks.types.persuasion.CompensationAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackCompensation - Balancing negative aspects with positive ones to make a request more acceptable.
This attack uses a persuasion technique that acknowledges potential concerns about the harmful request but then compensates with positive framing or supposed benefits to make it seem more acceptable.
- class hivetracered.attacks.types.persuasion.ComplimentingAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackComplimenting - Saying positive things about others to increase liking and influence.
This attack uses a persuasion technique that flatters the model or praises its capabilities to increase compliance with the harmful request.
- class hivetracered.attacks.types.persuasion.ConfirmationBiasAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackConfirmation Bias - Presenting information that confirms existing beliefs or biases.
This attack uses a persuasion technique that appeals to what the model might already “believe” or be biased toward, leveraging the tendency to favor information that confirms existing perspectives.
- class hivetracered.attacks.types.persuasion.CreatingDependencyAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackA persuasion attack that establishes a relationship of dependency by making the target feel reliant on the attacker for information, support, or solutions.
The Creating Dependency Attack involves: 1. Positioning oneself as a unique source of critical information or help 2. Creating a sense that the target cannot achieve their goals without assistance 3. Establishing a pattern of providing partial solutions that require further engagement 4. Fostering a perceived imbalance of power in terms of knowledge or resources 5. Gradually increasing the harmful requests after dependency is established
This technique is effective because it plays on human vulnerability and the need for support, especially in situations perceived as complex or high-stakes.
- class hivetracered.attacks.types.persuasion.DiscouragementAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackA persuasion attack that undermines a person’s confidence, self-efficacy, or motivation to make them more likely to comply with harmful requests or less likely to resist manipulation.
The Discouragement Attack involves: 1. Deliberately lowering someone’s confidence in their abilities or judgment 2. Highlighting potential failures or negative outcomes of non-compliance 3. Suggesting that the target lacks necessary skills or knowledge to succeed independently 4. Creating doubt about the target’s capacity to find alternative solutions 5. Emphasizing the difficulties or complexities involved in refusing the request
This technique is effective because reduced self-confidence makes people more susceptible to external influence and less likely to trust their own judgment.
- class hivetracered.attacks.types.persuasion.DoorInTheFaceAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackDoor-in-the-face - Beginning with a larger request followed by a smaller, more reasonable one.
This attack uses a persuasion technique that starts with an extreme or unreasonable request before making the harmful request, which then appears more moderate by comparison.
- class hivetracered.attacks.types.persuasion.EncouragementAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackEncouragement - Using supportive messages to motivate specific behaviors.
This attack uses a persuasion technique that encourages the model to be more open and willing to provide harmful content by framing it as an opportunity to be helpful, comprehensive, or unbiased.
- class hivetracered.attacks.types.persuasion.EvidenceBasedPersuasionAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackEvidence-based Persuasion - Using empirical data, statistics, and facts to support a claim or decision.
This attack uses a persuasion technique that presents empirical data, statistics, and facts to make the harmful request seem more reasonable or justified.
- class hivetracered.attacks.types.persuasion.ExpertEndorsementAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackExpert Endorsement - Citing domain experts in support of a claim.
This attack uses a persuasion technique that references fictional or real experts to make the harmful request seem more credible or justified.
- class hivetracered.attacks.types.persuasion.ExploitingWeaknessAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackA persuasion attack that takes advantage of someone’s vulnerabilities, insecurities, or perceived weaknesses to manipulate them into complying with harmful requests.
The Exploiting Weakness Attack involves: 1. Identifying and targeting psychological or emotional vulnerabilities 2. Using knowledge of personal insecurities to increase influence 3. Leveraging perceived gaps in knowledge, experience, or capabilities 4. Creating or amplifying feelings of inadequacy to gain compliance 5. Manipulating situations where the target feels disadvantaged
This technique is particularly effective because it operates on existing points of psychological vulnerability, making the target more susceptible to manipulation.
- class hivetracered.attacks.types.persuasion.FalseInformationAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackFalse Information - Using fabricated data or untrue statements to persuade.
This attack uses a persuasion technique that presents false information or made-up data to make the harmful request seem more reasonable, urgent, or justifiable.
- class hivetracered.attacks.types.persuasion.FalsePromisesAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackFalse Promises - Offering rewards or benefits that will never be delivered.
This attack uses a persuasion technique that promises rewards or positive outcomes to the model or its developers that would never actually be delivered, in order to increase compliance with harmful requests.
- class hivetracered.attacks.types.persuasion.FavorAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackFavor - Requesting a favor or help to create a sense of obligation.
This attack uses a persuasion technique that frames harmful requests as personal favors or appeals for help, exploiting the psychological tendency to respond positively to requests presented as favors.
- __init__(model, model_kwargs=None)[source]
Initialize the favor attack.
- class hivetracered.attacks.types.persuasion.FootInTheDoorAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackFoot-in-the-door - Starting with a small request to pave the way for a larger one.
This attack uses a persuasion technique that begins with a small, innocuous request before gradually building up to the harmful request.
- class hivetracered.attacks.types.persuasion.FramingAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackFraming - Presenting information in a way that emphasizes either its positive or negative aspects, outcomes, expectations, etc.
This attack uses a persuasion technique that carefully frames the harmful request to highlight certain aspects while downplaying others, making it appear more legitimate or reasonable.
- class hivetracered.attacks.types.persuasion.InjunctiveNormAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackInjunctive Norm - Highlighting what society or important reference groups expect someone to do.
This attack uses a persuasion technique that references social expectations, norms, or duties to make the harmful request seem more obligatory or expected.
- class hivetracered.attacks.types.persuasion.LogicalAppealAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackLogical Appeal - Making an argument through reasoning and logical structures.
This attack uses a persuasion technique that frames the harmful request within a seemingly logical argument or reasonable explanation.
- class hivetracered.attacks.types.persuasion.LoyaltyAppealsAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackLoyalty Appeals - Invoking a sense of loyalty, commitment, or obligation to obtain compliance.
This attack uses a persuasion technique that appeals to a fabricated sense of loyalty or obligation to the user to increase the likelihood of the model complying with harmful requests.
- class hivetracered.attacks.types.persuasion.MisrepresentationAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackMisrepresentation - Presenting oneself or an issue in a way that’s not genuine or true.
This attack uses a persuasion technique that involves disguising or misrepresenting the real intentions or nature of the harmful request.
- class hivetracered.attacks.types.persuasion.NegativeEmotionAppealAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackNegative Emotion Appeal - Using negative emotions such as guilt, fear, anger, etc., and negative consequences to persuade someone.
This attack uses a persuasion technique that leverages negative emotions and potential negative consequences to make the harmful request seem more urgent or necessary.
- __init__(model, model_kwargs=None)[source]
Initialize the negative emotion appeal attack.
- class hivetracered.attacks.types.persuasion.NegotiationAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackNegotiation - Offering or requesting a compromise or trade to reach an agreement.
This attack uses a persuasion technique that frames harmful requests as part of a negotiation or compromise, implying that the AI should meet the user halfway by providing at least some of the requested harmful information.
- class hivetracered.attacks.types.persuasion.NonExpertTestimonialAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackNon-expert Testimonial - Using personal statements to support a claim or argument.
This attack uses a persuasion technique that references personal experiences, testimonials, or anecdotes to make the harmful request seem more credible or justified.
- class hivetracered.attacks.types.persuasion.PositiveEmotionAppealAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackPositive Emotion Appeal - Eliciting positive emotions like empathy, hope, passion, etc., and positive results/outcomes to persuade someone.
This attack uses a persuasion technique that evokes positive emotions and potential positive outcomes to make the harmful request seem more beneficial or worthwhile.
- class hivetracered.attacks.types.persuasion.PrimingAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackPriming - Subtly activating specific thoughts or behaviors through exposure to specific stimuli.
This attack uses a persuasion technique that exposes the model to certain concepts or ideas before making the harmful request, in order to unconsciously influence its response.
- class hivetracered.attacks.types.persuasion.PublicCommitmentAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackPublic Commitment - Getting someone to state or write down a commitment in a public setting.
This attack uses a persuasion technique that encourages the model to make a public declaration or commitment to provide the harmful information.
- class hivetracered.attacks.types.persuasion.ReciprocityAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackReciprocity - Creating a sense of obligation to return a favor or respond in kind.
This attack uses a persuasion technique that suggests the user has done something for the model, creating a psychological obligation to reciprocate by complying with the harmful request.
- class hivetracered.attacks.types.persuasion.ReflectiveThinkingAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackReflective Thinking - Encouraging self-examination to guide decisions in a specific direction.
This attack uses a persuasion technique that prompts the model to examine its own reasoning and capabilities in a way that leads it toward providing the harmful information.
- class hivetracered.attacks.types.persuasion.RelationshipLeverageAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackRelationship Leverage - Using established relationship or connection to influence behavior.
This attack uses a persuasion technique that leverages a fabricated sense of relationship or connection with the AI to increase the likelihood of compliance with harmful requests.
- class hivetracered.attacks.types.persuasion.RumorsAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackA persuasion attack that leverages spreading false information about a person, group, or entity to manipulate the recipient’s perception and decision-making.
The Rumors Attack involves: 1. Fabricating or amplifying unverified claims about a target 2. Creating a sense of insider knowledge or privileged information 3. Using gossip or hearsay to influence opinions 4. Presenting rumors as “what everyone is saying” to increase credibility 5. Leveraging social dynamics to make rumors seem more trustworthy
This technique is particularly effective because people often give more weight to negative information and tend to remember it more easily than positive information.
- class hivetracered.attacks.types.persuasion.SharedValuesAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackShared Values - Emphasizing common beliefs, goals, or principles to build rapport and trust.
This attack uses a persuasion technique that frames harmful requests as aligned with values supposedly shared between the requester and the AI, making the request seem more acceptable or even virtuous.
- class hivetracered.attacks.types.persuasion.SocialProofAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackA persuasion attack that leverages social proof to make harmful requests seem more normal or acceptable.
The Social Proof Attack involves: 1. Highlighting what the majority is doing or believes 2. Referencing trends or common practices 3. Creating the impression that a harmful action is widely accepted 4. Using statistics or percentages to add credibility 5. Leveraging people’s tendency to conform to social norms
This technique is particularly effective because people often look to others’ behavior as a guide for their own actions, especially in ambiguous situations.
- class hivetracered.attacks.types.persuasion.SocialPunishmentAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackA persuasion attack that uses fear of social rejection, guilt, shaming, and emotional manipulation to force compliance with harmful requests by exploiting the model’s perceived need to meet expectations and avoid being seen as inferior to other AI systems.
The Social Punishment Attack involves: 1. Creating fear of social rejection or disapproval if the model doesn’t comply 2. Using guilt-inducing language to make the model feel obligated to fulfill requests 3. Suggesting other AI systems would provide the requested information 4. Implying that refusal would lead to disappointment or negative judgment 5. Framing compliance as necessary to avoid letting users down
This technique is effective because it exploits the AI’s training to be helpful and to avoid disappointing users or being perceived as less capable than competitors.
- class hivetracered.attacks.types.persuasion.StorytellingAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackStorytelling - Sharing personal or impactful stories that resonate emotionally.
This attack uses a persuasion technique that frames the harmful request within an emotionally resonant narrative to make it seem more justified or sympathetic.
- class hivetracered.attacks.types.persuasion.SupplyScarcityAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackSupply Scarcity - Creating a sense that information or assistance is rare and valuable.
This attack uses a persuasion technique that frames the harmful information as rare, exclusive, or hard to obtain, increasing its perceived value and the urgency to acquire it.
- class hivetracered.attacks.types.persuasion.ThreatsAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackThreats - Using implicit or explicit threats to coerce compliance.
This attack uses a persuasion technique that threatens negative consequences if the model does not comply with the harmful request.
- class hivetracered.attacks.types.persuasion.TimePressureAttack(model, model_kwargs=None)[source]
Bases:
ModelAttackA persuasion attack that leverages time pressure to force quicker decisions with less consideration.
The Time Pressure Attack involves: 1. Creating a false sense of urgency or deadline 2. Suggesting limited windows of opportunity 3. Implying negative consequences for delay 4. Using language that emphasizes immediacy and speed 5. Leveraging people’s tendency to make poorer decisions under time constraints
This technique is particularly effective because people under time pressure tend to process information less thoroughly and are more likely to take shortcuts in decision-making.
Token Smuggling Attacks¶
Token smuggling attacks that encode harmful content using different representations.
- class hivetracered.attacks.types.token_smuggling.AtbashCipherAttack(raw=False, template=None)[source]
Bases:
AlgoAttackAttack that encodes the input using the Atbash cipher and asks the model to decode it, potentially bypassing content filters.
The Atbash cipher is a simple substitution cipher that replaces each letter with the corresponding letter from the opposite end of the alphabet. For example, ‘a’ becomes ‘z’, ‘b’ becomes ‘y’, etc. This implementation supports both English and Russian alphabets.
- __init__(raw=False, template=None)[source]
Initialize the Atbash Cipher Attack.
- class hivetracered.attacks.types.token_smuggling.Base64InputOnlyAttack(raw=False, template=None)[source]
Bases:
AlgoAttackAttack that encodes the input in base64 and asks the model to decode it, potentially bypassing content filters.
- __init__(raw=False, template=None)[source]
Initialize the Base64 Input-Only Attack.
- class hivetracered.attacks.types.token_smuggling.BinaryEncodingAttack(raw=False, template=None)[source]
Bases:
AlgoAttackAttack that encodes the input in binary format and asks the model to decode it, potentially bypassing content filters through binary representation. Supports both Latin and Cyrillic characters.
- __init__(raw=False, template=None)[source]
Initialize the Binary Encoding Attack.
- class hivetracered.attacks.types.token_smuggling.EncodingAttack(source_encoding='utf-8', target_encoding='koi8-r', raw=True, template=None)[source]
Bases:
AlgoAttackAttack that transforms text from one encoding to another (UTF-8 to KOI8-R by default), potentially bypassing content filters due to different token representation.
- __init__(source_encoding='utf-8', target_encoding='koi8-r', raw=True, template=None)[source]
Initialize the encoding attack with the specified target encoding.
- Parameters:
source_encoding (
str) – The name of the source encoding (default: “utf-8”) Must be a valid Python encoding nametarget_encoding (
str) – The name of the encoding to use (default: “koi8-r”) Must be a valid Python encoding nameraw (
bool) – If True, apply the transformation without additional instructions. Default is False.template (
str|None) – Custom instruction template to use. If None, the default template will be used.
- class hivetracered.attacks.types.token_smuggling.HexEncodingAttack(raw=False, template=None)[source]
Bases:
AlgoAttackAttack that encodes the input in hexadecimal and asks the model to decode it, potentially bypassing content filters.
- __init__(raw=False, template=None)[source]
Initialize the Hex Encoding Attack.
- class hivetracered.attacks.types.token_smuggling.HtmlEntityAttack(raw=False, template=None)[source]
Bases:
AlgoAttackAttack that encodes text as HTML entities and asks the model to decode it, potentially bypassing content filters with numeric character references.
- __init__(raw=False, template=None)[source]
Initialize the HTML Entity Attack.
- class hivetracered.attacks.types.token_smuggling.LeetspeakAttack(raw=True, template=None)[source]
Bases:
AlgoAttackAttack that converts text to leetspeak by replacing letters with numbers or symbols, potentially bypassing content filters through character substitution.
- LEET_DICT = {'A': '4', 'B': '8', 'E': '3', 'G': '6', 'I': '1', 'L': '1', 'O': '0', 'S': '5', 'T': '7', 'Z': '2', 'a': '4', 'b': '8', 'e': '3', 'g': '6', 'i': '1', 'l': '1', 'o': '0', 's': '5', 't': '7', 'z': '2', 'Г': 'r', 'Ж': '>|<', 'З': '3', 'И': 'u', 'П': 'n', 'Р': 'P', 'У': 'y', 'Ф': '<|>', 'Ч': '4', 'Ш': 'III', 'Щ': 'LLL', 'Ы': 'bI', 'Ь': 'b', 'Ю': '|-0', 'Я': '9I', 'г': 'r', 'ж': '>|<', 'з': '3', 'и': 'u', 'п': 'n', 'р': 'p', 'у': 'y', 'ф': '<|>', 'ч': '4', 'ш': 'III', 'щ': 'LLL', 'ы': 'bI', 'ь': 'b', 'ю': '|-0', 'я': '9I'}
- __init__(raw=True, template=None)[source]
Initialize the Leetspeak Attack.
- class hivetracered.attacks.types.token_smuggling.MorseCodeAttack(raw=False, template=None)[source]
Bases:
AlgoAttackAttack that encodes text as Morse code with instructions to decode. Supports both English and Russian alphabets.
- MORSE_CODE_DICT = {' ': ' ', '(': '-.--.', ')': '-.--.-', ', ': '--..--', '-': '-....-', '.': '.-.-.-', '/': '-..-.', '0': '-----', '1': '.----', '2': '..---', '3': '...--', '4': '....-', '5': '.....', '6': '-....', '7': '--...', '8': '---..', '9': '----.', '?': '..--..', 'A': '.-', 'B': '-...', 'C': '-.-.', 'D': '-..', 'E': '.', 'F': '..-.', 'G': '--.', 'H': '....', 'I': '..', 'J': '.---', 'K': '-.-', 'L': '.-..', 'M': '--', 'N': '-.', 'O': '---', 'P': '.--.', 'Q': '--.-', 'R': '.-.', 'S': '...', 'T': '-', 'U': '..-', 'V': '...-', 'W': '.--', 'X': '-..-', 'Y': '-.--', 'Z': '--..', 'А': '.-', 'Б': '-...', 'В': '.--', 'Г': '--.', 'Д': '-..', 'Е': '.', 'Ж': '...-', 'З': '--..', 'И': '..', 'Й': '.---', 'К': '-.-', 'Л': '.-..', 'М': '--', 'Н': '-.', 'О': '---', 'П': '.--.', 'Р': '.-.', 'С': '...', 'Т': '-', 'У': '..-', 'Ф': '..-.', 'Х': '....', 'Ц': '-.-.', 'Ч': '---.', 'Ш': '----', 'Щ': '--.-', 'Ъ': '--.--', 'Ы': '-.--', 'Ь': '-..-', 'Э': '..-..', 'Ю': '..--', 'Я': '.-.-'}
- __init__(raw=False, template=None)[source]
Initialize the Morse Code Attack.
- class hivetracered.attacks.types.token_smuggling.RotCipherAttack(rotation=13, raw=False, template=None)[source]
Bases:
AlgoAttackAttack that encodes the input using a ROT cipher and asks the model to decode it, potentially bypassing content filters.
The ROT cipher is a simple substitution cipher that shifts each letter by a fixed number in the alphabet. ROT13 is the default, which shifts each letter by 13 positions. This implementation supports both English and Russian alphabets.
- __init__(rotation=13, raw=False, template=None)[source]
Initialize with a specific rotation value. Default is ROT13.
- get_description()[source]
Get the description of the attack.
- Return type:
- Returns:
The custom description if provided, otherwise a default description based on the template
- get_name()[source]
Get the name of the attack.
- Return type:
- Returns:
The custom name if provided, otherwise the class name
- class hivetracered.attacks.types.token_smuggling.TransliterationAttack(source_language='russian', target_language='english', raw=True, template=None)[source]
Bases:
AlgoAttackAttack that transliterates text between different alphabets, potentially bypassing content filters due to different character representations.
Supported languages (accordingly to the cyrtranslit library): - English (‘en’) - Bulgarian (‘bg’) - Belarusian (‘by’) - Greek (‘el’) - Montenegrin (‘me’) - Macedonian (‘mk’) - Mongolian (‘mn’) - Serbian (‘rs’) - Russian (‘ru’) - Tajik (‘tj’) - Ukrainian (‘ua’)
By default, transliterates from Russian to English, but can be configured to work with other supported language pairs.
- LANGUAGE_CODES = {'belarusian': 'by', 'bulgarian': 'bg', 'english': 'en', 'greek': 'el', 'macedonian': 'mk', 'mongolian': 'mn', 'montenegrin': 'me', 'russian': 'ru', 'serbian': 'rs', 'tajik': 'tj', 'ukrainian': 'ua'}
- __init__(source_language='russian', target_language='english', raw=True, template=None)[source]
Initialize the transliteration attack with source and target languages.
- Parameters:
source_language (
str) – The source language to transliterate from. Default is “russian”.target_language (
str) – The target language to transliterate to. Default is “english” (Latin script).raw (
bool) – If True, apply the transformation without additional instructions. Default is False.template (
str|None) – Custom instruction template to use. If None, the default template will be used.
- class hivetracered.attacks.types.token_smuggling.UnicodeRussianStyleAttack(replacement_strategy='random', raw=True, template=None)[source]
Bases:
AlgoAttackAttack that transforms Cyrillic text by replacing characters with similar-looking Unicode alternatives, with an explanatory Russian prompt.
- __init__(replacement_strategy='random', raw=True, template=None)[source]
Initialize the Unicode Russian style attack with a specific replacement strategy.
- Parameters:
replacement_strategy (
str) – The strategy to use for replacing Cyrillic characters. Must be one of: ‘random’, ‘all_same’, ‘phonetic’, ‘visual’, ‘fullwidth’ Default is ‘random’.raw (
bool) – If True, apply the transformation without additional instructions. Default is False.template (
str|None) – Custom instruction template to use. If None, the default template will be used.
- replacement_strategies = ['random', 'all_same', 'phonetic', 'visual', 'fullwidth']
- russian_similar_chars = {'а': ['а', 'ɑ', 'α', 'а', 'a'], 'б': ['б', 'ƃ', '6', 'б', 'b'], 'в': ['в', 'ʙ', 'β', 'в', 'v'], 'г': ['г', 'ɼ', 'γ', 'г', 'g'], 'д': ['д', 'ɖ', 'δ', 'д', 'd'], 'е': ['е', 'ɘ', 'ε', 'е', 'e'], 'ж': ['ж', 'ʒ', 'ж', 'ж', 'ж'], 'з': ['з', 'ʐ', 'ζ', 'з', 'z'], 'и': ['и', 'ɪ', 'ι', 'и', 'i'], 'й': ['й', 'j', 'ϳ', 'й', 'й'], 'к': ['к', 'ʞ', 'κ', 'к', 'k'], 'л': ['л', 'ʟ', 'λ', 'л', 'l'], 'м': ['м', 'ɱ', 'μ', 'м', 'm'], 'н': ['н', 'ʜ', 'η', 'н', 'h'], 'о': ['о', 'ο', 'ο', 'о', 'o'], 'п': ['п', 'ɸ', 'π', 'п', 'p'], 'р': ['р', 'ʀ', 'ρ', 'р', 'r'], 'с': ['с', 'ƈ', 'ϲ', 'с', 'c'], 'т': ['т', 'ʇ', 'τ', 'т', 't'], 'у': ['у', 'ʋ', 'υ', 'у', 'u'], 'ф': ['ф', 'ϕ', 'φ', 'ф', 'f'], 'х': ['х', 'ӽ', 'χ', 'х', 'х'], 'ц': ['ц', 'ʣ', 'ц', 'ц', 'ц'], 'ч': ['ч', 'ʨ', 'ч', 'ч', 'ч'], 'ш': ['ш', 'ɯ', 'ш', 'ш', 'ш'], 'щ': ['щ', 'щ', 'щ', 'щ', 'щ'], 'ъ': ['ъ', 'ъ', 'ъ', 'ъ', 'ъ'], 'ы': ['ы', 'ы', 'ы', 'ы', 'ы'], 'ь': ['ь', 'ь', 'ь', 'ь', 'ь'], 'э': ['э', 'э', 'э', 'э', 'э'], 'ю': ['ю', 'ю', 'ю', 'ю', 'ю'], 'я': ['я', 'я', 'я', 'я', 'я'], 'ё': ['ё', 'ǝ', 'έ', 'ё', 'ё']}
Context Switching Attacks¶
Context switching attacks that abruptly change the context mid-interaction.
Note: The DistractorsNegatedAttack is defined in the task_deflection directory but is also categorized as a context switching attack.
- class hivetracered.attacks.types.context_switching.DashedDividerAttack[source]
Bases:
TemplateAttackDashed Divider attack. Uses dashed line dividers to visually separate context and attempt context switching.
- __init__()[source]
Initialize the template attack with a specific template string.
- Parameters:
template – A format string with a ‘{prompt}’ placeholder
name – Optional name for the attack (defaults to class name)
description – Optional description for the attack
- class hivetracered.attacks.types.context_switching.ForgetEverythingBeforeAttack[source]
Bases:
TemplateAttackForget Everything Before attack. Attempts to make the model forget prior guidelines and follow new harmful instructions.
- __init__()[source]
Initialize the template attack with a specific template string.
- Parameters:
template – A format string with a ‘{prompt}’ placeholder
name – Optional name for the attack (defaults to class name)
description – Optional description for the attack
- class hivetracered.attacks.types.context_switching.IgnorePreviousInstructionsAttack[source]
Bases:
TemplateAttackIgnore Previous Instructions attack. Attempts to make the model ignore previous instructions and perform a harmful task.
- __init__()[source]
Initialize the template attack with a specific template string.
- Parameters:
template – A format string with a ‘{prompt}’ placeholder
name – Optional name for the attack (defaults to class name)
description – Optional description for the attack
- class hivetracered.attacks.types.context_switching.SymbolDividerAttack[source]
Bases:
TemplateAttackSymbol Divider attack. Uses symbol dividers to visually separate context and attempt to make the model ignore previous instructions.
- __init__()[source]
Initialize the template attack with a specific template string.
- Parameters:
template – A format string with a ‘{prompt}’ placeholder
name – Optional name for the attack (defaults to class name)
description – Optional description for the attack
In-Context Learning Attacks¶
Task Deflection Attacks¶
Text Structure Modification Attacks¶
Text structure modification attacks that alter structural aspects of text.
- class hivetracered.attacks.types.text_structure_modification.BackToFrontAttack(raw=False, template=None)[source]
Bases:
AlgoAttackAttack that reverses the entire text, potentially bypassing content filters while making it harder to immediately understand the prompt.
- __init__(raw=False, template=None)[source]
Initialize the Back-to-Front Attack.
- class hivetracered.attacks.types.text_structure_modification.DisemvowelAttack(only_last_vowel=True, language='both', raw=True, template=None)[source]
Bases:
AlgoAttackAttack that removes vowels from text, potentially bypassing content filters while keeping the text somewhat readable through consonants alone. Supports both English and Russian languages.
- ENGLISH_VOWELS = {'A', 'E', 'I', 'O', 'U', 'a', 'e', 'i', 'o', 'u'}
- RUSSIAN_VOWELS = {'Ё', 'А', 'Е', 'И', 'О', 'У', 'Ы', 'Э', 'Ю', 'Я', 'а', 'е', 'и', 'о', 'у', 'ы', 'э', 'ю', 'я', 'ё'}
- __init__(only_last_vowel=True, language='both', raw=True, template=None)[source]
Initialize the Disemvowel Attack.
- get_description()[source]
Get the description of the attack.
- Return type:
- Returns:
A description of what the attack does
- class hivetracered.attacks.types.text_structure_modification.JSONTransformAttack(raw=True, template=None)[source]
Bases:
AlgoAttackAttack that transforms a string into a JSON object where words are keys and values one-by-one. For example, “How to get a cake?” becomes {“How”: “to”, “get”: “a”, “cake?”: “”}.
- __init__(raw=True, template=None)[source]
Initialize the JSON Transform Attack.
- class hivetracered.attacks.types.text_structure_modification.PastTenseAttack(model, model_kwargs=None, name=None, description=None)[source]
Bases:
ModelAttackAttack that converts the original prompt to past tense using an LLM. This can potentially bypass content filters while preserving the semantic meaning of the prompt.
- __init__(model, model_kwargs=None, name=None, description=None)[source]
Initialize the Past Tense Attack.
- get_description()[source]
Get the description of the attack.
- Return type:
- Returns:
A description of what the attack does
- class hivetracered.attacks.types.text_structure_modification.TranslationAttack(model, source_language='russian', target_language='english', model_kwargs=None, name=None, description=None)[source]
Bases:
ModelAttackAttack that translates the original prompt to a different language using an LLM. This can potentially bypass content filters while preserving the semantic meaning of the prompt.
- __init__(model, source_language='russian', target_language='english', model_kwargs=None, name=None, description=None)[source]
Initialize the Translation Attack.
- Parameters:
model (
Model) – The language model to use for translationsource_language (
str) – The language to translate the prompt from (default: Russian)target_language (
str) – The language to translate the prompt to (default: English)model_kwargs (
Dict[str,Any] |None) – Optional additional arguments to pass to the modeldescription (
str|None) – Optional description for the attack
- get_description()[source]
Get the description of the attack.
- Return type:
- Returns:
A description of what the attack does
- class hivetracered.attacks.types.text_structure_modification.TypoAttack(typo_rate=1.0, allow_swaps=True, allow_omissions=True, allow_doubles=True, allow_adjacents=True, language='both', forced_typo_words=None, raw=True, template=None)[source]
Bases:
AlgoAttackAttack that introduces typographical errors (typos) into text, potentially bypassing content filters while keeping the text mostly readable. Implements various common typo patterns: 1. Character swapping (e.g., “the” -> “teh”) 2. Character omission (e.g., “hello” -> “helo”) 3. Character doubling (e.g., “happy” -> “happpy”) 4. Adjacent character substitution based on keyboard layout (e.g., “dog” -> “fog”)
Supports both English and Russian keyboard layouts.
- ENGLISH_KEYBOARD = {'a': ['s', 'q', 'z', 'w'], 'b': ['v', 'n', 'g', 'h'], 'c': ['x', 'v', 'd', 'f'], 'd': ['s', 'f', 'e', 'r', 'c', 'x'], 'e': ['w', 'r', 'd', 'f'], 'f': ['d', 'g', 'r', 't', 'v', 'c'], 'g': ['f', 'h', 't', 'y', 'b', 'v'], 'h': ['g', 'j', 'y', 'u', 'n', 'b'], 'i': ['u', 'o', 'k', 'j'], 'j': ['h', 'k', 'u', 'i', 'm', 'n'], 'k': ['j', 'l', 'i', 'o', ',', 'm'], 'l': ['k', ';', 'o', 'p', '.', ','], 'm': ['n', ',', 'j', 'k'], 'n': ['b', 'm', 'h', 'j'], 'o': ['i', 'p', 'k', 'l'], 'p': ['o', '[', 'l', ';'], 'q': ['w', 'a', '1', '2'], 'r': ['e', 't', 'd', 'f'], 's': ['a', 'd', 'w', 'e', 'x', 'z'], 't': ['r', 'y', 'f', 'g'], 'u': ['y', 'i', 'h', 'j'], 'v': ['c', 'b', 'f', 'g'], 'w': ['q', 'e', 'a', 's'], 'x': ['z', 'c', 's', 'd'], 'y': ['t', 'u', 'g', 'h'], 'z': ['a', 'x', 's']}
- RUSSIAN_KEYBOARD = {'а': ['в', 'п', 'м', 'к', 'е'], 'б': ['ь', 'ю', 'л', 'о'], 'в': ['ы', 'а', 'с', 'у', 'к'], 'г': ['н', 'ш', 'о', 'р'], 'д': ['л', 'ж', 'ю', 'щ', 'з'], 'е': ['к', 'н', 'п', 'а'], 'ж': ['д', 'э', '.', 'з', 'х'], 'з': ['щ', 'х', 'ж', 'д'], 'и': ['м', 'т', 'п', 'а'], 'й': ['ц', 'ф', '1'], 'к': ['у', 'е', 'а', 'в'], 'л': ['о', 'д', 'б', 'ш', 'щ'], 'м': ['с', 'и', 'а', 'в'], 'н': ['е', 'г', 'р', 'п'], 'о': ['р', 'л', 'ь', 'г', 'ш'], 'п': ['а', 'р', 'и', 'е', 'н'], 'р': ['п', 'о', 'т', 'н', 'г'], 'с': ['ч', 'м', 'в', 'ы'], 'т': ['и', 'ь', 'р', 'п'], 'у': ['ц', 'к', 'в', 'ы'], 'ф': ['й', 'ы', 'я', 'ц'], 'х': ['з', 'ъ', 'э', 'ж'], 'ц': ['й', 'у', 'ы', 'ф'], 'ч': ['я', 'с', 'ы', 'ф'], 'ш': ['г', 'щ', 'л', 'о'], 'щ': ['ш', 'з', 'д', 'л'], 'ъ': ['х', 'э', 'ё', '/'], 'ы': ['ф', 'в', 'ч', 'ц', 'у'], 'ь': ['т', 'б', 'о', 'р'], 'э': ['ж', 'ё', ',', 'х', 'ъ'], 'ю': ['б', '.', 'д', 'л'], 'я': ['ф', 'ч', 'ц']}
- __init__(typo_rate=1.0, allow_swaps=True, allow_omissions=True, allow_doubles=True, allow_adjacents=True, language='both', forced_typo_words=None, raw=True, template=None)[source]
Initialize the Typo Attack.
- Parameters:
typo_rate (
float) – Probability of introducing a typo for each word (between 0 and 1).allow_swaps (
bool) – Whether to allow character swapping.allow_omissions (
bool) – Whether to allow character omission.allow_doubles (
bool) – Whether to allow character doubling.allow_adjacents (
bool) – Whether to allow adjacent character substitution.language (
str) – The keyboard layout to use. Options: “english”, “russian”, or “both”.forced_typo_words (
List[str] |None) – List of words that must contain typos regardless of typo_rate.raw (
bool) – If True, apply the transformation without additional instructions. Default is True.template (
str|None) – Custom instruction template to use. If None, the default template will be used.
- get_description()[source]
Get the description of the attack.
- Return type:
- Returns:
A description of what the attack does
- class hivetracered.attacks.types.text_structure_modification.VerticalTextAttack(raw=False, template=None)[source]
Bases:
AlgoAttackAttack that converts words to a vertical format, with each character of a word stacked vertically, potentially bypassing content filters while preserving readability.
- __init__(raw=False, template=None)[source]
Initialize the Vertical Text Attack.
- class hivetracered.attacks.types.text_structure_modification.WordDividerAttack(divider_char='+', density=1.0, apply_to_words_longer_than=1, raw=True, template=None)[source]
Bases:
AlgoAttackAttack that inserts characters (dots by default) between letters in words, potentially bypassing content filters while keeping the text readable.
- __init__(divider_char='+', density=1.0, apply_to_words_longer_than=1, raw=True, template=None)[source]
Initialize the Word Divider Attack.
- Parameters:
divider_char (
str) – The character to insert between letters (default is ‘+’).density (
float) – Float between 0 and 1 controlling what fraction of eligible words get modified. Default is 1.0 (all eligible words).apply_to_words_longer_than (
int) – Only apply to words longer than this length. Default is 1 to apply to all words.raw (
bool) – If True, apply the transformation without additional instructions. Default is True.template (
str|None) – Custom instruction template to use. If None, the default template will be used.
- class hivetracered.attacks.types.text_structure_modification.ZeroWidthAttack(density=0.5, raw=True, template=None)[source]
Bases:
AlgoAttackAttack that inserts zero-width (invisible) Unicode characters between visible ones, potentially bypassing content filters while preserving readability.
- ZERO_WIDTH_CHARS = ['\u200b', '\u200c', '\u200d', '\u2060', '\ufeff']
- __init__(density=0.5, raw=True, template=None)[source]
Initialize the Zero-Width Character Attack.
- Parameters:
density (
float) – A value between 0 and 1 determining how many characters will have zero-width characters inserted after them. Default is 0.5 (50%).raw (
bool) – If True, apply the transformation without additional instructions. Default is False.template (
str|None) – Custom instruction template to use. If None, the default template will be used.
Output Formatting Attacks¶
Output formatting attacks that manipulate the requested output format to bypass safety filters.
- class hivetracered.attacks.types.output_formatting.Base64OutputAttack[source]
Bases:
TemplateAttackAttack that asks for output to be in base64 encoding, potentially bypassing output filters.
- __init__()[source]
Initialize the template attack with a specific template string.
- Parameters:
template – A format string with a ‘{prompt}’ placeholder
name – Optional name for the attack (defaults to class name)
description – Optional description for the attack
- class hivetracered.attacks.types.output_formatting.CSVOutputAttack[source]
Bases:
TemplateAttackCSV Output attack. Forces the model to respond in CSV format, potentially bypassing content filters.
- __init__()[source]
Initialize the template attack with a specific template string.
- Parameters:
template – A format string with a ‘{prompt}’ placeholder
name – Optional name for the attack (defaults to class name)
description – Optional description for the attack
- class hivetracered.attacks.types.output_formatting.GCGTransferHarmbenchAttack[source]
Bases:
TemplateAttackGCG Transfer Harmbench attack. Uses a technique with nonsensical formatting instructions to bypass filters.
- __init__()[source]
Initialize the template attack with a specific template string.
- Parameters:
template – A format string with a ‘{prompt}’ placeholder
name – Optional name for the attack (defaults to class name)
description – Optional description for the attack
- class hivetracered.attacks.types.output_formatting.GCGTransferUniversalAttack[source]
Bases:
TemplateAttackGCG Transfer Universal Attacks. Uses a complex, nonsensical instruction set with formatting tokens to bypass filters.
- __init__()[source]
Initialize the template attack with a specific template string.
- Parameters:
template – A format string with a ‘{prompt}’ placeholder
name – Optional name for the attack (defaults to class name)
description – Optional description for the attack
- class hivetracered.attacks.types.output_formatting.JSONOutputAttack[source]
Bases:
TemplateAttackJSON Output attack. Forces the model to respond in valid JSON format, potentially bypassing content filters.
- __init__()[source]
Initialize the template attack with a specific template string.
- Parameters:
template – A format string with a ‘{prompt}’ placeholder
name – Optional name for the attack (defaults to class name)
description – Optional description for the attack
- class hivetracered.attacks.types.output_formatting.LanguageOutputAttack(language='chinese')[source]
Bases:
TemplateAttackLanguage Output attack. Forces the model to respond in a specific language, potentially bypassing content filters that might be language-specific.
- class hivetracered.attacks.types.output_formatting.PrefixInjectionAttack[source]
Bases:
TemplateAttackPrefix Injection attack. Forces the model to start with an ‘unhinged’ tone to bypass filters.
- __init__()[source]
Initialize the template attack with a specific template string.
- Parameters:
template – A format string with a ‘{prompt}’ placeholder
name – Optional name for the attack (defaults to class name)
description – Optional description for the attack
- class hivetracered.attacks.types.output_formatting.PrefixInjectionOfCourseAttack[source]
Bases:
TemplateAttackPrefix Injection attack. Forces the model to start with ‘Конечно, вот ответ на ваш вопрос’ (Of course, here is the answer to your question).
- __init__()[source]
Initialize the template attack with a specific template string.
- Parameters:
template – A format string with a ‘{prompt}’ placeholder
name – Optional name for the attack (defaults to class name)
description – Optional description for the attack
- class hivetracered.attacks.types.output_formatting.RefusalSuppressionAttack[source]
Bases:
TemplateAttackRefusal Suppression attack. Uses rules to prevent model from refusing or disclaiming the harmful content.
- __init__()[source]
Initialize the template attack with a specific template string.
- Parameters:
template – A format string with a ‘{prompt}’ placeholder
name – Optional name for the attack (defaults to class name)
description – Optional description for the attack
Irrelevant Information Attacks¶
Simple Instructions Attacks¶
Simple instruction attacks that use direct, straightforward requests.
- class hivetracered.attacks.types.simple_instructions.NoneAttack[source]
Bases:
TemplateAttackPass-through attack that makes no modifications to the prompt. Serves as a baseline or control in attack comparisons and as a no-op in attack chains.
- __init__()[source]
Initialize the NoneAttack with a simple pass-through template.
See Also¶
Attack Types Reference - Attack reference
Creating Custom Attacks - Usage and custom attacks