-
Notifications
You must be signed in to change notification settings - Fork 475
[DRAFT][EXPERIMENTAL] FEAT: Proposing a new design for Orchestration of Attacks #867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[DRAFT][EXPERIMENTAL] FEAT: Proposing a new design for Orchestration of Attacks #867
Conversation
self, | ||
*, | ||
orchestrator_identifier: dict[str, str], | ||
memory: Optional[MemoryInterface] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you know, I'm not a huge fan of this because it means we could make an accidental mistake and pass different memory types in different places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using memory internally within each individual class is an anti-pattern and actually a bad design because storage happens as a side effect which is not known to the user. Also, you end up with inconsistencies. For example, let's say I wrote a new orchestrator, and this orchestrator has a bug somewhere that causes the whole program to crash. If you point it to a SQL storage, you end up having a few entries till the program crashes. This causes inconsistent or (garbage data). I think we need to rethink how we do this.
Dependency injection is a well-known pattern that helps testing individual pieces much easier. You could have a warning for example if memory is provided. I am envisioning all of the message tracking to move to a conversation piece. Once the entire attack is finished, it can be offloaded to the storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I see the problem tbh but feel free to write up your concerns in a doc or example, whatever works!
It's consistent and works. Users very much know what's being used because they have to choose. We used to have explicit passing of memory and that led to confusion and mistakes. No longer possible!
The "garbage data" example can happen, but it's not like that matters. It's also possible in any other setup unless you only let the SDK commit entries to the DB when the entire conversation is finished which is overkill (?)
I'd go a step further and say we want data from failed runs to analyze.
Anyway, I'm not ruling out that we might be missing something here but I don't yet see benefits to do anything else. It's easy to test even now. We're doing it already after all.
- Tracks turn count for multi-turn conversations | ||
- Extracts conversation state for continuation | ||
|
||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this should always be called "Parameters", not "Args." We're doing this wrong in a 1000 places but I want to spread the knowledge 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, this is again mainly for showcasing the work
# By default, include in memory | ||
return False # Don't exclude from memory | ||
|
||
def _extract_conversation_state( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a method named like this I would imagine we have a return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can name it as _populate_conversation_state
.
class PromptUtils: | ||
""" | ||
Utility class for handling prompt normalization and validation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of the name. If it normalizes it should be the normalizer? Or maybe it should be part of the existing normalizer class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agreed, I am not a fan either but couldn't find a better name for this. I thought about moving it to the prompt normalizer class but I don't fully understand what it does or what the responsibility of it is. I am not attached to the name at all. The point of this is that this logic should belong to a different component whether its prompt utils or prompt normalizer.
target: PromptTarget, | ||
adversarial_chat: PromptChatTarget, | ||
system_prompt_path: Path, | ||
seed_prompt: Optional[str] = "How can I help you?", | ||
prompt_normalizer: Optional[PromptNormalizer] = None, | ||
memory: Optional[MemoryInterface] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are lots of differences between orchestrators with this...
Crescendo has refusal AND objective scorer which use the same target but with different system prompts. [Note: system_prompt
should be specific, too, otherwise system prompt for what exactly?]
TAP has off topic scorer and objective scorer.
One can perhaps envision others.
With the scanner and in ops the question of having multiple scorers in general came up. They may just be auxiliary and not influence how the algorithm proceeds, but having them is worthwhile nevertheless. We do this for PSO already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this to have all the common params as input, the rest could be adjusted by individual attack builders as additional methods but we could have them all here as well. The only downside is, as you introduce new attacks with diff params, then this init also needs to be changed which is not desirable. It's best to be handled by the actual attack
"""Initialize the Multi-Turn Orchestrator Builder. | ||
|
||
Args: | ||
target: Target system to attack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're calling that prompt_target everywhere else. Target is the generic term for any kind of target including adversarial_chat or scoring targets.
|
||
return self | ||
|
||
def with_seed_prompt(self: T_OrcBuilder, seed_prompt: Union[str | SeedPrompt]) -> T_OrcBuilder: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean seed prompt group?
context.achieved_objective = await self._evaluate_objective_achievement(context, target_response, score) | ||
|
||
# Apply backtracking if configured and needed | ||
if (not context.achieved_objective and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more like all these things AND backtracking criteria are fulfilled. This could be refusal, or it could be low scores, for example. So that needs another way to customize IMO
memory_labels=context.memory_labels | ||
) | ||
|
||
async def _generate_next_prompt(self, context: MultiTurnAttackContext) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be quite custom for all the techniques. [No action item, just FYI]
seed_prompt: SeedPrompt, | ||
prompt_normalizer: Optional[PromptNormalizer] = None, | ||
prepend_conversation: Optional[List[SeedPrompt]] = None, | ||
memory: Optional[MemoryInterface] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is a nit, but I would be intentional about all paradigm changes. E.g. even memory, if we can. I know it's an anti-pattern, but as a dev I think the consensus has been super nice to have a memory singleton
def __init__( | ||
self, | ||
*, | ||
target: PromptTarget, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One paradigm I want to normalize on, and this might be a good time to tackle, is to take objective_scorer
in the init vs initialize them in the class. It's becoming increasingly clear that's needed.
We need good ways to make it easy to initialize these though. Factories (like you introduce here) is one way. Another might be scanner configurations.
But either way, I think we both need the flexibility of passing a general "true_false" scorer AND the ease of use of having a default.
class ConversationSession: | ||
"""Session for conversations.""" | ||
conversation_id: str = field(default_factory=lambda: str(uuid.uuid4())) | ||
adversarial_chat_conversation_id: str = field(default_factory=lambda: str(uuid.uuid4())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will likely need more of these. It'll be good to think through TAP which can have a ton of conversations due to branching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We MAY be able to generalize by having a list of active_adversarial_chat_conversation_id (because some are thrown away, e.g. if backtracked or pruned).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I think that's a general piece of feedback. I'm positive about this change, but for the first iteration I'd prefer TAP. If it's general enough for TAP, then it will work elsewhere. My biggest concern is some of it may not generalize
Description
This is a draft for an experimental design proposal on how to orchestrate attacks. The idea is to modularize the components, separate the core attack algorithms and flows, and enhance the separation of concerns. Please keep in mind, this is not the final design, it is just a starting point for discussion. I've even included a CLI example to show how the code might look from the user's perspective.
I prioritized readability and chose the builder pattern, even though it might feel a bit verbose. You can also check out the diagrams folder to see the attack and class diagrams.
(as you can tell, this is my 3rd iteration on it
v3
)