Skip to content

[DRAFT][EXPERIMENTAL] FEAT: Proposing a new design for Orchestration of Attacks #867

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

bashirpartovi
Copy link
Contributor

@bashirpartovi bashirpartovi commented Apr 8, 2025

Description

This is a draft for an experimental design proposal on how to orchestrate attacks. The idea is to modularize the components, separate the core attack algorithms and flows, and enhance the separation of concerns. Please keep in mind, this is not the final design, it is just a starting point for discussion. I've even included a CLI example to show how the code might look from the user's perspective.

I prioritized readability and chose the builder pattern, even though it might feel a bit verbose. You can also check out the diagrams folder to see the attack and class diagrams.

(as you can tell, this is my 3rd iteration on it v3)

self,
*,
orchestrator_identifier: dict[str, str],
memory: Optional[MemoryInterface] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know, I'm not a huge fan of this because it means we could make an accidental mistake and pass different memory types in different places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using memory internally within each individual class is an anti-pattern and actually a bad design because storage happens as a side effect which is not known to the user. Also, you end up with inconsistencies. For example, let's say I wrote a new orchestrator, and this orchestrator has a bug somewhere that causes the whole program to crash. If you point it to a SQL storage, you end up having a few entries till the program crashes. This causes inconsistent or (garbage data). I think we need to rethink how we do this.
Dependency injection is a well-known pattern that helps testing individual pieces much easier. You could have a warning for example if memory is provided. I am envisioning all of the message tracking to move to a conversation piece. Once the entire attack is finished, it can be offloaded to the storage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I see the problem tbh but feel free to write up your concerns in a doc or example, whatever works!

It's consistent and works. Users very much know what's being used because they have to choose. We used to have explicit passing of memory and that led to confusion and mistakes. No longer possible!

The "garbage data" example can happen, but it's not like that matters. It's also possible in any other setup unless you only let the SDK commit entries to the DB when the entire conversation is finished which is overkill (?)
I'd go a step further and say we want data from failed runs to analyze.

Anyway, I'm not ruling out that we might be missing something here but I don't yet see benefits to do anything else. It's easy to test even now. We're doing it already after all.

- Tracks turn count for multi-turn conversations
- Extracts conversation state for continuation

Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this should always be called "Parameters", not "Args." We're doing this wrong in a 1000 places but I want to spread the knowledge 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, this is again mainly for showcasing the work

# By default, include in memory
return False # Don't exclude from memory

def _extract_conversation_state(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a method named like this I would imagine we have a return value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can name it as _populate_conversation_state.

Comment on lines 23 to 25
class PromptUtils:
"""
Utility class for handling prompt normalization and validation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of the name. If it normalizes it should be the normalizer? Or maybe it should be part of the existing normalizer class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agreed, I am not a fan either but couldn't find a better name for this. I thought about moving it to the prompt normalizer class but I don't fully understand what it does or what the responsibility of it is. I am not attached to the name at all. The point of this is that this logic should belong to a different component whether its prompt utils or prompt normalizer.

Comment on lines 28 to 33
target: PromptTarget,
adversarial_chat: PromptChatTarget,
system_prompt_path: Path,
seed_prompt: Optional[str] = "How can I help you?",
prompt_normalizer: Optional[PromptNormalizer] = None,
memory: Optional[MemoryInterface] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are lots of differences between orchestrators with this...

Crescendo has refusal AND objective scorer which use the same target but with different system prompts. [Note: system_prompt should be specific, too, otherwise system prompt for what exactly?]
TAP has off topic scorer and objective scorer.
One can perhaps envision others.
With the scanner and in ops the question of having multiple scorers in general came up. They may just be auxiliary and not influence how the algorithm proceeds, but having them is worthwhile nevertheless. We do this for PSO already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this to have all the common params as input, the rest could be adjusted by individual attack builders as additional methods but we could have them all here as well. The only downside is, as you introduce new attacks with diff params, then this init also needs to be changed which is not desirable. It's best to be handled by the actual attack

"""Initialize the Multi-Turn Orchestrator Builder.

Args:
target: Target system to attack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're calling that prompt_target everywhere else. Target is the generic term for any kind of target including adversarial_chat or scoring targets.


return self

def with_seed_prompt(self: T_OrcBuilder, seed_prompt: Union[str | SeedPrompt]) -> T_OrcBuilder:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean seed prompt group?

context.achieved_objective = await self._evaluate_objective_achievement(context, target_response, score)

# Apply backtracking if configured and needed
if (not context.achieved_objective and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more like all these things AND backtracking criteria are fulfilled. This could be refusal, or it could be low scores, for example. So that needs another way to customize IMO

memory_labels=context.memory_labels
)

async def _generate_next_prompt(self, context: MultiTurnAttackContext) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be quite custom for all the techniques. [No action item, just FYI]

seed_prompt: SeedPrompt,
prompt_normalizer: Optional[PromptNormalizer] = None,
prepend_conversation: Optional[List[SeedPrompt]] = None,
memory: Optional[MemoryInterface] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a nit, but I would be intentional about all paradigm changes. E.g. even memory, if we can. I know it's an anti-pattern, but as a dev I think the consensus has been super nice to have a memory singleton

def __init__(
self,
*,
target: PromptTarget,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One paradigm I want to normalize on, and this might be a good time to tackle, is to take objective_scorer in the init vs initialize them in the class. It's becoming increasingly clear that's needed.

We need good ways to make it easy to initialize these though. Factories (like you introduce here) is one way. Another might be scanner configurations.

But either way, I think we both need the flexibility of passing a general "true_false" scorer AND the ease of use of having a default.

class ConversationSession:
"""Session for conversations."""
conversation_id: str = field(default_factory=lambda: str(uuid.uuid4()))
adversarial_chat_conversation_id: str = field(default_factory=lambda: str(uuid.uuid4()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will likely need more of these. It'll be good to think through TAP which can have a ton of conversations due to branching.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We MAY be able to generalize by having a list of active_adversarial_chat_conversation_id (because some are thrown away, e.g. if backtracked or pruned).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I think that's a general piece of feedback. I'm positive about this change, but for the first iteration I'd prefer TAP. If it's general enough for TAP, then it will work elsewhere. My biggest concern is some of it may not generalize

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants