-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cvector: better prompt handling, add "mean vector" method #8069
Conversation
What's the reason for removing it? If you don't want to use them just make an empty file and run with |
@HatsuneMikuUwU33 The main reason is that completions file is used for preparing input data used for training. But main goal of At this stage, I don't feel like mixing these 2 steps in the same program because it makes the code too complex. The actually completion mentioned in #7514 (comment) can be done better using If we absolutely want to have completion here, I'd suggest writing a dedicated program (maybe shell script or python) to do the data preparing step. |
I'm just looking through the code now and I can't see anywhere where the data matrix is projected onto the principle component: PCA::run_pca(pca_params, ctx_train.v_diff, ctx_train.v_final);
// write output vectors to gguf
export_gguf(ctx_train.v_final, params.cvector_outfile, model_hint); static void run_pca(
struct pca_params & params,
const std::vector<struct ggml_tensor *> & v_input, // shape of v_input[0]: [n_samples, n_embd]
const std::vector<struct ggml_tensor *> & v_output) {
printf("%s: Running PCA...\n", __func__);
for (size_t il = 0; il < v_input.size(); ++il) {
// prepare output vector
struct ggml_tensor * ctrl_out = v_output[il];
ggml_format_name(ctrl_out, "direction.%ld", il+1);
// run power_iteration
params.i_layer = il;
params.n_layers = v_input.size();
power_iteration(params, v_input[il], ctrl_out);
printf("%s: Done layer %d / %d\n", __func__, (int) il+1, (int) v_input.size());
}
} // get output tensor
GGML_ASSERT(last_eigenvector);
ggml_backend_tensor_get(last_eigenvector, output->data, 0, ggml_nbytes(last_eigenvector));
//print_debug_tensor(output);
ggml_gallocr_free(allocr); You need to project the data matrix onto the Eigenvector(s), calculate the mean, and see if the signs of the vector(s) need flipping so that adding the vector makes the mean go the way you want. There is no inherent directionality to the Eigenvectors found and it's pretty much random if it points the way you want it or not. The Python code does this here: # calculate sign
projected_hiddens = project_onto_direction(h, directions[layer])
# order is [positive, negative, positive, negative, ...]
positive_smaller_mean = np.mean(
[
projected_hiddens[i] < projected_hiddens[i + 1]
for i in range(0, len(inputs) * 2, 2)
]
)
positive_larger_mean = np.mean(
[
projected_hiddens[i] > projected_hiddens[i + 1]
for i in range(0, len(inputs) * 2, 2)
]
)
if positive_smaller_mean > positive_larger_mean: # type: ignore
directions[layer] *= -1 See: https://github.com/vgel/repeng/blob/main/repeng/extract.py But I actually think (and have tested successful in my own code), that not only should the sign be flipped but the magnitude should be scaled by the mean or else all the vectors will just keep their norm of 1 as returned by PCA, and it will be very hard to balance mixing control vectors from early and later layers where the mean hidden state is 1-2 orders of magnitude different. |
This method is actually Linear Discriminant Analysis where you assume the covariance matrices are just the identity. This is a nice explanation of why this is and why it might not be optimal: |
Thanks for the explanation. I looked on the python code earlier but didn't understand this part in particular. It's all clear now and I'll try to bring this part to cpp. For now I'll just remove my hot fix and leave a TODO there.
Cool! Maybe this is also related to the fact that the generated control vector only effective if I apply it to layers higher than 10 (i.e. |
Hopefully this isn't confusing as I'm actually using more than 2 classes ( projected_scores = [self._project_data_onto_component(d, component) for d in data]
mean_differences = self._compute_mean_difference(projected_scores[0], projected_scores[1]) # 2 classes only!
for j in range(num_dataset_types - 1):
scaled_direction = -mean_differences[j] * component
direction_matrices[j][layer_index].append(torch.tensor(scaled_direction))
def _project_data_onto_component(self, data, component):
return np.dot(data, component.reshape(-1, 1)).squeeze()
def _compute_mean_difference(projected_scores1, projected_scores2):
return np.mean(projected_scores1) - np.mean(projected_scores2) To use the same logic as the old code where you just keep the norms of 1: scaled_direction = -math.copysign(1.0, mean_differences[j]) * component I'm also being careful to use the delta of |
This looks fine. Ideally I'd like to have an example Python/shell script to provide examples or instructions of how to format data, but this is not urgent. |
…8069) * remove completions file * fix inverted vector * add mean method * code style * remove inverted pca hotfix
…8069) * remove completions file * fix inverted vector * add mean method * code style * remove inverted pca hotfix
Motivation
Ref comment: #7514 (comment)
After more consideration, I think that we should not handle completions in cvector (at least for now), because it can add unnecessary complexity. Positive/negative prompts are now 100% up to user to prepare. I also changed the example to using llama-3 format.
Also fix a bug where special tokens are not being correctly tokenized.
With this change, I spotted a problem with PCA: the output vector is being inverted (i.e.
cvector_happy.gguf +1.0
makes it sad, whilecvector_happy.gguf -1.0
makes it happy). Don't know why for now, but the quick fix is to invert the vector back before saving it.Mean method
Added "mean" as dimensionality reduction method. It simple calculates mean vector from all embeddings.
The output results turns out to be very acceptable even with this simple method:
Demo