Skip to content

Conversation

@edison240121
Copy link
Contributor

No description provided.

std::memcpy(cum_vec.data(),
cum_tensor.data_ptr<int>(),
cum_tensor.numel() * sizeof(int));
;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you doing in line 661-673?
why not get std::vector<int32_t> q_cu_seq_lens from std::vector<int32_t> q_seq_lens directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted.


SET_ARG(stop_token_ids, std::unordered_set<int32_t>({1}));
});
} // namespace xllm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no changed

#include <torch/torch.h>

#include <boost/algorithm/string.hpp>
#include <string>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove:

#include <gflags/gflags.h>
#include <boost/algorithm/string.hpp>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted.

}

} // namespace layer
} // namespace xllm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored to the state before modification.

void initialize_quantization_parameters(
atb_speed::deepseekV2::DecoderLayerParam& param);

void initialize_kimi_k2_parameters(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need initialize kimi_k2 parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted.

std::memcpy(cum_vec.data(),
cum_tensor.data_ptr<int>(),
cum_tensor.numel() * sizeof(int));
;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

; is redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted.

raw_forward_input.q_max_seq_len = state_.q_max_seq_len;
raw_forward_input.seq_lens = std::move(state_.seq_lens);
raw_forward_input.q_seq_lens = std::move(state_.q_seq_lens);
torch::Tensor q_seq_len_tensor =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this torch op may be not fast than std::partial_sum;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has been changed to use std::partial_sum


params.kv_seq_lens = safe_to(kv_seq_lens, device, true);
params.q_seq_lens = safe_to(q_seq_lens, device, true);
params.q_cu_seq_lens = safe_to(q_cu_seq_lens, device, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need to use both cu* q_cu_seq_lens and non cu* q_seq_lens params.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeepSeek V3.2 requires the use of the sparse flash attention operator, where both q_seq_lens and q_cu_seq_lens are inputs to this operator.

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must add #pragma once

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

pb_forward_input->q_seq_lens().end());
// aprint<int32_t>(q_seq_lens, "q_seq_lens", global_rank_);
std::vector<int32_t> q_cu_seq_lens(q_seq_lens.size());
std::partial_sum(q_seq_lens.begin(), q_seq_lens.end(), q_cu_seq_lens.begin());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please confirm the size of q_cu_seq_lens , in mlu and gpu, the first item is 0, so q_cu_seq_lens.size() == batch_size+1. Here in npu, q_cu_seq_lens.size == batchsize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for the sparse flashattention operator requires that the shapes of q_seq_lens and q_cu_seq_lens must be equal.

Copy link
Collaborator

@yq33victor yq33victor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge the code and speed up the deepseek32 testing process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants