Skip to content

[pull] master from ggml-org:master #201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 40 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
c33fe8b
vocab : prevent heap overflow when vocab is too small (#14145)
ggerganov Jun 13, 2025
09cf2c7
cmake : Improve build-info.cpp generation (#14156)
ckastner Jun 13, 2025
c61285e
SYCL: Bump oneMath commit (#14152)
EwanC Jun 13, 2025
0889eba
sycl: Adding additional cpy dbg print output (#14034)
ShanoToni Jun 13, 2025
ffad043
server : fix SWA condition for full context reprocess (#14163)
ggerganov Jun 13, 2025
d714dad
pooling : make cls_b and cls_out_b optional (#14165)
huydt84 Jun 13, 2025
cc8d081
cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167)
ckastner Jun 13, 2025
b7cc774
readme : remove survey link (#14168)
ggerganov Jun 13, 2025
60c6663
batch : rework llama_batch_allocr (#14153)
ggerganov Jun 13, 2025
26ff368
docs : Update multimodal.md (#14122)
ddpasa Jun 13, 2025
80709b7
batch : add LLAMA_BATCH_DEBUG environment variable (#14172)
ggerganov Jun 13, 2025
3cfbbdb
Merge commit from fork
GuyGoldenberg Jun 13, 2025
40643ed
sycl: fix docker image (#14144)
sgeor255 Jun 13, 2025
fb85a28
vocab : fix build (#14175)
ggerganov Jun 13, 2025
2e42be4
compare-llama-bench: add option to plot (#14169)
am17an Jun 14, 2025
3cb203c
llama-chat : Do not throw when tool parsing fails (#14012)
p1-0tr Jun 14, 2025
00ba772
docs : remove WIP since PR has been merged (#13912)
pepijndevos Jun 15, 2025
b9912ac
batch : auto-gen positions + verify multi-sequence input (#14177)
ggerganov Jun 15, 2025
c311ac6
cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188)
ggerganov Jun 15, 2025
9ae4143
model : add dots.llm1 architecture support (#14044) (#14118)
Noeda Jun 15, 2025
5fce5f9
kv-cache : fix use-after-move of defrag info (#14189)
ggerganov Jun 15, 2025
2c2caa4
HIP: Replace usage of depricated preprocessor macro __AMDGCN_WAVEFRON…
IMbackK Jun 15, 2025
e54b394
CUDA/HIP: fix ssm_scan on devices where warp size is not 32 (#14196)
IMbackK Jun 15, 2025
30e5b01
quantize : change int to unsigned int for KV overrides (#14197)
EAddario Jun 15, 2025
cd355ed
server : When listening on a unix domain socket don't print http:// a…
ericcurtin Jun 15, 2025
d7da8dc
model : Add support for Arcee AI's upcoming AFM model (#14185)
bartowski1182 Jun 15, 2025
3555b30
ggml-cpu : rework weak alias on apple targets (#14146)
xctan Jun 16, 2025
c89c2d1
vulkan: mutex around vkQueueSubmit (#14127)
jeffbolznv Jun 16, 2025
4ad2436
gguf-py : allow key override when adding value to GGUFWriter (#14194)
huydt84 Jun 16, 2025
0bf49eb
convert : remove arcee change in convert_hf_to_gguf_update.py (#14207)
bartowski1182 Jun 16, 2025
3ba0d84
ggml: Add Android support for GGML_CPU_ALL_VARIANTS (#14206)
chaxu01 Jun 16, 2025
d3e64b9
llama : rework embeddings logic (#14208)
ggerganov Jun 16, 2025
7d6d91b
HIP: disable rocwmma on gfx12 by default until rocm 7.0 (#14202)
IMbackK Jun 16, 2025
ad590be
model : add NeoBERT (#14164)
huydt84 Jun 16, 2025
0dbcabd
cmake: clean up external project logic for vulkan-shaders-gen (#14179)
bandoti Jun 16, 2025
6adc3c3
llama : add thread safety test (#14035)
slaren Jun 16, 2025
89fea80
server : fix incorrect usage of llama_get_embeddings() (#14225)
ggerganov Jun 16, 2025
e434e69
common : suggest --jinja when autodetection fails (#14222)
CISC Jun 16, 2025
fe9d60e
musa: fix build warning (unused variable) (#14231)
yeahdongcn Jun 17, 2025
860a9e4
ggml-cpu : remove the weak alias trick (#14221)
xctan Jun 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 17 additions & 13 deletions .devops/intel.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -49,19 +49,23 @@ COPY --from=build /app/full /app

WORKDIR /app

RUN apt-get update \
&& apt-get install -y \
git \
python3 \
python3-pip \
&& pip install --upgrade pip setuptools wheel \
&& pip install -r requirements.txt \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete

RUN apt-get update && \
apt-get install -y \
git \
python3 \
python3-pip \
python3-venv && \
python3 -m venv /opt/venv && \
. /opt/venv/bin/activate && \
pip install --upgrade pip setuptools wheel && \
pip install -r requirements.txt && \
apt autoremove -y && \
apt clean -y && \
rm -rf /tmp/* /var/tmp/* && \
find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete && \
find /var/cache -type f -delete

ENV PATH="/opt/venv/bin:$PATH"

ENTRYPOINT ["/app/tools.sh"]

Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -693,7 +693,7 @@ jobs:
- build: 'openblas-x64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DBLAS_INCLUDE_DIRS="$env:RUNNER_TEMP/openblas/include" -DBLAS_LIBRARIES="$env:RUNNER_TEMP/openblas/lib/openblas.lib"'
- build: 'vulkan-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON'
defines: '-DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON'
- build: 'llvm-arm64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON'
- build: 'llvm-arm64-opencl-adreno'
Expand Down Expand Up @@ -778,6 +778,7 @@ jobs:
cmake -S . -B build ${{ matrix.defines }} `
-DCURL_LIBRARY="$env:CURL_PATH/lib/libcurl.dll.a" -DCURL_INCLUDE_DIR="$env:CURL_PATH/include"
cmake --build build --config Release -j ${env:NUMBER_OF_PROCESSORS}
cp $env:CURL_PATH/bin/libcurl-*.dll build/bin/Release

- name: Add libopenblas.dll
id: add_libopenblas_dll
Expand Down
14 changes: 10 additions & 4 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,14 @@ option(LLAMA_LLGUIDANCE "llama-common: include LLGuidance library for structured
include(${CMAKE_CURRENT_SOURCE_DIR}/cmake/build-info.cmake)
include(${CMAKE_CURRENT_SOURCE_DIR}/cmake/common.cmake)

if (NOT DEFINED LLAMA_BUILD_NUMBER)
set(LLAMA_BUILD_NUMBER ${BUILD_NUMBER})
endif()
if (NOT DEFINED LLAMA_BUILD_COMMIT)
set(LLAMA_BUILD_COMMIT ${BUILD_COMMIT})
endif()
set(LLAMA_INSTALL_VERSION 0.0.${BUILD_NUMBER})

# override ggml options
set(GGML_ALL_WARNINGS ${LLAMA_ALL_WARNINGS})
set(GGML_FATAL_WARNINGS ${LLAMA_FATAL_WARNINGS})
Expand Down Expand Up @@ -155,6 +163,8 @@ if (LLAMA_USE_SYSTEM_GGML)
endif()

if (NOT TARGET ggml AND NOT LLAMA_USE_SYSTEM_GGML)
set(GGML_BUILD_NUMBER ${LLAMA_BUILD_NUMBER})
set(GGML_BUILD_COMMIT ${LLAMA_BUILD_COMMIT})
add_subdirectory(ggml)
# ... otherwise assume ggml is added by a parent CMakeLists.txt
endif()
Expand Down Expand Up @@ -204,10 +214,6 @@ endif()
include(GNUInstallDirs)
include(CMakePackageConfigHelpers)

set(LLAMA_BUILD_NUMBER ${BUILD_NUMBER})
set(LLAMA_BUILD_COMMIT ${BUILD_COMMIT})
set(LLAMA_INSTALL_VERSION 0.0.${BUILD_NUMBER})

set(LLAMA_INCLUDE_INSTALL_DIR ${CMAKE_INSTALL_INCLUDEDIR} CACHE PATH "Location of header files")
set(LLAMA_LIB_INSTALL_DIR ${CMAKE_INSTALL_LIBDIR} CACHE PATH "Location of library files")
set(LLAMA_BIN_INSTALL_DIR ${CMAKE_INSTALL_BINDIR} CACHE PATH "Location of binary files")
Expand Down
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
## Hot topics

- 🔥 Multimodal support arrived in `llama-server`: [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) | [documentation](./docs/multimodal.md)
- **GGML developer experience survey (organized and reviewed by NVIDIA):** [link](https://forms.gle/Gasw3cRgyhNEnrwK9)
- A new binary `llama-mtmd-cli` is introduced to replace `llava-cli`, `minicpmv-cli`, `gemma3-cli` ([#13012](https://github.com/ggml-org/llama.cpp/pull/13012)) and `qwen2vl-cli` ([#13141](https://github.com/ggml-org/llama.cpp/pull/13141)), `libllava` will be deprecated
- VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
- Universal [tool call support](./docs/function-calling.md) in `llama-server` https://github.com/ggml-org/llama.cpp/pull/9639
Expand Down
2 changes: 1 addition & 1 deletion ci/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ sd=`dirname $0`
cd $sd/../
SRC=`pwd`

CMAKE_EXTRA="-DLLAMA_FATAL_WARNINGS=ON -DLLAMA_CURL=OFF"
CMAKE_EXTRA="-DLLAMA_FATAL_WARNINGS=ON -DLLAMA_CURL=ON"

if [ ! -z ${GG_BUILD_METAL} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_METAL=ON -DGGML_METAL_USE_BF16=ON"
Expand Down
24 changes: 7 additions & 17 deletions common/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,31 +23,21 @@ if(EXISTS "${PROJECT_SOURCE_DIR}/.git")
endif()

if(EXISTS "${GIT_DIR}/index")
set(GIT_INDEX "${GIT_DIR}/index")
# For build-info.cpp below
set_property(DIRECTORY APPEND PROPERTY CMAKE_CONFIGURE_DEPENDS "${GIT_DIR}/index")
else()
message(WARNING "Git index not found in git repository.")
set(GIT_INDEX "")
endif()
else()
message(WARNING "Git repository not found; to enable automatic generation of build info, make sure Git is installed and the project is a Git repository.")
set(GIT_INDEX "")
endif()

# Add a custom command to rebuild build-info.cpp when .git/index changes
add_custom_command(
OUTPUT "${CMAKE_CURRENT_SOURCE_DIR}/build-info.cpp"
COMMENT "Generating build details from Git"
COMMAND ${CMAKE_COMMAND} -DMSVC=${MSVC} -DCMAKE_C_COMPILER_VERSION=${CMAKE_C_COMPILER_VERSION}
-DCMAKE_C_COMPILER_ID=${CMAKE_C_COMPILER_ID} -DCMAKE_VS_PLATFORM_NAME=${CMAKE_VS_PLATFORM_NAME}
-DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}
-DCMAKE_SYSTEM_NAME=${CMAKE_SYSTEM_NAME} -DCMAKE_SYSTEM_PROCESSOR=${CMAKE_SYSTEM_PROCESSOR}
-P "${CMAKE_CURRENT_SOURCE_DIR}/cmake/build-info-gen-cpp.cmake"
WORKING_DIRECTORY "${PROJECT_SOURCE_DIR}"
DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/build-info.cpp.in" ${GIT_INDEX}
VERBATIM
)
set(TEMPLATE_FILE "${CMAKE_CURRENT_SOURCE_DIR}/build-info.cpp.in")
set(OUTPUT_FILE "${CMAKE_CURRENT_BINARY_DIR}/build-info.cpp")
configure_file(${TEMPLATE_FILE} ${OUTPUT_FILE})

set(TARGET build_info)
add_library(${TARGET} OBJECT build-info.cpp)
add_library(${TARGET} OBJECT ${OUTPUT_FILE})
if (BUILD_SHARED_LIBS)
set_target_properties(${TARGET} PROPERTIES POSITION_INDEPENDENT_CODE ON)
endif()
Expand Down
9 changes: 3 additions & 6 deletions common/arg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -988,10 +988,6 @@ static bool common_params_parse_ex(int argc, char ** argv, common_params_context
params.tensor_buft_overrides.push_back({nullptr, nullptr});
}

if (params.reranking && params.embedding) {
throw std::invalid_argument("error: either --embedding or --reranking can be specified, but not both");
}

if (!params.chat_template.empty() && !common_chat_verify_template(params.chat_template, params.use_jinja)) {
throw std::runtime_error(string_format(
"error: the supplied chat template is not supported: %s%s\n",
Expand Down Expand Up @@ -2747,9 +2743,10 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_EMBEDDINGS"));
add_opt(common_arg(
{"--reranking", "--rerank"},
string_format("enable reranking endpoint on server (default: %s)", params.reranking ? "enabled" : "disabled"),
string_format("enable reranking endpoint on server (default: %s)", "disabled"),
[](common_params & params) {
params.reranking = true;
params.embedding = true;
params.pooling_type = LLAMA_POOLING_TYPE_RANK;
}
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_RERANKING"));
add_opt(common_arg(
Expand Down
4 changes: 2 additions & 2 deletions common/build-info.cpp.in
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
int LLAMA_BUILD_NUMBER = @BUILD_NUMBER@;
char const *LLAMA_COMMIT = "@BUILD_COMMIT@";
int LLAMA_BUILD_NUMBER = @LLAMA_BUILD_NUMBER@;
char const *LLAMA_COMMIT = "@LLAMA_BUILD_COMMIT@";
char const *LLAMA_COMPILER = "@BUILD_COMPILER@";
char const *LLAMA_BUILD_TARGET = "@BUILD_TARGET@";
5 changes: 5 additions & 0 deletions common/chat-parser.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ bool common_chat_msg_parser::add_tool_call(const std::string & name, const std::

// LOG_DBG("Tool call arguments:\n\traw: %s\n\tresult: %s\n", arguments.c_str(), tool_call.arguments.c_str());
result_.tool_calls.emplace_back(tool_call);

return true;
}
bool common_chat_msg_parser::add_tool_call(const json & tool_call) {
Expand Down Expand Up @@ -378,3 +379,7 @@ std::optional<common_chat_msg_parser::consume_json_result> common_chat_msg_parse
/* .is_partial = */ found_healing_marker,
};
}

void common_chat_msg_parser::clear_tools() {
result_.tool_calls.clear();
}
2 changes: 2 additions & 0 deletions common/chat-parser.h
Original file line number Diff line number Diff line change
Expand Up @@ -115,4 +115,6 @@ class common_chat_msg_parser {
const std::vector<std::vector<std::string>> & args_paths = {},
const std::vector<std::vector<std::string>> & content_paths = {}
);

void clear_tools();
};
6 changes: 4 additions & 2 deletions common/chat.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1838,7 +1838,7 @@ static common_chat_params common_chat_templates_apply_legacy(
if (res < 0) {
// if the custom "tmpl" is not supported, we throw an error
// this is a bit redundant (for good), since we're not sure if user validated the custom template with llama_chat_verify_template()
throw std::runtime_error("this custom template is not supported");
throw std::runtime_error("this custom template is not supported, try using --jinja");
}

// if it turns out that our buffer is too small, we resize it
Expand Down Expand Up @@ -1921,7 +1921,9 @@ common_chat_msg common_chat_parse(const std::string & input, bool is_partial, co
} catch (const common_chat_msg_partial_exception & ex) {
LOG_DBG("Partial parse: %s\n", ex.what());
if (!is_partial) {
throw std::runtime_error(ex.what());
builder.clear_tools();
builder.move_to(0);
common_chat_parse_content_only(builder);
}
}
auto msg = builder.result();
Expand Down
24 changes: 0 additions & 24 deletions common/cmake/build-info-gen-cpp.cmake

This file was deleted.

78 changes: 41 additions & 37 deletions common/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -767,6 +767,9 @@ bool fs_validate_filename(const std::string & filename) {
return true;
}

#include <iostream>


// returns true if successful, false otherwise
bool fs_create_directory_with_parents(const std::string & path) {
#ifdef _WIN32
Expand All @@ -784,9 +787,16 @@ bool fs_create_directory_with_parents(const std::string & path) {
// process path from front to back, procedurally creating directories
while ((pos_slash = path.find('\\', pos_slash)) != std::string::npos) {
const std::wstring subpath = wpath.substr(0, pos_slash);
const wchar_t * test = subpath.c_str();

const bool success = CreateDirectoryW(test, NULL);
pos_slash += 1;

// skip the drive letter, in some systems it can return an access denied error
if (subpath.length() == 2 && subpath[1] == ':') {
continue;
}

const bool success = CreateDirectoryW(subpath.c_str(), NULL);

if (!success) {
const DWORD error = GetLastError();

Expand All @@ -800,8 +810,6 @@ bool fs_create_directory_with_parents(const std::string & path) {
return false;
}
}

pos_slash += 1;
}

return true;
Expand Down Expand Up @@ -897,34 +905,6 @@ struct common_init_result common_init_from_params(common_params & params) {

const llama_vocab * vocab = llama_model_get_vocab(model);

if (params.reranking) {
bool ok = true;

if (llama_vocab_bos(vocab) == LLAMA_TOKEN_NULL) {
LOG_WRN("%s: warning: vocab does not have a BOS token, reranking will not work\n", __func__);
ok = false;
}

bool has_eos = llama_vocab_eos(vocab) != LLAMA_TOKEN_NULL;
bool has_sep = llama_vocab_sep(vocab) != LLAMA_TOKEN_NULL;

if (!has_eos && !has_sep) {
LOG_WRN("%s: warning: vocab does not have an EOS token or SEP token, reranking will not work\n", __func__);
ok = false;
} else if (!has_eos) {
LOG_WRN("%s: warning: vocab does not have an EOS token, using SEP token as fallback\n", __func__);
} else if (!has_sep) {
LOG_WRN("%s: warning: vocab does not have a SEP token, reranking will not work\n", __func__);
ok = false;
}

if (!ok) {
llama_model_free(model);

return iparams;
}
}

auto cparams = common_context_params_to_llama(params);

llama_context * lctx = llama_init_from_model(model, cparams);
Expand Down Expand Up @@ -966,6 +946,35 @@ struct common_init_result common_init_from_params(common_params & params) {
}
}

if (llama_pooling_type(lctx) == LLAMA_POOLING_TYPE_RANK) {
bool ok = true;

if (llama_vocab_bos(vocab) == LLAMA_TOKEN_NULL) {
LOG_WRN("%s: warning: vocab does not have a BOS token, reranking will not work\n", __func__);
ok = false;
}

bool has_eos = llama_vocab_eos(vocab) != LLAMA_TOKEN_NULL;
bool has_sep = llama_vocab_sep(vocab) != LLAMA_TOKEN_NULL;

if (!has_eos && !has_sep) {
LOG_WRN("%s: warning: vocab does not have an EOS token or SEP token, reranking will not work\n", __func__);
ok = false;
} else if (!has_eos) {
LOG_WRN("%s: warning: vocab does not have an EOS token, using SEP token as fallback\n", __func__);
} else if (!has_sep) {
LOG_WRN("%s: warning: vocab does not have a SEP token, reranking will not work\n", __func__);
ok = false;
}

if (!ok) {
llama_free(lctx);
llama_model_free(model);

return iparams;
}
}

// load and optionally apply lora adapters
for (auto & la : params.lora_adapters) {
llama_adapter_lora_ptr lora;
Expand Down Expand Up @@ -1143,11 +1152,6 @@ struct llama_context_params common_context_params_to_llama(const common_params &
cparams.op_offload = !params.no_op_offload;
cparams.swa_full = params.swa_full;

if (params.reranking) {
cparams.embeddings = true;
cparams.pooling_type = LLAMA_POOLING_TYPE_RANK;
}

cparams.type_k = params.cache_type_k;
cparams.type_v = params.cache_type_v;

Expand Down
1 change: 0 additions & 1 deletion common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -355,7 +355,6 @@ struct common_params {
int32_t embd_normalize = 2; // normalisation for embeddings (-1=none, 0=max absolute int16, 1=taxicab, 2=euclidean, >2=p-norm)
std::string embd_out = ""; // empty = default, "array" = [[],[]...], "json" = openai style, "json+" = same "json" + cosine similarity matrix
std::string embd_sep = "\n"; // separator of embeddings
bool reranking = false; // enable reranking support on server

// server params
int32_t port = 8080; // server listens on this network port
Expand Down
Loading