Skip to content

Rebase 20240414 #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 569 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
569 commits
Select commit Hold shift + click to select a range
4ab99d8
clip : rename lerp function to avoid conflict (#6894)
danbev Apr 25, 2024
5154372
ggml : fix redefinition of vaddvq_f32 for 32-bit ARM (#6906)
ggerganov Apr 25, 2024
0ead1f1
llama : check that all the tensor data is in the model file (#6885)
slaren Apr 25, 2024
3fe0596
readme : update model list (#6908)
BarfingLemurs Apr 25, 2024
853d06f
ci : tmp disable slow tests
ggerganov Apr 25, 2024
d6e1d44
llama : synchronize before get/set session data (#6911)
slaren Apr 25, 2024
fa0b4ad
cmake : remove obsolete ANDROID check
ggerganov Apr 25, 2024
dba497e
cmake : restore LLAMA_LLAMAFILE_DEFAULT
ggerganov Apr 25, 2024
46e12c4
llava : add support for moondream vision language model (#6899)
vikhyat Apr 25, 2024
5790c8d
bench: server add stop word for PHI-2 (#6916)
phymbert Apr 26, 2024
7d641c2
ci: fix concurrency for pull_request_target (#6917)
phymbert Apr 26, 2024
d4a9afc
ci: server: fix python installation (#6918)
phymbert Apr 26, 2024
83b72cb
Merge pull request from GHSA-p5mv-gjc5-mwqv
ggerganov Apr 26, 2024
9e4e077
ci: server: fix python installation (#6922)
phymbert Apr 26, 2024
7f5ff55
server: stop generation at `n_ctx_train` if `n_predict` is not set (#…
phymbert Apr 26, 2024
bbe3c6e
ci: server: fix python installation (#6925)
phymbert Apr 26, 2024
4b1c3c9
llamafile : use 64-bit integers in sgemm (#6928)
jart Apr 26, 2024
e2764cd
gguf : fix mismatch between alloc and free functions (#6929)
slaren Apr 26, 2024
017e699
add basic tensor data validation function (#6884)
slaren Apr 26, 2024
0c4d489
quantize: add imatrix and dataset metadata in GGUF (#6658)
phymbert Apr 26, 2024
928e0b7
Reset schedule earlier to allow overlap with ggml graph computation o…
agray3 Apr 26, 2024
b736833
ci: server: tests python env on github container ubuntu latest / fix …
phymbert Apr 27, 2024
4dba7e8
Replace "alternative" boolean operator in conditional compilation dir…
mgroeber9110 Apr 27, 2024
6e472f5
flake.lock: Update
github-actions[bot] Apr 28, 2024
ce023f6
add device version in device list (#6959)
arthw Apr 28, 2024
7bb36cc
gguf : enforce that tensor names are unique (#6905)
ngxson Apr 28, 2024
e00b4a8
Fix more int overflow during quant (PPL/CUDA). (#6563)
dranger003 Apr 28, 2024
c4f708a
llama : fix typo LAMMAFILE -> LLAMAFILE (#6974)
JohannesGaessler Apr 29, 2024
ca7f29f
ci : add building in MSYS2 environments (Windows) (#6967)
przemoc Apr 29, 2024
577277f
make : change GNU make default CXX from g++ to c++ (#6966)
przemoc Apr 29, 2024
3055a41
convert : fix conversion of some BERT embedding models (#6937)
christianazinn Apr 29, 2024
3f16747
sampling : use std::random_device{}() for default random seed (#6962)
dwrensha Apr 29, 2024
f4ab2a4
llama : fix BPE pre-tokenization (#6920)
ggerganov Apr 29, 2024
24affa7
readme : update hot topics
ggerganov Apr 29, 2024
ffe6665
llava-cli : multiple images (#6969)
cpumaxx Apr 29, 2024
544f1f1
ggml : fix __MSC_VER -> _MSC_VER (#6977)
ggerganov Apr 29, 2024
d2c898f
ci : tmp disable gguf-split (#6983)
ggerganov Apr 29, 2024
b8a7a5a
build(cmake): simplify instructions (`cmake -B build && cmake --build…
ochafik Apr 29, 2024
5539e6f
main : fix typo in comment in main.cpp (#6985)
danbev Apr 29, 2024
b8c1476
Extending grammar integration tests (#6644)
HanClinto Apr 29, 2024
8843a98
Improve usability of --model-url & related flags (#6930)
ochafik Apr 29, 2024
952d03d
convert : use utf8 encoding (#7000)
ggerganov Apr 30, 2024
9c67c27
ggml : add Flash Attention (#5021)
ggerganov Apr 30, 2024
a68a1e7
metal : log more info on error (#6987)
bakkot Apr 30, 2024
77e15be
metal : remove deprecated error code (#7008)
ggerganov Apr 30, 2024
f364eb6
switch to using localizedDescription (#7010)
bakkot Apr 30, 2024
a8f9b07
perplexity: more statistics, added documentation (#6936)
JohannesGaessler Apr 30, 2024
c4ec9c0
ci : exempt confirmed bugs from being tagged as stale (#7014)
slaren May 1, 2024
1613ef8
CUDA: CUDART < 11.7 workaround for __hmax, __hmax2 (#7019)
JohannesGaessler May 1, 2024
3ea0d36
Server: add tests for batch size, different seeds (#6950)
JohannesGaessler May 1, 2024
8d608a8
main : fix off by one error for context shift (#6921)
l3utterfly May 1, 2024
b0d943d
Update LOG_IMPL and LOG_TEE_IMPL (#7029)
a-downing May 1, 2024
6ecf318
chore: fix typo in llama.cpp (#7032)
alwqx May 2, 2024
60325fa
Remove .attention from skipped tensors to match more accurately (#7051)
bartowski1182 May 2, 2024
433def2
llama : rename ctx to user_data in progress_callback (#7045)
danbev May 3, 2024
a2ac89d
convert.py : add python logging instead of print() (#6511)
mofosyne May 3, 2024
92139b9
tests : add test-tokenizer-0.sh + fix some tokenizers (#7036)
ggerganov May 4, 2024
03fb8a0
If first token generated from the server is the stop word the server …
maor-ps May 4, 2024
fcd84a0
Fix Linux /sys cpu path to guess number of cores (#7064)
viric May 4, 2024
cf768b7
Tidy Android Instructions README.md (#7016)
Jeximo May 4, 2024
8425001
gguf-split: add --no-tensor-first-split (#7072)
ngxson May 4, 2024
6fbd432
py : logging and flake8 suppression refactoring (#7081)
mofosyne May 5, 2024
889bdd7
command-r : add BPE pre-tokenization (#7063)
dranger003 May 5, 2024
ca36326
readme : add note that LLaMA 3 is not supported with convert.py (#7065)
lyledean1 May 5, 2024
8f8acc8
Disable benchmark on forked repo (#7034)
CISC May 5, 2024
628b299
Adding support for the --numa argument for llama-bench. (#7080)
kunnis May 5, 2024
bcdee0d
minor : fix trailing whitespace
ggerganov May 6, 2024
b3a995b
flake.lock: Update (#7079)
ggerganov May 6, 2024
858f6b7
Add an option to build without CUDA VMM (#7067)
WilliamTambellini May 6, 2024
947d3ad
ci : add GG_BUILD_EXTRA_TESTS_0 env (#7098)
ggerganov May 7, 2024
04976db
docs: fix typos (#7124)
omahs May 7, 2024
3af34c1
main : update log text (EOS to EOG) (#7104)
RhinoDevel May 7, 2024
53d6c52
readme : update hot topics
ggerganov May 7, 2024
260b7c6
server : update readme with undocumented options (#7013)
K-Mistele May 7, 2024
b6aa670
Fix OLMo HF to GGUF conversion (#6910)
nopperl May 7, 2024
af0a5b6
server: fix incorrectly reported token probabilities (#7125)
JohannesGaessler May 7, 2024
48b2f9c
Fixed save_imatrix to match old behaviour for MoE (#7099)
jukofyork May 8, 2024
c780e75
Further tidy on Android instructions README.md (#7077)
Jeximo May 8, 2024
c0e6fbf
metal : fix unused warning
ggerganov May 8, 2024
3855416
ggml : introduce bfloat16 support (#6412)
jart May 8, 2024
acdce3c
compare-llama-bench.py: add missing basicConfig (#7138)
mofosyne May 8, 2024
7e0b6a7
py : also print the normalizers
ggerganov May 8, 2024
4cd621c
convert : add BPE pre-tokenization for DBRX (#7132)
dranger003 May 8, 2024
1fd9c17
clean up json_value & server_log (#7142)
ngxson May 8, 2024
229ffff
llama : add BPE pre-tokenization for Qwen2 (#7114)
jklj077 May 8, 2024
ad211ed
convert.py : --vocab-only generates false but valid params (#7027)
20kdc May 8, 2024
911b390
server : add_special option for tokenize endpoint (#7059)
JohanAR May 8, 2024
465263d
sgemm : AVX Q4_0 and Q8_0 (#6891)
netrunnereve May 8, 2024
83330d8
main : add --conversation / -cnv flag (#7108)
May 8, 2024
26458af
metal : use `vm_allocate` instead of `posix_memalign` on macOS (#7078)
giladgd May 8, 2024
bd1871f
server : add themes + favicon (#6848)
jboero May 8, 2024
9da243b
Revert "llava : add support for moondream vision language model (#6899)"
ggerganov May 8, 2024
c12452c
JSON: [key] -> .at(key), assert() -> GGML_ASSERT (#7143)
JohannesGaessler May 8, 2024
bc4bba3
Introduction of CUDA Graphs to LLama.cpp (#6766)
agray3 May 8, 2024
f98eb31
convert-hf : save memory with lazy evaluation (#7075)
compilade May 8, 2024
4426e29
cmake : fix typo (#7151)
cebtenzzre May 8, 2024
07cd41d
TypoFix (#7162)
AhmedZeer May 9, 2024
4734524
opencl : alignment size converted from bits to bytes (#7090)
albertjin May 9, 2024
2284216
gguf-py : add special token modification capability (#7166)
CISC May 9, 2024
fd9f92b
llama : update llama_timings.n_p_eval setting (#7160)
danbev May 9, 2024
f31ec12
Add warning if token is invalid (#7173)
Galunid May 9, 2024
a743d76
CUDA: generalize FP16 fattn vec kernel (#7061)
JohannesGaessler May 9, 2024
43248e5
llama3 custom regex split (#6965)
jaime-m-p May 9, 2024
0961d86
readme : add app (#6371)
l3utterfly May 9, 2024
d46dbc7
readme : add scheduled server workflow status badge
ggerganov May 9, 2024
befddd0
Vulkan Bugfixes and Improvements (#7084)
0cc4m May 9, 2024
eaf4bd8
eval-callback : fix conversion to float (#7184)
slaren May 9, 2024
8c570c9
Minor arithmetic improvement to mmvq wrapper kernel (#7172)
OuadiElfarouki May 10, 2024
d11afd6
llava : fix moondream support (#7163)
abetlen May 10, 2024
f89fe27
Main+: optionally allow special tokens from user in interactive mode …
hanishkvc May 10, 2024
4e38809
Fix memory bug in grammar parser (#7194)
jart May 10, 2024
25c6e82
llama : use n_vocab to differentiate between mistral 7B and llama3 8B…
slaren May 10, 2024
8c66024
convert : print "ignore_merges" field
ggerganov May 10, 2024
18e4376
metal : fix flash attention kernel requirements (#7169)
ggerganov May 10, 2024
e849648
llama-bench : add pp+tg test type (#7199)
slaren May 10, 2024
9cb317f
ggml : full ALiBi support (#7192)
ggerganov May 11, 2024
b83cc3f
llama : add Jina Embeddings architecture (#6826)
JoanFM May 11, 2024
5ae3426
server: fix reported top tokens for temperature 0 (#7203)
JohannesGaessler May 11, 2024
f99e1e4
llama : lookup word in vocab before doing BPE merges (#7193)
tonyfettes May 11, 2024
9886313
server : free llama_batch on exit (#7212)
stevegrubb May 11, 2024
3292733
convert : skip unaccessible HF repos (#7210)
CrispStrobe May 11, 2024
ef0d5e3
build: fix and ignore msvc warnings (ggml/805)
iboB Apr 25, 2024
f5ef34e
feat: implemented sigmoid function (ggml/806)
justcho5 May 1, 2024
fae9d23
sync : ggml
ggerganov May 11, 2024
5a41992
convert-hf : support bfloat16 conversion (#7158)
compilade May 11, 2024
72c177c
fix system prompt handling (#7153)
ngxson May 11, 2024
fed0108
Scripting & documenting debugging one test without anything else in t…
josh-ramer May 11, 2024
325756d
ggml : resolve merge (ggml/0)
ggerganov May 11, 2024
6aeff24
metal : fix indent (ggml/0)
ggerganov May 11, 2024
1622ac0
sync : ggml
ggerganov May 11, 2024
7bd4ffb
metal : fix warnings (skipme) (#0)
ggerganov May 11, 2024
b228aba
remove convert-lora-to-ggml.py (#7204)
slaren May 12, 2024
6f1b636
cmake : fix version cmp (#7227)
ggerganov May 12, 2024
dc685be
CUDA: add FP32 FlashAttention vector kernel (#7188)
JohannesGaessler May 12, 2024
0d5cef7
[SYCL] update CI with oneapi 2024.1 (#7235)
arthw May 13, 2024
cbf7589
[SYCL] Add oneapi runtime dll files to win release package (#7241)
arthw May 13, 2024
e586ee4
change default temperature of OAI compat API from 0 to 1 (#7226)
Kartoffelsaft May 13, 2024
b1f8af1
convert.py: Outfile default name change and additional metadata suppo…
mofosyne May 13, 2024
9aa6724
llama : rename jina tokenizers to v2 (#7249)
May 13, 2024
948f4ec
[SYCL] rm wait() (#7233)
arthw May 13, 2024
1c570d8
perplexity: add BF16 vs. FP16 results (#7150)
JohannesGaessler May 13, 2024
30e7033
llava-cli: fix base64 prompt (#7248)
Adriankhl May 13, 2024
614d3b9
llama : less KV padding when FA is off (#7257)
ggerganov May 13, 2024
ee52225
convert-hf : support direct Q8_0 conversion (#7234)
compilade May 13, 2024
757f952
add detection of Xeon PHI: Knights Corner.
julialongtin Mar 12, 2024
78291d9
handle the case that we have no glibc on the PHI.
julialongtin Mar 12, 2024
b9e2f2a
instead of checking on glibc, check on SYS_getcpu
julialongtin Mar 12, 2024
f7f174e
try to detect the PHI cross compiler in make.
julialongtin Mar 12, 2024
8f6e535
try to detect the PHI cross compiler in make.
julialongtin Mar 12, 2024
25095ca
try to implement one intrinsic
julialongtin Mar 13, 2024
c08ddb8
use right type, and define GGML_F32_VEC_ZERO.
julialongtin Mar 13, 2024
59ce785
import intrinsics.
julialongtin Mar 13, 2024
2458643
implement F32 dot products.
julialongtin Mar 16, 2024
f940c96
Update ggml.c
julialongtin Mar 16, 2024
6e1b77a
Update ggml.c
julialongtin Mar 16, 2024
926b0e8
Update ggml.c
julialongtin Mar 16, 2024
6f699fc
merge from upstream
julialongtin Mar 17, 2024
72e2b13
add a benchmark / test binary.
julialongtin Mar 17, 2024
9ba28ea
Update ggml-phi-knc.c
julialongtin Mar 17, 2024
580a347
remove intrinsics import, and use upConv to save 12 bytes of memory t…
julialongtin Mar 20, 2024
97c6983
use the same header as ggml.c, and remove some warnings.
julialongtin Mar 20, 2024
56be29f
formatting changes.
julialongtin Mar 20, 2024
c605e95
spacing changes.
julialongtin Mar 21, 2024
16cbe5d
be more specific about the length of our list of run amounts.
julialongtin Mar 21, 2024
0e6c910
begin work on targeting dot_q5_K_q8_K.
julialongtin Mar 23, 2024
96dce97
import stdint.h for sizeSt.
julialongtin Mar 23, 2024
7080280
import stdio.h for size_t.
julialongtin Mar 23, 2024
2c5daab
pull in ggml specific types.
julialongtin Mar 23, 2024
b794e48
tell ggml-common.h to export what we want.
julialongtin Mar 23, 2024
d5f39c3
force to compile.
julialongtin Mar 23, 2024
2ed3066
allow using code from ggml-phi-knc-dot_q5_K_q8_K.c
julialongtin Mar 23, 2024
feed51c
attempt to speed up float clearing.
julialongtin Mar 23, 2024
ea858ee
first fixes.
julialongtin Mar 23, 2024
b92e064
formatting improvement.
julialongtin Mar 23, 2024
405b5fa
promote aux16 into a vector.
julialongtin Mar 23, 2024
fb0fb9f
promote aux16 into a vector.
julialongtin Mar 23, 2024
484c4ab
promote aux16 into a vector. (part three)
julialongtin Mar 23, 2024
9f92f97
fix typo.
julialongtin Mar 23, 2024
d5a27eb
copy right block.
julialongtin Mar 23, 2024
e227717
add missing variable.
julialongtin Mar 23, 2024
3994d81
try to use vectorized zeroing function.
julialongtin Mar 23, 2024
588a0b1
expand mask, and align memory.
julialongtin Mar 23, 2024
2dc7991
use better memory save operator.
julialongtin Mar 23, 2024
df33835
use quotes properly.
julialongtin Mar 23, 2024
bff7b69
promote aux16 to a vector.
julialongtin Mar 23, 2024
a9cc0e7
add missing address of operators.
julialongtin Mar 23, 2024
1446a72
promote aux32 to a vector.
julialongtin Mar 23, 2024
b22e3e0
add I32 vector memory clearing.
julialongtin Mar 23, 2024
e72539b
attempt our first FMA.
julialongtin Mar 23, 2024
6d4535e
use proper mov operator, and pass addresses.
julialongtin Mar 23, 2024
7e3eb5c
perform 16 operations at a time.
julialongtin Mar 24, 2024
b5c1135
better comments, and fix some small errors.
julialongtin Mar 24, 2024
babe051
spacing changes, eliminate dead references to k1 or zero, and use the…
julialongtin Mar 24, 2024
a95c7b0
fix our reference to src in the second place, and use a more accurate…
julialongtin Mar 24, 2024
185d4b8
promote aux8 into a vector.
julialongtin Mar 24, 2024
d351d99
loosen alignment requirements for zeros, add missing function, and pr…
julialongtin Mar 24, 2024
5a60242
separate filling aux16 from consuming aux16 by making it an array of …
julialongtin Mar 24, 2024
e66a97f
fix vector sizes.
julialongtin Mar 25, 2024
efcd202
massively rewrite assembly routines.
julialongtin Apr 2, 2024
021ae03
minor changes.
julialongtin Apr 2, 2024
aa33f28
formatting.
julialongtin Apr 2, 2024
481f174
indent headers consistently.
julialongtin Apr 3, 2024
e544a3f
formatting changes.
julialongtin Apr 3, 2024
10f0637
add Makefile rule for generation .s file, for manual inspection.
julialongtin Apr 3, 2024
7214391
whoops. missing tab.
julialongtin Apr 3, 2024
feb8bcc
use GGML_F32_EPR, and remove some dead code.
julialongtin Apr 3, 2024
039685d
reformat, and label what these files are.
julialongtin Apr 3, 2024
a33c82b
replace tabs with spaces.
julialongtin Apr 3, 2024
934f869
further optimizations. 0.99 tokens per second.
julialongtin Apr 22, 2024
5b2023b
fix some small errors.
julialongtin Apr 22, 2024
d27cd93
fix an offset error, and get rid of tabs.
julialongtin Apr 22, 2024
2cfc15b
comment and spacing fixes.
julialongtin Apr 24, 2024
93d0a0a
use or, instead of and. bug fix?
julialongtin Apr 24, 2024
1ba6534
spacing and capitalization changes.
julialongtin Apr 25, 2024
e108564
spacing and capitalization changes. Fix the register list of GGML_5bi…
julialongtin Apr 26, 2024
b33cd8d
minor spacing and comment changes.
julialongtin May 9, 2024
de44c66
add batch fp16<->fp32 conversion functions.
julialongtin May 9, 2024
30e8b37
remove a warning.
julialongtin May 9, 2024
bf674be
fix typo
julialongtin May 9, 2024
201566c
use different restrict syntax, to make g++ happy.
julialongtin May 9, 2024
7efdcf5
broadcast a single int8, instead of 4 of them.
julialongtin May 10, 2024
0261b3b
Use a vectorized assembly function to handle remaining chunks less th…
julialongtin May 10, 2024
14638be
use vbroadcastss in place of vbroadcast32x4.
julialongtin May 10, 2024
cb96a48
perform better prefetches, and invert the test of our clear flag for …
julialongtin May 10, 2024
156b9b6
remove useless prefetches.
julialongtin May 10, 2024
1d74ddb
spacing and comment changes.
julialongtin May 10, 2024
50800b9
move sub earlier, and move the compare of iterations to outside, and …
julialongtin May 10, 2024
6bd8dcb
fix loop.
julialongtin May 10, 2024
f1af881
use values inside of the loop as soon as we have them.
julialongtin May 10, 2024
fc828b4
correct a comment, and use jz when comparing to zero.
julialongtin May 10, 2024
7819247
comment clarification.
julialongtin May 10, 2024
cfe47d0
change from handling three iterations per loop to four.
julialongtin May 11, 2024
e8087c5
subtract the correct amount.
julialongtin May 11, 2024
41a9ed0
look at the right final memory location.
julialongtin May 11, 2024
9372048
add missing jump.
julialongtin May 11, 2024
6ed6f2f
spacing changes.
julialongtin May 11, 2024
5615c86
spacing changes.
julialongtin May 11, 2024
8e854f4
introduce r10 and r11, for vloadunpackhd.
julialongtin May 11, 2024
25fc1d6
rename label 1 to 3.
julialongtin May 11, 2024
9396061
rename some labels.
julialongtin May 11, 2024
5e7d7ab
relabel some other labels.
julialongtin May 11, 2024
5c76364
fill and increment r12 and r13.
julialongtin May 11, 2024
259da93
add missing vector.
julialongtin May 11, 2024
d526074
make the offset of q4 available.
julialongtin May 11, 2024
f0d4f51
minor comment fixes.
julialongtin May 11, 2024
92bd588
load from identical addresses for low and high side.
julialongtin May 11, 2024
464a74f
make offset available in a register.
julialongtin May 11, 2024
f7b062f
do 2 rounds of 4, instead of 4 rounds of 2. and properly offset unall…
julialongtin May 11, 2024
b90a41d
spacing changes.
julialongtin May 12, 2024
18cb539
fix a missing endif.
julialongtin May 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .clang-tidy
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Checks: >
-readability-implicit-bool-conversion,
-readability-magic-numbers,
-readability-uppercase-literal-suffix,
-readability-simplify-boolean-expr,
clang-analyzer-*,
-clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling,
performance-*,
Expand Down
8 changes: 5 additions & 3 deletions .devops/full-cuda.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ FROM ${BASE_CUDA_DEV_CONTAINER} as build
ARG CUDA_DOCKER_ARCH=all

RUN apt-get update && \
apt-get install -y build-essential python3 python3-pip git
apt-get install -y build-essential python3 python3-pip git libcurl4-openssl-dev

COPY requirements.txt requirements.txt
COPY requirements requirements
Expand All @@ -26,8 +26,10 @@ COPY . .

# Set nvcc architecture
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
# Enable cuBLAS
ENV LLAMA_CUBLAS=1
# Enable CUDA
ENV LLAMA_CUDA=1
# Enable cURL
ENV LLAMA_CURL=1

RUN make

Expand Down
5 changes: 5 additions & 0 deletions .devops/full-rocm.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,11 @@ ENV LLAMA_HIPBLAS=1
ENV CC=/opt/rocm/llvm/bin/clang
ENV CXX=/opt/rocm/llvm/bin/clang++

# Enable cURL
ENV LLAMA_CURL=1
RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev

RUN make

ENTRYPOINT ["/app/.devops/tools.sh"]
5 changes: 4 additions & 1 deletion .devops/full.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ARG UBUNTU_VERSION=22.04
FROM ubuntu:$UBUNTU_VERSION as build

RUN apt-get update && \
apt-get install -y build-essential python3 python3-pip git
apt-get install -y build-essential python3 python3-pip git libcurl4-openssl-dev

COPY requirements.txt requirements.txt
COPY requirements requirements
Expand All @@ -15,6 +15,9 @@ WORKDIR /app

COPY . .

ENV LLAMA_CURL=1


RUN make

ENV LC_ALL=C.utf8
Expand Down
2 changes: 1 addition & 1 deletion .devops/llama-cpp-clblast.srpm.spec
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# SRPM for building from source and packaging an RPM for RPM-based distros.
# https://fedoraproject.org/wiki/How_to_create_an_RPM_package
# https://docs.fedoraproject.org/en-US/quick-docs/creating-rpm-packages
# Built and maintained by John Boero - boeroboy@gmail.com
# In honor of Seth Vidal https://www.redhat.com/it/blog/thank-you-seth-vidal

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# SRPM for building from source and packaging an RPM for RPM-based distros.
# https://fedoraproject.org/wiki/How_to_create_an_RPM_package
# https://docs.fedoraproject.org/en-US/quick-docs/creating-rpm-packages
# Built and maintained by John Boero - boeroboy@gmail.com
# In honor of Seth Vidal https://www.redhat.com/it/blog/thank-you-seth-vidal

Expand All @@ -12,7 +12,7 @@
# 4. OpenCL/CLBLAST support simply requires the ICD loader and basic opencl libraries.
# It is up to the user to install the correct vendor-specific support.

Name: llama.cpp-cublas
Name: llama.cpp-cuda
Version: %( date "+%%Y%%m%%d" )
Release: 1%{?dist}
Summary: CPU Inference of LLaMA model in pure C/C++ (no CUDA/OpenCL)
Expand All @@ -32,24 +32,24 @@ CPU inference for Meta's Lllama2 models using default options.
%setup -n llama.cpp-master

%build
make -j LLAMA_CUBLAS=1
make -j LLAMA_CUDA=1

%install
mkdir -p %{buildroot}%{_bindir}/
cp -p main %{buildroot}%{_bindir}/llamacppcublas
cp -p server %{buildroot}%{_bindir}/llamacppcublasserver
cp -p simple %{buildroot}%{_bindir}/llamacppcublassimple
cp -p main %{buildroot}%{_bindir}/llamacppcuda
cp -p server %{buildroot}%{_bindir}/llamacppcudaserver
cp -p simple %{buildroot}%{_bindir}/llamacppcudasimple

mkdir -p %{buildroot}/usr/lib/systemd/system
%{__cat} <<EOF > %{buildroot}/usr/lib/systemd/system/llamacublas.service
%{__cat} <<EOF > %{buildroot}/usr/lib/systemd/system/llamacuda.service
[Unit]
Description=Llama.cpp server, CPU only (no GPU support in this build).
After=syslog.target network.target local-fs.target remote-fs.target nss-lookup.target

[Service]
Type=simple
EnvironmentFile=/etc/sysconfig/llama
ExecStart=/usr/bin/llamacppcublasserver $LLAMA_ARGS
ExecStart=/usr/bin/llamacppcudaserver $LLAMA_ARGS
ExecReload=/bin/kill -s HUP $MAINPID
Restart=never

Expand All @@ -67,10 +67,10 @@ rm -rf %{buildroot}
rm -rf %{_builddir}/*

%files
%{_bindir}/llamacppcublas
%{_bindir}/llamacppcublasserver
%{_bindir}/llamacppcublassimple
/usr/lib/systemd/system/llamacublas.service
%{_bindir}/llamacppcuda
%{_bindir}/llamacppcudaserver
%{_bindir}/llamacppcudasimple
/usr/lib/systemd/system/llamacuda.service
%config /etc/sysconfig/llama

%pre
Expand Down
2 changes: 1 addition & 1 deletion .devops/llama-cpp.srpm.spec
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# SRPM for building from source and packaging an RPM for RPM-based distros.
# https://fedoraproject.org/wiki/How_to_create_an_RPM_package
# https://docs.fedoraproject.org/en-US/quick-docs/creating-rpm-packages
# Built and maintained by John Boero - boeroboy@gmail.com
# In honor of Seth Vidal https://www.redhat.com/it/blog/thank-you-seth-vidal

Expand Down
4 changes: 2 additions & 2 deletions .devops/main-cuda.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ COPY . .

# Set nvcc architecture
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
# Enable cuBLAS
ENV LLAMA_CUBLAS=1
# Enable CUDA
ENV LLAMA_CUDA=1

RUN make

Expand Down
8 changes: 3 additions & 5 deletions .devops/main-intel.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,12 @@ WORKDIR /app

COPY . .

RUN mkdir build && \
cd build && \
if [ "${LLAMA_SYCL_F16}" = "ON" ]; then \
RUN if [ "${LLAMA_SYCL_F16}" = "ON" ]; then \
echo "LLAMA_SYCL_F16 is set" && \
export OPT_SYCL_F16="-DLLAMA_SYCL_F16=ON"; \
fi && \
cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx ${OPT_SYCL_F16} && \
cmake --build . --config Release --target main
cmake -B build -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx ${OPT_SYCL_F16} && \
cmake --build build --config Release --target main

FROM intel/oneapi-basekit:$ONEAPI_VERSION as runtime

Expand Down
6 changes: 2 additions & 4 deletions .devops/main-vulkan.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,8 @@ RUN wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key
# Build it
WORKDIR /app
COPY . .
RUN mkdir build && \
cd build && \
cmake .. -DLLAMA_VULKAN=1 && \
cmake --build . --config Release --target main
RUN cmake -B build -DLLAMA_VULKAN=1 && \
cmake --build build --config Release --target main

# Clean up
WORKDIR /
Expand Down
48 changes: 35 additions & 13 deletions .devops/nix/package.nix
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,14 @@
config,
stdenv,
mkShell,
runCommand,
cmake,
ninja,
pkg-config,
git,
python3,
mpi,
openblas, # TODO: Use the generic `blas` so users could switch between alternative implementations
blas,
cudaPackages,
darwin,
rocmPackages,
Expand All @@ -23,7 +24,7 @@
useOpenCL
useRocm
useVulkan
],
] && blas.meta.available,
useCuda ? config.cudaSupport,
useMetalKit ? stdenv.isAarch64 && stdenv.isDarwin && !useOpenCL,
useMpi ? false, # Increases the runtime closure size by ~700M
Expand All @@ -35,7 +36,8 @@
# It's necessary to consistently use backendStdenv when building with CUDA support,
# otherwise we get libstdc++ errors downstream.
effectiveStdenv ? if useCuda then cudaPackages.backendStdenv else stdenv,
enableStatic ? effectiveStdenv.hostPlatform.isStatic
enableStatic ? effectiveStdenv.hostPlatform.isStatic,
precompileMetalShaders ? false
}@inputs:

let
Expand Down Expand Up @@ -65,10 +67,15 @@ let
strings.optionalString (suffices != [ ])
", accelerated with ${strings.concatStringsSep ", " suffices}";

executableSuffix = effectiveStdenv.hostPlatform.extensions.executable;

# TODO: package the Python in this repository in a Nix-like way.
# It'd be nice to migrate to buildPythonPackage, as well as ensure this repo
# is PEP 517-compatible, and ensure the correct .dist-info is generated.
# https://peps.python.org/pep-0517/
#
# TODO: Package up each Python script or service appropriately, by making
# them into "entrypoints"
llama-python = python3.withPackages (
ps: [
ps.numpy
Expand All @@ -87,6 +94,11 @@ let
]
);

xcrunHost = runCommand "xcrunHost" {} ''
mkdir -p $out/bin
ln -s /usr/bin/xcrun $out/bin
'';

# apple_sdk is supposed to choose sane defaults, no need to handle isAarch64
# separately
darwinBuildInputs =
Expand Down Expand Up @@ -150,13 +162,18 @@ effectiveStdenv.mkDerivation (
postPatch = ''
substituteInPlace ./ggml-metal.m \
--replace '[bundle pathForResource:@"ggml-metal" ofType:@"metal"];' "@\"$out/bin/ggml-metal.metal\";"

# TODO: Package up each Python script or service appropriately.
# If we were to migrate to buildPythonPackage and prepare the `pyproject.toml`,
# we could make those *.py into setuptools' entrypoints
substituteInPlace ./*.py --replace "/usr/bin/env python" "${llama-python}/bin/python"
substituteInPlace ./ggml-metal.m \
--replace '[bundle pathForResource:@"default" ofType:@"metallib"];' "@\"$out/bin/default.metallib\";"
'';

# With PR#6015 https://github.com/ggerganov/llama.cpp/pull/6015,
# `default.metallib` may be compiled with Metal compiler from XCode
# and we need to escape sandbox on MacOS to access Metal compiler.
# `xcrun` is used find the path of the Metal compiler, which is varible
# and not on $PATH
# see https://github.com/ggerganov/llama.cpp/pull/6118 for discussion
__noChroot = effectiveStdenv.isDarwin && useMetalKit && precompileMetalShaders;

nativeBuildInputs =
[
cmake
Expand All @@ -173,6 +190,8 @@ effectiveStdenv.mkDerivation (
]
++ optionals (effectiveStdenv.hostPlatform.isGnu && enableStatic) [
glibc.static
] ++ optionals (effectiveStdenv.isDarwin && useMetalKit && precompileMetalShaders) [
xcrunHost
];

buildInputs =
Expand All @@ -181,6 +200,7 @@ effectiveStdenv.mkDerivation (
++ optionals useMpi [ mpi ]
++ optionals useOpenCL [ clblast ]
++ optionals useRocm rocmBuildInputs
++ optionals useBlas [ blas ]
++ optionals useVulkan vulkanBuildInputs;

cmakeFlags =
Expand All @@ -191,7 +211,7 @@ effectiveStdenv.mkDerivation (
(cmakeBool "CMAKE_SKIP_BUILD_RPATH" true)
(cmakeBool "LLAMA_BLAS" useBlas)
(cmakeBool "LLAMA_CLBLAST" useOpenCL)
(cmakeBool "LLAMA_CUBLAS" useCuda)
(cmakeBool "LLAMA_CUDA" useCuda)
(cmakeBool "LLAMA_HIPBLAS" useRocm)
(cmakeBool "LLAMA_METAL" useMetalKit)
(cmakeBool "LLAMA_MPI" useMpi)
Expand All @@ -216,14 +236,16 @@ effectiveStdenv.mkDerivation (
# Should likely use `rocmPackages.clr.gpuTargets`.
"-DAMDGPU_TARGETS=gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx940;gfx941;gfx942;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102"
]
++ optionals useMetalKit [ (lib.cmakeFeature "CMAKE_C_FLAGS" "-D__ARM_FEATURE_DOTPROD=1") ]
++ optionals useBlas [ (lib.cmakeFeature "LLAMA_BLAS_VENDOR" "OpenBLAS") ];
++ optionals useMetalKit [
(lib.cmakeFeature "CMAKE_C_FLAGS" "-D__ARM_FEATURE_DOTPROD=1")
(cmakeBool "LLAMA_METAL_EMBED_LIBRARY" (!precompileMetalShaders))
];

# TODO(SomeoneSerge): It's better to add proper install targets at the CMake level,
# if they haven't been added yet.
postInstall = ''
mv $out/bin/main $out/bin/llama
mv $out/bin/server $out/bin/llama-server
mv $out/bin/main${executableSuffix} $out/bin/llama${executableSuffix}
mv $out/bin/server${executableSuffix} $out/bin/llama-server${executableSuffix}
mkdir -p $out/include
cp $src/llama.h $out/include/
'';
Expand Down
11 changes: 8 additions & 3 deletions .devops/server-cuda.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,21 +12,26 @@ FROM ${BASE_CUDA_DEV_CONTAINER} as build
ARG CUDA_DOCKER_ARCH=all

RUN apt-get update && \
apt-get install -y build-essential git
apt-get install -y build-essential git libcurl4-openssl-dev

WORKDIR /app

COPY . .

# Set nvcc architecture
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
# Enable cuBLAS
ENV LLAMA_CUBLAS=1
# Enable CUDA
ENV LLAMA_CUDA=1
# Enable cURL
ENV LLAMA_CURL=1

RUN make

FROM ${BASE_CUDA_RUN_CONTAINER} as runtime

RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev

COPY --from=build /app/server /server

ENTRYPOINT [ "/server" ]
13 changes: 7 additions & 6 deletions .devops/server-intel.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,24 @@ FROM intel/oneapi-basekit:$ONEAPI_VERSION as build

ARG LLAMA_SYCL_F16=OFF
RUN apt-get update && \
apt-get install -y git
apt-get install -y git libcurl4-openssl-dev

WORKDIR /app

COPY . .

RUN mkdir build && \
cd build && \
if [ "${LLAMA_SYCL_F16}" = "ON" ]; then \
RUN if [ "${LLAMA_SYCL_F16}" = "ON" ]; then \
echo "LLAMA_SYCL_F16 is set" && \
export OPT_SYCL_F16="-DLLAMA_SYCL_F16=ON"; \
fi && \
cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx ${OPT_SYCL_F16} && \
cmake --build . --config Release --target server
cmake -B build -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_CURL=ON ${OPT_SYCL_F16} && \
cmake --build build --config Release --target server

FROM intel/oneapi-basekit:$ONEAPI_VERSION as runtime

RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev

COPY --from=build /app/build/bin/server /server

ENV LC_ALL=C.utf8
Expand Down
5 changes: 5 additions & 0 deletions .devops/server-rocm.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,11 @@ ENV LLAMA_HIPBLAS=1
ENV CC=/opt/rocm/llvm/bin/clang
ENV CXX=/opt/rocm/llvm/bin/clang++

# Enable cURL
ENV LLAMA_CURL=1
RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev

RUN make

ENTRYPOINT [ "/app/server" ]
Loading
Loading