Description
see #27
https://ai.google.dev/gemma/docs?hl=en
https://www.kaggle.com/models/google/gemma
Gemma on Vertex AI Model garden
https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/335?_ga=2.34476193.-1036776313.1707424880&hl=en
https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf
https://blog.google/technology/developers/gemma-open-models/
https://huggingface.co/google/gemma-7b
https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/l4/PB-11316-001_v01.pdf
pull and remake the latest llama.cpp (see previous article running llama 70b - #7
abetlen/llama-cpp-python#1207
ggml-org/llama.cpp@580111d
7B (32G model needs 64G on a CPU or a RTX-A6000/RTX-5000 Ada) and 2B (on a macbook M1Max:32G unified ram - working perfectly
obrien@mbp7 llama.cpp % ./main -m models/gemma-2b.gguf -p "Describe how gold is made in collapsing stars" -t 24 -n 1000 -e --color
Log start
main: build = 2234 (973053d8)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.3.0
main: seed = 1708573311
llama_model_loader: loaded meta data with 19 key-value pairs and 164 tensors from models/gemma-2b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
...
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 9561.31 MiB, ( 9561.38 / 21845.34)
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 19/19 layers to GPU
llm_load_tensors: Metal buffer size = 9561.30 MiB
llm_load_tensors: CPU buffer size = 2001.00 MiB
.............................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
llama_print_timings: load time = 10956.18 ms
llama_print_timings: sample time = 650.20 ms / 1000 runs ( 0.65 ms per token, 1537.98 tokens per second)
llama_print_timings: prompt eval time = 55.43 ms / 9 tokens ( 6.16 ms per token, 162.36 tokens per second)
llama_print_timings: eval time = 32141.38 ms / 999 runs ( 32.17 ms per token, 31.08 tokens per second)
llama_print_timings: total time = 33773.63 ms / 1008 tokens
ggml_metal_free: deallocating