Skip to content

Google Gemma 7B 2B OSS models are available on Hugging Face as of 20240221 #13

Open
@obriensystems

Description

@obriensystems

see #27
https://ai.google.dev/gemma/docs?hl=en
https://www.kaggle.com/models/google/gemma

Gemma on Vertex AI Model garden
https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/335?_ga=2.34476193.-1036776313.1707424880&hl=en

https://obrienlabs.medium.com/google-gemma-7b-and-2b-llm-models-are-now-available-to-developers-as-oss-on-hugging-face-737f65688f0d

https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf
https://blog.google/technology/developers/gemma-open-models/

https://huggingface.co/google/gemma-7b
https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/l4/PB-11316-001_v01.pdf

pull and remake the latest llama.cpp (see previous article running llama 70b - #7

abetlen/llama-cpp-python#1207
ggml-org/llama.cpp@580111d
Screenshot 2024-02-21 at 22 49 34

7B (32G model needs 64G on a CPU or a RTX-A6000/RTX-5000 Ada) and 2B (on a macbook M1Max:32G unified ram - working perfectly

obrien@mbp7 llama.cpp % ./main -m models/gemma-2b.gguf -p "Describe how gold is made in collapsing stars" -t 24 -n 1000 -e --color 
Log start
main: build = 2234 (973053d8)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.3.0
main: seed  = 1708573311
llama_model_loader: loaded meta data with 19 key-value pairs and 164 tensors from models/gemma-2b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
...
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  9561.31 MiB, ( 9561.38 / 21845.34)
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 19/19 layers to GPU
llm_load_tensors:      Metal buffer size =  9561.30 MiB
llm_load_tensors:        CPU buffer size =  2001.00 MiB
.............................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil


llama_print_timings:        load time =   10956.18 ms
llama_print_timings:      sample time =     650.20 ms /  1000 runs   (    0.65 ms per token,  1537.98 tokens per second)
llama_print_timings: prompt eval time =      55.43 ms /     9 tokens (    6.16 ms per token,   162.36 tokens per second)
llama_print_timings:        eval time =   32141.38 ms /   999 runs   (   32.17 ms per token,    31.08 tokens per second)
llama_print_timings:       total time =   33773.63 ms /  1008 tokens
ggml_metal_free: deallocating

https://cloud.google.com/blog/products/ai-machine-learning/performance-deepdive-of-gemma-on-google-cloud

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions