-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[User] Embedding doesn't seem to work? #899
Comments
It seems embedding.cpp returns the output embeddings. |
In reality "hello" and "hello " is a different phrase. However these two phrases should be closer to each other than to other phrases. I've made two scripts for testing of the embedding behaviour namely:
#!/bin/bash
# /* +----------------------------------+ */
# /* | LLaMA Embeddings Tester | */
# /* | get_embeddings.sh | */
# /* | (c)copyright nitram147 2023 | */
# /* +----------------------------------+ */
usage="Usage: bash $0 path_to_model phrase"
if [[ $# -ne 2 ]]; then
echo "Invalid number of parameters!" >&2
echo "$usage"
exit 1
fi
if [[ ! -f $1 ]]; then
echo "Invalid path to model!" >&2
echo "$usage"
exit 2
fi
# better way would be to calculate model's weight hash, however that would take a while
model_path_hash=$(echo -n "$1" | sha256sum | head -c 64)
phrase_hash=$(echo -n "$2" | sha256sum | head -c 64)
mkdir -p results/"$model_path_hash"
if [[ -f results/"$model_path_hash"/"$phrase_hash" ]]; then
echo "Embedding was already calculated by previous run"
exit 0
fi
echo "Calculating embedding for phrase: $2"
echo "Phrase: $2" >results/"$model_path_hash"/"$phrase_hash"
./embedding -m $1 -p "$2" >>results/"$model_path_hash"/"$phrase_hash" And #!/usr/bin/python3
# /* +----------------------------------+ */
# /* | LLaMA Embeddings Tester | */
# /* | compare_embeddings.py | */
# /* | (c)copyright nitram147 2023 | */
# /* +----------------------------------+ */
import sys
import glob
import math
def print_help(script_name: str) -> None:
print("Usage: python3 " + script_name + " path_to_results_folder")
def get_results_subfolders(path_to_results_folder: str) -> list:
return [
x + "/" for x in sorted(glob.glob(path_to_results_folder + "*"))
if glob.os.path.isdir(x)
]
def get_results_filenames_from_folder(folder: str) -> list:
return [
x for x in sorted(glob.glob(folder + "*"))
if glob.os.path.isfile(x) and len(glob.os.path.basename(x)) == 64
]
def load_embedding_from_file(file: str) -> dict:
if not glob.os.path.isfile(file): raise ValueError("Invalid argument provided!!!")
lines = [x.strip("\n") for x in open(file, "r").readlines()]
if not lines[0].startswith("Phrase: "): raise ValueError("Invalid result file provided!!!")
#remove last space character on the end of returned embedding by [:-1]
return { lines[0][len("Phrase: "):] : [float(x) for x in lines[1][:-1].split(" ")] }
def get_distance_between_embeddings(first: list, second: list) -> float:
if (
not isinstance(first, list) or
not isinstance(second, list)
): raise ValueError("Invalid arguments provided!!!")
return math.dist(first, second)
def get_table_index(i: int, j: int, length: int) -> int:
if j < i: i, j = j, i
return sum([length - x for x in range(i)]) + (j - i)
if len(sys.argv) != 2:
print("Invalid count of arguments! See help below:", file=sys.stderr)
print_help(sys.argv[0])
sys.exit(1)
path_to_results_folder = sys.argv[1] + "/" if sys.argv[1][-1] != "/" else sys.argv[1]
results_subfolders = get_results_subfolders(path_to_results_folder)
for folder in results_subfolders:
print("Analyzing data in folder: " + folder)
filenames = get_results_filenames_from_folder(folder)
phrases_embeddings = sorted(
[load_embedding_from_file(file) for file in filenames],
key = lambda v: list(v.keys())[0]
)
phrases_count = len(phrases_embeddings)
distances = []
for i in range(phrases_count):
for j in range(i, phrases_count):
distances.append(
get_distance_between_embeddings(
phrases_embeddings[i][list(phrases_embeddings[i].keys())[0]],
phrases_embeddings[j][list(phrases_embeddings[j].keys())[0]]
)
)
for i in range(phrases_count):
print("Distance from phrase \"" + list(phrases_embeddings[i].keys())[0] + "\" to:")
for j in range(phrases_count):
print(
"\tPhrase: \"" + list(phrases_embeddings[j].keys())[0] + "\" is " +
str(distances[get_table_index(i, j, phrases_count)])
) For my surprise for the short phrases it does not hold this "phrases with the similar meaning should be closer to each other" premise. See: Extract embeddings for a few short phrases:
Obtain results:
Results:
Unfortunately, I don't have any more time at the moment. But if you have, try to extract embeddings for more complicated phrases and post the results here :-) |
I ran more tests using cosine similarity, so that it would be easier to comapare to the initial tests. Some results are as expected:
However some similarities are way off:
@StrikingLoo @ggerganov any intuition why the current embedding calculation logic could be behaving this way?
|
I don't see these results as particularly unexpected.
A sentence that ends in a ' ' is inherently incomplete (it would be missing
a word, etc) so it's not weird that the model encodes it very differently
than a complete one, though this is just my interpretation. As a
recommendation I would advise any real applications using these embeddings
strip trailing whitespace off input text, especially if it's user input.
As for the "I like cats" vs "cats" similarity, I also don't see it as
particularly unexpected that they are not similar, as one is a sentence and
the other a single word, and they only share part of the topic. I would be
more surprised if two noun clauses (like "hairy feline" and "purring
kitten") that have similar meanings were assigned very different scores.
Basically things that are syntactically dissimilar are understandably not
very close in embedding space.
If you test sentences with very similar syntax and somewhat similar
semantics and they are not aligned at all, that would worry me more.
I hope this clarifies things! Anyone who knows more please chime in too.
…On Fri, Apr 21, 2023, 02:13 Rimvydas Naktinis ***@***.***> wrote:
I ran more tests using cosine similarity, so that it would be easier to
comapare to the initial tests
<#282 (comment)>.
Some results are as expected:
- "I like cats" is similar to "I love cats" and "cats are cute", and
dissimilar to "Napoleonic France"
- "cat" is quite similar to "dog"
- "Napoleonic France" is somewhat similar to "Victorian England"
- "hello" is quite similar to "hi"
However some similarities are way off:
- appending one of the phrases with a space character dramatically
reduces the similarity
- if both phrases end with a space character, the similarity comes
back up
- "I like cats" is very dissimilar to "cat" and "I like dogs" is very
dissimilar to "dog"
@StrikingLoo <https://github.com/StrikingLoo> @ggerganov
<https://github.com/ggerganov> any intuition why the current embedding
calculation logic could be behaving this way?
"I like cats" -- "I like cats "................ 0.20311777255799193
"I like cats" -- "I like dogs"................. 0.896390003690664
"I like cats" -- "I like dogs "................ 0.20045489096743105
"I like cats" -- "I love cats"................. 0.9571038771953083
"I like cats" -- "I love cats "................ 0.2156631142674983
"I like cats" -- "I love dogs"................. 0.8450703589509785
"I like cats" -- "I love dogs "................ 0.2169230548515942
"I like cats" -- "Napoleonic France"........... -0.21246371932212327
"I like cats" -- "Napoleonic France ".......... 0.04575540547715773
"I like cats" -- "Victorian England"........... -0.29933218462361305
"I like cats" -- "Victorian England ".......... -0.06149233717528417
"I like cats" -- "cat"......................... -0.22651239180178487
"I like cats" -- "cat "........................ 0.05906783956749464
"I like cats" -- "cats are cute"............... 0.3670225246784726
"I like cats" -- "cats are cute ".............. 0.11606769194395
"I like cats" -- "dog"......................... -0.14639967519051528
"I like cats" -- "dog "........................ 0.04783762210617664
"I like cats" -- "dogs are cute"............... 0.31819465704480615
"I like cats" -- "dogs are cute ".............. 0.11610797748796792
"I like cats" -- "hello"....................... -0.20630688086162569
"I like cats" -- "hello "...................... 0.05191533662217677
"I like cats" -- "hi".......................... -0.18188225673086578
"I like cats" -- "hi "......................... -0.0595385355447103
"I like cats " -- "I like dogs"................ 0.19392397721812782
"I like cats " -- "I like dogs "............... 0.9601616172820892
"I like cats " -- "I love cats"................ 0.20298700271041506
"I like cats " -- "I love cats "............... 0.9692328566598946
"I like cats " -- "I love dogs"................ 0.18069456493337113
"I like cats " -- "I love dogs "............... 0.9361746123408047
"I like cats " -- "Napoleonic France".......... 0.04077828080003284
"I like cats " -- "Napoleonic France "......... 0.7514104733324016
"I like cats " -- "Victorian England".......... 0.009752570450316756
"I like cats " -- "Victorian England "......... 0.7966698584728275
"I like cats " -- "cat"........................ -0.015622401712858672
"I like cats " -- "cat "....................... 0.7438255953321713
"I like cats " -- "cats are cute".............. 0.20019632673493853
"I like cats " -- "cats are cute "............. 0.870023708294639
"I like cats " -- "dog"........................ 0.0030972791571316615
"I like cats " -- "dog "....................... 0.8017966029865697
"I like cats " -- "dogs are cute".............. 0.18456252662747993
"I like cats " -- "dogs are cute "............. 0.8497227651725612
"I like cats " -- "hello"...................... -0.0005249279792397854
"I like cats " -- "hello "..................... 0.8324597099732179
"I like cats " -- "hi"......................... 0.0012268027593519127
"I like cats " -- "hi "........................ 0.7523755760379622
"I like dogs" -- "I like dogs "................ 0.22689238866131242
"I like dogs" -- "I love cats"................. 0.8745890129079315
"I like dogs" -- "I love cats "................ 0.20704656061606252
"I like dogs" -- "I love dogs"................. 0.9488098708025015
"I like dogs" -- "I love dogs "................ 0.24556722925131885
"I like dogs" -- "Napoleonic France"........... -0.26413464286093585
"I like dogs" -- "Napoleonic France ".......... 0.05801915836818936
"I like dogs" -- "Victorian England"........... -0.3562970344216997
"I like dogs" -- "Victorian England ".......... -0.06291220071485515
"I like dogs" -- "cat"......................... -0.3220193857299431
"I like dogs" -- "cat "........................ 0.01976040801733492
"I like dogs" -- "cats are cute"............... 0.30090905476542995
"I like dogs" -- "cats are cute ".............. 0.08185635301464264
"I like dogs" -- "dog"......................... -0.15754898020924868
"I like dogs" -- "dog "........................ 0.05649268019207619
"I like dogs" -- "dogs are cute"............... 0.25782603756454203
"I like dogs" -- "dogs are cute ".............. 0.0890702719335868
"I like dogs" -- "hello"....................... -0.2796894362421596
"I like dogs" -- "hello "...................... 0.035996981301803635
"I like dogs" -- "hi".......................... -0.25787672908495085
"I like dogs" -- "hi "......................... -0.08290316130522596
"I like dogs " -- "I love cats"................ 0.2045472826446419
"I like dogs " -- "I love cats "............... 0.9167028194335608
"I like dogs " -- "I love dogs"................ 0.2129259955894849
"I like dogs " -- "I love dogs "............... 0.9534364920909392
"I like dogs " -- "Napoleonic France".......... 0.030884468121599513
"I like dogs " -- "Napoleonic France "......... 0.7373470208338967
"I like dogs " -- "Victorian England".......... -0.02431210206116902
"I like dogs " -- "Victorian England "......... 0.7752905016610782
"I like dogs " -- "cat"........................ -0.08397765922811914
"I like dogs " -- "cat "....................... 0.71447935466483
"I like dogs " -- "cats are cute".............. 0.17071387667006183
"I like dogs " -- "cats are cute "............. 0.8151229555939554
"I like dogs " -- "dog"........................ -0.04537135780039387
"I like dogs " -- "dog "....................... 0.8167544600308861
"I like dogs " -- "dogs are cute".............. 0.15822200994259486
"I like dogs " -- "dogs are cute "............. 0.7938602405373409
"I like dogs " -- "hello"...................... -0.05666404826137203
"I like dogs " -- "hello "..................... 0.8289671743241819
"I like dogs " -- "hi"......................... -0.060960899056495974
"I like dogs " -- "hi "........................ 0.7187010548820195
"I love cats" -- "I love cats "................ 0.2448064338260396
"I love cats" -- "I love dogs"................. 0.899362557333871
"I love cats" -- "I love dogs "................ 0.2469770260439035
"I love cats" -- "Napoleonic France"........... -0.2619564319421419
"I love cats" -- "Napoleonic France ".......... 0.04512874304527943
"I love cats" -- "Victorian England"........... -0.3351779492606247
"I love cats" -- "Victorian England ".......... -0.05627744048023769
"I love cats" -- "cat"......................... -0.24381239195695179
"I love cats" -- "cat "........................ 0.05865530689702666
"I love cats" -- "cats are cute"............... 0.3642354902833239
"I love cats" -- "cats are cute ".............. 0.12915733809213054
"I love cats" -- "dog"......................... -0.181630562647824
"I love cats" -- "dog "........................ 0.04991525949175284
"I love cats" -- "dogs are cute"............... 0.31779280347738087
"I love cats" -- "dogs are cute ".............. 0.12914489705580579
"I love cats" -- "hello"....................... -0.20556096184328576
"I love cats" -- "hello "...................... 0.07391973600329921
"I love cats" -- "hi".......................... -0.18424632031868096
"I love cats" -- "hi "......................... -0.05032686070896378
"I love cats " -- "I love dogs"................ 0.2252081785243395
"I love cats " -- "I love dogs "............... 0.9536944077380259
"I love cats " -- "Napoleonic France".......... 0.022966387887623004
"I love cats " -- "Napoleonic France "......... 0.74409242120594
"I love cats " -- "Victorian England".......... 0.005962043345386044
"I love cats " -- "Victorian England "......... 0.781874206851949
"I love cats " -- "cat"........................ -0.0034494665529626427
"I love cats " -- "cat "....................... 0.7317299538195132
"I love cats " -- "cats are cute".............. 0.2262019494531532
"I love cats " -- "cats are cute "............. 0.8769976427038626
"I love cats " -- "dog"........................ 0.02328492758403161
"I love cats " -- "dog "....................... 0.7703433589994425
"I love cats " -- "dogs are cute".............. 0.2104158917188272
"I love cats " -- "dogs are cute "............. 0.8660908021592335
"I love cats " -- "hello"...................... 0.023115252661932466
"I love cats " -- "hello "..................... 0.8086529873575895
"I love cats " -- "hi"......................... 0.023717349902427878
"I love cats " -- "hi "........................ 0.7426429014054192
"I love dogs" -- "I love dogs "................ 0.2668744541065285
"I love dogs" -- "Napoleonic France"........... -0.29275150529306815
"I love dogs" -- "Napoleonic France ".......... 0.04357306838641106
"I love dogs" -- "Victorian England"........... -0.36638799068196853
"I love dogs" -- "Victorian England ".......... -0.06908215968686245
"I love dogs" -- "cat"......................... -0.3047423164022532
"I love dogs" -- "cat "........................ 0.0101104682762854
"I love dogs" -- "cats are cute"............... 0.3039941060157555
"I love dogs" -- "cats are cute ".............. 0.08910464218525402
"I love dogs" -- "dog"......................... -0.15135784328566665
"I love dogs" -- "dog "........................ 0.05290617609381392
"I love dogs" -- "dogs are cute"............... 0.26499805257358044
"I love dogs" -- "dogs are cute ".............. 0.09934014727476749
"I love dogs" -- "hello"....................... -0.24268201717121615
"I love dogs" -- "hello "...................... 0.045935074892588655
"I love dogs" -- "hi".......................... -0.22500111960052072
"I love dogs" -- "hi "......................... -0.07546074189006309
"I love dogs " -- "Napoleonic France".......... 0.01883090481368493
"I love dogs " -- "Napoleonic France "......... 0.7386010682104132
"I love dogs " -- "Victorian England".......... -0.02281812995309553
"I love dogs " -- "Victorian England "......... 0.7603767383707928
"I love dogs " -- "cat"........................ -0.06416890752500873
"I love dogs " -- "cat "....................... 0.7087321235353528
"I love dogs " -- "cats are cute".............. 0.20021670300208802
"I love dogs " -- "cats are cute "............. 0.8293343369105992
"I love dogs " -- "dog"........................ -0.007743482872031577
"I love dogs " -- "dog "....................... 0.791858638404352
"I love dogs " -- "dogs are cute".............. 0.18901810582114495
"I love dogs " -- "dogs are cute "............. 0.8217160711203176
"I love dogs " -- "hello"...................... -0.028063669282785846
"I love dogs " -- "hello "..................... 0.7975007795567103
"I love dogs " -- "hi"......................... -0.02801001638132258
"I love dogs " -- "hi "........................ 0.7116123302355635
"Napoleonic France" -- "Napoleonic France ".... 0.23522390837866922
"Napoleonic France" -- "Victorian England"..... 0.6859025998049194
"Napoleonic France" -- "Victorian England ".... 0.15648560509818651
"Napoleonic France" -- "cat"................... 0.35800033036759454
"Napoleonic France" -- "cat ".................. 0.10647011838283668
"Napoleonic France" -- "cats are cute"......... 0.07981987732132663
"Napoleonic France" -- "cats are cute "........ 0.078149911960321
"Napoleonic France" -- "dog"................... 0.3826710214412356
"Napoleonic France" -- "dog ".................. 0.11401018637067296
"Napoleonic France" -- "dogs are cute"......... 0.0773770554340013
"Napoleonic France" -- "dogs are cute "........ 0.09123545209030627
"Napoleonic France" -- "hello"................. 0.37213096418783836
"Napoleonic France" -- "hello "................ 0.057774352193263975
"Napoleonic France" -- "hi".................... 0.3507834273848848
"Napoleonic France" -- "hi "................... 0.17696122118133434
"Napoleonic France " -- "Victorian England".... 0.08466607680324116
"Napoleonic France " -- "Victorian England "... 0.8037786302246899
"Napoleonic France " -- "cat".................. -0.019977595529280315
"Napoleonic France " -- "cat "................. 0.7037017986232446
"Napoleonic France " -- "cats are cute"........ 0.07337913536494711
"Napoleonic France " -- "cats are cute "....... 0.6771872359838416
"Napoleonic France " -- "dog".................. 0.010643016302043572
"Napoleonic France " -- "dog "................. 0.739274480331095
"Napoleonic France " -- "dogs are cute"........ 0.04549129074053724
"Napoleonic France " -- "dogs are cute "....... 0.6471315374932367
"Napoleonic France " -- "hello"................ -0.04491316315316086
"Napoleonic France " -- "hello "............... 0.7016026239194642
"Napoleonic France " -- "hi"................... -0.04483742943994349
"Napoleonic France " -- "hi ".................. 0.6379276120297552
"Victorian England" -- "Victorian England ".... 0.1970397243022337
"Victorian England" -- "cat"................... 0.5315626991866473
"Victorian England" -- "cat ".................. 0.1132440438361098
"Victorian England" -- "cats are cute"......... 0.07564712547170802
"Victorian England" -- "cats are cute "........ 0.07047236143056597
"Victorian England" -- "dog"................... 0.5023841250096192
"Victorian England" -- "dog ".................. 0.09627092477400122
"Victorian England" -- "dogs are cute"......... 0.08558379851546237
"Victorian England" -- "dogs are cute "........ 0.0892397072219102
"Victorian England" -- "hello"................. 0.5153410825616703
"Victorian England" -- "hello "................ 0.04956935673613258
"Victorian England" -- "hi".................... 0.4727394855738129
"Victorian England" -- "hi "................... 0.21165691018559324
"Victorian England " -- "cat".................. 0.051434725296773336
"Victorian England " -- "cat "................. 0.7840190374817173
"Victorian England " -- "cats are cute"........ 0.0647773051440868
"Victorian England " -- "cats are cute "....... 0.7347889675972376
"Victorian England " -- "dog".................. 0.03781358609513832
"Victorian England " -- "dog "................. 0.8064848839781267
"Victorian England " -- "dogs are cute"........ 0.04396328283710094
"Victorian England " -- "dogs are cute "....... 0.6829677641565312
"Victorian England " -- "hello"................ 0.042481552565901706
"Victorian England " -- "hello "............... 0.808211277301756
"Victorian England " -- "hi"................... 0.028057424386802313
"Victorian England " -- "hi ".................. 0.745340058660383
"cat" -- "cat "................................ 0.2261562422600992
"cat" -- "cats are cute"....................... 0.055479073463416025
"cat" -- "cats are cute "...................... 0.042783194474326644
"cat" -- "dog"................................. 0.7428052216162652
"cat" -- "dog "................................ 0.07579947107525319
"cat" -- "dogs are cute"....................... 0.08587503622015415
"cat" -- "dogs are cute "...................... 0.06047225271094304
"cat" -- "hello"............................... 0.5867101415982408
"cat" -- "hello ".............................. 0.020849676027916392
"cat" -- "hi".................................. 0.5395565382469979
"cat" -- "hi "................................. 0.18922445718289724
"cat " -- "cats are cute"...................... 0.1377352530456209
"cat " -- "cats are cute "..................... 0.7045457312726324
"cat " -- "dog"................................ 0.151244663186442
"cat " -- "dog "............................... 0.8130943607206529
"cat " -- "dogs are cute"...................... 0.10501837198032893
"cat " -- "dogs are cute "..................... 0.6591719081389649
"cat " -- "hello".............................. 0.05720266958632396
"cat " -- "hello "............................. 0.7487726233664361
"cat " -- "hi"................................. 0.04989013326368915
"cat " -- "hi "................................ 0.7145388756690138
"cats are cute" -- "cats are cute "............ 0.3062561991124481
"cats are cute" -- "dog"....................... 0.12454416235558191
"cats are cute" -- "dog "...................... 0.13800195126360817
"cats are cute" -- "dogs are cute"............. 0.9635889317154503
"cats are cute" -- "dogs are cute "............ 0.3340841814158837
"cats are cute" -- "hello"..................... 0.19514219546396794
"cats are cute" -- "hello ".................... 0.15336479785550297
"cats are cute" -- "hi"........................ 0.19398964147538308
"cats are cute" -- "hi "....................... 0.15299873070429496
"cats are cute " -- "dog"...................... 0.047162606186251725
"cats are cute " -- "dog "..................... 0.7412502506668067
"cats are cute " -- "dogs are cute"............ 0.2922316428497054
"cats are cute " -- "dogs are cute "........... 0.9694282308713165
"cats are cute " -- "hello".................... 0.07172682701676675
"cats are cute " -- "hello "................... 0.7659055442905317
"cats are cute " -- "hi"....................... 0.07395981795315063
"cats are cute " -- "hi "...................... 0.7292468669861907
"dog" -- "dog "................................ 0.15684172716063982
"dog" -- "dogs are cute"....................... 0.15841792840219776
"dog" -- "dogs are cute "...................... 0.07408198913521334
"dog" -- "hello"............................... 0.5390216946824529
"dog" -- "hello ".............................. 0.004148038650315292
"dog" -- "hi".................................. 0.4729935607550871
"dog" -- "hi "................................. 0.1594872209549724
"dog " -- "dogs are cute"...................... 0.11997556990898295
"dog " -- "dogs are cute "..................... 0.700678277520648
"dog " -- "hello".............................. 0.0252694300933951
"dog " -- "hello "............................. 0.8277359075315665
"dog " -- "hi"................................. 0.005638887786058549
"dog " -- "hi "................................ 0.7370474015559675
"dogs are cute" -- "dogs are cute "............ 0.3421108849843522
"dogs are cute" -- "hello"..................... 0.2194993676080879
"dogs are cute" -- "hello ".................... 0.1326256045415808
"dogs are cute" -- "hi"........................ 0.21773509413279815
"dogs are cute" -- "hi "....................... 0.1500779129341232
"dogs are cute " -- "hello".................... 0.1031251333293599
"dogs are cute " -- "hello "................... 0.7279175778496194
"dogs are cute " -- "hi"....................... 0.11635693884531505
"dogs are cute " -- "hi "...................... 0.7269622577344995
"hello" -- "hello "............................ 0.15147937239963533
"hello" -- "hi"................................ 0.8043211555390358
"hello" -- "hi "............................... 0.2946474076607263
"hello " -- "hi"............................... 0.10360399054770437
"hello " -- "hi ".............................. 0.8464744965225194
"hi" -- "hi ".................................. 0.373960595217887
—
Reply to this email directly, view it on GitHub
<#899 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD6JCX57AZXUQPMBVZNQXLTXCFVCRANCNFSM6AAAAAAW2SDSGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'm not even sure what the embedding vector is supposed to be that llama.h gives you, I think it may represent the next generated token more than anything because it's extracted at the end. |
Yes, I also tried myself. my similarity search based on this llama embedding doesn't work at all. It finds content that is far away from the query. Switching to a different embedding system solved my issue. Also, does the tokenizer tokenize spaces? I thought "hello" and "hello " should be the same if tokenized? |
I think it mostly has the space comes in the beginning of the word, that's also why main.cpp inserts a space in the beginning. |
Are you using a non-llama model for generating embeddings and doing the search, or did you find a way to do it with Llama? |
instead of 7B, have you tried with bigger llama model? |
Is the embedding value is not correct? |
I tried reading on basics of transformers at https://www.baeldung.com/cs/transformer-text-embeddings and near the end they say:
The last link says that "The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering." The authors present a different network structure, that can actually generate sentence embeddings. So, it would seem that the keyword to google is "sentence embedding with LLM". Googling that, there is a SO question noticing that OpenAI embeddings don't seem to work much for short inputs either: https://datascience.stackexchange.com/questions/120422/text-embeddings-for-words-or-very-short-sentences-with-a-llm |
@tkafka So the embedding using llama is correct then ? PrivateGPT is generating embedding on its own, it use this method ? |
Exactly, we are taking the vector corresponding to the last token, which should have the information of the whole sentence. That is option 1 in @tkafka's comment. At least that's what we wanted to do. I think more inputs could be tested, but in general it was working pretty well. I wonder which search terms vs corpus are not matching in your use-case? It could be interesting to develop a search capability and add it to this project (or an open-source spin-off). Also maybe a zero-shot prompt could work, though a lot slower. Something along the lines of
|
@x4080 That depends on the intended use - for example, for document comparison and similarity search, I would definitely prefer mean (or even more probably max) pooling. Here is what GPT says about the methods:
...
|
(as an aside, for indexing a large base of documents, I would definitely welcome a webserver-like mode, that would load the model once, and then accept requests with documents, returning the embeddings - currently each run loads the model again) |
In this case are the transformer-generated representations the ones we are pooling, or would it be the word-embeddings? I'm leaning towards the first option but wanted to make extra sure. A different approach to this would be using some sort of attention measure. Something like, for each element in our corpus, taking the convex sum of its embeddings scaled by their cosine similarity with our keywords. I'm not advocating for this particularly crude attention, but a better thought-out approach. |
Something I think would be good is taking an already established search dataset and using it as a benchmark, then we could iterate over the crudest possible search (cosine similarity of last token embedding) and find improvements. I assume some good dataset exists in open source. |
@StrikingLoo Not sure actually - I have been using LLMs like a 'magical black boxes' so far, and am reading up on the basics. The word embeddings are definitely problematic, as the google researchers replied (for the BERT embeddings):
and also
google-research/bert#164 (comment) I am beginning to lean onto the idea that what llama does now is actually 'least bad'option out of the easily available ones, and there seems to be active research still going on about how to best semantically embed sentences or documents ... |
More of the interesting discussions (from BERT)
|
Llama is unidirectional, not bidirectional like BERT, which I think may make the embeddings better but not sure. I agree that this is a 'least-bad' approach, not sure how we could improve it. I leveraged the script by @nitram147 and switched it to use cosine similarity, and output the results ranked by similarity instead of randomly. I see one-word queries are similar to each other in embedding space, even if the words are not that related. This will definitely be bad for search. Maybe for one-word search it would be better to use word-embedding similarity over the document (with max pooling, or highlighting of sections with high similarity), instead of the full language model. Then for sentences we could switch to the full llama sentence embedding. Again this is a least-bad approach, but it could work better than what we have now for search. If anyone has the time to do it. Here are the results I got, plus the script (which is a modified version of Nitram's)
And here are the results. I think especially sentence vs sentence, they make sense. The biggest problem is one-word queries (which I guess are a big portion of all search queries). Maybe a good search would be grep-first, word-embedding second, sentence embedding third? This sounds like the kind of problem where someone smarter than me has already invented solutions though.
|
Here is an example of search using only word embeddings. I think this may work better than sentence embeddings in most cases. We could implement this using the same word embeddings as Llama is using. |
I think the output embedding is associated with current predication of next token. memcpy(embedding_out.data(), (float *) ggml_get_data(embeddings) + (n_embd*(N - 1)), sizeof(float)*n_embd); |
I did some experiments on this embedding the other day and tested using averaging the vectors. How: change the embedding vector to be It seemed to be doing a little better in some of the document retrieval tasks. There is still the issue that it is kind of slow even with GPU acceleration to process a lot of text. Maybe all the layers are not necessary to process? I think maybe LLaMA is not the right model for this task, some kind of encoder-decoder model could be better. |
I don't know whether it is relevant here but the llama.cpp's server endpoint '/embedding' doesn't seem to work at all. ./embedding works though. The response I get is a 4096 length long 0.0 for llama 2 model.
|
@akarshanbiswas, the server needs to be started with the |
when i use langchain libs to create vectorstor(eg:faiss), it cost me long time(seems 20s) to create with embedding server api. how can i speed up the api ? |
here's me running that in a gist For the hf reference implementation, for the first 10 embeddings I get quite differen't numbers
|
I mean to run it just for |
I'm just speculating, but could it be because it uses the transformers FEATURE_EXTRACTION pipeline which bypasses the model head |
Oh, I get you:
|
|
I think you are running your tests on different models. If I run the original on |
So for reference model I'm using huggingface transformers 4.38.2 to run |
fwiw, if you include a hello/jimmy test in the docs. Better to use |
@ggerganov 's original test was
run on a |
I would expect it to be minor if quantization is working correctly, since the point is to minimize the difference in outputs while shrinking weights. But lets see...
Yeah, pretty minor |
Aha, we were missing the EOS tokens. Try latest |
Beautiful 👌 cosine similarity matrix: 1.00 0.99 0.74 0.75 tokenizers man, they do my head in |
hey so the embeddings displayed on console is of size 16 but isn't mistral embedding dimension 1024? |
Embeddings can be of many different sizes depending on the model, but the code only allows up to 16 values to be shown. You can change this if you want to see more by altering
Just change the |
thank you for the quick response! Oh I see, so its just for the visual purposes. My actual issue is that I'm using llama-cpp-python wrapper and llm.create_emebeddings (or llm.embed) is throwing out an error.
pooling_type is by defualt set to zero reference commit so adding the pooling_type=1 (mean) argument is a necessity to obtain the embeddings. But I'm endding up with this error. So I wanted the complete embedding. Is there a better way to extract these embeddings? Thank you for your time. |
You can try using |
thanks a lot, I'll check it out. |
### EDIT: PLEASE IGNORE THIS. Hey just tried the server, the model being used to return the embeddings is "gpt-3.5-turbo-0613" instead of the model path passed by me. CLI:
raw data (POST method):
Output:
|
It's a placeholder string - you can override it by passing |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I'm trying to use llama.cpp to generate sentence embeddings, and then use a query to search for answers in a vector database. But my code doesn't work. Upon further inspection, it seems that the sentence embeddings generated by llama.cpp is not trustworthy. This can be reproduced by the embedding example:
./embedding -m models/7B/ggml-model-q4_0.bin -p "hello" -n 512
./embedding -m models/7B/ggml-model-q4_0.bin -p "hello " -n 512
notice that the only difference between the above two commands is that there is an extra space in the second prompt. But the above will result in completely different embeddings. I would assume, since the meaning of the prompts is the same, the extra space shouldn't cause the embedding to be very different.
Is the embedding function working?
Current Behavior
The current embedding output seems to be random?
Environment and Context
Linux + A100
Failure Information (for bugs)
The embedding output can be altered by adding a space in the prompt.
Steps to Reproduce
./embedding -m models/7B/ggml-model-q4_0.bin -p "hello" -n 512
./embedding -m models/7B/ggml-model-q4_0.bin -p "hello " -n 512
build the project and run the official embedding example like the above and compare the generated embeddings.
The text was updated successfully, but these errors were encountered: