Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove code that overrides max_seq_length #54

Merged
merged 1 commit into from
Jul 2, 2023
Merged

Remove code that overrides max_seq_length #54

merged 1 commit into from
Jul 2, 2023

Conversation

michael-quinlan
Copy link
Contributor

@michael-quinlan michael-quinlan commented Jun 26, 2023

max_seq_length was being hardcoded at 512 rather then being loaded via the config or via the code from the is None block. In fact the following if could not be reached

if max_seq_length is None:
            if hasattr(self.auto_model, "config") and hasattr(self.auto_model.config, "max_position_embeddings") and hasattr(self.tokenizer, "model_max_length"):
                max_seq_length = min(self.auto_model.config.max_position_embeddings, self.tokenizer.model_max_length)

The default code loads the corred sequence length of 512 so this hardcoding is not requred. I verifed this by running the inference examples, noting that the print on line 252 print('max_seq_length ',max_seq_length) shows 512.

I also verified that the simiarities produced by running the following code in the README is also identtical from the release verison on pypi

from sklearn.metrics.pairwise import cosine_similarity
sentences_a = [['Represent the Science sentence: ','Parton energy loss in QCD matter'], 
               ['Represent the Financial statement: ','The Federal Reserve on Wednesday raised its benchmark interest rate.']]
sentences_b = [['Represent the Science sentence: ','The Chiral Phase Transition in Dissipative Dynamics'],
               ['Represent the Financial statement: ','The funds rose less than 0.5 per cent on Friday']]
embeddings_a = model.encode(sentences_a)
embeddings_b = model.encode(sentences_b)
similarities = cosine_similarity(embeddings_a,embeddings_b)

==

array([[0.81227076, 0.7351362 ],
       [0.6770725 , 0.81411076]], dtype=float32)

Although this change probably won't effect anyone running the default version of the code (since the 512 override is the same as what is loaded) I think it's worth fixing now either by merging this PR of by the authors making the fix and verifying end-to-end.

@hongjin-su hongjin-su merged commit 30b5e5c into xlang-ai:main Jul 2, 2023
@Ethan-Chen-plus
Copy link

@michael-quinlan Hi!
Could we change the max_length when we fine tune on our own data with long text?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants