-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add morphfitting as a build step #150
Conversation
conceptnet5/builders/morphology.py
Outdated
@@ -0,0 +1,57 @@ | |||
import click |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Click is never used. The import statements should be sorted (here and ex. in sparse_matrix_builder.py
).
conceptnet5/builders/morphology.py
Outdated
|
||
# Strip a possible trailing underscore, which would particularly show | ||
# up in the way we segment ATOMIC_SPACE_LANGUAGES (Vietnamese) | ||
slug = ''.join(chunks).strip('_') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this variable called slug
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you mention it, that only makes sense in terms of my Mystery Hunt code. I'll rename it.
@@ -130,7 +130,7 @@ def get_uri_language(uri): | |||
""" | |||
if uri.startswith('/a/'): | |||
return get_uri_language(parse_possible_compound_uri('a', uri)[0]) | |||
elif uri.startswith('/c/'): | |||
elif uri.startswith('/c/') or uri.startswith('/x/'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_vectors
have a couple of places where we check if terms start with '/c'. Should these be changed as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remind me what test_vectors
is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, I meant conceptnet5/tests/small-build/test_vectors.py
.
@@ -57,6 +57,10 @@ def build_from_conceptnet_table(filename, orig_index=(), self_loops=True): | |||
index1 = labels.add(replace_numbers(concept1)) | |||
index2 = labels.add(replace_numbers(concept2)) | |||
value = float(value_str) | |||
|
|||
if dataset == '/d/morphology': | |||
value /= 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment explaining where 10 comes from would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I look at it, this was an after-the-fact hack for the fact that I'd given the morphology edges too high a weight and didn't want to rebuild them. Might as well fix it now that I'm going to rebuild them.
The weight should be 0.01, and as always this is pretty arbitrary. The goal is for subword vectors to not affect term vectors much, unless we know very little about the term; meanwhile, because subwords will only have these low-weight links, terms will affect subwords a lot.
Also adds
conceptnet5.builders.cli
. We've accidentally already merged things that depend on that.