Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add morphfitting as a build step #150

Merged
merged 18 commits into from
Dec 13, 2017
Merged

Add morphfitting as a build step #150

merged 18 commits into from
Dec 13, 2017

Conversation

rspeer
Copy link
Member

@rspeer rspeer commented Dec 12, 2017

Also adds conceptnet5.builders.cli. We've accidentally already merged things that depend on that.

@@ -0,0 +1,57 @@
import click
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Click is never used. The import statements should be sorted (here and ex. in sparse_matrix_builder.py).


# Strip a possible trailing underscore, which would particularly show
# up in the way we segment ATOMIC_SPACE_LANGUAGES (Vietnamese)
slug = ''.join(chunks).strip('_')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this variable called slug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you mention it, that only makes sense in terms of my Mystery Hunt code. I'll rename it.

@@ -130,7 +130,7 @@ def get_uri_language(uri):
"""
if uri.startswith('/a/'):
return get_uri_language(parse_possible_compound_uri('a', uri)[0])
elif uri.startswith('/c/'):
elif uri.startswith('/c/') or uri.startswith('/x/'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_vectors have a couple of places where we check if terms start with '/c'. Should these be changed as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me what test_vectors is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I meant conceptnet5/tests/small-build/test_vectors.py.

@@ -57,6 +57,10 @@ def build_from_conceptnet_table(filename, orig_index=(), self_loops=True):
index1 = labels.add(replace_numbers(concept1))
index2 = labels.add(replace_numbers(concept2))
value = float(value_str)

if dataset == '/d/morphology':
value /= 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment explaining where 10 comes from would be great.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I look at it, this was an after-the-fact hack for the fact that I'd given the morphology edges too high a weight and didn't want to rebuild them. Might as well fix it now that I'm going to rebuild them.

The weight should be 0.01, and as always this is pretty arbitrary. The goal is for subword vectors to not affect term vectors much, unless we know very little about the term; meanwhile, because subwords will only have these low-weight links, terms will affect subwords a lot.

@jlowryduda jlowryduda merged commit d4c2e96 into master Dec 13, 2017
@jlowryduda jlowryduda deleted the morphfitting branch December 13, 2017 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants