Merge pull request #150 from commonsense/morphfitting

Add morphfitting as a build step
commonsense · Dec 13, 2017 · d4c2e96 · d4c2e96
2 parents acce22a + d25ab9c
commit d4c2e96
Show file tree

Hide file tree

Showing 21 changed files with 5,369 additions and 132 deletions.
diff --git a/DATA-CREDITS.txt → DATA-CREDITS.md b/DATA-CREDITS.txt → DATA-CREDITS.md
@@ -5,29 +5,25 @@ see LICENSE.txt.
 
 ## Data license
 
-The complete data in ConceptNet is available under the Creative Commons
-Attribution-ShareAlike 4.0 license [CC-By-SA].
+The complete data in ConceptNet is available under the [Creative Commons
+Attribution-ShareAlike 4.0 license][CC-By-SA].
 
-Additionally, because we track the provenance of the data, you may extract and
-use a subset of its data under the Creative Commons Attribution 4.0 license
-[CC-By].
+See [Sharing][] for more information.
 
-See [Sharing] for more information.
-
-[CC-By-SA] http://creativecommons.org/licenses/by-sa/4.0/
-[CC-By] http://creativecommons.org/licenses/by/4.0/
-[Sharing] https://github.com/commonsense/conceptnet5/wiki/Copying-and-sharing-ConceptNet
+[CC-By-SA]: http://creativecommons.org/licenses/by-sa/4.0/
+[CC-By]: http://creativecommons.org/licenses/by/4.0/
+[Sharing]: https://github.com/commonsense/conceptnet5/wiki/Copying-and-sharing-ConceptNet
 
 To give credit to ConceptNet, we suggest this text:
 
     This work includes data from ConceptNet 5, which was compiled by the
     Commonsense Computing Initiative. ConceptNet 5 is freely available under
-    the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from
-    http://conceptnet5.media.mit.edu.
+    the Creative Commons Attribution-ShareAlike license (CC-By-SA 4.0) from
+    http://conceptnet.io.
 
     The included data was created by contributors to Commonsense Computing
     projects, contributors to Wikimedia projects, Games with a Purpose,
-    Princeton University's WordNet, DBPedia, OpenCyc, and Umbel.
+    Princeton University's WordNet, DBPedia, and Cycorp's OpenCyc.
 
 
 ## Credits and acknowledgements
@@ -58,7 +54,8 @@ Significant amounts of data were imported from:
 * Wikipedia and Wiktionary, collaborative projects of the Wikimedia Foundation
 * Luis von Ahn's "Games with a Purpose"
 * DBPedia
-* Umbel, a project of Structured Dynamics LLC
+* OpenCyc
+* JMDict, by Jim Breen
 
 Here is a short, incomplete list of people who have made significant
 contributions to the development of ConceptNet as a data resource, roughly in
@@ -75,79 +72,25 @@ order of appearance:
 
 ## Licenses for included resources
 
-### Commonsense Computing
-
-The Commonsense Computing project originated at the MIT Media Lab and expanded
-worldwide. Tens of thousands of contributors have taken some time to teach
-facts to computers. Their pseudonyms can be found in the "sources" list found
-in ConceptNet's raw data and in its API.
-
-
-### Games with a Purpose
-
-Data collected from Verbosity, one of the CMU "Games with a Purpose", is used
-and released under ConceptNet's license, by permission from Luis von Ahn and
-Harshit Surana.
-
-Verbosity players are anonymous, so in the "sources" list, data from Verbosity
-is simply credited to the pseudonym "verbosity".
-
-
-### UMBEL
-
-UMBEL is available under a Creative Commons Attribution license. Here are
-UMBEL's license terms, adapted from [Umbel]:
-
-UMBEL and its documentation are the joint creative works of Structured Dynamics
-LLC [SD] and Ontotext AD [Ontotext], which grant free use rights thereto, only
-limited by the attribution terms described in the Creative Commons 3.0
-Attribution License [CC-By-3]. The copyrights to UMBEL and its documentation
-remain the sole rights of Structured Dynamics LLC and Ontotext AD.
-
-The UMBEL Reference Concept Ontology is based on a faithful but reduced subset
-extraction of concepts and relationships from the OpenCyc version of the Cyc
-knowledge base. As such, UMBEL is a lightweight reflection of these sources,
-but not nearly as capable nor complete.
-
-Use and relations based on UMBEL may therefore not be an accurate
-representation of what might be obtained in working directly with the source
-Cyc or OpenCyc knowledge bases.
-
-[Umbel] http://umbel.org/resources/about/
-[SD] http://structureddynamics.com/
-[Ontotext] http://www.ontotext.com/index.html
-[CC-By-3] http://creativecommons.org/licenses/by/3.0/
-
-
 ### Wikimedia projects
 
-ConceptNet uses data directly from Wiktionary, the free dictionary [wiktionary].
-It also uses data from Wikipedia, the free encyclopedia [wikipedia] via DBPedia
-[dbpedia].
+ConceptNet uses data directly from [Wiktionary, the free dictionary][wiktionary].
+It also uses data from [Wikipedia, the free encyclopedia][wikipedia] via
+[DBPedia][dbpedia].
 
 Wiktionary and Wikipedia are collaborative projects, authored by their
-respective online communities. They are currently released under the Creative
-Commons Attribution-ShareAlike license [CC-By-SA-3].
+respective online communities. They are currently released under the [Creative
+Commons Attribution-ShareAlike license][CC-By-SA-3].
 
 Wikimedia encourages giving attribution by providing links to the hosted pages
-that the data came from, and DBPedia asks for the same thing in turn. In the
-raw data and the Web API, the sources of Wikimedia contributions can be found
-as URLs following the token `/s/web`.
-
-For example, an assertion attributed to `/s/web/de.wiktionary.org/wiki/Sprache/`
-uses information extracted from the page that can be seen on the Web at
-http://de.wiktionary.org/wiki/Sprache. Its list of individual contributors can
-be seen at: http://de.wiktionary.org/wiki/Sprache?action=history
+that the data came from, and DBPedia asks for the same thing in turn. The
+ConceptNet relation `/r/ExternalURL` provides links between terms in ConceptNet
+and the external pages or RDF resources that they incorporate information from.
 
-Information from DBPedia is credited in a way that is designed to encourage
-interoperability with DBPedia. ConceptNet nodes that use information from
-DBPedia are linked to their DBPedia nodes in RDF N-Triples format. These links
-can be found in the `data/sw_map` directory.
-
-[wiktionary] http://wiktionary.org/
-[wikipedia] http://wikipedia.org/
-[dbpedia] http://dbpedia.org/
-[CC-By-SA-3] http://creativecommons.org/licenses/by-sa/3.0/
+[wiktionary]: http://wiktionary.org/
+[wikipedia]: http://wikipedia.org/
+[dbpedia]: http://dbpedia.org/
+[CC-By-SA-3]: http://creativecommons.org/licenses/by-sa/3.0/
 
 
 ### WordNet
@@ -183,3 +126,36 @@ publicity pertaining to distribution of the software and/or database. Title to
 copyright in this software, database and any associated documentation shall at
 all times remain with Princeton University and LICENSEE agrees to preserve
 same.
+
+
+### Commonsense Computing
+
+The Commonsense Computing project originated at the MIT Media Lab and expanded
+worldwide. Tens of thousands of contributors have taken some time to teach
+facts to computers. Their pseudonyms can be found in the "sources" list found
+attached to each statement, in ConceptNet's raw data and in its API.
+
+
+### Games with a Purpose
+
+Data collected from anonymous players of Verbosity, one of the CMU "Games with
+a Purpose", is used and released under ConceptNet's license, by permission from
+Luis von Ahn and Harshit Surana.
+
+
+## Multilingual dictionaries
+
+We import data from [CEDict][] and [JMDict][], both of which are available
+under the [Creative Commons Attribution-ShareAlike license][CC-By-SA-3].
+
+[CEDict]: https://cc-cedict.org/wiki/
+[JMDict]: http://www.edrdg.org/jmdict/j_jmdict.html
+
+
+### OpenCyc
+
+The OWL data we use from [OpenCyc][opencyc-license] is made available by Cycorp
+under a [Creative Commons Attribution 3.0 license][CC-By-3].
+
+[opencyc-license]: http://www.cyc.com/documentation/opencyc-license/
+[CC-By-3]: http://creativecommons.org/licenses/by/3.0/
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ http://conceptnet.io for more information and a browsable Web interface.
 
 Further documentation is available on the Wiki: https://github.com/commonsense/conceptnet5/wiki
 
-Licensing and attribution appear in LICENSE.txt and DATA-CREDITS.txt.
+Licensing and attribution appear in LICENSE.txt and DATA-CREDITS.md.
 
 
 ## Discussion groups

diff --git a/Snakefile b/Snakefile
@@ -1,4 +1,6 @@
 from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
+from conceptnet5.languages import COMMON_LANGUAGES, ATOMIC_SPACE_LANGUAGES
+
 import os
 HTTP = HTTPRemoteProvider()
 
@@ -18,6 +20,10 @@ USE_PRECOMPUTED = not os.environ.get("CONCEPTNET_REBUILD_PRECOMPUTED")
 # can be used as precomputed files later? (Requires ConceptNet S3 credentials.)
 UPLOAD = False
 
+# If USE_MORPHOLOGY is true, we will build and learn from sub-words derived
+# from Morfessor.
+USE_MORPHOLOGY = False
+
 # How many pieces to split edge files into. (Works best when it's a power of
 # 2 that's 64 or less.)
 N_PIECES = 16
@@ -105,6 +111,8 @@ CORE_DATASET_NAMES += ["emoji/{}".format(lang) for lang in EMOJI_LANGUAGES]
 
 
 DATASET_NAMES = CORE_DATASET_NAMES + ["dbpedia/dbpedia_en"]
+if USE_MORPHOLOGY:
+    DATASET_NAMES += ["morphology/subwords-{}".format(lang) for lang in COMMON_LANGUAGES]
 
 
 rule all:
@@ -368,7 +376,7 @@ rule combine_assertions:
     output:
         DATA + "/assertions/assertions.msgpack"
     shell:
-        "python3 -m conceptnet5.builders.combine_assertions {input} {output}"
+        "cn5-build combine {input} {output}"
 
 
 # Putting data in PostgreSQL
@@ -472,6 +480,28 @@ rule concepts_right:
         "cut -f 4 {input} > {output}"
 
 
+rule concept_counts:
+    input:
+        DATA + "/stats/concepts_left.txt",
+        DATA + "/stats/concepts_right.txt"
+    output:
+        DATA + "/stats/concept_counts.txt"
+    shell:
+        "cat {input} | grep '^/c/' | cut -d '/' -f 1,2,3,4 "
+        "| LC_ALL=C sort | LC_ALL=C uniq -c > {output}"
+
+
+rule core_concept_counts:
+    input:
+        DATA + "/stats/core_concepts_left.txt",
+        DATA + "/stats/core_concepts_right.txt"
+    output:
+        DATA + "/stats/core_concept_counts.txt"
+    shell:
+        "cat {input} | grep '^/c/' | cut -d '/' -f 1,2,3,4 "
+        "| LC_ALL=C sort | LC_ALL=C uniq -c > {output}"
+
+
 rule language_stats:
     input:
         DATA + "/stats/concepts_left.txt",
@@ -517,7 +547,7 @@ rule reduce_assoc:
     output:
         DATA + "/assoc/reduced.csv"
     shell:
-        "python3 -m conceptnet5.builders.reduce_assoc {input} {output}"
+        "cn5-build reduce_assoc {input} {output}"
 
 
 # Building the vector space
@@ -659,6 +689,37 @@ rule export_english_text:
         "cn5-vectors export_text -l en {input} {output}"
 
 
+# Morphology
+# ==========
+
+rule prepare_vocab:
+    input:
+        DATA + "/stats/core_concept_counts.txt"
+    output:
+        DATA + "/morph/vocab/{language}.txt"
+    shell:
+        "cn5-build prepare_morphology {wildcards.language} {input} {output}"
+
+rule morfessor_segmentation:
+    input:
+        DATA + "/morph/vocab/{language}.txt"
+    output:
+        DATA + "/morph/segments/{language}.txt"
+    run:
+        if wildcards.language in ATOMIC_SPACE_LANGUAGES:
+            shell("morfessor-train {input} -S {output} --traindata-list --nosplit-re '[^_].'")
+        else:
+            shell("morfessor-train {input} -S {output} -f '_' --traindata-list")
+
+rule subwords:
+    input:
+        DATA + "/morph/segments/{language}.txt",
+    output:
+        DATA + "/edges/morphology/subwords-{language}.msgpack"
+    shell:
+        "cn5-build subwords {wildcards.language} {input} {output}"
+
+
 # Evaluation
 # ==========
 

diff --git a/conceptnet5/builders/cli.py b/conceptnet5/builders/cli.py
@@ -0,0 +1,51 @@
+import click
+from .combine_assertions import combine_assertions
+from .reduce_assoc import reduce_assoc
+from .morphology import prepare_vocab_for_morphology, subwords_to_edges
+
+
+@click.group()
+def cli():
+    pass
+
+
+@cli.command(name='combine')
+@click.argument('input', type=click.Path(readable=True, dir_okay=False))
+@click.argument('output', type=click.Path(writable=True, dir_okay=False))
+def run_combine(input, output):
+    """
+    Combine edges that have the same relation, start, and end, into
+    higher-level assertions that add their weights and sources.
+
+    `input` is a tab-separated CSV file to be grouped into assertions.
+    `output` is the combined assertions, as a Msgpack stream.
+    """
+    combine_assertions(input, output)
+
+
+@cli.command(name='reduce_assoc')
+@click.argument('input', type=click.Path(readable=True, dir_okay=False))
+@click.argument('output', type=click.Path(writable=True, dir_okay=False))
+def run_reduce_assoc(input, output):
+    """
+    Takes in a file of tab-separated simple associations, and removes
+    low-frequency terms and associations that are judged unlikely to be
+    useful by various filters.
+    """
+    reduce_assoc(input, output)
+
+
+@cli.command('prepare_morphology')
+@click.argument('language')
+@click.argument('input', type=click.File('r'))
+@click.argument('output', type=click.File('w'))
+def run_prepare_morphology(language, input, output):
+    prepare_vocab_for_morphology(language, input, output)
+
+
+@cli.command('subwords')
+@click.argument('language')
+@click.argument('input', type=click.File('r'))
+@click.argument('output', type=click.File('wb'))
+def run_subwords(language, input, output):
+    subwords_to_edges(language, input, output)
diff --git a/conceptnet5/builders/combine_assertions.py b/conceptnet5/builders/combine_assertions.py
@@ -94,11 +94,12 @@ def make_assertion(line_group):
     )
 
 
-def combine_assertions(input_filename, output_file):
+def combine_assertions(input_filename, output_filename):
     """
     Take in a tab-separated, sorted "CSV" files, indicated by
     `input_filename`, that should be grouped together into assertions.
-    Output a msgpack stream of assertions to `output_file`.
+    Output a msgpack stream of assertions the file indicated by
+    `output_filename`.
 
     The input file should be made from multiple sources of assertions by
     concatenating and sorting them.
@@ -113,8 +114,8 @@ def group_func(line):
         "Group lines by their URI (their first column)."
         return line.split('\t', 1)[0]
 
-    out = MsgpackStreamWriter(output_file)
-    out_bad = MsgpackStreamWriter(output_file + '.reject')
+    out = MsgpackStreamWriter(output_filename)
+    out_bad = MsgpackStreamWriter(output_filename + '.reject')
 
     with open(input_filename, encoding='utf-8') as stream:
         for key, line_group in itertools.groupby(stream, group_func):
@@ -129,18 +130,3 @@ def group_func(line):
 
     out.close()
     out_bad.close()
-
-
-@click.command()
-#tab-separated csv file to be grouped into assertion
-@click.argument('input', type=click.Path(readable=True, dir_okay=False))
-#msgpack stream of assertions
-@click.argument('output', type=click.Path(writable=True, dir_okay=False))
-def cli(input, output):
-    combine_assertions(input,output)
-
-if __name__ == '__main__':
-    # This is the main command-line entry point, used in steps of building
-    # ConceptNet that need to combine edges into assertions. See data/Makefile
-    # for more context.
-    cli()