Skip to content

Commit

Permalink
Merge pull request #150 from commonsense/morphfitting
Browse files Browse the repository at this point in the history
Add morphfitting as a build step
  • Loading branch information
jlowryduda authored Dec 13, 2017
2 parents acce22a + d25ab9c commit d4c2e96
Show file tree
Hide file tree
Showing 21 changed files with 5,369 additions and 132 deletions.
136 changes: 56 additions & 80 deletions DATA-CREDITS.txt → DATA-CREDITS.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,29 +5,25 @@ see LICENSE.txt.

## Data license

The complete data in ConceptNet is available under the Creative Commons
Attribution-ShareAlike 4.0 license [CC-By-SA].
The complete data in ConceptNet is available under the [Creative Commons
Attribution-ShareAlike 4.0 license][CC-By-SA].

Additionally, because we track the provenance of the data, you may extract and
use a subset of its data under the Creative Commons Attribution 4.0 license
[CC-By].
See [Sharing][] for more information.

See [Sharing] for more information.

[CC-By-SA] http://creativecommons.org/licenses/by-sa/4.0/
[CC-By] http://creativecommons.org/licenses/by/4.0/
[Sharing] https://github.com/commonsense/conceptnet5/wiki/Copying-and-sharing-ConceptNet
[CC-By-SA]: http://creativecommons.org/licenses/by-sa/4.0/
[CC-By]: http://creativecommons.org/licenses/by/4.0/
[Sharing]: https://github.com/commonsense/conceptnet5/wiki/Copying-and-sharing-ConceptNet

To give credit to ConceptNet, we suggest this text:

This work includes data from ConceptNet 5, which was compiled by the
Commonsense Computing Initiative. ConceptNet 5 is freely available under
the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from
http://conceptnet5.media.mit.edu.
the Creative Commons Attribution-ShareAlike license (CC-By-SA 4.0) from
http://conceptnet.io.

The included data was created by contributors to Commonsense Computing
projects, contributors to Wikimedia projects, Games with a Purpose,
Princeton University's WordNet, DBPedia, OpenCyc, and Umbel.
Princeton University's WordNet, DBPedia, and Cycorp's OpenCyc.


## Credits and acknowledgements
Expand Down Expand Up @@ -58,7 +54,8 @@ Significant amounts of data were imported from:
* Wikipedia and Wiktionary, collaborative projects of the Wikimedia Foundation
* Luis von Ahn's "Games with a Purpose"
* DBPedia
* Umbel, a project of Structured Dynamics LLC
* OpenCyc
* JMDict, by Jim Breen

Here is a short, incomplete list of people who have made significant
contributions to the development of ConceptNet as a data resource, roughly in
Expand All @@ -75,79 +72,25 @@ order of appearance:

## Licenses for included resources

### Commonsense Computing

The Commonsense Computing project originated at the MIT Media Lab and expanded
worldwide. Tens of thousands of contributors have taken some time to teach
facts to computers. Their pseudonyms can be found in the "sources" list found
in ConceptNet's raw data and in its API.


### Games with a Purpose

Data collected from Verbosity, one of the CMU "Games with a Purpose", is used
and released under ConceptNet's license, by permission from Luis von Ahn and
Harshit Surana.

Verbosity players are anonymous, so in the "sources" list, data from Verbosity
is simply credited to the pseudonym "verbosity".


### UMBEL

UMBEL is available under a Creative Commons Attribution license. Here are
UMBEL's license terms, adapted from [Umbel]:

UMBEL and its documentation are the joint creative works of Structured Dynamics
LLC [SD] and Ontotext AD [Ontotext], which grant free use rights thereto, only
limited by the attribution terms described in the Creative Commons 3.0
Attribution License [CC-By-3]. The copyrights to UMBEL and its documentation
remain the sole rights of Structured Dynamics LLC and Ontotext AD.

The UMBEL Reference Concept Ontology is based on a faithful but reduced subset
extraction of concepts and relationships from the OpenCyc version of the Cyc
knowledge base. As such, UMBEL is a lightweight reflection of these sources,
but not nearly as capable nor complete.

Use and relations based on UMBEL may therefore not be an accurate
representation of what might be obtained in working directly with the source
Cyc or OpenCyc knowledge bases.

[Umbel] http://umbel.org/resources/about/
[SD] http://structureddynamics.com/
[Ontotext] http://www.ontotext.com/index.html
[CC-By-3] http://creativecommons.org/licenses/by/3.0/


### Wikimedia projects

ConceptNet uses data directly from Wiktionary, the free dictionary [wiktionary].
It also uses data from Wikipedia, the free encyclopedia [wikipedia] via DBPedia
[dbpedia].
ConceptNet uses data directly from [Wiktionary, the free dictionary][wiktionary].
It also uses data from [Wikipedia, the free encyclopedia][wikipedia] via
[DBPedia][dbpedia].

Wiktionary and Wikipedia are collaborative projects, authored by their
respective online communities. They are currently released under the Creative
Commons Attribution-ShareAlike license [CC-By-SA-3].
respective online communities. They are currently released under the [Creative
Commons Attribution-ShareAlike license][CC-By-SA-3].

Wikimedia encourages giving attribution by providing links to the hosted pages
that the data came from, and DBPedia asks for the same thing in turn. In the
raw data and the Web API, the sources of Wikimedia contributions can be found
as URLs following the token `/s/web`.

For example, an assertion attributed to `/s/web/de.wiktionary.org/wiki/Sprache/`
uses information extracted from the page that can be seen on the Web at
http://de.wiktionary.org/wiki/Sprache. Its list of individual contributors can
be seen at: http://de.wiktionary.org/wiki/Sprache?action=history
that the data came from, and DBPedia asks for the same thing in turn. The
ConceptNet relation `/r/ExternalURL` provides links between terms in ConceptNet
and the external pages or RDF resources that they incorporate information from.

Information from DBPedia is credited in a way that is designed to encourage
interoperability with DBPedia. ConceptNet nodes that use information from
DBPedia are linked to their DBPedia nodes in RDF N-Triples format. These links
can be found in the `data/sw_map` directory.

[wiktionary] http://wiktionary.org/
[wikipedia] http://wikipedia.org/
[dbpedia] http://dbpedia.org/
[CC-By-SA-3] http://creativecommons.org/licenses/by-sa/3.0/
[wiktionary]: http://wiktionary.org/
[wikipedia]: http://wikipedia.org/
[dbpedia]: http://dbpedia.org/
[CC-By-SA-3]: http://creativecommons.org/licenses/by-sa/3.0/


### WordNet
Expand Down Expand Up @@ -183,3 +126,36 @@ publicity pertaining to distribution of the software and/or database. Title to
copyright in this software, database and any associated documentation shall at
all times remain with Princeton University and LICENSEE agrees to preserve
same.


### Commonsense Computing

The Commonsense Computing project originated at the MIT Media Lab and expanded
worldwide. Tens of thousands of contributors have taken some time to teach
facts to computers. Their pseudonyms can be found in the "sources" list found
attached to each statement, in ConceptNet's raw data and in its API.


### Games with a Purpose

Data collected from anonymous players of Verbosity, one of the CMU "Games with
a Purpose", is used and released under ConceptNet's license, by permission from
Luis von Ahn and Harshit Surana.


## Multilingual dictionaries

We import data from [CEDict][] and [JMDict][], both of which are available
under the [Creative Commons Attribution-ShareAlike license][CC-By-SA-3].

[CEDict]: https://cc-cedict.org/wiki/
[JMDict]: http://www.edrdg.org/jmdict/j_jmdict.html


### OpenCyc

The OWL data we use from [OpenCyc][opencyc-license] is made available by Cycorp
under a [Creative Commons Attribution 3.0 license][CC-By-3].

[opencyc-license]: http://www.cyc.com/documentation/opencyc-license/
[CC-By-3]: http://creativecommons.org/licenses/by/3.0/
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ http://conceptnet.io for more information and a browsable Web interface.

Further documentation is available on the Wiki: https://github.com/commonsense/conceptnet5/wiki

Licensing and attribution appear in LICENSE.txt and DATA-CREDITS.txt.
Licensing and attribution appear in LICENSE.txt and DATA-CREDITS.md.


## Discussion groups
Expand Down
65 changes: 63 additions & 2 deletions Snakefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
from conceptnet5.languages import COMMON_LANGUAGES, ATOMIC_SPACE_LANGUAGES

import os
HTTP = HTTPRemoteProvider()

Expand All @@ -18,6 +20,10 @@ USE_PRECOMPUTED = not os.environ.get("CONCEPTNET_REBUILD_PRECOMPUTED")
# can be used as precomputed files later? (Requires ConceptNet S3 credentials.)
UPLOAD = False

# If USE_MORPHOLOGY is true, we will build and learn from sub-words derived
# from Morfessor.
USE_MORPHOLOGY = False

# How many pieces to split edge files into. (Works best when it's a power of
# 2 that's 64 or less.)
N_PIECES = 16
Expand Down Expand Up @@ -105,6 +111,8 @@ CORE_DATASET_NAMES += ["emoji/{}".format(lang) for lang in EMOJI_LANGUAGES]


DATASET_NAMES = CORE_DATASET_NAMES + ["dbpedia/dbpedia_en"]
if USE_MORPHOLOGY:
DATASET_NAMES += ["morphology/subwords-{}".format(lang) for lang in COMMON_LANGUAGES]


rule all:
Expand Down Expand Up @@ -368,7 +376,7 @@ rule combine_assertions:
output:
DATA + "/assertions/assertions.msgpack"
shell:
"python3 -m conceptnet5.builders.combine_assertions {input} {output}"
"cn5-build combine {input} {output}"


# Putting data in PostgreSQL
Expand Down Expand Up @@ -472,6 +480,28 @@ rule concepts_right:
"cut -f 4 {input} > {output}"


rule concept_counts:
input:
DATA + "/stats/concepts_left.txt",
DATA + "/stats/concepts_right.txt"
output:
DATA + "/stats/concept_counts.txt"
shell:
"cat {input} | grep '^/c/' | cut -d '/' -f 1,2,3,4 "
"| LC_ALL=C sort | LC_ALL=C uniq -c > {output}"


rule core_concept_counts:
input:
DATA + "/stats/core_concepts_left.txt",
DATA + "/stats/core_concepts_right.txt"
output:
DATA + "/stats/core_concept_counts.txt"
shell:
"cat {input} | grep '^/c/' | cut -d '/' -f 1,2,3,4 "
"| LC_ALL=C sort | LC_ALL=C uniq -c > {output}"


rule language_stats:
input:
DATA + "/stats/concepts_left.txt",
Expand Down Expand Up @@ -517,7 +547,7 @@ rule reduce_assoc:
output:
DATA + "/assoc/reduced.csv"
shell:
"python3 -m conceptnet5.builders.reduce_assoc {input} {output}"
"cn5-build reduce_assoc {input} {output}"


# Building the vector space
Expand Down Expand Up @@ -659,6 +689,37 @@ rule export_english_text:
"cn5-vectors export_text -l en {input} {output}"


# Morphology
# ==========

rule prepare_vocab:
input:
DATA + "/stats/core_concept_counts.txt"
output:
DATA + "/morph/vocab/{language}.txt"
shell:
"cn5-build prepare_morphology {wildcards.language} {input} {output}"

rule morfessor_segmentation:
input:
DATA + "/morph/vocab/{language}.txt"
output:
DATA + "/morph/segments/{language}.txt"
run:
if wildcards.language in ATOMIC_SPACE_LANGUAGES:
shell("morfessor-train {input} -S {output} --traindata-list --nosplit-re '[^_].'")
else:
shell("morfessor-train {input} -S {output} -f '_' --traindata-list")

rule subwords:
input:
DATA + "/morph/segments/{language}.txt",
output:
DATA + "/edges/morphology/subwords-{language}.msgpack"
shell:
"cn5-build subwords {wildcards.language} {input} {output}"


# Evaluation
# ==========

Expand Down
51 changes: 51 additions & 0 deletions conceptnet5/builders/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
import click
from .combine_assertions import combine_assertions
from .reduce_assoc import reduce_assoc
from .morphology import prepare_vocab_for_morphology, subwords_to_edges


@click.group()
def cli():
pass


@cli.command(name='combine')
@click.argument('input', type=click.Path(readable=True, dir_okay=False))
@click.argument('output', type=click.Path(writable=True, dir_okay=False))
def run_combine(input, output):
"""
Combine edges that have the same relation, start, and end, into
higher-level assertions that add their weights and sources.
`input` is a tab-separated CSV file to be grouped into assertions.
`output` is the combined assertions, as a Msgpack stream.
"""
combine_assertions(input, output)


@cli.command(name='reduce_assoc')
@click.argument('input', type=click.Path(readable=True, dir_okay=False))
@click.argument('output', type=click.Path(writable=True, dir_okay=False))
def run_reduce_assoc(input, output):
"""
Takes in a file of tab-separated simple associations, and removes
low-frequency terms and associations that are judged unlikely to be
useful by various filters.
"""
reduce_assoc(input, output)


@cli.command('prepare_morphology')
@click.argument('language')
@click.argument('input', type=click.File('r'))
@click.argument('output', type=click.File('w'))
def run_prepare_morphology(language, input, output):
prepare_vocab_for_morphology(language, input, output)


@cli.command('subwords')
@click.argument('language')
@click.argument('input', type=click.File('r'))
@click.argument('output', type=click.File('wb'))
def run_subwords(language, input, output):
subwords_to_edges(language, input, output)
24 changes: 5 additions & 19 deletions conceptnet5/builders/combine_assertions.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,11 +94,12 @@ def make_assertion(line_group):
)


def combine_assertions(input_filename, output_file):
def combine_assertions(input_filename, output_filename):
"""
Take in a tab-separated, sorted "CSV" files, indicated by
`input_filename`, that should be grouped together into assertions.
Output a msgpack stream of assertions to `output_file`.
Output a msgpack stream of assertions the file indicated by
`output_filename`.
The input file should be made from multiple sources of assertions by
concatenating and sorting them.
Expand All @@ -113,8 +114,8 @@ def group_func(line):
"Group lines by their URI (their first column)."
return line.split('\t', 1)[0]

out = MsgpackStreamWriter(output_file)
out_bad = MsgpackStreamWriter(output_file + '.reject')
out = MsgpackStreamWriter(output_filename)
out_bad = MsgpackStreamWriter(output_filename + '.reject')

with open(input_filename, encoding='utf-8') as stream:
for key, line_group in itertools.groupby(stream, group_func):
Expand All @@ -129,18 +130,3 @@ def group_func(line):

out.close()
out_bad.close()


@click.command()
#tab-separated csv file to be grouped into assertion
@click.argument('input', type=click.Path(readable=True, dir_okay=False))
#msgpack stream of assertions
@click.argument('output', type=click.Path(writable=True, dir_okay=False))
def cli(input, output):
combine_assertions(input,output)

if __name__ == '__main__':
# This is the main command-line entry point, used in steps of building
# ConceptNet that need to combine edges into assertions. See data/Makefile
# for more context.
cli()
Loading

0 comments on commit d4c2e96

Please sign in to comment.