-
Notifications
You must be signed in to change notification settings - Fork 356
FAQ
Here are answers to some frequently-asked questions, updated for ConceptNet 5.5.
ConceptNet is a knowledge graph of things people know and computers should know, expressed in various natural languages. See the main page for more details.
ConceptNet is a resource. You can use it as part of making an AI that understands the meanings of words people use.
ConceptNet is not a chatbot. Some chatbot systems have used ConceptNet as a resource, but this is not a primary use case that ConceptNet is designed for.
You can browse the knowledge graph at http://www.conceptnet.io/.
We recommend starting with the Web API. If you need a greater flow of information than the Web API provides, then consider downloading the data.
One way to take advantage of all the information in ConceptNet, as well as information that can be learned from large corpora of text, is to use the ConceptNet Numberbatch word embeddings. These can be used as a more accurate replacement for word2vec or GloVe vectors.
When used together with some extra code in conceptnet5.vectors
, ConceptNet Numberbatch provides the best word embeddings in the world in multiple languages, as tested at SemEval 2017.
We went to some effort to make the API responses look nice in a Web browser. The JSON gets formatted and highlighted, and values that are references to other URLs you can look up become links, so you can just explore by following these links.
Try clicking the link below and you'll be using the ConceptNet API:
http://api.conceptnet.io/c/en/example
Of course you don't have to be a Web browser. If you have curl
(a small command-line HTTP utility) on your computer, try running this at the command line:
curl http://api.conceptnet.io/c/en/example
Or in Python, using the requests
library:
import requests
requests.get('http://api.conceptnet.io/c/en/example').json()
There are more things you can do that won't be quite so obvious just from looking at the responses, so once you've explored a little, go read the API documentation.
There are more pages of results. The default page size is set to 20 -- this speeds up the responses, and makes sure you notice that there aren't many results.
When the API results are paginated, the response will end with a section that looks like this:
"view": {
"@id": "/c/en/example?offset=0&limit=20",
"@type": "PartialCollectionView",
"comment": "There are more results. Follow the 'nextPage' link for more.",
"firstPage": "/c/en/example?offset=0&limit=20",
"nextPage": "/c/en/example?offset=20&limit=20",
"paginatedProperty": "edges"
}
As the comment states, "nextPage" contains a link to the next page of results. If you're viewing the API response in a Web browser, you can click the link to see more results.
We were trying to only send you the formatted HTML if it looked like you were using a Web browser, but maybe we're wrong, and maybe you just want the plain JSON anyway. Add ?format=json
to the URL that you query. For example:
http://api.conceptnet.io/c/en/example?format=json
Try going to that URL in Firefox, which has its own built-in JSON formatter. It won't give you a way to follow the links, but other than that, it's pretty nice.
JSON-LD, a linked data format that on the surface is just reasonable-looking JSON, and under the hood, preserves some of the good parts of RDF and the Semantic Web.
This is an interesting comparison to make, as the projects have similar goals, and by now they both make use of multilingual linked data.
ConceptNet contains more kinds of relationships than WordNet. ConceptNet's vocabulary is larger and interconnected in many more ways. In exchange, it's somewhat messier than WordNet.
ConceptNet does only the bare minimum to distinguish word senses so far -- in the built graph of ConceptNet 5.5, word senses are only distinguished by their part of speech (similar to sense2vec). WordNet has a large number of senses for every word, though some of them are difficult to distinguish in practice.
WordNet is too sparse for some applications. You can't build word vectors from WordNet alone. You can't compare nouns to verbs in WordNet, because they are mostly unconnected vocabularies.
ConceptNet does not assume that words fall into "synsets", sets of synonyms that are completely interchangeable. Synonymy in ConceptNet is a relation like any other. If you've worked with WordNet, you may have been frustrated by the implications of the synset assumption on real text, where words are not marked with specific senses, and where the word "He" cannot usually be replaced synonymously with "atomic number 2".
In ConceptNet, we incorporate as much of WordNet as we can while undoing the synset assumption, and we give it a high weight, because the information in WordNet is valuable and usually quite accurate.
ConceptNet is linked open data, and that makes it fundamentally a different thing than a proprietary knowledge base.
Google's Knowledge Graph is a brand name on top of the structured knowledge that it takes to run the Google search engine, Google Assistant, and probably other applications. It provides those sidebars of facts you get when you search for things on Google, and it provides answers to questions that you ask the Google Assistant. It seems to focus largely on things you can buy and things you can look up on Wikipedia. (In ConceptNet, we focus more on the general meanings of all words, whether they be nouns, verbs, adjectives, or adverbs, and less on named entities.)
I assume it's a very well-designed knowledge representation for a search engine. And there is only one search engine that it can power.
Google makes press releases about how they're advancing the state of knowledge representation, but fundamentally, the Google Knowledge Graph advances the ability to interact with Google products on Google's terms.
Unlike the typical corporate knowledge base, ConceptNet has remained true to its crowdsourcing roots. While it's a project developed at Luminoso, it is open for anyone to use under a Creative Commons license. This is the fair thing to do, given how much of it depends on public contributions and linked data, but it's also part of Luminoso's ideals. When we let you see and use our state-of-the-art knowledge representation first-hand, it promotes understanding of why Luminoso's products are a better approach to NLP.
The Microsoft Concept Graph is an odd variation on the proprietary knowledge base: they actually let you download some of the data, but under terms of use that imply you can't do anything with it. It's for "academic use only", with "no derivative works" allowed, which makes me wonder what one is supposed to do with it academically.
The Microsoft Concept Graph is a taxonomy of English nouns, connected with the "IsA" relation, with some automatic word sense disambiguation. Its data comes from machine reading of a Web search index. It resembles an automatically-generated version of OpenCyc.
DBPedia is very much focused on named entities. It's messier than ConceptNet. Its vocabulary consists only of titles of Wikipedia articles.
DBPedia contains information that can be used for answering specific questions, such as "Where is the birthplace of John Adams?" or "What countries have a population of over 10 million?". It particularly knows a lot about locations, movies, and music albums. You could use DBPedia to solve Six Degrees of Kevin Bacon.
ConceptNet imports a small amount of DBPedia, and also contains external links to DBPedia and Wikidata.
DBnary is a counterpart to DBPedia that's actually quite compatible with ConceptNet. Like ConceptNet, it focuses on word definitions rather than named entities, and it gets them from parsing Wiktionary.
Right now we use our own Wiktionary parser, which covers fewer Wiktionary sites than DBnary does but extracts more detail from each entry. We would gladly use DBnary instead, if DBnary starts extracting information such as links from definitions.
Cyc is an ontology built on a predicate logic representation called CycL. CycL can enable very precise reasoning in a way that machine learning over ConceptNet doesn't. However, Cyc is intolerant of errors, and adding information to Cyc is a difficult task.
OpenCyc provides a hierarchy of types of things, with English names, some of which are automatically generated. It seems to be intended as a preview of the full Cyc system, which is not open.
ConceptNet includes a subset of OpenCyc, consisting of the IsA statements that can be reasonably represented in natural language.
Approximately 28 million.
No. Its representation is words and phrases of natural language, and relations between them. Natural language can be vague, illogical, and incredibly useful.
The data that ConceptNet is built from spans a lot of different languages, with a long tail of marginally-represented languages. 10 languages have core support, 77 languages have moderate support, and 304 languages are supported in total. See Languages for a complete list.
This will always be true. We use machine-learning techniques, including word embeddings, to learn generalizable things from ConceptNet despite the incompleteness of the knowledge it contains.
There will probably always be isolated mistakes or falsehoods in ConceptNet. Our data sources and our processes are not perfect. Machine learning can be relatively robust against errors, as long as the errors are not systematic.
If you've identified a systematic source of errors in ConceptNet, that is more important. It would probably improve ConceptNet to get rid of it. In that case, please go to the 'Issues' tab and describe it in an issue report.
See the table on the Relations page of this wiki.
Made-up numbers that are programmed into the reader modules that import various sources of knowledge. These weights represent a rough heuristic of which statements you should trust more than other statements.
During the golden age of crowdsourcing (the decade of the 2000s), ConceptNet accepted direct contributions of knowledge. This was a great start, but now the opportunities for improving ConceptNet have changed, and we are content to leave crowdsourcing to the organizations that are really good at it, like the Wikimedia Foundation.
If you contribute to Wiktionary and follow their guidelines, the information you contribute will eventually be represented in ConceptNet.
What I mean is, can I make my own version of ConceptNet that includes information that I need in my domain?
Well, you can reproduce ConceptNet's build process using Docker and change the code to import a new source of data. This may or may not accomplish what you want.
What ConceptNet is designed for is representing general knowledge. Making a useful domain-specific semantic model is a rather different process, in our experience. The software we built on top of ConceptNet to make this possible eventually became our company, Luminoso. Luminoso provides software as a service that creates domain-specific semantic models, which make use of ConceptNet so they can start out knowing what words mean and just have to learn what's different in your domain.
We've tried a lot of them. Currently PostgreSQL.
Probably one of the following reasons:
- It isn't as efficient as PostgreSQL
- It doesn't actually work as advertised
- It is no longer maintained
- It doesn't provide a good workflow for importing a medium-sized graph such as ConceptNet
- It takes more than a day to import a medium-sized graph such as ConceptNet
- It inflates the size of the data it stores by a factor of more than 10
- It assumes every user has access to and wants to use a distributed computing cluster
- It doesn't run well inside a container
- It's not free software
- It has a restriction on it that would prevent people from reusing ConceptNet, such as the GPL or "academic use only"
If you think you know of a database that doesn't fail one of these criteria, I'd still be interested to hear about it.
It fits on a hard disk, so no. It's enough data for many purposes. But text is small.
If you have textual knowledge that actually requires distributed computation, you work at a company that does Web search.
You're asking about a visualization like this, right?
Notice that that graph is a few thousand times smaller than ConceptNet and it's already an incomprehensible rainbow-colored hairball. I am not convinced there's a technology that exists that can put all of ConceptNet in one meaningful image, although there may be an approach that involves spreading it out into local clusters using t-SNE.
It will almost certainly involve custom code -- ConceptNet makes off-the-shelf graph visualizers collapse under the insoluble problem of laying out its edges. I'm interested in making such a visualization, but the result has to be informative, not just a hairball.
No. SPARQL is computationally infeasible. Similar projects that use SPARQL have unacceptable latency and go down whenever anyone starts using them in earnest.
The way to query ConceptNet is using a rather straightforward REST API, described on the API page. If you need to make a form of query that this API doesn't support, open an issue and we'll look into supporting it.
Blame science reporting for doing what it usually does. There's a nugget of truth in there surrounded by a big wad of meaningless AI hype. It's true that ConceptNet 4 could compete with 4-year-olds on a particular question-answering task -- and ConceptNet 5 performs much better on a similar task. This is cool. It doesn't mean that anyone's about to make robot children.
Here's the background: A much older version of ConceptNet, ConceptNet 4, was evaluated on some intelligence tests involving question-answering and sentence comprehension. The researchers who performed these tests compared ConceptNet's performance to a 4-year-old child.
We found the comparison odd but flattering. 4-year-old children are incredible beings. They have desires, goals, and imagination, and they can communicate them in their spoken language with a level of competence that second-language learners have to put tremendous effort into achieving. No real AI system can come close to emulating the range of things a child can do.
When it comes to the narrower task of answering questions, though, it's believable that ConceptNet 4 compared to a 4-year-old. We're always interested in measurably improving the general intelligence contained in ConceptNet. Excitingly, we now have a question-answering task in which ConceptNet 5 compares to a 17-year-old: that of answering SAT-style analogy questions.
But there is much more to be done. The Story Cloze Test is a test of story understanding that any human can score close to 100% on in their native language. Natural language AI systems, including ConceptNet, have not yet surpassed 65% on this test.
Starting points
Reproducibility
Details