Skip to content

Conversation

@lindajiawenli
Copy link
Contributor

Ran "npm install --save neo4j-driver" hence the changes to package-lock.json and package.json
Added code to starting-neo4j.js but it doesn't seem to be doing anything as of yet

DO NOT MERGE YET, code does not actually do anything

@lindajiawenli
Copy link
Contributor Author

lindajiawenli commented Feb 13, 2023

addNode( params )

This function takes in params, which are of the form <see src/neo4j/graph-data.js>. If it does not yet exist (according to the grounding id, which is: "database name" + ":" + "database id"), create one "Gene" node with the specified parameters.

If it does exist already, do nothing

Possible edit: instead of doing nothing if it already exists, add the factoidId/UUID to the node in the factoidId field (since we can have node properties that are arrays)

@lindajiawenli
Copy link
Contributor Author

lindajiawenli commented Feb 13, 2023

addEdge( params )

This function takes in params, which are of the form <see src/neo4j/graph-data.js>. If it does not yet exist (according to the grounding id which is currently the UUID of that specific interaction [NOT the document UUID]), create one edge with the specified parameters.

If it does exist already, do nothing (highly unlikely this will happen if the grounding id is a UUID)

@maxkfranz
Copy link
Member

Nice start.

Let’s say I’m a person who is using the functions that you’re creating. How would I know what the params are? Would I have to read the query strings every time I want to use these functions? Could things be made simpler or more explicit upfront? What parameters are needed exactly in each case?

@lindajiawenli
Copy link
Contributor Author

What parameters are needed exactly in each case?

For a node, that would be (in the form of an object with these name-value pairs):

  • id: 'element.association.dbPrefix'+ ':' + 'element.association.id', (ex: 'ncbigene:5597')
  • factoidId: 'element.id', (ex: 598f8bef-f858-4dd0-b1c6-5168a8ae5349)
  • name: 'element.name', (ex. 'MAPK6')
  • type: 'element.type', (ex. 'protein')
  • dbId: 'element.association.id', (ex. 5597)
  • dbName: "element.association.dbName" (ex. 'NCBI Gene''NCBI Gene')

For a relationship/edge, that would be (again in the form of an object):

  • id1: 'element.association.dbPrefix'+ ':' + 'element.association.id', (ex. 'ncbigene:5597')
  • id2: 'element.association.dbPrefix'+ ':' + 'element.association.id', (ex. 'ncbigene:207')
  • id3: 'element.id', (ex. 01ef22cc-2a8e-46d4-9060-6bf1c273869b. NOT the document id, but the interaction id)
  • type: 'element.type', (ex. 'phosphorylation')
  • doi: 'value.citation.doi', (ex. '10.1126/sciadv.abi6439')
  • pmid: 'value.citation.pmid' (ex. '34767444')
  • documentId: 'value.id', (ex. a896d611-affe-4b45-a5e1-9bc560ffceab. This is the document id)
  • title: 'value.citation.title' (ex. 'MAPK6-AKT signaling promotes tumor growth and resistance to mTOR kinase blockade')

Ideally all the fields will be given to us via the Document API and the user won't have to do anything to actually create the node/edge themselves (we'll get all the info from the Biofactoid forms they fill out and make it automatically. I think this is what we said in a meeting ~2 weeks ago? I could be wrong though), but right now I just have all these parameters hard-coded.

Let me know what you guys think! @jvwong @maxkfranz

{
id1: 'ncbigene:5597', id2: 'ncbigene:207', id3: '01ef22cc-2a8e-46d4-9060-6bf1c273869b',
type: 'phosphorylation', doi: '10.1126/sciadv.abi6439', pmid: '34767444',
documentId: 'a896d611-affe-4b45-a5e1-9bc560ffceab',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware DOI and PMID are not guaranteed to exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, in those cases Neo4j would be fine receiving 'null' or an empty string (I haven't figured out what the Document API does in those cases yet, but we can work around it).

It would complicate the use case of when the user wants to search for a pathway (rather than a gene) unless people search by title, I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still useful to store, but you'll have to be careful of null cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that articleTitle could also be null.

// MAPK6 data
{
id: 'ncbigene:5597', factoidId: '598f8bef-f858-4dd0-b1c6-5168a8ae5349', name: 'MAPK6',
type: 'protein', dbId: '5597', dbName: 'NCBI Gene'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbId is redundant I suppose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with dbName? I can get rid of both of them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbName should be the normalised namespace, like ncbi, if we store it at all. We could remove dbId and dbName completely if we're confident we're not going to do queries on dbName, e.g. give me all proteins that are grounded to ncbi rather than uniprot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2023-02-14 at 11 27 33 AM

I got "NCBI Gene" from the JSON file way back when I was just getting the values straight from those files. So I think it's pretty consistent

I think I will keep those two fields for now, but I can easily remove them when we're certain they're not necessary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In striving for consistency in syntax, in Factoid I'm going by:

  • NCBI Gene
    • dbPrefix: ncbigene
    • dbName: NCBI Gene
    • Local id pattern (regex): ^\d+$
    • Compact identifier: ncbigene:^\d+$
  • ChEBI
    • dbPrefix: CHEBI
    • dbName: ChEBI
    • Local id pattern (regex): ^CHEBI:\d+$
    • Compact identifier: CHEBI:^\d+$

@maxkfranz
Copy link
Member

maxkfranz commented Feb 14, 2023

OK, so addNode() might look something like this:

function addNode(id, type, factoidId (?), dbName (?), dbId (?), name)

Node notes:

  • The factoidId would have to be stored as an array in Neo4J. There are potentially multiple Biofactoid IDs to each Neo4J node.
  • It might be simplest to forgo factoidId for now, making the signature function addNode(id, type, dbName, dbId, name). If we really want to store the Biofactoid node UUIDs somewhere, we could do it in the edges as something like sourceFactoidId and targetFactoidId. @jvwong
  • Future: The name may not be consistent across documents. A user might use name A for somedb:123 and another user might use name B for the same somedb:123. It might make sense to use the 'official' name when hooking things up to the Document API.
  • See above re. possible removal of dbName and dbId. Then we'd just have function addNode(id, type, name)

And addEdge() might look like:

function addEdge(id, type, factoidId, sourceId, targetId, doi, pmid, documentId, articleTitle)

Edge notes:

  • It might make sense to store a factoidId field in edges, even if it's redundant with id. In future, the DB may contain data that's not just from Biofactoid (i.e. id = someUUID, factoidId = null).
  • sourceId and targetId are dbname:123 strings rather than UUIDs.
  • id is the interaction UUID.
  • articleTitle is a bit redundant, since you could get that info elsewhere from the IDs. We shouldn't be doing text search in Neo4J. Maybe Jeff has a use case in mind for getting the article title quickly, e.g. a PC Apps like query where you click on an edge and see the article title it's associated with as a link? @jvwong

General notes:

  • Functions should have the parameters up front rather than in an object, unless you're going to extensively document the object's format in Javadoc/JSDoc-like comments. An options object for the parameters can also make sense in cases where lots of fields are optional, but that's not really the case here.
  • It's best if commonly-named parameters have similar ordering across the functions, like id and type.
  • IDs should be sanitised in the implementation. For instance, you could have a DB name of 'NCBIorncbi`. They should always be forced to lower case. Same with UUIDs.
  • In future, you'll have higher level functions like addDocument(document), addInteraction(interaction), and addEntity(entity). Those functions will use the Document API and your lower level functions that you're working on now.

@maxkfranz
Copy link
Member

In short, I'd suggest for now:

  • function addNode(id, type, name)
  • function addEdge(id, type, factoidId, sourceId, targetId, doi, pmid, factoidDocumentId, articleTitle)

@jvwong
Copy link
Member

jvwong commented Feb 14, 2023

In short, I'd suggest for now:

  • function addNode(id, type, name)

For nodes, type could vary. For example, I could have a node for TP53 (dbPrefix: ncbigene; id: 7157 ) with type protein but also RNA (as well as other types) depending on what the user specified. So an even simpler approach is to ignore type altogether. This way, I ask for TP53, you give me everything, regardless of the type.

I don't think this is a problem with type chemical.

@maxkfranz
Copy link
Member

For nodes, type could vary...

This is another example of something that, if needed, may be better placed in interaction data (e.g. sourceType and targetType).

Let's ignore node type for now.

You could make the same argument for an edge's factoidId being redundant with factoidDocumentId: With one, you could find the other.

factoidDocumentId might be better as a more general datasourceId or xref. Then you could give it a value like factoid:some-doc-uuid for Biofactoid edges, and you could use a value like pc:some-other-id if the interaction came from PC (in future).

Then we'd have:

  • function addNode(id, name)
  • function addEdge(id, type, sourceId, targetId, xref, doi, pmid, articleTitle) -- I'd put xref before the following parameters, since doi, pmid, articleTitle can each be null, whereas xref should always be defined. Best to put mandatory things first.

@lindajiawenli lindajiawenli merged commit 980c29a into unstable Feb 16, 2023
@jvwong jvwong deleted the neo4j-beginning branch August 30, 2023 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants