Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #377: Implemented support for adding the tokens to parsed nodes. #388

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

faubulous
Copy link

I extended the factory methods namedNode, blankNode, literalNode and variable to also accept an optional token as additional parameter and pass it to the created Term instance.

Then I changed the Parser to provide the tokens as an additional argument when parsing a file.

The default exported DataFactory does, however, not pass the token to the factory methods and thus creates nodes without additional added tokens as a default behaviour. This means that the current behaviour is not changed and all existing tests pass.

Using the exported factory methods for one can create a custom DataFactory which can be passed to the parser as an argument using the existing API. This way the exported tokens can either be a) passed as they are or b) modified and stored with the nodes as neded.

@jeswr jeswr self-requested a review July 5, 2024 02:58
@RubenVerborgh RubenVerborgh self-requested a review July 5, 2024 11:27
Copy link
Member

@RubenVerborgh RubenVerborgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @faubulous, off to a good start.

Gist of my comments:

  • Let's make the second argument a context object { token } for extensibility.
  • Let's expose this factory as-is, with the second optional argument.
  • The base Term constructor should do the work of assigning context / token. (Preferably context actually; Term should not look inside context.)
  • Let's not use explicit undefined; only null or ''.

Open questions:

  • Reading mechanism for context/token?

namedNode: iri => namedNode(iri),
blankNode: name => blankNode(name),
variable: name => variable(name),
literal: (name, datatype) => literal(name, datatype),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy leaving these as they are actually. It doesn't hurt to have the second optional argument, and that way, we can expose it to other parsers.

Copy link
Author

@faubulous faubulous Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand you correctly, you want to pass the default constructor functions to the DataFactory as they are. This would mean, however, that the default behavior of the DataFactory changes to always creating Terms with an initialized { token } context. Which in turn will break some tests and is probably not needed in standard situations.

I was hesitant to put a condition into the Parser to check wether to pass the token/context argument to the factory or not because this would have an impact on the performance. The solution to simply ignore the second argument in the DataFactory seemed like a good compromise and allows the API users to define their own behavior regarding the context variable of a node, such as offsetting positions or adding additional data if needed.

So should I revert this to the default functions and have the context always initialized with a token?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default behavior of the DataFactory changes to always creating Terms with an initialized { token } context.

That's fine; objects can always have optional fields.

Which in turn will break some tests

Those tests might be too strict; we can loosen them.

I was hesitant to put a condition into the Parser to check wether to pass the token/context argument to the factory

Indeed; let's definitely not.

The solution to simply ignore the second argument in the DataFactory

Aah okay. This actually shows me that we need 2 DataFactory classes then. One that discards the context, and one that keeps it. Let's make both first-class citizens.

Comment on lines 67 to 72
constructor(id, token) {
super(id);

this.token = token;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
constructor(id, token) {
super(id);
this.token = token;
}
constructor(id, token = null) {
super(id);
this.token = token;
}

I'm nitting here, but I want things to be explicit. No undefined.

However, even better would be if the Term constructor simply took token; then we don't need any of this.

And let's not make it token, let's make it context, with let { token } = context.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's OK. So will the default be context = { token = null } or context = null?

I am personally hesitant to initialize token with an empty string as this is a different datataype as the Token object and it might indicate that one should expect a non-empty string when it is initalized which is misleading.

But I'll do as you wish. Just let me know.

Copy link
Author

@faubulous faubulous Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Saw the answer in the comment below. Will go with DEFAULT_CONTEXT.

@@ -92,7 +105,7 @@ export class Literal extends Term {

// ### The datatype IRI of this literal
get datatype() {
return new NamedNode(this.datatypeString);
return new NamedNode(this.datatypeString, this.token);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to pass the token along here. I don't think we should.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@@ -273,7 +291,8 @@ export function termToId(term, nested) {
// ## Quad constructor
export class Quad extends Term {
constructor(subject, predicate, object, graph) {
super('');
super(undefined);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never. Either null or '', but never explicitly undefined.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, no problem.

However, I'm just curious to learn why. Am I right that you want the variable to be initialized with something so that it is visible on an API level? I have to admit that I am not that deep into JavaScript to know all the consequences. It was my reasoning that when passing undefined the variable might not occupy any memory as interpreters might just delete it and thus having the API behave as before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was my reasoning that when passing undefined the variable might not occupy any memory

That's not how JavaScript engines work today; the will still explicitly assign the field.

However, I'm just curious to learn why.

Mainly for explicitness. undefined is what happens when we don't pass anything, forget stuff, uninitialized, etc. null signifies a choice. Choices are good.

src/N3Parser.js Outdated
@@ -194,7 +194,7 @@ export default class N3Parser {
case '[':
// Start a new quad with a new blank node as subject
this._saveContext('blank', this._graph,
this._subject = this._blankNode(), null, null);
this._subject = this._blankNode(undefined, token), null, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not undefined

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should I initialize it then? Please advise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null, or allow a constructor that makes the first argument optional, like this._blankNode({ token }). Then we can do a string test on the first argument.

src/N3Parser.js Outdated
@@ -206,7 +206,7 @@ export default class N3Parser {
if (!this._n3Mode)
return this._error('Unexpected graph', token);
this._saveContext('formula', this._graph,
this._graph = this._blankNode(), null, null);
this._graph = this._blankNode(undefined, token), null, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem (for all below)

@RubenVerborgh
Copy link
Member

The more I think about it, it should be:

constructor(whatever, context = DEFAULT_CONTEXT)
and
this.context = context;

with DEFAULT_CONTEXT a frozen object { token: null } that we recycle.

@faubulous
Copy link
Author

Thank you for your quick review @RubenVerborgh! I implemented most of your suggested changes except exporting the DataFactory as is because this will change the default behavior of the API and break some unit tests. Please advise.

@bergos
Copy link
Member

bergos commented Jul 6, 2024

I only had a brief look as I am on vacation and traveling, so forgive me if I missed something.

It looks like you are adding arguments to the factory methods, which are defined in the RDF/JS spec. That may cause problems if someone else comes up with the same idea but with different properties. If we want to move forward in that direction, we must extend the specification to avoid any conflicts. It would be good if the spec would cover that topic, but it may take a bit longer and it's understandable if one want to move forward faster.

I would recommend another approach that doesn't require extending the spec. Just pass a flag to the constructor of the parser, and if it's set, set the properties after creating the Term instance with the factory like this:

const term = factory.namedNode(iri)
term.source = { url, line }

Also, please have a look at the additional properties page in the wiki of the spec. I requested creating that page with source maps or DOM element linking in mind. It would be great if we could define a reusable standard structure for it. I have something like this in mind with all properties optional:

term.source = {
  url, // file or http URL
  line, // line number
  position, // start position of the token
  text, // text of the token
  element // DOM element for RDFa
}

I'm back in the week of 22nd of July. It would be great if we could keep the discussion open till the end of that week.

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Jul 7, 2024

@bergos

That may cause problems if someone else comes up with the same idea but with different properties.

Indeed, and thanks for looking at this.

That's the main reason why I am proposing an object as an optional second argument, so { token } rather than directly token or even a string. This should give us the extensibility we're looking for.

Just pass a flag to the constructor of the parser, and if it's set, set the properties after creating the Term instance

I could also agree with this; just not a big fan of external classes managing the internal state of another class. Plus, some factories could create frozen objects etc… I think the first approach, with an optional object, is actually the safest all things considered.

Also, please have a look at the additional properties page in the wiki of the spec. I requested creating that page with source maps or DOM element linking in mind. It would be great if we could define a reusable standard structure for it.

+1 to this

I'm back in the week of 22nd of July. It would be great if we could keep the discussion open till the end of that week.

We'll wait for you, mate. Would be good to have an RDF/JS issue open to track this then. Cheers!

@RubenVerborgh RubenVerborgh marked this pull request as draft July 7, 2024 12:20
@RubenVerborgh RubenVerborgh requested a review from bergos July 7, 2024 12:20
@bergos
Copy link
Member

bergos commented Jul 23, 2024

@RubenVerborgh

That's the main reason why I am proposing an object as an optional second argument, so { token } rather than directly token or even a string. This should give us the extensibility we're looking for.

I agree with that. If we extend the interface, we will almost certainly use an additional object argument.

I dug a bit into the code. The debugger showed me that the token object has the following properties: type, value, prefix, line, start, end. That makes sense for the parser as an internal object, but as an additional property to the term object, I prefer something more self-describing for that context.

My proposal:

  • token -> source or origin
  • type -> textType or tokenType
  • value -> text or token
  • prefix -> keep
  • line -> keep
  • start -> keep
  • end -> keep

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Jul 24, 2024

Thanks @bergos!

Based on your comment, may I suggest to @faubulous:

  • source
  • tokenType
  • token
  • prefix
  • line
  • start
  • end

We should still retain backward compatibility with those who expect value/type to be there.

@faubulous
Copy link
Author

Dear @RubenVerborgh, could you please summarise what changes you expect me to make? From the comments I gather that I need to refactor the token argument into an object of conforming to this interface:

{
  source,
  tokenType,
  token,
  prefix,
  line,
  start,
  end,
}

Is this correct?

@RubenVerborgh
Copy link
Member

@faubulous That's the one!

@bergos
Copy link
Member

bergos commented Aug 23, 2024

@faubulous @RubenVerborgh I expected source to be a property with an object value that contains all other properties like this:

{
  source: {
    tokenType,
    token,
    prefix,
    line,
    start,
    end
  }
}

@RubenVerborgh
Copy link
Member

@bergos Doesn't fully make sense to me; some of those are not source attributes. Perhaps:

{
  tokenType,
  token,
  prefix,
  source: {
    line,
    start,
    end,
  },
}

@bergos
Copy link
Member

bergos commented Aug 24, 2024

@RubenVerborgh All of them are source attributes, as a different serialization may generate different tokens, and prefixes are also defined as part of the serialization.

Maybe a short overview of how I think additional attributes could be used in the future makes it easier to understand my view:

Terms and Quads have a lifecycle. They are generated by a source, they can be shown somewhere and bound to UI components, and they can be written to a target, which can be another intermediate data model object or the final serialization.

An example UI application could be a Turtle to Turtle mapper with a Quad table component that lists the complete content of the source document. A prefix table component can be used to define the prefixes of the target document. On a click on Quad of the table component, the source and target tokens are highlighted. Behind the scenes, the parser attaches a source property to all Term objects. The Quad table component attaches a binding property to all Term objects. That property refers to the DOM cells in the table. The serializer attaches a target property. It contains the same kind of properties as the parser generates.

Using only the Term and Quad objects for this use case may not be optimal. Additional data structures like maps are required for performance reasons. But that may vary based on the use case. The important point here is: how could additional properties be used? My goal would be to get some kind of standard for the properties. It should be possible to use them with different libraries. Therefore, we should have a look at the whole lifecycle of data model objects.

@RubenVerborgh
Copy link
Member

I'll leave the decision to others.

@RubenVerborgh RubenVerborgh requested review from RubenVerborgh and removed request for RubenVerborgh August 24, 2024 10:41
@faubulous
Copy link
Author

faubulous commented Aug 27, 2024

@bergos I've implemented most of the requested changes. However, I want to make sure that I got it right. The context object should be structured like this:

{
  source: {
    tokenType, // token.type
    token, // token.value
    prefix, // token.prefix
    line, // token.line
    start, // token.start
    end // token.end
  }
}

@RubenVerborgh
Copy link
Member

I'm very unsure now why we have an object with a single property wrapping an object with all the other properties.
We're now creating twice the objects, and object creation is expensive. I think this will cost us in performance, so wouldn't merge before we have assessed that.

@faubulous
Copy link
Author

Agree. I would personally prefer the original proposal of: { source: token }

Where the token object is provided as-is in a wrapped context. We could also rename token to source, but I would not like to map the token to another format as this is expensive. At least we would spare ourselves the transformation step for the original token object.

However, this also requires the Parser to provide { source: token } to the namedNode factory method every time. This is a performance overhead even if we do not pass the object along to the term in the default DataFactory. This is why I originally only provided the token object without modification as this is the fastest way to do it. Then the custom DataFactory can take care of the transformation and the associated performance overhead.. the default factory would simply ignore the parameter.

With this approach, every application that needs another information / structure can provide a custom DataFactory to implement all required transformations or mappings like this:

new Parser({
    factory: {
       namedNode: (iri, token) => namedNode(iri, { source: { ... }, target: { ... } }),
       ....
    }
  });

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants