Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Listing with chunked transfer #145

Merged
merged 3 commits into from
Mar 29, 2021
Merged

Listing with chunked transfer #145

merged 3 commits into from
Mar 29, 2021

Conversation

farshidtz
Copy link
Member

@farshidtz farshidtz commented Mar 23, 2021

This proposal simplifies the listing API by suggesting streaming instead of pagination.

Most HTTP servers and clients support this internally. Moreover, this makes the discovery spec much simpler as there is no need to specify query arguments for ranges, sorting, and headers. The proposal #130 was getting out of hand with several substandard requirements.

Chunked transfer encoding can be mapped to CoAP: https://tools.ietf.org/html/draft-castellani-core-http-coap-mapping-00#appendix-A.2

Please also refer to the original discussion in #16.


Preview | Diff

@mmccool
Copy link
Contributor

mmccool commented Mar 23, 2021

So to be clear, this is an ALTERNATIVE to #130, so if we merge this we would NOT merge PR130, right?

@farshidtz
Copy link
Member Author

So to be clear, this is an ALTERNATIVE to #130, so if we merge this we would NOT merge PR130, right?

Yes

@mmccool
Copy link
Contributor

mmccool commented Mar 23, 2021

At any rate, I am generally in favor of this approach, since it (a) builds on current specifications (b) simplifies our own specification (c) can be extended to CoAP.

Edit: Never mind the following, I see it is dealt with. We should probably be saying "HTTP/1.1" or "HTTP/2" as appropriate in our specification. I know that "for reasons" we have to support HTTP/1.1 in some cases, but I think to make others in the community happy we should also support HTTP/2, and I understand chunking is handled differently

For reference, I have recently been reading the Solid Protocol document. The use cases for Solid (private data stores) is aligned with what we are doing, and they make some useful recommendations that we should try to align with. They are a CG, not a WG, so we can't cite their documents normatively but we can cite them informatively and make similar protocol choices.



<p>
<span class="rfc2119-assertion" id="tdd-reg-list-method">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertions should ideally be self-contained. I would add "from the retrieveTDs property". BTW, "retrieveTDs" is a verb (phrase). Properties should ideally be nouns, and actions verbs. This is a bikeshed issue, but maybe something like "allTDs" makes more sense. Likewise, it seems query actions like "searchJSONPath" are properties... that doesn't necessarily make sense either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we don't have to fix this in this PR, but maybe we should create another issue to clean this up. We should probably create an actual TM also.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this point is being discussed in #133

index.html Outdated Show resolved Hide resolved
index.html Outdated
Memory-constrained applications which require the full list
should consider processing the received data incrementally.

<p class="ednote" title="HTTP/2 chunking">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see you do include this. Good. Yeah, testing is going to be a pain. However, we probably DO need to include support for HTTP/2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My implementation supports both HTTP/1.1 and 2 (built-in Go HTTP library). I just need to prepare SSL certificates (generally required for HTTP/2) to and simply test it with a modern browser and curl.

Copy link
Member Author

@farshidtz farshidtz Mar 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished testing this in the following ways:

  • HTTP/2 proxy in front of the directory. This was as simple as adding an http2 flag to the respective Nginx listener. The proxy was able to translate HTTP/1.1 chunked encoding to an HTTP/2 stream. The HTTP/2 performance (speed) was twice better than HTTP/1.1 when tested over the internet.
  • Adding TLS certificates on the directory. This enabled the serving HTTP/2 clients with the HTTP/2 stream. A benchmark showed slightly better performance compared to HTTP/1.1 over TLS when tested in a local network.

I will go ahead with taking the assertion out of the ednote.

Copy link
Member

@relu91 relu91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the reasons why we are leading to the chucked approach instead of the pagination and I think given the current status is the right way forward. However, I want to underlie that chunked encoding does not really solve the same problem that pagination does. As far as I know, this mechanism is good from the client's point of view but it is still quite a burden for the server-side. As we all know the server is required to maintain a persistent connection where it sends the collection of all TDs which in the worst-case scenarios could take minutes. On the other hand, pagination allows better resource management (both client and server-side) employing short-lived connections and "flow" control (client can skip pages or stop early).

Not sure if we should mention those comments in the spec. I think now the discussion about pagination should move to the TD spec where we can define a way to correctly describe it. We'll gain the ability to use it also outside the discovery use case. I'll open an issue there if there's no already one.

@wiresio
Copy link
Member

wiresio commented Mar 24, 2021

Since there are a lot of open issues around this, let's try to sum up and see what we have on the table.

From my perspective the following is here in the API:

Exactly one entire TD One or multiple entire TDs Partial TD
UPLOAD createTD -> PUT application/td+json TO /td/{id}
createTD -> POST application/td+json TO /td
updateTD -> PUT application/td+json TO /td/{id} updatePartialTD -> PATCH application/merge-patch+json TO /td/{id}
DOWNLOAD retrieveTD -> GET application/td+json FROM /td/{id} retrieveTDs -> GET application/json? FROM /td
- Pagination in body
- Header based pagination
- Chunking / streaming
retrieveTD -> GET application/json? FROM /td/{id}
- Chunking / streaming
searchJSONPath -> GET application/json? FROM /search/jsonpath?query={query}
- Pagination in body
- Header based pagination
- Chunking / streaming
searchJSONPath -> GET application/json? FROM /search/jsonpath?query={query}
- Pagination in body
- Header based pagination
- Chunking / streaming
searchXPath -> GET application/json? FROM /search/xpath?query={query}
- Pagination in body
- Header based pagination
- Chunking / streaming
searchXPath -> GET application/json? FROM /search/xpath?query={query}
- Pagination in body
- Header based pagination
- Chunking / streaming
searchSPARQL -> GET application/json? FROM /search/sparql?query={query}
- Pagination in body
- Header based pagination
- Chunking / streaming
searchSPARQL -> GET application/json? FROM /search/sparql?query={query}
- Pagination in body
- Header based pagination
- Chunking / streaming
searchSPARQL -> POST query TO /search/sparql and retrieve application/json?
- Pagination in body
- Header based pagination
- Chunking / streaming
searchSPARQL -> POST query TO /search/sparql and retrieve application/json?
- Pagination in body
- Header based pagination
- Chunking / streaming
DELETE deleteTD -> DELETE /td/{id}

Despite the fact that we have several white spots, especially when it comes to uploading TDs, and several open questions about the content type to be returned, the main question in this PR circles around the way how to treat large reponses (either large sets of TDs or large TDs) in the DOWNLOAD part.

Options, not applicable to all download parts, are:

  • Pagination in body
  • Header based pagination
  • Chunking / streaming

Header based pagination has several limitations and challenges, as pointed out by @farshidtz , such that we could assume it is a candidate for "rule out".

Chunking / streaming, originally proposed by @zolkis , looks like a good fit when it comes to serving contrained clients with (a) TD(s). However, such constrained clients might not want to use SPARQL queries or even try retrieve all TDs stored in the TDD. So why have Chunking / streaming as a response style for the resources that will only be used by clients having sufficient computation and IO power? For such "big" clients, I'd like to have an alternative to pure Chunking / streaming which is Pagination in body as proposed @farshidtz by in #16 (comment) - plus some slight extensions for which I'm preparing a proposal to be presented later on.

To sum up: Let's look over the table together and pick the best options for each entry. I think we could have both Chunking / streaming as well as Pagination in body.

@relu91
Copy link
Member

relu91 commented Mar 24, 2021

For such "big" clients, I'd like to have an alternative to pure Chunking / streaming which is Pagination in body as proposed @farshidtz by in #16 (comment) - plus some slight extensions for which I'm preparing a proposal to be presented later on.

For query-based functions (or should I say actions? 🤣) we have also the option to use their own native pagination support. Both SPARQL and JSONPath support it. I think this is also missing in your table, right?

@farshidtz
Copy link
Member Author

I see the reasons why we are leading to the chucked approach instead of the pagination and I think given the current status is the right way forward. However, I want to underlie that chunked encoding does not really solve the same problem that pagination does. As far as I know, this mechanism is good from the client's point of view but it is still quite a burden for the server-side. As we all know the server is required to maintain a persistent connection where it sends the collection of all TDs which in the worst-case scenarios could take minutes. On the other hand, pagination allows better resource management (both client and server-side) employing short-lived connections and "flow" control (client can skip pages or stop early).

Not sure if we should mention those comments in the spec. I think now the discussion about pagination should move to the TD spec where we can define a way to correctly describe it. We'll gain the ability to use it also outside the discovery use case. I'll open an issue there if there's no already one.

I don't agree that chunked transfer is a burden on the server side. Considering that the size of each TD is variable (putting a size limit with profile just erases the problem), chunking provides a way for the server to allocate a predictable amount of memory to server each client. Even if we do pagination, it needs to be mixed with chunking to solve this issue. On the other hand, chunked response results in a single connection instead of several opened and closed connections which is much more expensive for the server, the proxies, and the network. MQTT, CoAP observer, SSE, and WebSockets all provide high performance because of the single long-lived connections.

I agree with the nice ability to have flow control and its benefits, but we already do have that on the search API:

/search/jsonpath?query=$[0:10]

@farshidtz
Copy link
Member Author

Just to add a bit of background, chunked transfer encoding is applicable to self-describing devices, retrieval of 1 or multiple TDs from the directory, as well as submission of TDs to the directory (POST/PUT/PATCH incrementally). This solves many of our open issues. Though this PR only suggests the use of it for listing.

Chunked transfer encoding is widely used and built into most client libraries. Try this experimental endpoint in your browser, curl or any other HTTP client. Most probably, the only difference you'll notice is the different response header.

@relu91
Copy link
Member

relu91 commented Mar 24, 2021

I don't agree that chunked transfer is a burden on the server side. Considering that the size of each TD is variable (putting a size limit with profile just erases the problem), chunking provides a way for the server to allocate a predictable amount of memory to server each client.

It is more about computational time than memory allocation. As I said in very big collections of TDs you could transfer Gbs, even if the server is smart enough to handle its memory occupation with the small buffer it still needs to send the WHOLE collection, right?. it could take minutes where your memory and CPU are allocated to do the transfer.

MQTT, CoAP observer, SSE, and WebSockets all provide high performance because of the single long-lived connections.

Yeah because the assumption is that you're sending a small amount of data each time. I think it's another use case.

If we want to take other solutions as an example, I mean, just pick a random Web API and it would have pagination support 🤣 . I think this is not by chance but a design that in the long run will allow better scalability of the web service.

Just to add a bit of background, chunked transfer encoding is applicable to self-describing devices, retrieval of 1 or multiple TDs from the directory, as well as submission of TDs to the directory (POST/PUT/PATCH incrementally). This solves many of our open issues. Though this PR only suggests the use of it for listing.

As I said in the previous post, totally agree about the usefulness of chunked encoding! What I don't really like is that in the spec it feels that it solve the same problem of pagination.

@benfrancis
Copy link
Member

Question: Will chunked transfer encoding mean that the client can only parse the JSON list once the entire list has been downloaded?

@farshidtz
Copy link
Member Author

I don't agree that chunked transfer is a burden on the server side. Considering that the size of each TD is variable (putting a size limit with profile just erases the problem), chunking provides a way for the server to allocate a predictable amount of memory to server each client.

It is more about computational time than memory allocation. As I said in very big collections of TDs you could transfer Gbs, even if the server is smart enough to handle its memory occupation with the small buffer it still needs to send the WHOLE collection, right?. it could take minutes where your memory and CPU are allocated to do the transfer.

I think the use case for the retrieveTDs is to query the whole collection. A partial retrieval may be needed by clients interested in specific TDs or for e.g. IDs and titles of all TDs. In this case, they really need to use the search API. If they wanna retrieve all TDs, pagination will require more computation than chunking.

I'd also like to point out that HTTP has Range Requests (compatible with chunked transfer encoding) to provide flow control, but that goes toward the header-based approach and really useful for downloading files, continuing from where it was left off or asking for missing pieces.

If we want to take other solutions as an example, I mean, just pick a random Web API and it would have pagination support 🤣 . I think this is not by chance but a design that in the long run will allow better scalability of the web service.

This is true. I was originally totally in favour of pagination, as you see in the pagination issue. But JSON-LD issues led us to the complicated header-based approach.

@wiresio
Copy link
Member

wiresio commented Mar 24, 2021

Chunking / streaming vs. pagination: Is it really one or the other? I think both make sense.

Question: Will chunked transfer encoding mean that the client can only parse the JSON list once the entire list has been downloaded?

Would be also my concern. What happens if I just want page 1 and page 100? Wait for all the other data in between?

@farshidtz
Copy link
Member Author

farshidtz commented Mar 24, 2021

Question: Will chunked transfer encoding mean that the client can only parse the JSON list once the entire list has been downloaded?

It will be challenging for a developer to decode chunks manually before the object is complete. First, as @mmccool pointed out once, memory constrained devices have no business querying an entire TD set. They should instead query what they need. Regardless, there are ways to parse JSON from streams and extract objects. I looked into Go, Python, Node.js, Java and found ways to parse JSON streams using built-in or common libraries.

Would be also my concern. What happens if I just want page 1 and page 100? Wait for all the other data in between?

Please see my answer above. This is a corner case, but still possible:

/search/jsonpath?query=$[0:10] // page 1
/search/jsonpath?query=$[1000:10] // page 100

@relu91
Copy link
Member

relu91 commented Mar 24, 2021

Chunking / streaming vs. pagination: Is it really one or the other? I think both make sense.

I would say so, but I agree with the intended direction to have chucking now (cause it's almost free) and think on the TD spec how to properly model pagination. Again pagination would be beneficial for every affordance that could return a long list of elements. So it's a common problem.

I think the use case for the retrieveTDs is to query the whole collection. A partial retrieval may be needed by clients interested in specific TDs or for e.g. IDs and titles of all TDs. In this case, they really need to use the search API. If they wanna retrieve all TDs, pagination will require more computation than chunking.

Do you mean that it will require too many connections? Since their main goal is to have the complete list, right? Kinda agree here, even if the use case that I have in mind for retrieveTDs is UI listing. In the end, in this use case you'll never really read the whole collection, but just one page at a time. Do you think that UI should use a query (e.g., not filter at all) for this use case?
So, just to understand, when you speak about retriveTDs you have in mind a "bulk download" use case, right? For example, if I have to replicate the collection somewhere else. If yes now I understand why you're advocating about completely forget pagination and use chunked encoding. Not sure if this is described in the spec, it might help developers to choose the right affordance (i.e., query vs retrieve). I would add it if missing.

I'd also like to point out that HTTP has Range Requests (compatible with chunked transfer encoding) to provide flow control, but that goes toward the header-based approach and really useful for downloading files, continuing from where it was left off or asking for missing pieces.

This follow the hypothesis above that you are thinking about "bulk downloads", am I right?

@relu91
Copy link
Member

relu91 commented Mar 24, 2021

Regardless, there are ways to parse JSON from streams and extract objects. I looked into Go, Python, Node.js, Java and found ways to parse JSON streams using built-in or common libraries.

Yes, it's true. Also is possible to obtain the stream from an HTTP client, like in node/fetch where you can the response directly using a stream interface.

@farshidtz
Copy link
Member Author

farshidtz commented Mar 24, 2021

Chunking / streaming vs. pagination: Is it really one or the other? I think both make sense.

I would say so, but I agree with the intended direction to have chucking now (cause it's almost free) and think on the TD spec how to properly model pagination. Again pagination would be beneficial for every affordance that could return a long list of elements. So it's a common problem.

I think the use case for the retrieveTDs is to query the whole collection. A partial retrieval may be needed by clients interested in specific TDs or for e.g. IDs and titles of all TDs. In this case, they really need to use the search API. If they wanna retrieve all TDs, pagination will require more computation than chunking.

Do you mean that it will require too many connections? Since their main goal is to have the complete list, right? Kinda agree here, even if the use case that I have in mind for retrieveTDs is UI listing. In the end, in this use case you'll never really read the whole collection, but just one page at a time. Do you think that UI should use a query (e.g., not filter at all) for this use case?
So, just to understand, when you speak about retriveTDs you have in mind a "bulk download" use case, right? For example, if I have to replicate the collection somewhere else. If yes now I understand why you're advocating about completely forget pagination and use chunked encoding. Not sure if this is described in the spec, it might help developers to choose the right affordance (i.e., query vs retrieve). I would add it if missing.

I'd also like to point out that HTTP has Range Requests (compatible with chunked transfer encoding) to provide flow control, but that goes toward the header-based approach and really useful for downloading files, continuing from where it was left off or asking for missing pieces.

This follow the hypothesis above that you are thinking about "bulk downloads", am I right?

Yes, the use case I have in mind is for bulk download and transfer to a different environment. We really need to define the application use cases. Not referring to you, but in general: Instead of saying we have a use case which needs pagination, we should say what exactly the concrete use case is and consider solutions in the spec.

In case of listing in a UI:

  • Let's say we have 1000 TDs and the UI intends to show all attributes. This is going to be around 2-3MB on average and the best way IMO is to query all and let the browser or the application cache it. The user will be able to navigate throughout the pages (size e.g. 10) with no extra waiting time and would even wanna be able to filter on the client side and quickly view a subset. Overall, this gives a better user experience.
  • Let's now consider the same but with 1M TDs. Would the user wanna navigate through thousands of pages? The user will most probably wanna navigate through a subset after filtering. Since the 1M set is too large to query, cache, and perform local filtering, the UI is better off calling the search API and get what the user needs.
  • If the UI doesn't intend to have/show all attributes, it should call the search API and get what it needs.

@farshidtz
Copy link
Member Author

@wiresio thanks for summarizing. Though I think the discussion goes beyond this PR.

This PR, only specifies that the collection (array of TDs) is retrieved by making a GET query on /td. It gives recommendations on how to chunk. It doesn't discuss the search API (those have their own way of paginating). It also doesn't prevent a server from accepting query parameters to return a subset and paginate.

A server may even support content negotiation and return JSON Lines #93 (one TD per line) to accommodate other requirements.

@benfrancis
Copy link
Member

benfrancis commented Mar 24, 2021

@wiresio wrote:

Chunking / streaming vs. pagination: Is it really one or the other? I think both make sense.

I agree, because they are solving different problems:

  1. Chunking/streaming solves the problem where you want to download a very large resource (either a very large TD or a very large collection of TDs) but it needs breaking into chunks, measured in bytes, in order to prevent HTTP timeouts. You know you want the whole resource so you don't mind waiting for the whole thing to download before parsing it. An example would be exporting the entire collection of Thing Descriptions in a WoT directory for a smart city to store in a database and analyse.
  2. Pagination helps with the case where you want a set of results to a query (which could be getting all TDs or a subset of TDs in the directory) but you have no way of knowing in advance how large that set of results might be. If it's very large then you may want the server to provide only a few results, before asking for some more. An example would be a mobile app (acting as a WoT consumer) querying a smart city database (acting as a WoT directory) looking for smart street lamps. The consumer has no way of knowing whether there might be 3 results, or 3,000. If there are more than 10 then it might want to get the first 10, then decide whether to ask for more depending whether the user scrolls the list of decides to change their search query to narrow it further. The consumer doesn't want the server to just keep sending all 3,000 results until it tells it to stop.

@farshidtz wrote:

This PR, only specifies that the collection (array of TDs) is retrieved by making a GET query on /td. It gives recommendations on how to chunk. It doesn't discuss the search API (those have their own way of paginating).

As discussed in #133 (comment) I don't really see these as two separate APIs. A cleaner REST API design (which doesn't use verbs in property or resource names) would be a single /things resource which represents the collection of thing descriptions in a directory, which can be filtered in different ways. Either by id (e.g. /things/{id}) or using some kind of query (e.g. /things/?jq={jq}).

The design of this kind of API really comes down to personal taste, which is why I hope that the URL structure is not fixed in the specification but left down to the developer implementing the directory, with a specification which instead describes the set of operations that the API should support (see #144).

Even if we define chunking as a way of dealing with large amounts of data, I don't think that means we can avoid solving the pagination problem. As @relu91 says this may also be needed for other types of operations in Thing Descriptions, like readpastevents, being discussed in w3c/wot-thing-description#892.

FWIW what I liked about both the header-based pagination proposal and this proposal is that they keep the payload of the /things resource very clean (just an array of Thing Descriptions). This means that in implementations like a smart home hub which are unlikely to have enough devices in their directory to require either pagination or chunking, they can just always return the whole array and hopefully still be compliant with the spec. Unfortunately we have other use cases where some form of pagination is going to be necessary.

@farshidtz
Copy link
Member Author

farshidtz commented Mar 25, 2021

  1. Pagination helps with the case where you want a set of results to a query (which could be getting all TDs or a subset of TDs in the directory) but you have no way of knowing in advance how large that set of results might be. If it's very large then you may want the server to provide only a few results, before asking for some more. An example would be a mobile app (acting as a WoT consumer) querying a smart city database (acting as a WoT directory) looking for smart street lamps. The consumer has no way of knowing whether there might be 3 results, or 3,000. If there are more than 10 then it might want to get the first 10, then decide whether to ask for more depending whether the user scrolls the list of decides to change their search query to narrow it further. The consumer doesn't want the server to just keep sending all 3,000 results until it tells it to stop.

I think when it comes to TDs, we need to consider bytes. The size of the first 3 TDs could be larger than the remaining 2997. A client should make a HEAD request and check the Content-Length, before making a potentially large GET request.

I keep mentioning this: resource based pagination is still possible on the search API.

Moreover, a server hosting huge amounts of data should have access control preventing daily consumers from using the listing API. That's why we have a separate security scope for this:

"scopes": "readAll"

As discussed in #133 (comment) I don't really see these as two separate APIs.

This PR lays the foundation and doesn't prevent future additions. A resource based pagination can still be added to the same API and should still do chunking to break large TDs.

@benfrancis Let's discuss the other comments in respective issues, linked by you.

@wiresio
Copy link
Member

wiresio commented Mar 25, 2021

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding:
HTTP/2 doesn't support HTTP 1.1's chunked transfer encoding mechanism, as it provides its own, more efficient, mechanisms for data streaming.

@farshidtz
Copy link
Member Author

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding:
HTTP/2 doesn't support HTTP 1.1's chunked transfer encoding mechanism, as it provides its own, more efficient, mechanisms for data streaming.

wot-discovery/index.html

Lines 1020 to 1024 in f22b432

<p class="ednote" title="HTTP/2 chunking">
This should be tested before turned into an assertion:<br>
Chunked transfer encoding is not supported in HTTP/2.
HTTP/2 servers SHOULD respond the data incrementally using frames.
</p>

@wiresio
Copy link
Member

wiresio commented Mar 26, 2021

Thanks for explaining @farshidtz!
But doesn't that just show that we are heavily relying on the underlying L7 protocols when voting for chunking/streaming? And instead of doing so, wouldn't it be better to define a payload format for pagination to be transported over any protocol regardless of the specifics it offers?
Just as some food for thought: The TDD API is defined via a Thing Description, which in theory means that it is independent from the underlying protocol. So why use specifics of HTTP 1.1, that are somehow available in HTTP/2, maybe in CoAP v?? and probably (not) in MQTT, AMQP, or whatever other protocols the implementor prefers?

@farshidtz
Copy link
Member Author

Thanks for explaining @farshidtz!
But doesn't that just show that we are heavily relying on the underlying L7 protocols when voting for chunking/streaming? And instead of doing so, wouldn't it be better to define a payload format for pagination to be transported over any protocol regardless of the specifics it offers?
Just as some food for thought: The TDD API is defined via a Thing Description, which in theory means that it is independent from the underlying protocol. So why use specifics of HTTP 1.1, that are somehow available in HTTP/2, maybe in CoAP v?? and probably (not) in MQTT, AMQP, or whatever other protocols the implementor prefers?

I agree that we are relying on the capabilities available in HTTP/1.1 and HTTP/2 to "recommend" transfer in chunks of bytes, but this is not a bad thing. If done right, this is the most efficient way of transferring a large set in HTTP.

The payload format is a different topic and I think that is your main concern. IMO, an array of resources is simple and portable across protocols. If we rely on a custom pagination envelop with a subset of resources of variable sizes and a nextLink, how is that useful in MQTT? And how do we guarantee its usefulness in future protocols? The current proposal (which defines the bare minimum and does not prevent extensions) can be easily mapped to pub/sub pattern by recommending that the items in the same array of resource should be published in individual messages. By keeping it simple and not re-inventing the wheel, we allow ourselves to specify efficient mechanisms for various protocols without writing massive specifications and conditional requirements.

@wiresio
Copy link
Member

wiresio commented Mar 26, 2021

What about proceeding as follows?

@farshidtz
Copy link
Member Author

What about proceeding as follows?

  • Define a payload format inspired by: #16 (comment)
  • Allow retrieving TDs via one or both options: Chunking/streaming or/and pagination/GET
  • Formulate in the spec: MUST either implement C/s or p/G and MAY implement both
  • Allign search with retrieval

Could you please explain your use case? Why do you require the specification of the above and what pieces of the current proposal prevents you from realizing your use case?

Please keep in mind the following:

  • That payload format was dropped in favor of a suitable way to represent linked data collections. @AndreaCimminoArriaga and his colleagues who co-authored the TD ontology can explain better. But it led to the LDP Paging, which we can still add later. The current proposal neither prevents nor overlaps with that.
  • One listing method MUST be mandatory for all servers and additional methods may be added. Otherwise, the clients will have to implement all.
  • Aligning search with retrieval is a different topic. Even if we do so, we should avoid allowing double pagination, since search queries already provide their own way of setting ranges.
  • HTTP allows server-driven content negotiation with Accept header (see here or here). The payload set in this proposal doesn't have to be the only supported payload.

@wiresio
Copy link
Member

wiresio commented Mar 26, 2021

IMHO use cases for c/s do not differ from those for p/g.

@wiresio
Copy link
Member

wiresio commented Mar 29, 2021

So to be clear, this is an ALTERNATIVE to #130, so if we merge this we would NOT merge PR130, right?

Yes

What if we include both as "informative"? Think we'll not find a general, protocol independent, and well-accepted / well-known solution for the listing interface. Would also like to preserve all the good findings from @farshidtz.

@farshidtz
Copy link
Member Author

farshidtz commented Mar 29, 2021

IMHO use cases for c/s do not differ from those for p/g.

Then why do we need both? Please see my questions in #145 (comment)

So to be clear, this is an ALTERNATIVE to #130, so if we merge this we would NOT merge PR130, right?

Yes

What if we include both as "informative"? Think we'll not find a general, protocol independent, and well-accepted / well-known solution for the listing interface. Would also like to preserve all the good findings from @farshidtz.

I think we are looking for something that works best with HTTP and at the same time, is simple enough that can be ported to other protocols following the respective efficient and protocol-specific mechanisms. The suggested streaming is already standardized in HTTP (we don't specify how to do it) and recommending it is technically no different from having it as informative.

I think the first step is to agree on a "default" payload. This proposal and #130 both mandate array of TDs. That is the most protocol-agnostic payload format. Alternative payload formats can be supported. If we agree on the default payload, we can build on top of this proposal to add an alternative pagination method.

If the pagination on search/filtering (/search/jsonpath?query=$[0:10] or /things?jsonpath=$[0:10] according to #133)) is not good enough (hard to optimize?) or does not satisfy the requirements (no next links?), we can of course add useful parts of #130, along the following:

  • Servers that don't want to return the entire set (why not?) with recommended chunking, redirect the clients to the first page.
  • Clients that don't want to retrieve the entire set, query the first page.
  • First page will include a subset and provide a next link in header.
  • Last page will not have any next link.
  • All pages include an etag value in header to tell the collection state to clients.

The above is based on https://www.w3.org/TR/ldp-paging/#ldpp-ex-paging-303 (the useful parts of it)

Note: this addition has nothing to do with this PR. As mentioned few times, the current proposal does not prevent addition of pagination.

@benfrancis
Copy link
Member

@farshidtz wrote:

This proposal and #130 both mandate array of TDs. That is the most protocol-agnostic payload format.

I agree it would be great to keep this as the default payload format, since it is so simple.

the current proposal does not prevent addition of pagination.

Trying to design a standard pagination format (either in the payload or in HTTP headers) has demonstrated how hard this kind of operation is to describe in a Thing Description, but I agree that JSONPath can provide basic pagination features and hopefully implementation-specific additions like next links and etags in HTTP headers aren't too hard to describe in a way that all WoT consumers could understand.

@farshidtz
Copy link
Member Author

From the call on 2021.03.29:
Will merge this PR and add a new PR to add pagination based on the header-based proposal in #130 with necessary features (e.g. nextLink, etag headers). A statement will be added mentioning server-driven content negotiation with Accept header to request different payload formats.

@mmccool mmccool merged commit e3ca84b into w3c:main Mar 29, 2021
@farshidtz farshidtz mentioned this pull request Apr 13, 2021
3 tasks
@wiresio wiresio mentioned this pull request Apr 15, 2021
@farshidtz farshidtz deleted the listing-chunked branch March 7, 2022 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants