Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating NDB into gcloud.datastore? #557

Closed
jgeewax opened this issue Jan 16, 2015 · 43 comments
Closed

Integrating NDB into gcloud.datastore? #557

jgeewax opened this issue Jan 16, 2015 · 43 comments
Assignees
Labels
api: datastore Issues related to the Datastore API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. type: question Request for information or clarification. Not an issue.

Comments

@jgeewax
Copy link
Contributor

jgeewax commented Jan 16, 2015

Had a few discussions with @GoogleCloudPlatform/cloud-datastore (particularly @pcostell) so I wanted to summarize the things we covered. This issue is to clarify the goals for Datastore support in gcloud-python, and discuss and decide on a course of action that everyone likes for how to reconcile what we have today with the other libraries out there.


Current state of the world

Currently, I see two styles that people want to use when interacting with Datastore:

  1. a "simple" more key-value-based style (similar to Amazon's Simple DB) where you're CRUDding dictionaries and then adding some querying
  2. a more advanced ORM style where you're create models, and have some sort of schema defined in a Python file.

For the former, gcloud.datastore has had the goal of covering this use case. For the latter, ndb is the latest way (supported) way of doing this -- with others potentially existing, but ndb seems to be the clear leader.

We also have a unique situation where our code currently might have trouble running in App Engine -- whereas ndb can't run outside of App Engine. The layout sort of looks like this:

current state

Looking forward

If our goals are ....

  1. We agree that both styles of interacting with Datastore matter (and both should exist)
  2. gcloud.datastore and ndb are our choices for each style respectively
  3. gcloud.datastore and ndb should both run in App Engine and non-App Engine runtimes
  4. gcloud.datastore is where all the recommended Python stuff to talk to Datastore should live (it is the "official source of truth")
  5. People who want to write new Python libraries for Datastore can rely on code that exists in gcloud.datastore (and set gcloud as a Python dependency)

... then I'd like to suggest that we....

  1. Port ndb over as gcloud.datastore.ndb (bringing with it datastore_rpc and datastore_query)
  2. Rewrite gcloud.datastore to run on top of datastore_query
  3. Rename gcloud.datastore to be a peer with ndb (using "simple" in this diagram, not set on that at all though).

which makes things look like this:

future state

What do you guys think?

/cc @GoogleCloudPlatform/cloud-datastore @dhermes @tseaver @silvolu @proppy

@jgeewax jgeewax added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. type: question Request for information or clarification. Not an issue. api: datastore Issues related to the Datastore API. labels Jan 16, 2015
@jgeewax jgeewax self-assigned this Jan 16, 2015
@tseaver
Copy link
Contributor

tseaver commented Jan 16, 2015

Do we know of issues using gcloud on GAE?

In the "new world order", I'm not sure what datastore_query.py and datastore_rpc.py are (I assume datastore_pb.py is really our current gcloud.datastore.datastore_v1_pb2.py). I think gcloud.datastore.simple might just be the APIs exposed currently from gcloud.datastore itself (re-exported from the underlying api, batch, connection, entity, key, query, and transaction modules).

ISTM that gcloud.datastore.ndb would sit "above" gcloud.datastore.simple in the stack, although it might be drawn as a "half layer" (to indicate that users had direct access to gcloud.datastore.simple). Also, I would think that the outer layer would just be "Anywhere" (we'd like GAE users to be able to reuse gcloud.datastore, right?)

@jgeewax
Copy link
Contributor Author

jgeewax commented Jan 16, 2015

Do we know of issues using gcloud on GAE?

I don't have confirmation either way. We certainly didn't design it with the Python App Engine sandboxing in mind... Might be worth a separate issue to look into that ?

In the "new world order", I'm not sure what datastore_query.py and datastore_rpc.py are

I think gcloud.datastore.simple might just be the APIs exposed currently from gcloud.datastore itself

Yes -- See item 3 in the list of suggestions ("Rename gcloud.datastore to be a peer with ndb (using "simple" in this diagram, not set on that at all though).")

ISTM that gcloud.datastore.ndb would sit "above" gcloud.datastore.simple in the stack

This was actually a point of contention -- because I don't think gcloud.datastore.simple (what we have as gcloud.datastore today) should support everything that's absolutely needed by a library as complex as NDB. I have a hunch that something like NDB would need access to protos or something lower level (like datastore_query).

I also suspect it will be more work to port NDB on top of gcloud.datastore.simple than it would to pull over datastore_rpc and datastore_query and have it continue living on top of those.

@pcostell
Copy link
Contributor

gcloud should work asis in GAE right now, but its performance would be significantly worse than ndb's. Implementing on top of something like datastore_query and datastore_rpc would allow it to use either datastore_v1 or datastore_v3.

JJ was able to convince me that NDB should not be implemented on top of gcloud.datastore.simple. However, I do think that the simple implementation should be pretty minimal with most of the heft in datastore_query and datastore_rpc.

Importantly though, I think that datastore_rpc and datastore_query will need some work before they become part of gcloud. I'm almost done getting them to support both Cloud Datastore and datastore_v3. However I think they'll need some serious API cleanup before being part of the gcloud library.

Personally (I don't think we really talked about this), I would like to see them result in a nice API for library developers or for customers who want more control than ndb or simple provide. I think right now there is some overlap between the existing implementation in gcloud and datastore_rpc and datastore_query. It may be worthwhile to see how we can best merge the API of the first with more of the functionality of the second.

@dhermes
Copy link
Contributor

dhermes commented Jan 16, 2015

RE: gcloud working on App Engine. It does: https://github.com/dhermes/test-gcloud-on-gae

As Patrick mentions, ndb saves take about 1/5 the amount of time the gcloud saves do because the RPCs happen directly in GAE instead of via the Cloud Datastore API.

@dhermes
Copy link
Contributor

dhermes commented Jan 16, 2015

@jgeewax and @pcostell know, but for those who don't I've also taken a stab at converting ndb to a project that builds on Travis, uses tox, etc.:
https://github.com/dhermes/ndb-git

@jgeewax There is a lot going on in ndb besides just an ORM. I think the scope of it justifies being it's own library. It can always have gcloud as a dependency.

ISTM (given the observed timing difference on GAE) that we should use datastore_rpc and datastore_query under the covers instead of using the _datastore_v1_pb2.py module generated from the proto. @pcostell I'm happy to help in that effort if possible.

@jgeewax
Copy link
Contributor Author

jgeewax commented Jan 17, 2015

There is a lot going on in ndb besides just an ORM. I think the scope of it justifies being it's own library.

Can we look at what those things are? The reason I ask is that we have to find a happy balance between having one entry point for Google Cloud stuff in Python (pip install gcloud) and dividing things apart for us developers (issues, code maintenance, documentation, and everything else I'm sure I'm forgetting).

Would this be a good fit for a git submodule? Where as far as our users are concerned, they still type pip install gcloud and from gcloud.datastore import ndb, but the code, issues, and everything else would live in a separate GitHub repository?

@dhermes
Copy link
Contributor

dhermes commented Jan 17, 2015

I recommend taking a peak at ndb's source. Every piece of code is in on the theme of async, and as a result, the code is quite complex.

The primary modules doing non-ORM type things are:

These are great features of ndb.


Just to get a sense of the size, we have about 2500 lines of non-test / non-generated code:

$ cd gcloud/datastore/
$ ls *py | egrep -v test | egrep -v pb2 | xargs wc -l
  223 api.py
  333 batch.py
  449 connection.py
  106 entity.py
  321 helpers.py
   12 _implicit_environ.py
  140 __init__.py
  362 key.py
  493 query.py
  160 transaction.py
 2599 total

On the other hand, ndb has nearly 12,000

$ cd ${GOOGLE_CLOUD_SDK}/platform/google_appengine/google/appengine/ext/ndb/
 $ wc -l *py
   459 blobstore.py
  1277 context.py
    65 django_middleware.py
   306 eventloop.py
    64 google_imports.py
    17 __init__.py
   805 key.py
   335 metadata.py
  3935 model.py
   432 msgprop.py
   242 polymodel.py
   194 prospective_search.py
  2042 query.py
   451 stats.py
  1149 tasklets.py
   220 utils.py
 11993 total

My current gcloud SDK environ at the time was:

$ gcloud -v
Google Cloud SDK 0.9.43

...
app-engine-python 1.9.17
...

@dhermes
Copy link
Contributor

dhermes commented Jan 17, 2015

RE: Using a git submodule, @tseaver has expressed a distaste for it. We can ship ndb in our package on PyPI without using a submodule with custom build logic, so that may not really be an issue.

@eric-optimizely
Copy link

Will pre/post hooks be supported when using the proposed gcloud.datastore.ndb module? If so, will it be possible to support them without necessarily having to import/access the modules in which those Datastore Models are defined?

@dhermes
Copy link
Contributor

dhermes commented Mar 13, 2015

Can you elaborate or provide some snippets?

@eric-optimizely
Copy link

Sure, here's a really simplified and contrived example of where hooks might be used that are currently difficult to replicate when making write-requests through the Datastore API:

class Book(ndb.Model):
  title = ndb.StringProperty()
  author = ndb.StringProperty()

  def _pre_put_hook(self):
    # If this is a new entity, use a sharded counter to track the total number of books.
    if self.key == None:
      sharded_counter.incr('sum_of_books')

Another, possibly more important, concern is the use of Computed Properties since they are stored/indexed and used in queries.

class Region(ndb.Model):
  zip_codes = ndb.StringProperty(repeated=True)
  len_zip_codes = ndb.ComputedProperty(lambda self: len(self.zip_codes))

@dhermes
Copy link
Contributor

dhermes commented Mar 17, 2015

@eric-optimizely Sorry for the poor question on my part. I am aware how hooks work in ndb.

I was curious how you saw that working with the datastore package as it currently exists.

As for datastore.ndb, it will just be ndb in it's entirety (but supported off of App Engine).

@eric-optimizely
Copy link

Ok, thanks for clarifying. I was hoping for some sort of magic :) but, I suspect that support for hooks/computed properties is going to be complicated unless there's some other mechanism that's storing/caching them within the Datastore API itself.

@dhermes
Copy link
Contributor

dhermes commented Mar 18, 2015

Magic within datastore or within datastore.ndb?

We could add support for hooks here but I'm not sure if you want them in datastore or datastore.ndb. Any features of ndb (hooks, computed properties, etc.) will still be available once the port is complete.

RE: storing/caching within Datastore API, that's not necessary for a hook or a computed properties. A computed property just uses local data to create some derived property while a hook just does pre- and post-processing on data sent/received.

@eric-optimizely
Copy link

My assumption is that you'd need to import the Models in order for those hooks/properties to be generated and executed. If that's a correct assumption, then the problem for us would be that not all codebases which access our Datastore via the API have a copy of the Model definitions. The Models are defined by the web application in one repo, and other ancillary systems that access the data life in other repos. This gets more complicated if the Models are defined in Python, but other callers are written in other languages.

The storing/caching mechanism I mentioned could allow all callers of the Datastore API to remain agnostic about the implementation details and maintain consistency on write ops.

@dhermes
Copy link
Contributor

dhermes commented Mar 18, 2015

Ahhhh I finally get it :) Sorry for being so slow on the uptake!

I don't think that's a doable feature, but I like it. It essentially would require another service or just a custom backend. In either case, you'd have HTTP overhead that could really hurt large applications.

Best bet would be to use the same models (even if you could duplicate the behavior in another language, keeping the code in sync would be a very dangerous proposition since so easy to slip up).

@eric-optimizely
Copy link

Yeah, I'm not sure if it would be possible to serialize the functionality of hooks/computed properties into protobufs in such a way that they could be retrieved by the gcloud library (language agnostic) and applied to the caller. I could imagine the (Python) code looking something like this:

# Implicit
query = datastore.query.Query('Book', use_model_prototype=True)
 # More explicit
model_prototype = datastore.Prototype(key=datastore.Key('BookPrototype'))
query = datastore.query.Query('Book', prototype=model_prototype)

@dhermes
Copy link
Contributor

dhermes commented Mar 18, 2015

For record-keeping. Scary ndb memory bugs:

https://code.google.com/p/googleappengine/issues/detail?id=9610
https://code.google.com/p/googleappengine/issues/detail?id=11647 (essentially the child of 9610)
GoogleCloudPlatform/endpoints-proto-datastore#122 (issue that alerted me to 9610)

/cc @jgeewax @tseaver

@squee1945
Copy link

Async/Tasklets was the first thing I looked for when I cracked open gcloud-python. As a long-time GAE user, I've grown to love tasklets because they allow a developer to write performant async code while still allowing for proper separation of concerns / encapsulation. Meaning: to do RPC work in parallel, the developer doesn't need to jam a whole bunch of otherwise unrelated work into a single function.

That said, the NDB implementation of tasklets is definitely a big undertaking and has challenging issues like the ones that @dhermes points out above (https://code.google.com/p/googleappengine/issues/detail?id=9610).

I see that each of gcloud.datastore, gcloud.storage, and gcloud.pubsub (to a certain extent) at least have the notion of batching, but it appears that this batching has been implemented separately in each package meaning, e.g., I can't batch a storage request together with a datastore request (please correct me if I'm wrong).

I don't yet know enough about the API backend for the Google APIs, but does it support a more global sense of batching? E.g., can gcloud have a higher-level notion of batching that will work over all (most?) of the sub-APIs?

"Higher-level" batching isn't as powerful as full-on tasklets, but at least it would allow the performance gains, i.e., across disparate APIs.

@dhermes
Copy link
Contributor

dhermes commented Apr 6, 2015

@squee1945 JJ opened #796 for discussion of library-wide batching.

As for getting async into gcloud, we'd like to start with ndb and take it from there. If you've ever had a chance to look at the source of ndb (it seems like you have based on comments) it uses some serious Python magic.

@squee1945
Copy link

@dhermes yes, it's just wow. Didn't Guido use that area as a proving ground for the Tulip stuff in Py3?

Another approach (as opposed to full-on tasklets) would be async methods and futures. But I suppose that may drag along the eventloop pump, etc.

@dhermes
Copy link
Contributor

dhermes commented Jun 6, 2015

Do we have a timeline (UPDATE: for v1beta3)? Is there any way I or @tseaver can help speed things along?

@dhermes
Copy link
Contributor

dhermes commented Jul 16, 2015

@pcostell I noticed https://www.googleapis.com/discovery/v1/apis/datastore/v1beta3/rest is serving. Does this mean it is ready?

@pcostell
Copy link
Contributor

Almost :-). It's not quite usable yet.

@theacodes
Copy link
Contributor

Bump, v1beta3 is out. Is this still something we want to do?

@pcostell
Copy link
Contributor

pcostell commented May 4, 2016

I don't think so. I'd like to recommend gcloud-python as the preferred way to use Cloud Datastore with ndb as an ORM on top of that if you'd prefer an ORM-like experience. I'd like ndb to be available as a separate install for users that want to use it.

@pcostell pcostell closed this as completed May 4, 2016
@theacodes
Copy link
Contributor

That's fine. But we do need a plan for making that happen.

On Wed, May 4, 2016, 9:59 AM Patrick Costello notifications@github.com
wrote:

I don't think so. I'd like to recommend gcloud-python as the preferred way
to use Datastore with ndb as an ORM on top of that if you'd prefer an
ORM-like experience. I'd like ndb to be available as a separate install for
users that want to use it.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#557 (comment)

@aatreya
Copy link

aatreya commented May 9, 2016

I'd like to recommend gcloud-python as the preferred way to use Cloud Datastore with ndb as an ORM on top of that if you'd prefer an ORM-like experience. I'd like ndb to be available as a separate install for users that want to use it.

What's the current best practice for folks who like NDB but want to build on the flexible environment? (There's the compat runtime but that sounds like a temporary/intermediate solution.)

@dhermes
Copy link
Contributor

dhermes commented May 9, 2016

I think that's a question for @pcostell?

@pcostell
Copy link
Contributor

pcostell commented May 9, 2016

If you're using a flexible environment and would like to use ndb, I'd recommend using the compat runtime, which will get you the entire App Engine SDK with serving fastpaths.

In the future, this hopefully won't be necessary and you can just include ndb manually, but we're not at that point yet.

@douglascorrea
Copy link

douglascorrea commented Jun 13, 2016

Is there any updates on this? Can I use ndb on flexible enviroment? Or should I go with python-compat yet? cc @pcostell

@posita
Copy link

posita commented Jul 3, 2016

Just to be clear, there's currently no Python 3-compatible way to access ndb, correct? python-compat appears to support Python 2.7 only.

@pcostell wrote:

I'd like to recommend gcloud-python as the preferred way to use Cloud Datastore with ndb as an ORM on top of that if you'd prefer an ORM-like experience. I'd like ndb to be available as a separate install for users that want to use it.

Since this issue has been closed, are there any other issues tracking the implementation(s) of these preferences? Are they on a roadmap anywhere?

@aatreya
Copy link

aatreya commented Aug 15, 2016

@pcostell wrote:

If you're using a flexible environment and would like to use ndb, I'd recommend using the compat runtime, which will get you the entire App Engine SDK with serving fastpaths.

In the future, this hopefully won't be necessary and you can just include ndb manually, but we're not at that point yet.

Is anyone actually working on this? Do you expect it'll be available in the next couple months or is this a longer-term undertaking?

(We'd like to use Python 3 in the flexible environment but want to be able to take advantage of caching, transactions, and other features of NDB.)

@corydolphin
Copy link

@aatreya did you make any progress or exploration here? I hit the same roadblock and challenge.

@mattwarrenrnp
Copy link

I'd like to use ndb on a python 3.5 project in the flexible environment. would love to see some progress on making this possible.

@dhermes
Copy link
Contributor

dhermes commented Nov 10, 2016

@jonparrott Do you know of any updates?

@theacodes
Copy link
Contributor

@dhermes, not really. Maybe @pcostell has an update?

@pcostell
Copy link
Contributor

We are working on it (ndb from outside of App Engine), but right now there are a lot of incompatibilities that make it so it is likely not as usable.

You can try out the existing state by following the instructions in the demo: https://github.com/GoogleCloudPlatform/datastore-ndb-python/tree/master/demo

However there are a lot of gotchas (one of the more substantial ones is that it isn't running on gRPC and is RPCs are only run synchronously). If you try it out, please file any bugs on the ndb github tracker.

Here is a bug to track this issue, rather than keeping this issue open in google-cloud-python: GoogleCloudPlatform/datastore-ndb-python#272

@magoarcano
Copy link

Any news about this. I don't move my projects to Appengine Flexible only because I don't want to migrate my code from ndb to the cloud datastore api.

@jgeewax
Copy link
Contributor Author

jgeewax commented Jul 24, 2017

@magoarcano : See @pcostell's note from above, specifically https://github.com/GoogleCloudPlatform/datastore-ndb-python/blob/master/demo/task_list.py which is a demo app using NDB that runs inside GCE. Pay special attention to the configuration settings that turn things off.

Also, As Patrick says, the performance of this will be sub-par due to synchronous requests and no caching.

It's important to note that this is an incredibly complex undertaking because the run time environments are significantly different and NDB was built on the premise that it would only ever run in App Engine (which is an "all or nothing" service with Datastore, Memcache, Task Queues, etc all available via RPC calls). In Flex (or GCE, or AWS, etc) things are very different so a majority of those assumptions don't hold anymore.

This means that even though a port of NDB might work, there will be a huge number of "Oh, we didn't realize NDB made that assumption!" moments, so we're being extra cautious about what we release to the world. We don't want to hand out code that works only in the right circumstances -- it should work everywhere.

parthea pushed a commit that referenced this issue Oct 21, 2023
Source-Link: https://togithub.com/googleapis/synthtool/commit/92006bb3cdc84677aa93c7f5235424ec2b157146
Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:2e247c7bf5154df7f98cce087a20ca7605e236340c7d6d1a14447e5c06791bd6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the Datastore API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests