-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible memory leak with test data #300
Comments
Thanks for the detailed writeup. I'll try to reproduce this and pin down the leak. |
This is profiling info for starting up the server with a db containing an arbitrary relation created with this tutd command:
From what I understand so far, most of the time and memory is spent in ProjectM36.DatabaseContext.hashBytes while calculating the merkle hash. Raw Also useful:
|
Thanks for the report! I've been meaning to play with BangPatterns. I suspect that we have a lot of unnecessary laziness in the Atom definitions. I'll start there and see how the profiles change. |
Just for the record, I think that the original bug opened here is resolved by f1fe611 which fixes an issue whereby the disconnected transaction used to take a long tuple list forward with it. We should still investigate how to reduce memory usage further. |
Currently, Project:M36 clients validate the transaction graph via the merkle hashes at connection time. How would you feel about offering the option for clients to skip the validation for trusted Project:M36 servers? |
That would be nice, but doesn't the server calculate the hashes on every transaction or is it not the case? Also what would be the risk of skipping it, if its not calculated on every transaction? |
Indeed, it's a fair question. The goal of the Merkle hashing is the same as in other blockchain technologies: a verifiable trail. Many databases don't have such a feature, so it's natural to ask if it's necessary. I have struggled with the requirements myself, so I am open to making Merkle hashing optional with the caveat that that would close the door to features such as:
Basically, as in a blockchain, these are features that allow a third party to place less trust in the database server. But I can certainly understand that someone would be willing to trade the merkle validation for some performance improvement. |
I get what you're saying, but there are a few things that come into my mind regarding this:
In a nutshell, what I'm saying is, hashing alone is not enough to secure a public access database, and if it's not supposed to be public, then the clients are from the same organization, which then means that the clients have to trust the server anyway (it's not like the clients can take over or something, unless we somehow make that a possibility). It might help me get a better perspective if you provide an example of the scenario you have in mind in which having Merkle hashing is a positive. I'm specifically interested in how the network would be configured server/client wise. (i.e is there a central server or each client is acting as a server with access to the db files?) |
I think that this is an interesting debate. In "Out of the Tarpit", the authors make the strong case that the database and application server should be one entity. This removes hurdles; for example, with regards to the type impedance mismatch between database and app server. With Haskell server-side-scripting, this unification should be possible today and I wish to make further progress in making databases public with security improvements, as you suggested. You are correct that in today's architectural model, databases are centralized software completely managed and trusted as a singular source of truth. However, that often makes them a single source of failure, too, even with high availability features like replication and failover, for example, if the database is overwhelmed with requests. With this in mind, Project:M36 has been making strides towards a distributed system more akin to git whereby a database can contain a subset of another database or selectively choose to replicate branches, etc. Therefore, no singular database is necessarily authoritative, so we need to some tools to determine in what state "somewhat trustworthy" remote databases are in; for example, to determine common ancestors in the transaction stream. I think it would still be fine to offer the option to disable hashing, both on server startup and client side connection. |
In 54c8221, I fixed an unnecessary bytestring strictness when writing the database to disk. In 0b008ba, I add BangPatterns to the most common Atoms which reduces data structure indirection and thunks. In the 200000-tupled arbitrary relation, this cuts down on memory usage by 10%. There is probably other low-hanging fruit to find. |
Given your additional notes, to implement something like git I think these are required:
I think if we manage to get this to work in a smooth, high performance way, the results should be ground-breaking! I know these are easier said than done, but we can try! Let me know what you think, this is too interesting! |
I import a 17 attributes / ~200000 tuples relation. So, I think a disable option would be much helpful for trying project-m36. |
Indeed! I am trying to execute an incremental plan towards this where I think the actual design and architecture aspects are most critical, so I have pushed the security components that you mentioned to the final phase. It should be easier to test the merging/pulling systems in a semi-trusted environment. Keep in mind that some of elements you mentioned above are actually features of a centralized github/gitlab-like architecture rather than git alone. For example, naked git does not support pull/merge requests. I think such features would be nice eventually, but I would like to experiment with the decentralized aspects first. |
There is likely some low-hanging fruit in further defragmenting the heap. Also, we plan to implement finite relations' tuple stores using streamly (currently a linked list) to take advantage of more parallelism and that could make lots of operations run off of disk-stored tuples run in constant memory. |
I use the code below to add 200,000 tuples to a test relvar. After execution,
project-m36-server
uses about 1GB of RAM. Once I quit the server and start it again, it uses about 2.4GB of RAM.(Sorry about the code wall, I moved everything relevant to the same file for ease of test.)
Test repo is here
To reproduce, start ghci in the DB module, load it and run:
After that if you check, you'll see that
project-m36-server
is using about 1GB of RAMIf you close the server and start it again, it will be using about 2.4GB
The text was updated successfully, but these errors were encountered: