Skip to content
This repository has been archived by the owner on May 27, 2020. It is now read-only.

Connections still not closing (branch-0.11) #131

Open
seddonm1 opened this issue May 7, 2016 · 9 comments
Open

Connections still not closing (branch-0.11) #131

seddonm1 opened this issue May 7, 2016 · 9 comments

Comments

@seddonm1
Copy link

seddonm1 commented May 7, 2016

Hi,
I have just built the branch-0.11 with the fix #118 applied.

It still appears to be holding connections open. Here is my connection string:
MongodbConfigBuilder(Map(Host -> List("servername:27017"), Database -> "myDB", Collection -> "myCollection", ConnectionsTime -> "120000", ConnectionsPerHost -> "10")).build

executed using:
sqlContext.fromMongoDB

If i watch the open files from a spark streaming app I can see more and more connections being opened:
lsof -p 14169 | grep servername:27017

Is there something else that needs to be configured to allow the scheduler to release these connections?

@wuciawe
Copy link
Contributor

wuciawe commented May 7, 2016

I didn't use it in spark streaming yet. will you try to use it in a normal spark app, will it close the connection? #118 fixes the problem that the ActorSystem gets stuck on exit of the application. And according to your description, maybe you can have a look at this file. It will reuse old connections if the old connection is not busy.

@seddonm1
Copy link
Author

seddonm1 commented May 7, 2016

Hi,
I can test without streaming but streaming is the main use case.

After just over 3 hours of 1 minute spark streaming batches (186 batches) I can see 1092 connections to mongo (~5 per batch). Eventually it hits the open file limit for Ubuntu and stops being able to open connections.

Interestingly it has not caused problems with WRITING using saveToMongodb to Mongo which ran at same interval for many days:
MongodbConfigBuilder(Map(Host -> List("servername:27017"), Database -> "myDB", Collection ->"myCollection", SamplingRatio -> 1.0, WriteConcern -> "normal", SplitSize -> 8, SplitKey -> "_id", IdAsObjectId -> "false")).build

Perhaps the problem can be diagnosed by comparing how connections are handled.

@wuciawe
Copy link
Contributor

wuciawe commented May 7, 2016

Did you check that file yet? I think it could be caused by keeping some connection in BUSY state which maybe caused by not calling freeConnection() method on every generated writer. Because after a quick searching, I find a suspicious place in the file com.stratio.datasource.mongodb.MongodbRelation,

def insert(data: DataFrame, overwrite: Boolean): Unit = {
  if (overwrite) {
    new MongodbSimpleWriter(config).dropCollection
  }

  data.saveToMongodb(config)
}

The newed MongodbSimpleWriter doesn't call freeConnection(), this maybe a leak and maybe a bug. In your case, maybe some other place there is a leak.

@pfcoperez is this a leak of connection?

@seddonm1
Copy link
Author

seddonm1 commented May 7, 2016

thanks @wuciawe

I think i will have to make do with some hacking with mongoexport and mongoimport and avoid the spark-mongodb package until this type of issue is resolved.

@pfcoperez
Copy link
Contributor

pfcoperez commented May 8, 2016

@seddonm1 @wuciawe The currently implemented connection pool has a problem related to the fact that there are not destructors in Java nor in Scala. If it were, we could guarantee that a provided connection, extracted from the pool, would close by using something RAII like pattern.

The main problem is that the connection pool is providing Client instances which should be freed explicitly by calling to one of MongoClientFactory#setFreeConnection... methods.

Some spark hooks are being used to automatically call these methods after Spark tasks have finished. However there are other cases of use, as described in your conversation, for which the client is not being freed.

The current approach would work just right if after any possible use of MongoClients they were closed. But that is quite a hazardous assumption provided that there are just too much possible points of need for explicit resource deallocation.

We've decided to follow a new approach: To imitate the way other well known JVM resources pools work. That is, by passing the pool a task to perform and letting it, the pool, to assign it to a connection. Hence, the pool is responsible for resource deallocation instead its client code. That focuses resource deallocation into a single point of responsibility.

We've already started the task of changing the pool implementation but it will take a while. In the mean time, we'll remove the connections pool thus removing all the issues you are finding concerning connection leaks.

I hope this helps.

@seddonm1
Copy link
Author

thanks @pfcoperez. I'm glad you acknowledge the issue and that you have a plan for resolution.

good luck.

@darroyocazorla
Copy link
Member

Hi,
The new aproach what @pfcoperez is referring to will be implemented for 0.12.X ASAP. In the meantime, we have added a PR ( #133) to solve the issue.

@jaminglam
Copy link

Hi,
I am using branch 0.11 by Python API, but when I use it like the following:
df = sqlContext.read.format("com.stratio.datasource.mongodb").options(host="serverhost:27017", database="db_name", collection="collection_name").load()
It still not close its connections even my spark jobs done. How can I do to force close the connection by Python API?

@darroyocazorla
Copy link
Member

darroyocazorla commented May 12, 2016

Hi @jaminglam
#133 has already been merged.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants