Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Bug? Query Latency Across Multiple Geographic Regions #3668

Closed
savearray2 opened this issue Jul 14, 2019 · 6 comments
Closed

Possible Bug? Query Latency Across Multiple Geographic Regions #3668

savearray2 opened this issue Jul 14, 2019 · 6 comments
Assignees
Labels
area/performance Performance related issues. kind/question Something requiring a response. status/accepted We accept to investigate/work on it.

Comments

@savearray2
Copy link

savearray2 commented Jul 14, 2019

  • What version of Dgraph are you using?
    Latest Docker version of Dgraph (v1.0.16/0590ee95)

  • What is the hardware spec (RAM, OS)?
    5x Amazon r5.2xlarge instances.

  • Steps to reproduce the issue (command/config used to run Dgraph).
    Launch the docker images and create a cluster using replicas=5. Load the test data from the introduction part of the Tour.

  • Expected behaviour and actual result.
    I understand that write latency should be slower over multiple geographical locations/instances due to the raft cluster having to communicate with the leader for proposals, however reads should be quick.

I currently have two zones created. One in us-west-2 (3x servers) and one in ap-northeast-1 (2x servers), where the round-trip time is around 100 ms.

curl "http://$(hostname -i):8080/query?debug=true" -XPOST -d $'{
  everyone(func: anyofterms(name, "Michael Amit")) {
    name
    friend {
      name@ru:ko:en
      friend { expand(_all_) { expand(_all_) } }
    }
  }
}'

When running the above query in us-west-2 (the location of the current leaders), I get the following:
{"server_latency":{"parsing_ns":16491,"processing_ns":3474530,"encoding_ns":1006297}
A total time of ~4.5 ms.

When I run the query in ap-northeast-1, I get the following:
{"server_latency":{"parsing_ns":15140,"processing_ns":3886714,"encoding_ns":101721380}
Approximately 106.6 ms.

Is there a reason why the encoding_ns time is so much higher in the ap-northeast-1 region? According to the documentation this should be the time to encode the response into JSON, but I'm not sure why that would be dependent on any server other than the local one where the query is running.

Thanks all 🙂

@campoy
Copy link
Contributor

campoy commented Jul 16, 2019

Thanks for the report, this indeed looks weird.
The encoding latency should not increase for your setup.

@gitlw, could you have a look at this?

@campoy campoy added area/performance Performance related issues. kind/question Something requiring a response. status/accepted We accept to investigate/work on it. labels Jul 16, 2019
@savearray2
Copy link
Author

Hello All,

If necessary I can provide more details if you have difficulties replicating the issue. It seems to happen for me every time I recreate the cluster from scratch.

Thanks 😶

@martinmr
Copy link
Contributor

Hey, I thought I had already replied to this. I think I wrote an answer but I forgot to submit it 🤣

This is not a bug but I'll try to explain what's happening.

  • When querying from the zone where the leader is, you get a fast response. This is obviously expected.
  • When querying from the other zone, you see a 100ms delay. Note that this delay is roughly the same that you get from doing a ping across the zones. In this case, your query needs to go to the other zone because the leader is there. In Dgraph replicas are not used to respond to queries but to replicate the data in case the leader goes down. If that happens, there will be another election and one of the former replicas will become the leader. All subsequent queries will then go to this new leader.

Dgraph's current design supports replication but is not really geared towards geographical replication since all operations are still going through the raft process (including reads).

Can you try something? If you don't care so much about read consistency, you could try best-effort queries (https://docs.dgraph.io/clients/#create-a-transaction) which are queries that do not go through RAFT consensus. However, I am not sure if those types of queries still need to go through the leader. If that is not the case, best-effort queries are a temporary solution.

Long term, we are currently working on cluster replication. Once this feature is out, you'll be able to have multiple clusters in different zones that work independently of each other but eventually receive the same data. In that case, your queries in the ap-northeast zone will not have to talk to any alpha in another zone.

@campoy
Copy link
Contributor

campoy commented Jul 19, 2019

Thanks for the explanation, @martinmr

I do wonder why this network latency appears as part of the encoding time.
Shouldn't it be somewhere else?

@martinmr
Copy link
Contributor

Most likely a bug. Probably it's counting the time to receive the response as part of the encoding. I'll open a separate issue and look into it.

@campoy
Copy link
Contributor

campoy commented Jul 19, 2019

Ok, then we can close this issue as it's normal we see some extra network latency.

@campoy campoy closed this as completed Jul 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Performance related issues. kind/question Something requiring a response. status/accepted We accept to investigate/work on it.
Development

No branches or pull requests

4 participants