Possible Bug? Query Latency Across Multiple Geographic Regions #3668

savearray2 · 2019-07-14T18:50:02Z

What version of Dgraph are you using?
Latest Docker version of Dgraph (v1.0.16/0590ee95)
What is the hardware spec (RAM, OS)?
5x Amazon r5.2xlarge instances.
Steps to reproduce the issue (command/config used to run Dgraph).
Launch the docker images and create a cluster using replicas=5. Load the test data from the introduction part of the Tour.
Expected behaviour and actual result.
I understand that write latency should be slower over multiple geographical locations/instances due to the raft cluster having to communicate with the leader for proposals, however reads should be quick.

I currently have two zones created. One in us-west-2 (3x servers) and one in ap-northeast-1 (2x servers), where the round-trip time is around 100 ms.

curl "http://$(hostname -i):8080/query?debug=true" -XPOST -d $'{
  everyone(func: anyofterms(name, "Michael Amit")) {
    name
    friend {
      name@ru:ko:en
      friend { expand(_all_) { expand(_all_) } }
    }
  }
}'

When running the above query in us-west-2 (the location of the current leaders), I get the following:
{"server_latency":{"parsing_ns":16491,"processing_ns":3474530,"encoding_ns":1006297}
A total time of ~4.5 ms.

When I run the query in ap-northeast-1, I get the following:
{"server_latency":{"parsing_ns":15140,"processing_ns":3886714,"encoding_ns":101721380}
Approximately 106.6 ms.

Is there a reason why the encoding_ns time is so much higher in the ap-northeast-1 region? According to the documentation this should be the time to encode the response into JSON, but I'm not sure why that would be dependent on any server other than the local one where the query is running.

Thanks all 🙂

The text was updated successfully, but these errors were encountered:

campoy · 2019-07-16T00:37:56Z

Thanks for the report, this indeed looks weird.
The encoding latency should not increase for your setup.

@gitlw, could you have a look at this?

savearray2 · 2019-07-17T09:38:51Z

Hello All,

If necessary I can provide more details if you have difficulties replicating the issue. It seems to happen for me every time I recreate the cluster from scratch.

Thanks 😶

martinmr · 2019-07-17T19:34:59Z

Hey, I thought I had already replied to this. I think I wrote an answer but I forgot to submit it 🤣

This is not a bug but I'll try to explain what's happening.

When querying from the zone where the leader is, you get a fast response. This is obviously expected.
When querying from the other zone, you see a 100ms delay. Note that this delay is roughly the same that you get from doing a ping across the zones. In this case, your query needs to go to the other zone because the leader is there. In Dgraph replicas are not used to respond to queries but to replicate the data in case the leader goes down. If that happens, there will be another election and one of the former replicas will become the leader. All subsequent queries will then go to this new leader.

Dgraph's current design supports replication but is not really geared towards geographical replication since all operations are still going through the raft process (including reads).

Can you try something? If you don't care so much about read consistency, you could try best-effort queries (https://docs.dgraph.io/clients/#create-a-transaction) which are queries that do not go through RAFT consensus. However, I am not sure if those types of queries still need to go through the leader. If that is not the case, best-effort queries are a temporary solution.

Long term, we are currently working on cluster replication. Once this feature is out, you'll be able to have multiple clusters in different zones that work independently of each other but eventually receive the same data. In that case, your queries in the ap-northeast zone will not have to talk to any alpha in another zone.

campoy · 2019-07-19T17:45:55Z

Thanks for the explanation, @martinmr

I do wonder why this network latency appears as part of the encoding time.
Shouldn't it be somewhere else?

martinmr · 2019-07-19T18:02:41Z

Most likely a bug. Probably it's counting the time to receive the response as part of the encoding. I'll open a separate issue and look into it.

campoy · 2019-07-19T19:12:19Z

Ok, then we can close this issue as it's normal we see some extra network latency.

campoy added area/performance Performance related issues. kind/question Something requiring a response. status/accepted We accept to investigate/work on it. labels Jul 16, 2019

campoy assigned gitlw Jul 16, 2019

martinmr mentioned this issue Jul 19, 2019

encoding_ns might not be accurate #3690

Closed

campoy closed this as completed Jul 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Bug? Query Latency Across Multiple Geographic Regions #3668

Possible Bug? Query Latency Across Multiple Geographic Regions #3668

savearray2 commented Jul 14, 2019 •

edited

Loading

campoy commented Jul 16, 2019

savearray2 commented Jul 17, 2019

martinmr commented Jul 17, 2019

campoy commented Jul 19, 2019

martinmr commented Jul 19, 2019

campoy commented Jul 19, 2019

Possible Bug? Query Latency Across Multiple Geographic Regions #3668

Possible Bug? Query Latency Across Multiple Geographic Regions #3668

Comments

savearray2 commented Jul 14, 2019 • edited Loading

campoy commented Jul 16, 2019

savearray2 commented Jul 17, 2019

martinmr commented Jul 17, 2019

campoy commented Jul 19, 2019

martinmr commented Jul 19, 2019

campoy commented Jul 19, 2019

savearray2 commented Jul 14, 2019 •

edited

Loading