Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Output MySQL format #6791

Closed
Tracked by #7876
doki23 opened this issue Jul 25, 2022 · 8 comments
Closed
Tracked by #7876

Support Output MySQL format #6791

doki23 opened this issue Jul 25, 2022 · 8 comments
Assignees
Labels
A-query Area: databend query

Comments

@doki23
Copy link
Contributor

doki23 commented Jul 25, 2022

Summary
Execution time of query select * from customer limit 1000000(customer is generated by tpc-h dbgen with 100GB scale factor) is 10-sec level which is less than 1 sec in mysql.

ps: test in my own macbook with a cluster of 3 query nodes

pprof:
截屏2022-07-25 12 39 51

@BohuTANG
Copy link
Member

BohuTANG commented Jul 25, 2022

Could you try one databend-query node and test it again? It is not useful to deploy 3-nodes on a single machine(IO&CPU-bound)
If you are using local fs, there are some performance issue.

@doki23
Copy link
Contributor Author

doki23 commented Jul 25, 2022

Yes, I used local fs as the engine. I before tested with 1 query node, just a little better. I'll test it again some time later.

@BohuTANG
Copy link
Member

Thank you.
Databend is not designed and improved for local FS, we only do it for test. There are some issues(eg. tokio-rs/tokio#3664) to slow the IO, databend is improved for the cloud object storage, such as S3-like and Azure Blob cloud storage.

@doki23
Copy link
Contributor Author

doki23 commented Jul 25, 2022

Yes, I actually do not realize this. But the problem seems not about storage engine, the performance bottleneck is about DFQueryResultWriter, a problem between mysql client and query server. Or if I do misunderstand?

@BohuTANG
Copy link
Member

BohuTANG commented Jul 26, 2022

Hmm, could you put the MySQL client query status, like:

select * from numbers(1000000) limit 1000000;

[result snippet]

1000000 rows in set (0.27 sec)            -- this is all cost: query server + send to the client
Read 1000000 rows, 7.63 MiB in 0.009 sec  -- this time is the query server to execute

MySQL DFQueryResultWriter is not a stream writer, for the query with limit 1000000 cost most is the send the result to client.

@doki23
Copy link
Contributor Author

doki23 commented Jul 26, 2022

Hmm, could you put the MySQL client query status

1000000 rows in set (11.30 sec)
Read 1000000 rows, 155.81 MiB in 1.199 sec., 833.69 thousand rows/sec., 129.89 MiB/sec.

Table customer contains many string columns -- StringColumn::get(&self, index: usize) costs much time. I think it's because [u8] to vec need do much copy operations and DFQueryResultWriter is not cache friendly.

@sundy-li
Copy link
Member

I think it's because [u8] to vec need do much copy operations and DFQueryResultWriter is not cache friendly.

That's right. We can refactor DFQueryResultWriter into output MySQL format like Tsv/Csv/Parquet (https://github.com/datafuselabs/databend/blob/136196aba2127b6e62535d1336357d92b6e3a9eb/common/formats/src/output_format_parquet.rs).

@sundy-li sundy-li changed the title Poor performance of StringColumn::get and RowWriter::end_row in DFQueryResultWriter Support Output MySQL format Jul 26, 2022
@sundy-li sundy-li added the A-query Area: databend query label Jul 26, 2022
@youngsofun
Copy link
Member

not needed in practice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-query Area: databend query
Projects
None yet
Development

No branches or pull requests

4 participants