Skip to content
This repository has been archived by the owner on Dec 8, 2021. It is now read-only.

*: parse the data source directly into data and skip the KV encoder #145

Merged
merged 17 commits into from
Apr 3, 2019

Conversation

kennytm
Copy link
Collaborator

@kennytm kennytm commented Mar 14, 2019

What problem does this PR solve?

In our testing with a 4.1 TB workload, we found that parsing SQL takes almost half of the time to encode a row. Since we have already used a parser to extract each row, parsing it again is wasting computing resource. Additionally, for CSV we need to perform the complex and unnecessary Parse CSV → Reconstruct SQL → Parse SQL.

What is changed and how it works?

We change the Lightning parsers to directly produce an array of types.Datum for both CSV and SQL. We also get rid of the abstraction layer KvEncoder (since it only accepts SQL statements), and directly use (*table.Table).AddRecord to convert the []types.Datum into KV pairs.

This slashes half of the encoding time according to experiment.

Check List

Tests

  • Unit test
  • Integration test

Side effects

Related changes

@lonng lonng added status/WIP Work in progress status/DNM Do not merge, test is failing or blocked by another PR priority/important type/enhancement Performance improvement or refactoring labels Mar 15, 2019
@kennytm kennytm force-pushed the kennytm/parse-faster branch 2 times, most recently from f81aead to 44db439 Compare March 15, 2019 12:09
@kennytm kennytm removed the status/WIP Work in progress label Mar 15, 2019
@kennytm kennytm changed the title [WIP] *: parse the data source directly into data and skip the KV encoder *: parse the data source directly into data and skip the KV encoder Mar 15, 2019
@kennytm
Copy link
Collaborator Author

kennytm commented Mar 15, 2019

@lonng Some metrics are temporarily removed, we need to see if we want to tweak the metrics or the process. The old process:

  • Read a block (64 KB) of SQL values → Reconstruct INSERT statement → Execute INSERT statement to get KV pairs → Deliver block

New process:

  • Read a single row and copy to datum array → Encode to get KV pairs of this row → Buffer until 64 KB → Deliver buffer

The following metrics were involved in this change and may need to be repurposed?

  • ChunkParserReadRowSecondsHistogram
  • BlockReadSecondsHistogram (concept of "block" no longer applies)
  • BlockReadBytesHistogram
  • BlockEncodeSecondsHistogram

@lonng
Copy link
Contributor

lonng commented Mar 16, 2019

@kennytm I think we could add some new metrics about:

  • Encode to get KV pairs duration → data part + index part
  • Data KV size
  • Index KV size

@kennytm kennytm force-pushed the kennytm/parse-faster branch 2 times, most recently from b83efb5 to 5624ac0 Compare March 19, 2019 15:46
@kennytm kennytm added status/PTAL This PR is ready for review. Add this label back after committing new changes status/WIP Work in progress and removed status/DNM Do not merge, test is failing or blocked by another PR status/PTAL This PR is ready for review. Add this label back after committing new changes labels Mar 20, 2019
This skips the more complex pingcap/parser, and speeds up parsing speed
by 50%.

We have also refactored the KV delivery mechanism to use channels
directly, and revamped metrics:

- Make the metrics about engines into its own `engines` counter. The
  `tables` counter is exclusively about tables now.
- Removed `block_read_seconds`, `block_read_bytes`, `block_encode_seconds`
  since the concept of "block" no longer applies. Replaced by the
  equivalents named `row_***`.
- Removed `chunk_parser_read_row_seconds` for being overlapping with
  `row_read_seconds`.
- Changed `block_deliver_bytes` into a histogram vec, with kind=index
  or kind=data. Introduced `block_deliver_kv_pairs`.
Only kill Lightning if the whole chunk is imported exactly. The chunk
checkpoint may be recorded before a chunk is fully written, and this will
hit the failpoint more than 5 times.
This helps debugging some mysterious cancellation where the log is
inhibited.

Added IsReallyContextCanceledError() for code logic affected by error
type.
@kennytm kennytm added status/PTAL This PR is ready for review. Add this label back after committing new changes and removed status/WIP Work in progress labels Mar 23, 2019
lightning/kv/session.go Outdated Show resolved Hide resolved
lightning/kv/sql2kv.go Outdated Show resolved Hide resolved
lightning/metric/metric.go Outdated Show resolved Hide resolved
lightning/mydump/csv_parser.go Show resolved Hide resolved
lightning/restore/restore.go Outdated Show resolved Hide resolved
@kennytm
Copy link
Collaborator Author

kennytm commented Apr 2, 2019

/run-all-tests

Copy link
Collaborator

@IANTHEREAL IANTHEREAL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that lightning lacks unit test

_, err = chunkStmt.ExecContext(
c, tableName, engineID,
value.Key.Path, value.Key.Offset, value.Columns, value.ShouldIncludeRowID,
value.Key.Path, value.Key.Offset, colPerm,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only know we call InsertEngineCheckpoints after calling populateChunks. If so, columns column always empty

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this and the checksum (always 0 too).

func (cr *chunkRestore) saveCheckpoint(t *TableRestore, engineID int32, rc *RestoreController) {
rc.saveCpCh <- saveCp{
tableName: t.tableName,
merger: &RebaseCheckpointMerger{
Copy link
Collaborator

@IANTHEREAL IANTHEREAL Apr 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it enough to only save it in L529? it seems AllocBase wouldn't change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no, the AllocBase needs to be larger than all content of _tidb_rowid (or the integer primary key), which value cannot be determined until we've read all data.

Added a comment for this.

{"t2", "CREATE TABLE `t2` (`c1` varchar(30000) NOT NULL)", "failed to ExecDDLSQL `mockdb`.`t2`:.*"},
{"t3", "CREATE TABLE `t3-a` (`c1-a` varchar(5) NOT NULL)", ""},
{"t1", "CREATE TABLE `t1` (`c1` varchar(5) NOT NULL)"},
// {"t2", "CREATE TABLE `t2` (`c1` varchar(30000) NOT NULL)"}, // no longer able to create this kind of table.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this case meaningless? or add some error cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR we no longer parses the CREATE TABLE DDL, and instead directly unmarshal the JSON result from TiDB (calling tables.TableFromMeta). So yeah this case becomes meaningless since either you can't create a VARCHAR(30000) in TiDB, or you can and produced a proper JSON which Lightning accepts without error.

Anyway, added a separate unit test to ensure malformed table info will produce an error.

tests/checkpoint_chunks/run.sh Outdated Show resolved Hide resolved
@@ -19,11 +19,12 @@ import (
"strings"

. "github.com/pingcap/check"
"github.com/pingcap/errors"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't add more test for it, only old cases

Copy link
Collaborator Author

@kennytm kennytm Apr 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old unit test didn't compile since the interface is changed. There's no new or deleted tests in this file.

@IANTHEREAL
Copy link
Collaborator

Rest LGTM

@kennytm
Copy link
Collaborator Author

kennytm commented Apr 2, 2019

/run-all-tests

@lonng
Copy link
Contributor

lonng commented Apr 3, 2019

Rest LGTM

@kennytm kennytm added status/LGT1 One reviewer already commented LGTM (LGTM1) and removed status/PTAL This PR is ready for review. Add this label back after committing new changes labels Apr 3, 2019
@kennytm
Copy link
Collaborator Author

kennytm commented Apr 3, 2019

@GregoryIan PTAL again

@IANTHEREAL
Copy link
Collaborator

LGTM

@IANTHEREAL IANTHEREAL added status/LGT2 Two reviewers already commented LGTM, ready for merge (LGTM2) and removed status/LGT1 One reviewer already commented LGTM (LGTM1) labels Apr 3, 2019
@kennytm
Copy link
Collaborator Author

kennytm commented Apr 3, 2019

/run-all-tests

@kennytm kennytm merged commit 7bae12e into master Apr 3, 2019
kennytm added a commit that referenced this pull request Apr 4, 2019
* tests: fix a test failure due to conflict between #145 and #158

* restore: apply the row count limit to failpoint KillIfImportedChunk too
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
status/LGT2 Two reviewers already commented LGTM, ready for merge (LGTM2) type/enhancement Performance improvement or refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants