Skip to content

Add references to IMPORT INTO in relevant docs #5451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 6, 2019
Merged

Conversation

lnhsingh
Copy link
Contributor

Closes #5180.

@lnhsingh lnhsingh requested a review from jseldess September 18, 2019 20:21
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@jseldess jseldess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jseldess and @lhirata)


v19.2/insert.md, line 12 at r1 (raw file):

## Performance best practices

- <span class="version-tag">New in v19.2:</span> To bulk-insert data into an existing table, use the [`IMPORT INTO`](import-into.html) statement.

I don't think we should equate "inserting multiple rows" with "bulk inserting". If you're inserting multiple rows at once, this is still more performance. If you're inserting "bulk", import into is better. Can you chat with Roland or the devs on where we should draw the threshold, or how to know which to use?


v19.2/insert.md, line 147 at r1 (raw file):

### Insert multiple rows into an existing table

<span class="version-tag">New in v19.2:</span> To bulk-insert data into an existing table, use the [`IMPORT INTO`](import-into.html) statement.

See above.


v19.2/performance-best-practices-overview.md, line 35 at r1 (raw file):

### Use `IMPORT INTO` instead of `INSERT` for bulk inserts into existing tables

<span class="version-tag">New in v19.2:</span> To bulk-insert data into an existing table, use the [`IMPORT INTO`](import-into.html) statement.

I suspect this is the right guidance here, since we're talking "bulk", but I also suspect we'll still want to guide people when to use multi-row inserts. Same question as above, really.


v19.2/performance-tuning.md, line 913 at r1 (raw file):

{{site.data.alerts.callout_info}}
<span class="version-tag">New in v19.2:</span> To import bulk data from one or more CSV files into an existing table, use the [`IMPORT INTO`](import-into.html) statement.

It's hard to see how this fits with what comes after. Do you want this to be a tip that insert into is the most performance for bulk inserting into an existing table when you can format the data as csvs?


v19.2/sql-faqs.md, line 12 at r1 (raw file):

- To bulk-insert data into a brand new table, the [`IMPORT`](import.html) statement.
- <span class="version-tag">New in v19.2:</span> To bulk-insert data into an existing table, use the [`IMPORT INTO`](import-into.html) statement.

Depending on what you find, you might still want to mention multi-row inserts as well.

Copy link
Contributor Author

@lnhsingh lnhsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lhirata and @rolandcrosby)


v19.2/insert.md, line 12 at r1 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

I don't think we should equate "inserting multiple rows" with "bulk inserting". If you're inserting multiple rows at once, this is still more performance. If you're inserting "bulk", import into is better. Can you chat with Roland or the devs on where we should draw the threshold, or how to know which to use?

When I talked to @rolandcrosby, he mentioned that it's kind of a grey area and maybe we could lay out that new in 19.2, you can import from CSV into an existing table and let the user decide what to do. Do you think specifying here (and where you commented elsewhere) "To insert bulk data from a CSV file into an existing table, use IMPORT INTO" differentiates itself enough from "inserting multiple rows"?

@jseldess
Copy link
Contributor


v19.2/insert.md, line 12 at r1 (raw file):

Previously, lhirata wrote…

When I talked to @rolandcrosby, he mentioned that it's kind of a grey area and maybe we could lay out that new in 19.2, you can import from CSV into an existing table and let the user decide what to do. Do you think specifying here (and where you commented elsewhere) "To insert bulk data from a CSV file into an existing table, use IMPORT INTO" differentiates itself enough from "inserting multiple rows"?

Can you look into this more? I'd like us to understand when to tell people to use multi-row inserts vs import into. I suspect multi-row inserts will be for smaller batches, where you get better performance from a single batch (insert into xxx values (a, 1), (b, 2), (c, 3);) than from separate inserts (insert into xxx values (a, 1);, insert into xxx values (b, 2);, insert into xxx values (c, 3);. You should ask some engineers like David Taylor and probably Ben Darnell.

Copy link
Contributor Author

@lnhsingh lnhsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lhirata and @rolandcrosby)


v19.2/import-into.md, line 200 at r1 (raw file):

- The table will be taken offline during the incremental import.

Known issues that will be addressed in upcoming alpha releases:

@dt, are these known issues resolved?

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lhirata and @rolandcrosby)


v19.2/import-into.md, line 200 at r1 (raw file):

Previously, lhirata wrote…

@dt, are these known issues resolved?

yes. Additional known limitation: the gc ttl on the table must allow enough time for the import to complete. If it doesn't, there is a risk of the table becoming permanently unavailable if the import encounters any issue causing it to fail if it has run longer than theTTL, as at that point the data needed to rollback to before it started may have been GC'ed and the table may be irrecoverable / need to be restored from backup.

@rolandcrosby Best practice is to have regular backups anyway, but given that IMPORT INTO a new feature with a thus elevated potential for bugs plus a known table-loss edge case above, do we need to add any special advice?

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lhirata and @rolandcrosby)


v19.2/import-into.md, line 200 at r1 (raw file):

Previously, dt (David Taylor) wrote…

yes. Additional known limitation: the gc ttl on the table must allow enough time for the import to complete. If it doesn't, there is a risk of the table becoming permanently unavailable if the import encounters any issue causing it to fail if it has run longer than theTTL, as at that point the data needed to rollback to before it started may have been GC'ed and the table may be irrecoverable / need to be restored from backup.

@rolandcrosby Best practice is to have regular backups anyway, but given that IMPORT INTO a new feature with a thus elevated potential for bugs plus a known table-loss edge case above, do we need to add any special advice?

I'd also move the - The table will be taken offline during the incremental import. to the top and/or also give that a call-out box as it is potentially a production-disrupting detail.

Copy link
Contributor Author

@lnhsingh lnhsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lhirata and @rolandcrosby)


v19.2/import-into.md, line 200 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I'd also move the - The table will be taken offline during the incremental import. to the top and/or also give that a call-out box as it is potentially a production-disrupting detail.

Removed alpha release related limitations / added new limitation callout.

@lnhsingh
Copy link
Contributor Author

Note: SQL diagram also needs to be added. The SVG needs to be generated once cockroachdb/cockroach#41214 is merged.

@lnhsingh
Copy link
Contributor Author

Might want to also add to Known Limitations that you can't use IMPORT INTO on interleaved tables.

Copy link
Contributor

@jseldess jseldess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lhirata and @rolandcrosby)


v19.2/import-into.md, line 200 at r1 (raw file):

Previously, lhirata wrote…

Removed alpha release related limitations / added new limitation callout.

Updated this list. @dt, does it look accurate and complete?


v19.2/insert.md, line 12 at r1 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

Can you look into this more? I'd like us to understand when to tell people to use multi-row inserts vs import into. I suspect multi-row inserts will be for smaller batches, where you get better performance from a single batch (insert into xxx values (a, 1), (b, 2), (c, 3);) than from separate inserts (insert into xxx values (a, 1);, insert into xxx values (b, 2);, insert into xxx values (c, 3);. You should ask some engineers like David Taylor and probably Ben Darnell.

After reading the warning at the top of the incremental import page about the feature not being ready for production use, I've changed my mind here. I've reverted to what we had and added a note about the new feature, but with a warning.


v19.2/insert.md, line 147 at r1 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

See above.

Reverted and added note, as mentioned above.


v19.2/performance-best-practices-overview.md, line 35 at r1 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

I suspect this is the right guidance here, since we're talking "bulk", but I also suspect we'll still want to guide people when to use multi-row inserts. Same question as above, really.

Reverted and added note, as mentioned above.


v19.2/performance-tuning.md, line 913 at r1 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

It's hard to see how this fits with what comes after. Do you want this to be a tip that insert into is the most performance for bulk inserting into an existing table when you can format the data as csvs?

Removed and added qualified note at end of section.


v19.2/performance-tuning-insecure.md, line 763 at r3 (raw file):

{{site.data.alerts.callout_info}}
<span class="version-tag">New in v19.2:</span> To import bulk data from one or more CSV files into an existing table, use the [`IMPORT INTO`](import-into.html) statement.

Removed and added qualified note at end of section.


v19.2/sql-faqs.md, line 12 at r1 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

Depending on what you find, you might still want to mention multi-row inserts as well.

Reworked to still emphasize multi-row inserts but with a note about import into.

@jseldess
Copy link
Contributor

@dt and @rolandcrosby, PTAL one more time.

@jseldess
Copy link
Contributor

jseldess commented Nov 4, 2019

@dt, @rolandcrosby, I'd really like to wrap this so. Can one of you please check the latest changes?

- `IMPORT INTO` **cannot**:
- Be used within a [transaction](transactions.html).
- Replace a row.
- Add a row that conflicts with a `UNIQUE` [index](indexes.html).
- Currently, `IMPORT INTO` does not work on indexed tables.
- `IMPORT` can sometimes fail with a "context canceled" error, or can restart itself many times without ever finishing. If this is happening, it is likely due to a high amount of disk contention. This can be mitigated by setting the `kv.bulk_io_write.max_rate` [cluster setting](cluster-settings.html) to a value below your max disk write speed. For example, to set it to 10MB/s, execute:
- Be used on [indexed](indexes.html) or [interleaved](interleave-in-parent.html) tables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We removed the no-indexes limitation, though the "replace a row" limitation above includes unique secondary indexes. The limitation is "no imported row may conflict with an existing row (in the primary or any unique secondary index) or the import will be rejected."

@@ -184,23 +182,18 @@ Google Cloud:
## Known limitations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should add "the table's configured GC TTL must be longer than the IMPORT will take to complete"

@@ -125,6 +125,10 @@ This example uses the `users` table from our open-source, fictional peer-to-peer

The [`SHOW RANGE ... FOR ROW`](show-range-for-row.html) statement shows information about a [range](architecture/overview.html#glossary) for a particular row of data. This information is useful for verifying how SQL data maps to underlying ranges, and where the replicas for a range are located.

### Incremental import into an existig table

The [`IMPORT INTO`](import-into.html) statement incrementally imports CSV data into an existing table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand what "incrementally" means in this context.

Copy link

@rolandcrosby rolandcrosby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @lhirata, and @rolandcrosby)


v19.2/import-into.md, line 14 at r5 (raw file):

{{site.data.alerts.callout_danger}}
**This is an [experimental feature](experimental-features.html) and should not be used in production due to correctness issues.** See [Known limitations](#known-limitations) below for more information.

We are not labeling IMPORT INTO as experimental for 19.2. The alpha releases were experimental and not ready for production use, but the version in the 19.2 final release is not experimental. There are some known limitations, which may justify calling it a beta feature for this release, but I'm not even sure that that's necessary.


v19.2/insert.md, line 147 at r1 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

Reverted and added note, as mentioned above.

The known limitations don't preclude production use anymore. I would remove the second sentence.


v19.2/performance-best-practices-overview.md, line 35 at r1 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

Reverted and added note, as mentioned above.

Same note as above re: known limitations.


v19.2/performance-tuning.md, line 969 at r5 (raw file):

{{site.data.alerts.callout_info}}
<span class="version-tag">New in v19.2:</span> You can also use the [`IMPORT INTO`](import-into.html) statement to bulk-insert CSV data into an existing table. However, note that this is an experimental feature and should not be used in production due to [known limitations](import-into.html#known-limitations).

Same note as above re: known limitations.


v19.2/performance-tuning-insecure.md, line 821 at r5 (raw file):

{{site.data.alerts.callout_info}}
<span class="version-tag">New in v19.2:</span> You can also use the [`IMPORT INTO`](import-into.html) statement to bulk-insert CSV data into an existing table. However, note that this is an experimental feature and should not be used in production due to [known limitations](import-into.html#known-limitations).

Same note as above re: known limitations.


v19.2/sql-faqs.md, line 13 at r5 (raw file):

    {{site.data.alerts.callout_info}}
    <span class="version-tag">New in v19.2:</span> You can also use the [`IMPORT INTO`](import-into.html) statement to bulk-insert CSV data into an existing table. However, note that this is an experimental feature and should not be used in production due to [known limitations](import-into.html#known-limitations).

Same note as above re: known limitations.

Copy link
Contributor

@jseldess jseldess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @lhirata, and @rolandcrosby)


v19.2/experimental-features.md, line 130 at r5 (raw file):

Previously, dt (David Taylor) wrote…

I don't think I understand what "incrementally" means in this context.

Removing since this feature isn't actually experimental.


v19.2/import-into.md, line 14 at r5 (raw file):

Previously, rolandcrosby (Roland Crosby) wrote…

We are not labeling IMPORT INTO as experimental for 19.2. The alpha releases were experimental and not ready for production use, but the version in the 19.2 final release is not experimental. There are some known limitations, which may justify calling it a beta feature for this release, but I'm not even sure that that's necessary.

OK, removed this and refactored other aspects of this PR as well.


v19.2/import-into.md, line 191 at r5 (raw file):

Previously, dt (David Taylor) wrote…

We removed the no-indexes limitation, though the "replace a row" limitation above includes unique secondary indexes. The limitation is "no imported row may conflict with an existing row (in the primary or any unique secondary index) or the import will be rejected."

Revised. PTAL.


v19.2/insert.md, line 147 at r1 (raw file):

Previously, rolandcrosby (Roland Crosby) wrote…

The known limitations don't preclude production use anymore. I would remove the second sentence.

Done.


v19.2/performance-best-practices-overview.md, line 35 at r1 (raw file):

Previously, rolandcrosby (Roland Crosby) wrote…

Same note as above re: known limitations.

Done.


v19.2/performance-tuning.md, line 969 at r5 (raw file):

Previously, rolandcrosby (Roland Crosby) wrote…

Same note as above re: known limitations.

Done.


v19.2/performance-tuning-insecure.md, line 821 at r5 (raw file):

Previously, rolandcrosby (Roland Crosby) wrote…

Same note as above re: known limitations.

Done.


v19.2/sql-faqs.md, line 13 at r5 (raw file):

Previously, rolandcrosby (Roland Crosby) wrote…

Same note as above re: known limitations.

Done.

@jseldess jseldess merged commit d621681 into master Nov 6, 2019
@jseldess jseldess deleted the import-into branch November 6, 2019 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update docs to reference importing into an existing table
5 participants