Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Refactor bulkinsert #28521

Closed
11 of 12 tasks
bigsheeper opened this issue Nov 17, 2023 · 4 comments
Closed
11 of 12 tasks

[Feature]: Refactor bulkinsert #28521

bigsheeper opened this issue Nov 17, 2023 · 4 comments
Assignees
Labels
kind/feature Issues related to feature request from users

Comments

@bigsheeper
Copy link
Contributor

bigsheeper commented Nov 17, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Is your feature request related to a problem? Please describe.

Refactor bulkinsert to accomplish the following objectives:

  1. Simplify and redesign the execution pathway for bulkinsert.
  2. Enable the processing of multiple input files.
  3. Support quota and limits for bulkinsert.
  4. Enhance the states and errors.
  5. Facilitate task retry and implement request-level timeouts.

Workloads/steps to refactor bulkinsert:

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

@bigsheeper bigsheeper added the kind/feature Issues related to feature request from users label Nov 17, 2023
@bigsheeper bigsheeper self-assigned this Nov 17, 2023
sre-ci-robot pushed a commit that referenced this issue Dec 4, 2023
Define the new rpc and metadata for ImportV2.

see also: #28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
@bigsheeper
Copy link
Contributor Author

Issues to be resolved:

  1. Potential proliferation of numerous small binlogs under PartitionKey mode.
  2. Task scheduling at the collection level instead of a random distribution.

sre-ci-robot pushed a commit that referenced this issue Jan 5, 2024
This PR defines the new import reader interfaces and implement a binlog
reader for import.

issue: #28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Jan 5, 2024
This PR implements a new json reader for import.

issue: #28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Jan 7, 2024
This PR implements a Parquet reader for import.

issue: #28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Jan 8, 2024
This PR implements a new numpy reader for import.

issue: #28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Jan 31, 2024
issue: #28521 #29732

include
1. list collection's import jobs
2. create a new import job
3. get the progress of an import job

fix:
1. mix the order of dbName & collectionName #29728
2. trace log keep the same as v1
3. support traceID
4. azure precheck, blob name cannot end with / #29703

---------

Signed-off-by: PowderLi <min.li@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Jan 31, 2024
This PR introduces novel importv2 roles for datanode:
1. Executor: To execute tasks, a import task will be divided into the
following steps: read data -> hash data -> sync data;
2. Manager: To manage all the tasks;

issue: #28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
PowderLi added a commit to PowderLi/milvus that referenced this issue Feb 1, 2024
issue: milvus-io#28521 milvus-io#29732

include
1. list collection's import jobs
2. create a new import job
3. get the progress of an import job

fix:
1. mix the order of dbName & collectionName milvus-io#29728
2. trace log keep the same as v1
3. support traceID

---------

Signed-off-by: PowderLi <min.li@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Mar 1, 2024
This PR introduces novel managerial roles for importv2:
1. ImportMeta: To manage all the import tasks;
2. ImportScheduler: To process tasks and modify their states;
3. ImportChecker: To ascertain the completion of all tasks and instigate
relevant operations.

issue: #28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
MrPresent-Han pushed a commit to MrPresent-Han/milvus that referenced this issue Mar 3, 2024
This PR introduces novel managerial roles for importv2:
1. ImportMeta: To manage all the import tasks;
2. ImportScheduler: To process tasks and modify their states;
3. ImportChecker: To ascertain the completion of all tasks and instigate
relevant operations.

issue: milvus-io#28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Mar 5, 2024
Revise the RESTful bulk insert API from version 1 to version 2.

issue: #28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
PowderLi added a commit to PowderLi/milvus that referenced this issue Mar 8, 2024
issue: milvus-io#28521 milvus-io#29732

include
1. list collection's import jobs
2. create a new import job
3. get the progress of an import job

fix:
1. mix the order of dbName & collectionName milvus-io#29728
2. trace log keep the same as v1
3. support traceID

---------

Signed-off-by: PowderLi <min.li@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Mar 10, 2024
…1046)

Replacing the current import API v1 implementation with the v2
implementation.

issue: #28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Mar 24, 2024
Return more fields in import progress response, include importedRows and
totalRows. Additionally, ensure compatibility with the old import
progress response by retaining fields of create timestamp and row count.

issue: #31448
#31237
#28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
bigsheeper added a commit to bigsheeper/milvus that referenced this issue Mar 25, 2024
)

Return more fields in import progress response, include importedRows and
totalRows. Additionally, ensure compatibility with the old import
progress response by retaining fields of create timestamp and row count.

issue: milvus-io#31448
milvus-io#31237
milvus-io#28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Mar 25, 2024
)

Return more fields in import progress response, include importedRows and
totalRows. Additionally, ensure compatibility with the old import
progress response by retaining fields of create timestamp and row count.

issue: #31448
#31237
#28521

pr: #31539

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Mar 25, 2024
…31497) (#31542)

The max number of import files per request should not exceed 1024 by
default (configurable).
The import file size allowed for importing should not exceed 16GB by
default (configurable).

issue: #28521

pr: #31497

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Mar 25, 2024
To reduce the overhead caused by listing the S3 objects, add an
interface to importutil.Reader to retrieve file sizes.

issue: #31532,
#28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Mar 26, 2024
#31594)

To reduce the overhead caused by listing the S3 objects, add an
interface to importutil.Reader to retrieve file sizes.

issue: #31532,
#28521

pr: #31533

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Apr 1, 2024
…ome logic (#31629)

Feature Introduced:
1. Ensure ImportV2 waits for the index to be built

Enhancements Introduced:
1. Utilization of local time for timeout ts instead of allocating ts
from rootcoord.
3. Enhanced input file length check for binlog import.
4. Removal of duplicated manager in datanode.
5. Renaming of executor to scheduler in datanode.
6. Utilization of a thread pool in the scheduler in datanode.

issue: #28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Apr 1, 2024
…ome logic (#31629) (#31733)

Feature Introduced:
1. Ensure ImportV2 waits for the index to be built

Enhancements Introduced:
1. Utilization of local time for timeout ts instead of allocating ts
from rootcoord.
2. Enhanced input file length check for binlog import.
3. Removal of duplicated manager in datanode.
4. Renaming of executor to scheduler in datanode.
5. Utilization of a thread pool in the scheduler in datanode.

issue: #28521

pr: #31629

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Apr 8, 2024
…#31937)

Use an individual buffer size parameter for imports and set buffer size
to 64MB.

issue: #28521

pr: #31833

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Apr 8, 2024
Use an individual buffer size parameter for imports and set buffer size
to 64MB.

issue: #28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Apr 11, 2024
issue: #28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
bigsheeper added a commit to bigsheeper/milvus that referenced this issue Apr 11, 2024
issue: milvus-io#28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Apr 11, 2024
issue: #28521

pr: #32112

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Apr 15, 2024
During binlog import, even if the primary key's autoID is set to true,
the primary key from the binlog should be used instead of being
reassigned.

issue: #31943,
#28521

pr: #32118

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Apr 16, 2024
During binlog import, even if the primary key's autoID is set to true,
the primary key from the binlog should be used instead of being
reassigned.

issue: #31943,
#28521

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
yellow-shine pushed a commit to yellow-shine/milvus that referenced this issue Apr 18, 2024
)

During binlog import, even if the primary key's autoID is set to true,
the primary key from the binlog should be used instead of being
reassigned.

issue: milvus-io#31943,
milvus-io#28521

pr: milvus-io#32118

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
@bigsheeper
Copy link
Contributor Author

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Issues related to feature request from users
Projects
None yet
Development

No branches or pull requests

1 participant