Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Bulk insert failed when the nullable/default_value field is not exist in the inserted json file #39036

Open
1 task done
binbinlv opened this issue Jan 7, 2025 · 2 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@binbinlv
Copy link
Contributor

binbinlv commented Jan 7, 2025

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-latest/2.5-latest
- Deployment mode(standalone or cluster):both
- MQ type(rocksmq, pulsar or kafka):    all
- SDK version(e.g. pymilvus v2.0.0rc2): master/2.5 latest
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Bulk insert failed when the nullable/default_value field is not exist in json file

<Bulk insert state
    - taskID          : 455120973545634129,
    - state           : Failed,
    - row_count       : 0,
    - infos           : {'failed_reason': "value of field 'int_scalar' is missed: importing data failed", 'progress_percent': '0'},
    - id_ranges       : [],
    - create_ts       : 2025-01-07 12:49:00

Expected Behavior

bulk insert successfully

Steps To Reproduce

    @pytest.mark.tags(CaseLabel.L3)
    @pytest.mark.parametrize("auto_id", [True])
    @pytest.mark.parametrize("dim", [128])  # 128
    @pytest.mark.parametrize("entities", [2000])
    @pytest.mark.parametrize("enable_dynamic_field", [True])
    @pytest.mark.parametrize("enable_partition_key", [True, False])
    # @pytest.mark.parametrize("nullable", [True, False])
    @pytest.mark.parametrize("nullable", [True])
    def test_bulk_insert_all_field_with_new_json_format(self, auto_id, dim, entities, enable_dynamic_field,
                                                        enable_partition_key, nullable):
        """
        collection schema 1: [pk, int64, float64, string float_vector]
        data file: vectors.npy and uid.npy,
        Steps:
        1. create collection
        2. import data
        3. verify
        """
        if enable_partition_key is True and nullable is True:
            pytest.skip("partition key field not support nullable")
        float_vec_field_dim = dim
        binary_vec_field_dim = ((dim+random.randint(-16, 32)) // 8) * 8
        bf16_vec_field_dim = dim+random.randint(-16, 32)
        fp16_vec_field_dim = dim+random.randint(-16, 32)
        fields = [
            cf.gen_int64_field(name=df.pk_field, is_primary=True, auto_id=auto_id),
            cf.gen_int64_field(name=df.int_field, nullable=nullable),
            cf.gen_float_field(name=df.float_field, nullable=nullable),
            cf.gen_string_field(name=df.string_field, is_partition_key=enable_partition_key, nullable=nullable),
            cf.gen_string_field(name=df.text_field, enable_analyzer=True, enable_match=True, nullable=nullable),
            cf.gen_json_field(name=df.json_field, nullable=nullable),
            cf.gen_array_field(name=df.array_int_field, element_type=DataType.INT64, nullable=nullable),
            cf.gen_array_field(name=df.array_float_field, element_type=DataType.FLOAT, nullable=nullable),
            cf.gen_array_field(name=df.array_string_field, element_type=DataType.VARCHAR, max_length=100, nullable=nullable),
            cf.gen_array_field(name=df.array_bool_field, element_type=DataType.BOOL, nullable=nullable),
            cf.gen_float_vec_field(name=df.float_vec_field, dim=float_vec_field_dim),
            cf.gen_binary_vec_field(name=df.binary_vec_field, dim=binary_vec_field_dim),
            cf.gen_bfloat16_vec_field(name=df.bf16_vec_field, dim=bf16_vec_field_dim),
            cf.gen_float16_vec_field(name=df.fp16_vec_field, dim=fp16_vec_field_dim)
        ]
        data_fields = [f.name for f in fields if not f.to_dict().get("auto_id", False)]
        data_fields.remove(df.int_field)
        self._connect()
        c_name = cf.gen_unique_str("bulk_insert")
        schema = cf.gen_collection_schema(fields=fields, auto_id=auto_id, enable_dynamic_field=enable_dynamic_field)

        files = prepare_bulk_insert_new_json_files(
            minio_endpoint=self.minio_endpoint,
            bucket_name=self.bucket_name,
            rows=entities,
            dim=dim,
            data_fields=data_fields,
            enable_dynamic_field=enable_dynamic_field,
            force=True,
            schema=schema
        )
        self.collection_wrap.init_collection(c_name, schema=schema)

        # import data
        t0 = time.time()
        task_id, _ = self.utility_wrap.do_bulk_insert(
            collection_name=c_name, files=files
        )
        logging.info(f"bulk insert task ids:{task_id}")
        success, states = self.utility_wrap.wait_for_bulk_insert_tasks_completed(
            task_ids=[task_id], timeout=300
        )
        tt = time.time() - t0
        log.info(f"bulk insert state:{success} in {tt} with states:{states}")
        assert success

Milvus Log

https://grafana-4am.zilliz.cc/explore?orgId=1&left=%7B%22datasource%22:%22Loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bcluster%3D%5C%22devops%5C%22,namespace%3D%5C%22chaos-testing%5C%22,pod%3D~%5C%22master-latest-gefzw.%2A%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D&panes=%7B%22q6c%22:%7B%22datasource%22:%22vhI6Vw67k%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bcluster%3D%5C%22devops%5C%22,namespace%3D%5C%22chaos-testing%5C%22,pod%3D~%5C%22master-latest-gefzw.%2A%5C%22%7D%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22vhI6Vw67k%22%7D%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D%7D

Anything else?

collection name: bulk_insert_jJBCSdkj

[2025/01/07 04:49:06.127 +00:00] [INFO] [datacoord/services.go:1788] ["GetImportProgress done"] [traceID=091fa76357147dc863d4564600fb2166] [jobID=455120973545634129] [resp="status:{} state:Failed reason:\"value of field 'int_scalar' is missed: importing data failed\" collection_name:\"bulk_insert_jJBCSdkj\" start_time:\"2025-01-07T04:49:00Z\""]
@binbinlv binbinlv added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 7, 2025
@binbinlv binbinlv added this to the 2.5.3 milestone Jan 7, 2025
@binbinlv
Copy link
Contributor Author

binbinlv commented Jan 7, 2025

/assign @smellthemoon

@binbinlv binbinlv added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 7, 2025
@yanliang567 yanliang567 removed their assignment Jan 7, 2025
sre-ci-robot pushed a commit that referenced this issue Jan 9, 2025
…exist (#39063)

#39036

Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Jan 9, 2025
…exist(#39063) (#39111)

pr: #39063 
issue: #39036

Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants