Skip to content

[rank 0] [WARNING] filtering table whose nb sentences and nb sonar vectors are aligned, keeping 2 rows out of11490 #25

@hasanyazarr

Description

@hasanyazarr
!CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation  \
  --predictor base_lcm \
  --model_card /content/drive/MyDrive/LCM/checkpoints/mse_lcm/checkpoints/step_10000/model_card.yaml \
  --launcher standalone \
  --dataset.parquet_path /content/drive/MyDrive/LCM/eval_data/0_55ac997a0bfaa427_0_0.parquet \
  --dataset.source_column prompt_sentences_sonar_emb \
  --dataset.source_text_column prompt_sentences \
  --dataset.target_column answer_sentences_sonar_emb \
  --dataset.target_text_column prompt_sentences \
  --tasks lcm_generation \
  --task_args '{"max_gen_len": 200}' \
  --data_loading.batch_size 16  --generator_batch_size 16 \
  --dump_dir /content/drive/MyDrive/LCM/output_results_lcm \

In this code, I get '[rank 0] [WARNING] filtering table whose nb sentences and nb sonar vectors are aligned, keeping 2 rows out of11490' error. I confirmed my data loading works correct and loads 11490 rows. Full output is in the below:

[2025-01-23 07:35:57,150] [rank 0] [INFO] submitted single job for lcm_generation_base_lcm_a591ec0874_2025-01-23_07-35-57: DEBUG_138828927617776
[2025-01-23 07:35:57,150] [rank 0] [INFO] Logs at: /content/executor_logs/lcm_generation_base_lcm_a591ec0874_2025-01-23_07-35-56/DEBUG_138828927617776_0_log.err
[2025-01-23 07:35:57,152] [rank 0] [WARNING] Logging is written both to stderr/stdout and to /content/executor_logs/lcm_generation_base_lcm_a591ec0874_2025-01-23_07-35-56/DEBUG_138828927617776_0_log.out/err. But call to print will only appear in the console.
[2025-01-23 07:35:57,157] [rank 0] [INFO] Writing configs and metadata to /content/drive/MyDrive/LCM/output_results_lcm/metadata.jsonl
[2025-01-23 07:35:57,163] [rank 0] [INFO] Evals version 0.1.0.dev0 (/content/large_concept_model/lcm/evaluation)
[2025-01-23 07:35:57,163] [rank 0] [INFO] Config: {'timestamp': '2025_01_23_07_35_57', 'command': '/content/large_concept_model/lcm/evaluation/main.py --predictor base_lcm --model_card /content/drive/MyDrive/LCM/checkpoints/mse_lcm/checkpoints/step_10000/model_card.yaml --launcher standalone --dataset.parquet_path /content/drive/MyDrive/LCM/eval_data/0_55ac997a0bfaa427_0_0.parquet --dataset.source_column prompt_sentences_sonar_emb --dataset.source_text_column prompt_sentences --dataset.target_column answer_sentences_sonar_emb --dataset.target_text_column prompt_sentences --tasks lcm_generation --task_args '{"max_gen_len": 200}' --data_loading.batch_size 4096 --generator_batch_size 4096 --dump_dir /content/drive/MyDrive/LCM/output_results_lcm ''', 'git_info': {'git_repo': '/content/large_concept_model/lcm', 'commit': 'd6402232cb7195530904d565cfe7c66d70c2b2a3', 'branch': 'main', 'user': 'root'}, 'config': {'name': 'lcm_generation', 'task_name': 'lcm_generation', 'dump_dir': '/content/drive/MyDrive/LCM/output_results_lcm', 'predictor': 'base_lcm', 'params': {'dataset': {'columns': None, 'source_text_column': 'prompt_sentences', 'target_text_column': 'prompt_sentences', 'source_prefix_text': None, 'source_suffix_text': None, 'target_prefix_text': None, 'target_suffix_text': None, 'source_sequences': None, 'target_sequences': None, 'silent_freeze': True, 'name': None, 'parquet_path': '/content/drive/MyDrive/LCM/eval_data/0_55ac997a0bfaa427_0_0.parquet', 'weight': 1.0, 'limit': None, 'source_column': 'prompt_sentences_sonar_emb', 'target_column': 'answer_sentences_sonar_emb', 'source_quality_column': None, 'source_quality_range': None, 'partition_filters': None, 'filters': None, 'filesystem_expr': None, 'filesystem': None, 'split_to_row_groups': None, 'nb_parallel_fragments': None, 'sharding_in_memory': False}, 'max_gen_len': 200, 'max_gen_len_ratio': None, 'max_prompt_len': 2048, 'eos_config': None}, 'data_loading': {'multiple_dataset_chaining': 'concat', 'batch_size': 4096, 'order_by_length': True, 'max_tokens': None, 'len_to_wrap_long_seq': None, 'packing': False, 'wrap_before_affixing': False, 'max_sentence_len_in_doc': None, 'min_sentence_len_in_doc': None, 'max_sentence_len_in_target_doc': None, 'min_sentence_len_in_target_doc': None, 'min_length_of_sequences': 1, 'min_length_of_sequences_after_batching': 1, 'min_length_of_target_sequences': 1, 'min_length_of_target_sequences_after_batching': 1, 'output_format': <ParquetBatchFormat.torch: 2>, 'shuffle': False, 'drop_null': True, 'seed': 123, 'nb_epochs': 1, 'min_batch_size': 1, 'nb_prefetch': 3.0, 'num_parallel_calls': 1.5, 'use_threads': False, 'ignore_checkpointed_pipeline': False, 'even_sharding': False, 'max_iteration_steps': None, 'sharding_in_memory': True, 'rank': 0, 'world_size': 1, 'max_samples': None}, 'dataset': {'columns': None, 'source_text_column': 'prompt_sentences', 'target_text_column': 'prompt_sentences', 'source_prefix_text': None, 'source_suffix_text': None, 'target_prefix_text': None, 'target_suffix_text': None, 'source_sequences': None, 'target_sequences': None, 'silent_freeze': True, 'name': None, 'parquet_path': '/content/drive/MyDrive/LCM/eval_data/0_55ac997a0bfaa427_0_0.parquet', 'weight': 1.0, 'limit': None, 'source_column': 'prompt_sentences_sonar_emb', 'target_column': 'answer_sentences_sonar_emb', 'source_quality_column': None, 'source_quality_range': None, 'partition_filters': None, 'filters': None, 'filesystem_expr': None, 'filesystem': None, 'split_to_row_groups': None, 'nb_parallel_fragments': None, 'sharding_in_memory': False}, 'dtype': 'torch.float32', 'predictor_config': {'max_seq_len': 200, 'min_seq_len': 1, 'eos_threshold': 0.9, 'sample_latent_variable': True, 'stop_on_repetition_cosine_threshold': None, 'include_eos_token': False, 'trim_hypotheses': False, 'seed': 42, 'lcm_temperature': 1.0, 'model_card': '/content/drive/MyDrive/LCM/checkpoints/mse_lcm/checkpoints/step_10000/model_card.yaml', 'decoder_config': {'tokenizer': 'text_sonar_basic_decoder', 'decoder': 'text_sonar_basic_decoder', 'lang': 'eng_Latn', 'max_tokens_in_sentence': 256, 'temperature': 1.0}, 'encoder_config': {'tokenizer': 'text_sonar_basic_encoder', 'encoder': 'text_sonar_basic_encoder', 'lang': 'eng_Latn'}, 'generator_batch_size': 4096}, 'seed': 42, 'confidence_level': None, 'disable_cache': False, 'temperature': 0.0, 'top_k': 0, 'top_p': 0, 'metric_log_dir': '/content/drive/MyDrive/LCM/output_results_lcm', 'tb_log_dir': None, 'no_resume': False, 'metrics_to_report': None, 'show_progress': False, 'log_raw_results': True, 'log_only_text': False, 'requirements': {'nodes': 1, 'mem_gb': None, 'tasks_per_node': 1, 'gpus_per_node': 1, 'cpus_per_task': 4, 'timeout_min': 150, 'constraint': None, 'max_num_timeout': 10}, 'nshards': None, 'os_environs': None}, 'task_configs': {'dataset': ParquetDatasetConfig(columns=None, source_text_column='prompt_sentences', target_text_column='prompt_sentences', source_prefix_text=None, source_suffix_text=None, target_prefix_text=None, target_suffix_text=None, source_sequences=None, target_sequences=None, silent_freeze=True, name=None, parquet_path='/content/drive/MyDrive/LCM/eval_data/0_55ac997a0bfaa427_0_0.parquet', weight=1.0, limit=None, source_column='prompt_sentences_sonar_emb', target_column='answer_sentences_sonar_emb', source_quality_column=None, source_quality_range=None, partition_filters=None, filters=None, filesystem_expr=None, filesystem=None, split_to_row_groups=None, nb_parallel_fragments=None, sharding_in_memory=False), 'max_gen_len': 200, 'max_gen_len_ratio': None, 'max_prompt_len': 2048, 'eos_config': None}}
[2025-01-23 07:35:57,238] [rank 0] [INFO] Running task lcm_generation on cuda:0
[2025-01-23 07:35:57,242] [rank 0] [INFO] Setting 'cuda:0' as the default device of the process.
[2025-01-23 07:35:57,426] [rank 0] [INFO] Card loaded: {'source': 'inproc', 'checkpoint': 'file:///content/drive/MyDrive/LCM/checkpoints/mse_lcm/checkpoints/step_10000/model.pt', 'model_arch': 'base_lcm_1_6B', 'model_family': 'base_lcm', 'name': 'on_the_fly_lcm'}
[2025-01-23 07:36:00,871] [rank 0] [INFO] Building sonar_normalizer = dummy_sonar_normalizer
[2025-01-23 07:36:00,872] [rank 0] [INFO] Using LCMFrontend with embeddings scaler = 1.0
[2025-01-23 07:36:00,873] [rank 0] [INFO] Initializing frontend embeddings (special and positional) ~ N(0, 0.006)
[2025-01-23 07:36:03,788] [rank 0] [WARNING] eos_threshold is set to 0.9, but eos_vec is not provided
[2025-01-23 07:36:03,789] [rank 0] [INFO] Using the cached checkpoint of text_sonar_basic_decoder. Set force to True to download again.
[2025-01-23 07:36:15,290] [rank 0] [INFO] Using the cached tokenizer of text_sonar_basic_decoder. Set force to True to download again.
[2025-01-23 07:36:15,676] [rank 0] [INFO] Predictor loaded: LCMPredictor
[2025-01-23 07:36:15,677] [rank 0] [INFO] Using rank=0 among world_size=1 to build self._pipeline
[2025-01-23 07:36:15,878] [rank 0] [INFO] Following columns will be loaded: ['answer_sentences_sonar_emb', 'prompt_sentences', 'prompt_sentences_sonar_emb', 'split']
0% 0/1 [00:18<?, ?it/s]
100% 1/1 [00:00<00:00, 5269.23it/s]
[2025-01-23 07:36:15,905] [rank 0] [INFO] Bucketing will require at least: 664882 of tokens (source + target)
[2025-01-23 07:36:15,905] [rank 0] [INFO] Dataset stats: {'min_number_of_fragment': 1, 'mean_fragment_length': 11490.0, 'mean_fragment_number_of_tokens': 443255.0}
[2025-01-23 07:36:15,905] [rank 0] [INFO] Dataset Config: ParquetDatasetConfig(columns=['answer_sentences_sonar_emb', 'prompt_sentences', 'prompt_sentences_sonar_emb', 'split'], source_text_column='prompt_sentences', target_text_column='prompt_sentences', source_prefix_text=None, source_suffix_text=None, target_prefix_text=None, target_suffix_text=None, source_sequences=None, target_sequences=None, silent_freeze=True, name=None, parquet_path='/content/drive/MyDrive/LCM/eval_data/0_55ac997a0bfaa427_0_0.parquet', weight=1.0, limit=None, source_column='prompt_sentences_sonar_emb', target_column='answer_sentences_sonar_emb', source_quality_column=None, source_quality_range=None, partition_filters=None, filters=None, filesystem_expr=None, filesystem=None, split_to_row_groups=True, nb_parallel_fragments=1, sharding_in_memory=False)
[2025-01-23 07:36:15,906] [rank 0] [INFO] Using Loading Config: EvaluationDataLoadingConfig(multiple_dataset_chaining='concat', batch_size=4096, order_by_length=True, max_tokens=None, len_to_wrap_long_seq=None, packing=False, wrap_before_affixing=False, max_sentence_len_in_doc=None, min_sentence_len_in_doc=None, max_sentence_len_in_target_doc=None, min_sentence_len_in_target_doc=None, min_length_of_sequences=1, min_length_of_sequences_after_batching=1, min_length_of_target_sequences=1, min_length_of_target_sequences_after_batching=1, output_format=<ParquetBatchFormat.torch: 2>, shuffle=False, drop_null=True, seed=123, nb_epochs=1, min_batch_size=1, nb_prefetch=3.0, num_parallel_calls=1.5, use_threads=False, ignore_checkpointed_pipeline=False, even_sharding=False, max_iteration_steps=None, sharding_in_memory=True, rank=0, world_size=1, max_samples=None)
[2025-01-23 07:36:15,906] [rank 0] [INFO] Activating sharding_in_memory
[2025-01-23 07:36:15,909] [rank 0] [INFO] /content/drive/MyDrive/LCM/eval_data : full number of files 1
[2025-01-23 07:36:15,909] [rank 0] [INFO] /content/drive/MyDrive/LCM/eval_data : starting split in row groups
[2025-01-23 07:36:26,397] [rank 0] [WARNING] filtering table whose nb sentences and nb sonar vectors are aligned, keeping 2 rows out of11490
0% 0/1 [00:29<?, ?it/s]/content/large_concept_model/lcm/datasets/parquet_utils.py:162: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
return torch.from_numpy(arr.to_numpy(zero_copy_only=True))
[2025-01-23 07:40:20,849] [rank 0] [INFO] Using default tokenizer.
[2025-01-23 07:40:21,314] [rank 0] [INFO] Using default tokenizer.
[2025-01-23 07:40:21,755] [rank 0] [INFO] Writing raw results to /content/drive/MyDrive/LCM/output_results_lcm/raw_results/lcm_generation/lcm_generation_0 ( *.json | *.pt)
[2025-01-23 07:40:21,808] [rank 0] [INFO] written cache for lcm_generation_base_lcm_a591ec0874_2025-01-23_07-40-21:0
[2025-01-23 07:40:21,810] [rank 0] [INFO] lcm_generation_base_lcm_a591ec0874_2025-01-23_07-40-21 done after full execution
[2025-01-23 07:40:21,810] [rank 0] [INFO] Writing metric results to /content/drive/MyDrive/LCM/output_results_lcm/results/lcm_generation.json
[2025-01-23 07:40:21,817] [rank 0] [INFO] All evaluation results: rouge2: 0.002397 | rougel: 0.010584 | rougelsum: 0.013138
100% 1/1 [04:24<00:00, 264.83s/it]2025-01-23 07:40:22.180669: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2025-01-23 07:40:22.197087: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-23 07:40:22.215684: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-23 07:40:22.221060: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-23 07:40:22.235472: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-23 07:40:23.335161: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2025-01-23 07:40:24,285] [rank 0] [INFO] Writing Tensorboard logs to /content/drive/MyDrive/LCM/output_results_lcm/tb
[2025-01-23 07:40:24,292] [rank 0] [INFO] Writing metric logs to /content/drive/MyDrive/LCM/output_results_lcm/metrics.eval.jsonl
[2025-01-23 07:40:24,296] [rank 0] [INFO] Tasks ['lcm_generation'] took 267.35 seconds (including scheduling).
100% 1/1 [04:27<00:00, 267.31s/it]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions