Skip to content

Conversation

haiqi96
Copy link
Contributor

@haiqi96 haiqi96 commented Jul 5, 2024

Description

Let decompress cmd script in package handle both decompression and ir extraction

Validation performed

Launched the package, verified that compression, decompression, ir extraction and search still work from command line


def generate_container_start_cmd(
container_name: str, container_mounts: List[CLPDockerMounts], container_image: str
container_name: str, container_mounts: List[Optional[DockerMount]], container_image: str
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a type fix

from strenum import KebabCaseStrEnum

# CONSTANTS
DECOMPRESSION_COMMAND = "x"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking about using a enum similar to JobType, but will need to manually assign a value instead of using auto.

@haiqi96
Copy link
Contributor Author

haiqi96 commented Jul 6, 2024

A high level concern is that what would be good terms to distinguish, (1) a general job name launched by decompress.py script" (2) a job that decompress a file, and (3) a job that extracts an IR.

I am currently using "x" as the decompression command following CLP and CLO, but internally still use decompression as job type, which is inconsistent.

Should we call both job as decompression job, and internally distinguish them as "extraction" and "ir_extraction" command?

@junhaoliao
Copy link
Member

How do we specify target-uncompressed-size?
I tried ./decompress.sh i --target-uncompressed-size 10240 --orig-file-id daf326b3-ab77-42ec-9fcf-056b541f948a 0 and also manually inserted a msgpack record of

{
	"orig_file_id": "f6fa2faf-d686-4d54-b086-b68c3da04405",
	"msg_ix": 1,
	"target_uncompressed_size": 10240
}

which both yielded

[2024-07-11 06:28:05,643: INFO/ForkPoolWorker-7] job_orchestration.executor.query.extract_ir_task.extract_ir[78f14806-52e2-4e87-81d8-3f00fc2ac0d2]: Started IR extraction task for job 3
[2024-07-11 06:28:05,648: ERROR/ForkPoolWorker-7] Task job_orchestration.executor.query.extract_ir_task.extract_ir[78f14806-52e2-4e87-81d8-3f00fc2ac0d2] raised unexpected: TypeError('sequence item 8: expected str instance, int found')
Traceback (most recent call last):
  File "/opt/clp/lib/python3/site-packages/celery/app/trace.py", line 453, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/clp/lib/python3/site-packages/celery/app/trace.py", line 736, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/clp/lib/python3/site-packages/job_orchestration/executor/query/extract_ir_task.py", line 106, in extract_ir
    return run_query_task(
  File "/opt/clp/lib/python3/site-packages/job_orchestration/executor/query/utils.py", line 62, in run_query_task
    logger.info(f'Running: {" ".join(task_command)}')
TypeError: sequence item 8: expected str instance, int found

@haiqi96 haiqi96 marked this pull request as ready for review July 11, 2024 14:48
haiqi96 and others added 2 commits July 11, 2024 10:49
Co-authored-by: Junhao Liao <junhao@junhao.ca>
@haiqi96
Copy link
Contributor Author

haiqi96 commented Jul 11, 2024

How do we specify target-uncompressed-size? I tried ./decompress.sh i --target-uncompressed-size 10240 --orig-file-id daf326b3-ab77-42ec-9fcf-056b541f948a 0 and also manually inserted a msgpack record of

{
	"orig_file_id": "f6fa2faf-d686-4d54-b086-b68c3da04405",
	"msg_ix": 1,
	"target_uncompressed_size": 10240
}

which both yielded

[2024-07-11 06:28:05,643: INFO/ForkPoolWorker-7] job_orchestration.executor.query.extract_ir_task.extract_ir[78f14806-52e2-4e87-81d8-3f00fc2ac0d2]: Started IR extraction task for job 3
[2024-07-11 06:28:05,648: ERROR/ForkPoolWorker-7] Task job_orchestration.executor.query.extract_ir_task.extract_ir[78f14806-52e2-4e87-81d8-3f00fc2ac0d2] raised unexpected: TypeError('sequence item 8: expected str instance, int found')
Traceback (most recent call last):
  File "/opt/clp/lib/python3/site-packages/celery/app/trace.py", line 453, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/clp/lib/python3/site-packages/celery/app/trace.py", line 736, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/clp/lib/python3/site-packages/job_orchestration/executor/query/extract_ir_task.py", line 106, in extract_ir
    return run_query_task(
  File "/opt/clp/lib/python3/site-packages/job_orchestration/executor/query/utils.py", line 62, in run_query_task
    logger.info(f'Running: {" ".join(task_command)}')
TypeError: sequence item 8: expected str instance, int found

Sorry, it looks like I forgot to do a type conversion.

I included it in https://github.com/y-scope/clp/pull/472/files#diff-c3c708ca5b9cee2be7634fcb5966f6fc622b85ec0ad8575491540e676414cbbaL48. if you need this change urgently, I can also make a separate PR for it,

logger.error(f"job {job_id} finished with unexpected status: {job_status}")


async def do_extract(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really need this method, right? (Since it's purely wrapping another async method.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I tried to remove it but then it seems that I have to turn the entire call trees to async, including main. Maybe my knowledge about async is not enough to make a neat change quickly.

Given that this code is copied from search.py, can we turn this refactor into a separate PR and apply to both decompress.py and search.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work (in handle_extract_ir_cmd)?

        return asyncio.run(
            run_function_in_process(
                submit_and_monitor_ir_extraction_job_in_db,
                clp_config.database,
                orig_file_id,
                parsed_args.msg_ix,
                parsed_args.target_uncompressed_size,
            )
        )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right,, this one just works. don't remeber what I wrote but not as neat as this one.

from strenum import KebabCaseStrEnum

# CONSTANTS
DECOMPRESSION_COMMAND = "x"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

  • EXTRACT_FILE_CMD and EXTRACT_IR_CMD
  • handle_extract_file_cmd and handle_extract_ir_cmd


group = ir_extraction_parser.add_mutually_exclusive_group(required=True)
group.add_argument("--orig-file-id", type=str, help="Original file's ID.")
group.add_argument("--path", type=str, help="Original file's path.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about --orig-file-path? Makes it more consistent with the other arg.

Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

clp-package: Add support for extracting files as IR streams through the CLI.

@haiqi96 haiqi96 changed the title Support ir extraction in decompression script clp-package: Add support for extracting files as IR streams through the CLI. Jul 19, 2024
@haiqi96 haiqi96 merged commit 1d71b6f into y-scope:main Jul 19, 2024
jackluo923 pushed a commit to jackluo923/clp that referenced this pull request Dec 4, 2024
@haiqi96 haiqi96 deleted the ir_extraction_script branch December 6, 2024 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants