Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Pydarshan file_based sorting #954

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Yanlilyu
Copy link

@Yanlilyu Yanlilyu commented Aug 9, 2023

Description:

  • This branch adds a file_based routine to help user find the most I/O intensive files. The file is file_stats.py in PyDarshan CLI tools.
  • This branch also adds a test_file_stats.py to PyDarshan tests to test file_stats.py.
  • It combines the data from multiple log files to a DataFrame, groups the data by “id”, sorts data by the column name the user inputs in a descending order, and then filters the data with the first n (number_of_rows from user input) records. It returns a DataFrame with n most I/O intensive files.
  • User input includes log_path, module, order_by_colname, number_of_rows. The command line arguments are name arguments.
  • log_path should be a list of files or a shell glob.
  • The default values for module, order_by_colname, number_of_rows are “POSIX”, “POSIX_BYTES_READ”, 10, respectively. If users don’t input these values, the tool will use default values.
  • The tool checks if the module is in the list of modules. If not, it prints an error out and exits immediately.
  • order_by_colname should be “{mod}_BYTES_READ” or “{mod}_BYTES_WRITTEN”.
  • The tool also checks if the order_by_colname the user inputs is consistent with the module. For example, if the module and order_by_colname are POSIX and STDIO_ BYTES_READ, there will be an error “Column name should be ‘{mod}_BYTES_READ’ or ‘{mod}_BYTES_WRITTEN’“.
  • Example usage:
    $ python -m darshan file_stats darshan_logs/nonmpi_workflow/worker*.darshan -m STDIO -o STDIO_BYTES_READ -n 5
    $ python -m darshan file_stats darshan_logs/nonmpi_workflow/worker*.darshan
    $ python -m darshan file_stats darshan_logs/nonmpi_workflow/worker_1.darshan darshan_logs/nonmpi_workflow/worker_3.darshan -m STDIO -o STDIO_BYTES_READ -n 5

@Yanlilyu Yanlilyu changed the title Pydarshan file based sorting WIP: Pydarshan file based sorting Aug 10, 2023
@Yanlilyu Yanlilyu changed the title WIP: Pydarshan file based sorting WIP: Pydarshan file_based sorting Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant