-
Notifications
You must be signed in to change notification settings - Fork 2
utils
The utils library houses modules for simplifying the experimental process.
There are a few basic file io functions available:
| read_file(file_name) | Read the contents of a text file into an array of strings. |
| write_file(file_name, contents) | Write a string (or alternatively an array of strings) to a text file. |
| load_CSV(filename, delimiter = ',') | Load a delimiter-separated-value file into a 2d array of strings. Note: The delimiter argument is optional. |
| save_CSV(data, filename, delimiter = ',') | Save a 2d array of items as a delimiter-separated-value file. Note: The delimiter argument is optional, and the data items will be converted to strings. |
Additionally, the following function can be used to obtain a list of files in a directory (useful when running experiments with a benchmark set of examples):
- get_file_list(dir_name, forbidden_list = None, match_list = None): Returns a list of files in the given directory subject to constraints.
- dir_name: The path of the directory to locate files in.
- forbidden_list: List of strings that, when matched to a filename, causes the file to be ignored. e.g. ['.svn', 'extra-directory', '.o', ...]
- match_list: List of strings that the found files should have in their name. e.g. ['.foo', 'problem', ...]
Say you have a directory foo/ with the following files: data1.csv, data2.csv, data3.csv, and readme.txt. Imagine you want to read each of the comma separated files in, and write them out as tab separated values, and display first 4 lines of the readme.txt file. The following code would achieve this:
from krrt.utils import read_file, load_CSV, save_CSV, get_file_list
#--- Load and print the first 4 lines of the readme.txt
readme_lines = read_file('foo/readme.txt')
print lines[:4]
#--- Locate all of the csv files
file_list = get_file_list('foo', forbidden_list = ['readme.txt'])
# Note: We could have used match_list=['.csv'] instead
#-- Iterate over each file
for file_name in file_list:
#- Load the file as comma separated data
data = load_CSV(file_name)
#- Replace the .csv extension with .tsv
new_file_name = file_name[:-4] + '.tsv'
#- Write the file as tab separated data
save_CSV(data, new_file_name, delimiter = "\t")There is one main function used to simplify the setup of experimental evaluation: run_experiment. The function has a number of arguments, most of which are optional.
- base_directory: The base directory that the experiments should be run from. (default: ".")
- base_command: The base command to be executed. This argument is mandatory.
- single_arguments: A dictionary where the key is the name of an argument list (which is not included in the command), and the value is a list of arguments that should be used. For example if one (and only one) of flagA, flagB, and flagC should be included as a command-line option, then the key/value pair 'flags': ['flagA', 'flagB', 'flagC'] should be in the single_arguments dictionary. (default: None)
- parameters: A dictionary where the key values are the command-line key name options, and the value is a list of command-line values for the associated key. For example, if the software being tested has -input <filename> as a command-line option then the dictionary would have an entry with the key '-input' and a value being a list of files for input. (default: None)
- time_limit: The number of seconds the software should be permitted to run. (default: 15)
- memory_limit: The number of megabytes the software should be limited to. (default: -1 (i.e. unlimited))
- results_dir: Directory to store the output of each program execution. (default: "results")
- progress_file: The file that should contain text indicating the progress of the experiment as a percentage. If None is passed in, standard output is used. (default: "/dev/null")
- processors: The number of cores to be used simultaneously. (default: 1)
The data structure returned by the run_experiment method tries to capture all of the information needed to filter results based on certain parameters. Returned is a ResultSet object that has the following functionality / attributes.
| res_set.size | The number of results contained. |
| res_set.get_ids() | Returns a list of key's that can be used to select specific results. |
| res_set[id] | Returns a Result object associated with id. |
| res_set.add_result(res) | Adds a result object res to the ResultSet object. |
| res_set.filter_parameter(param, value) | Returns a ResultSet with only the results that match the param / value pair specified. |
| res_set.filter_argument | Returns a ResultSet with only the results that match the argument / value pair specified. |
| res_set.filter(func) | Returns a ResultSet with only the results that pass a user-defined function pointer, func. |
Note: The parameter and argument filter functions are just syntactic sugar for the generic filter function.
The Result object contains information corresponding to a single run of your experiment. Specifically it has the following attributes:
| result.id | The id of the run (typically a number). |
| result.command | The full command executed. |
| result.output_file | The absolute path to the output captured from the command. |
| result.single_args | A dictionary mapping argument names to the value for this run. |
| result.parameters | A dictionary mapping parameter names to their setting for this run. |
| result.runtime | The runtime for this command to complete. |
| result.timed_out | A boolean value indicating whether or not this command timed out. |
from krrt.utils import run_experiment
# Run your program with different parameters, command-line arguments, etc
results = run_experiment(
base_directory = '/path/to/command/',
base_command = './command do_stuff',
single_arguments = {
'light_switch': ['-on', '-off'],
'args': ['-arg1', '-arg2', '-arg3'],
'flytype': ['-superfly', '']
},
parameters = {
'-parameter_1': [5, 25, 100],
'-parameter_2': [5, 25, 100],
'-parameter_3': [.1, .25, .35]
},
time_limit = 900, # 15minute time limit (900 seconds)
memory_limit = 1000, # 1gig memory limit (1000 megs)
results_dir = "results",
progress_file = None, # Print the progress to stdout
processors = 8 # You've got 8 cores, right?
)
# (for whatever reason) Find all of the runs that had -superfly as an argument
superfly_results = results.filter_argument('flytype', '-superfly')
# Partition the results that didn't timeout into lists depending on -parameter_1
good_results = results.filter(lambda result: not result.timed_out, results)
p1_results = {}
for result in good_results:
p1_results.setdefault(result.parameters['-parameter_1'], []).append(result)
# p1_results is now a dict with the keys '5', '25', and '100' and a list of
# results corresponding to those values for -parameter_1
The following functions are available for common parsing tasks that you may want to perform when building your experimental framework.
The get_value(file_name, regex, value_type = float) function is used to retrieve a single value from an output file.
- file_name: Path of the output file.
- regex: Regex string that is used to match for the value. (e.g. .*size:(\d+).*)
- value_type: (optional) Parameter to specify the type of the value (e.g. int)
from krrt.utils import get_value
#--- Get the runtime from the file 'output' that is of the form "runtime:3.02sec"
runtime = get_value('output', '.*runtime:([0-9]+\.?[0-9]+)sec.*', float)The match_value(file_name, regex) function is used to check if a regex appears inside a file anywhere.
- file_name: Path of the output file.
- regex: Regex string that is used to match for the value. (e.g. .*Timeout.*)
from krrt.utils import match_value
#--- Check if the file 'output' has the string "Timeout" inside of it.
timed_out = match_value('output', '.*Timeout.*')The get_lines(file_name, lower_bound = None, upper_bound = None) function is used to retrieve a contiguous sequence of lines from a file based on lines that surround the targeted text (non-inclusive). If lower_bound is not supplied, then all lines from the start of the file are included (similarly with upper_bound).
- file_name: Path of the output file.
- lower_bound: (optional) Parameter for indicating the lower bounding line to match on.
- upper_bound: (optional) Parameter for indicating the upper bounding line to match on.
from krrt.utils import match_value
#--- Get the lines of the output file between the lines "start_results" and "end_results"
result_lines = get_lines('output', lower_bound = 'start_results', upper_bound = 'end_results')Additionally the utils package provides the following functionality:
- get_opts(): Returns a tuple (opts, flags) of command line parameters, where:
- opts: Dictionary of options where the key is of the form -<option> and the value is just a string.
- flags: List of strings that weren't part of an -<option> <value> pair.