Add CSVHandler #1949

amontanez24 · 2024-04-22T15:45:47Z

Problem Description

As a user, I'd like an streamlined way to load my data and metadata from files so that I can get right to using SDV.

Expected behavior

In the sdv.io subpackage, add a folder called local
In that folder add a class called CSVHandler

init

Parameters

sep: The separator used, if not a comma
(default) ',': A comma separates the values in a row
encoding: The character encoding to to use
- (default) 'UTF'
- Other options: See Python list of standard encoding

from sdv.io.local import CSVHandler

handler = CSVHandler(sep='\t', encoding='UTF')

read

Functionality

Internally, reading should use the read_csv function from pandas. A few things should be hardcoded by default

Use the sep and encoding options from init
Pandas should not detect an index column from the data
Pandas should not try to infer datetime formats (or cast them to np.datetime objects). Any datetime column should be left as a dtype 'object'
Pandas should not error if there is a badly formatted line. We should just raise a warning and read the remaining lines.
After reading the data, we should use it to infer a MultiTableMetadata object. (Even if there is only 1 CSV file, we should still create a MultiTableMetadata object.)

from sdv.io.local import CSVHandler

data, metadata = handler.read(folder_name='project/data')

Parameters

(required) folder_name: The name of the folder that contains the CSV files, can include the entire path to the folder
file_names: A list of file names inside the folder to read
- (default) None: Read all files in the folder that end with ".csv"
- list(str): Only files with these names will be read into Python

Returns

data: A dictionary mapping each table name to a pandas DataFrame with the data. The table name is the same as the file name (excluding the '.csv' suffix)
metadata: A MultiTableMetadata object that describes the data

write

Functionality
Internally, writing should use the to_csv function from pandas. A few things should be hardcoded by default

Use the sep and encoding options from init
Do not write the index column

from sdv.io.local import CSVHandler

handler.write(
  synthetic_data,
  folder_name='project/synthetic_data',
  file_name_suffix='_v1', 
  mode='x')
)

Parameters

(required) synthetic_data: A dictionary that maps each table name to a pandas.DataFrame containing data from it
file_name_suffix: An optional suffix to add when writing each file
- (default) None: Do not add a suffix. The file name will be the same as the table name with a ".csv" extension
- string: Append the suffix after the table name. Eg. a suffix "_synthetic" will write a file as "TABLENAME_synthetic.csv"
mode: A string signaling which mode of writing to use
- (default) 'x': Write to new files, raising errors if any existing files exist with the same name
- 'w': Write to new files, clearing any existing files that exist
- 'a': Append the new CSV rows to any existing files

Additional context

We will add a number of local file handlers for different file types (see Add ExcelHandler #1950). Therefore the implementation of this class should also add a base class.
Optionally, the init, read and write functions can include a subset of arguments that the corresponding pandas functions use
- if both the read and write for pandas are the same for a parameter (eg. decimal), then put it in the init.
- We can ignore most of these parameters. Only add ones that seem impactful
- If there are some that the different file types have in common, consider adding to the Base.

The text was updated successfully, but these errors were encountered:

amontanez24 added the feature request Request for a new feature label Apr 22, 2024

pvk-developer mentioned this issue Apr 23, 2024

Implement CSVHandler #1958

Merged

pvk-developer closed this as completed in #1958 Apr 29, 2024

amontanez24 assigned pvk-developer May 13, 2024

amontanez24 added this to the 1.12.2 milestone May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CSVHandler #1949

Add CSVHandler #1949

amontanez24 commented Apr 22, 2024 •

edited

Loading

Add CSVHandler #1949

Add CSVHandler #1949

Comments

amontanez24 commented Apr 22, 2024 • edited Loading

Problem Description

Expected behavior

__init__

read

write

Additional context

amontanez24 commented Apr 22, 2024 •

edited

Loading

init