Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix detecting index column when reading from CSV in C++ #714

Merged
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
4fe6771
Fix finding index col
dagardner-nv Feb 16, 2023
b4b833a
Use load_table_from_file instead of CuDFTableUtil::load_table
dagardner-nv Feb 16, 2023
749488f
Add _should_use_cpp class method giving class methods in message clas…
dagardner-nv Feb 16, 2023
cc9018f
Expose make_from_file method
dagardner-nv Feb 16, 2023
5c1dd2a
New tests
dagardner-nv Feb 16, 2023
ce1562b
Merge branch 'branch-23.03' into david-from-file-no-index
dagardner-nv Feb 16, 2023
db45ee8
Remove unused includes
dagardner-nv Feb 16, 2023
f26fb18
Merge branch 'branch-23.03' into david-from-file-no-index
dagardner-nv Feb 22, 2023
94e4aa5
Merge branch 'branch-23.03' into david-from-file-no-index
dagardner-nv Feb 22, 2023
34bb25b
Merge branch 'david-from-file-no-index' of github.com:dagardner-nv/Mo…
dagardner-nv Feb 22, 2023
44961f1
Merge branch 'branch-23.03' into david-from-file-no-index
dagardner-nv Feb 23, 2023
a0fcc45
Split off the mutating bit of get_index_col_count into prepare_df_index
dagardner-nv Feb 23, 2023
06212ac
Split GetIndexColCount test into GetIndexColCountNoIdx and GetIndexCo…
dagardner-nv Feb 23, 2023
eee1e4d
Make index_regex a const so we aren't building it on every invocation…
dagardner-nv Feb 23, 2023
2526817
More tests
dagardner-nv Feb 23, 2023
c044213
fix year
dagardner-nv Feb 23, 2023
597c397
Use prepare_df_index
dagardner-nv Feb 23, 2023
05ea4ad
wip
dagardner-nv Feb 23, 2023
857b581
Make serializers visible
dagardner-nv Feb 23, 2023
d6a4aa0
wip
dagardner-nv Feb 23, 2023
7e4853a
Support make_from_file in subclasses
dagardner-nv Feb 24, 2023
685846e
Applying Devin's changes
dagardner-nv Feb 24, 2023
6f101b4
Merge branch 'david-devin-test-embedded' into david-from-file-no-index
dagardner-nv Feb 24, 2023
91a9419
wip
dagardner-nv Feb 24, 2023
4154591
wip
dagardner-nv Feb 24, 2023
292f40a
Merge branch 'david-devin-test-embedded' into david-from-file-no-index
dagardner-nv Feb 24, 2023
53b5d22
Fix tests
dagardner-nv Feb 24, 2023
19205e1
Remove unused include
dagardner-nv Feb 24, 2023
652bff3
Remove unused include
dagardner-nv Feb 24, 2023
accac71
Merge branch 'david-devin-test-embedded' into david-from-file-no-index
dagardner-nv Feb 24, 2023
8211db9
Add missing include
dagardner-nv Feb 24, 2023
a2cd582
Merge branch 'branch-23.03' into david-from-file-no-index
dagardner-nv Feb 27, 2023
92acc38
Revert classmethod
dagardner-nv Feb 27, 2023
199af79
Update load_table_from_file to use determine_file_type, add read_file…
dagardner-nv Feb 27, 2023
1e5baf5
Remove FileTypesInterfaceProxy as determine_file_type didn't need any…
dagardner-nv Feb 27, 2023
ba412a2
Add binding for read_file_to_df
dagardner-nv Feb 27, 2023
9c46e22
FileTypes::Auto is working
dagardner-nv Feb 27, 2023
e81b6f9
Use the C++ deserializers when C++ is enabled and df_type is cudf
dagardner-nv Feb 27, 2023
b750a9c
Update tests
dagardner-nv Feb 27, 2023
9690518
Formatting
dagardner-nv Feb 27, 2023
f07b091
IWYU changes
dagardner-nv Feb 28, 2023
d97f363
Remove unused import
dagardner-nv Feb 28, 2023
5a99001
Add a comment about the cudf issue I ran into
dagardner-nv Feb 28, 2023
6a58650
Fix pybind11 link error
dagardner-nv Feb 28, 2023
9a212ae
IWYU
dagardner-nv Feb 28, 2023
3a136da
Merge branch 'branch-23.03' into david-from-file-no-index [no ci]
dagardner-nv Mar 7, 2023
89e345b
Merge branch 'branch-23.03' into david-from-file-no-index
dagardner-nv Mar 7, 2023
5403d67
Merge branch 'branch-23.03' into david-from-file-no-index
dagardner-nv Mar 8, 2023
c26d719
Merge branch 'branch-23.03' into david-from-file-no-index
dagardner-nv Mar 13, 2023
58e79b6
Merge branch 'branch-23.03' into david-from-file-no-index
dagardner-nv Mar 13, 2023
092837d
Merge branch 'branch-23.03' into david-from-file-no-index
dagardner-nv Mar 17, 2023
3d9c333
Fix merge errors
dagardner-nv Mar 17, 2023
2728e5c
Merge branch 'branch-23.03' into david-from-file-no-index
dagardner-nv Mar 20, 2023
b116f74
Merge branch 'branch-23.03' into david-from-file-no-index
dagardner-nv Mar 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions morpheus/_lib/src/io/deserializers.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -111,23 +111,23 @@ cudf::io::table_with_metadata load_table_from_file(const std::string& filename)
int get_index_col_count(cudf::io::table_with_metadata& data_table)
dagardner-nv marked this conversation as resolved.
Show resolved Hide resolved
{
int index_col_count = 0;
auto& col_names = data_table.metadata.column_names;

// Check if we have a first column with INT64 data type
if (data_table.metadata.schema_info.size() >= 1 &&
data_table.tbl->get_column(0).type().id() == cudf::type_id::INT64)
if (col_names.size() >= 1 && data_table.tbl->get_column(0).type().id() == cudf::type_id::INT64)
dagardner-nv marked this conversation as resolved.
Show resolved Hide resolved
{
std::regex index_regex(R"((unnamed: 0|id))", std::regex_constants::ECMAScript | std::regex_constants::icase);

// Get the column name
auto col_name = data_table.metadata.schema_info[0].name;
auto col_name = col_names[0];

// Check it against some common terms
if (std::regex_search(col_name, index_regex))
{
// Also, if its the hideous 'Unnamed: 0', then just use an empty string
if (col_name == "Unnamed: 0")
{
data_table.metadata.schema_info[0].name = "";
col_names[0] = "";
}

index_col_count = 1;
Expand Down
6 changes: 4 additions & 2 deletions morpheus/_lib/src/messages/meta.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

#include "morpheus/messages/meta.hpp"

#include "morpheus/io/deserializers.hpp"
#include "morpheus/objects/mutable_table_ctx_mgr.hpp"
#include "morpheus/objects/python_data_table.hpp"
#include "morpheus/objects/table_info.hpp"
Expand Down Expand Up @@ -134,9 +135,10 @@ MutableTableCtxMgr MessageMetaInterfaceProxy::mutable_dataframe(MessageMeta& sel
std::shared_ptr<MessageMeta> MessageMetaInterfaceProxy::init_cpp(const std::string& filename)
{
// Load the file
auto df_with_meta = CuDFTableUtil::load_table(filename);
auto df_with_meta = load_table_from_file(filename);
int index_col_count = get_index_col_count(df_with_meta);

return MessageMeta::create_from_cpp(std::move(df_with_meta));
return MessageMeta::create_from_cpp(std::move(df_with_meta), index_col_count);
}

SlicedMessageMeta::SlicedMessageMeta(std::shared_ptr<MessageMeta> other,
Expand Down
1 change: 1 addition & 0 deletions morpheus/_lib/tests/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ list(APPEND CMAKE_MESSAGE_CONTEXT "tests")
# Keep all source files sorted
add_executable(test_libmorpheus
# test_cuda.cu
test_deserializers.cpp
test_main.cpp
test_matx_util.cpp
test_morpheus.cpp
Expand Down
55 changes: 55 additions & 0 deletions morpheus/_lib/tests/test_deserializers.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
/*
* SPDX-FileCopyrightText: Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* SPDX-License-Identifier: Apache-2.0
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include "./test_morpheus.hpp" // IWYU pragma: associated

#include "morpheus/io/deserializers.hpp"

#include <cudf/io/types.hpp>
#include <gtest/gtest.h>

#include <filesystem>
#include <string>
#include <vector>

using namespace morpheus;

TEST_CLASS(Deserializers);

TEST_F(TestDeserializers, GetIndexColCount)
dagardner-nv marked this conversation as resolved.
Show resolved Hide resolved
{
std::filesystem::path morpheus_root{std::getenv("MORPHEUS_ROOT")};
auto test_data_dir = morpheus_root / "tests/tests_data";

{
// First test a files without an index
std::vector<std::filesystem::path> input_files{test_data_dir / "filter_probs.csv",
test_data_dir / "filter_probs.jsonlines"};
for (const auto& input_file : input_files)
{
auto table = load_table_from_file(input_file);
EXPECT_EQ(get_index_col_count(table), 0);
}
}

{
// now test a file with an index
auto input_file = morpheus_root / "tests/tests_data/filter_probs_w_id_col.csv";
auto table = load_table_from_file(input_file);
EXPECT_EQ(get_index_col_count(table), 1);
}
dagardner-nv marked this conversation as resolved.
Show resolved Hide resolved
}
6 changes: 5 additions & 1 deletion morpheus/messages/message_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,14 @@ class MessageBase(metaclass=MessageImpl):
class has an associated C++ implementation (`cpp_class`), returns the Python implementation for all others.
"""

@classmethod
def _should_use_cpp(cls):
return getattr(cls, "_cpp_class", None) is not None and CppConfig.get_should_use_cpp()

def __new__(cls, *args, **kwargs):

# If _cpp_class is set, and use_cpp is enabled, create the C++ instance
if (getattr(cls, "_cpp_class", None) is not None and CppConfig.get_should_use_cpp()):
if (cls._should_use_cpp()):
return cls._cpp_class(*args, **kwargs)

# Otherwise, do the default init
Expand Down
9 changes: 9 additions & 0 deletions morpheus/messages/message_meta.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
import pandas as pd

import morpheus._lib.messages as _messages
from morpheus._lib.common import FileTypes
from morpheus.io.deserializers import read_file_to_df
from morpheus.messages.message_base import MessageBase


Expand Down Expand Up @@ -80,6 +82,13 @@ def __init__(self, df: pd.DataFrame) -> None:
self._mutex = threading.RLock()
self._df = df

@classmethod
def make_from_file(cls, filename: str):
dagardner-nv marked this conversation as resolved.
Show resolved Hide resolved
if (cls._should_use_cpp()):
return cls._cpp_class.make_from_file(filename)
else:
return MessageMeta(read_file_to_df(filename, file_type=FileTypes.Auto, df_type="cudf"))
dagardner-nv marked this conversation as resolved.
Show resolved Hide resolved
dagardner-nv marked this conversation as resolved.
Show resolved Hide resolved
dagardner-nv marked this conversation as resolved.
Show resolved Hide resolved

@property
def df(self) -> pd.DataFrame:
msg = ("Warning the df property returns a copy, please use the copy_dataframe method or the mutable_dataframe "
Expand Down
25 changes: 25 additions & 0 deletions tests/test_file_in_out_stage_pipe.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,3 +110,28 @@ def test_file_rw_multi_segment_pipe(tmp_path, config, output_type):
# Somehow 0.7 ends up being 0.7000000000000001
output_data = np.around(output_data, 2)
assert output_data.tolist() == input_data.tolist()


@pytest.mark.slow
@pytest.mark.parametrize("input_file",
[
os.path.join(TEST_DIRS.tests_data_dir, "filter_probs.csv"),
os.path.join(TEST_DIRS.tests_data_dir, "filter_probs_w_id_col.csv")
])
def test_file_rw_index_pipe(tmp_path, config, input_file):
out_file = os.path.join(tmp_path, 'results.csv')

pipe = LinearPipeline(config)
pipe.set_source(FileSourceStage(config, filename=input_file))
pipe.add_stage(WriteToFileStage(config, filename=out_file, overwrite=False, include_index_col=False))
pipe.run()

assert_path_exists(out_file)

validation_file = os.path.join(TEST_DIRS.tests_data_dir, "filter_probs.csv")
validation_data = np.loadtxt(validation_file, delimiter=",", skiprows=1)
output_data = np.loadtxt(out_file, delimiter=",", skiprows=1)

# Somehow 0.7 ends up being 0.7000000000000001
output_data = np.around(output_data, 2)
assert output_data.tolist() == validation_data.tolist()
18 changes: 18 additions & 0 deletions tests/test_message_meta.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,12 @@
import operator
import os

import numpy as np
import pytest

from morpheus._lib.common import FileTypes
from morpheus.io.deserializers import read_file_to_df
from morpheus.io.serializers import df_to_csv
from morpheus.messages.message_meta import MessageMeta
from utils import TEST_DIRS

Expand Down Expand Up @@ -67,3 +69,19 @@ def test_copy_dataframe(config):

assert meta.copy_dataframe()['v2'][3] != 47
assert meta.df != 47


def test_make_from_file(config, tmp_path):
input_file = os.path.join(TEST_DIRS.tests_data_dir, "filter_probs_w_id_col.csv")
out_file = os.path.join(tmp_path, 'results.csv')
meta = MessageMeta.make_from_file(input_file)
with meta.mutable_dataframe() as df:
assert list(df.columns) == ['v1', 'v2', 'v3', 'v4']

with open(out_file, 'w') as fh:
fh.writelines(df_to_csv(df, include_header=True, include_index_col=True))

input_data = np.loadtxt(input_file, delimiter=",", skiprows=1)
output_data = np.loadtxt(out_file, delimiter=",", skiprows=1)
output_data = np.around(output_data, 2)
assert output_data.tolist() == input_data.tolist()
3 changes: 3 additions & 0 deletions tests/tests_data/filter_probs_w_id_col.csv
Git LFS file not shown