Skip to content

What is the exact behavior of CreateTextLoader<TInput> when dataSample is given? #4898

Closed
@najeeb-kazmi

Description

@najeeb-kazmi

We have this overload for CreateTextLoader<TInput>, where the schema is defined in TInput.

public static TextLoader CreateTextLoader<TInput>(this DataOperationsCatalog catalog,
char separatorChar = TextLoader.Defaults.Separator,
bool hasHeader = TextLoader.Defaults.HasHeader,
IMultiStreamSource dataSample = null,
bool allowQuoting = TextLoader.Defaults.AllowQuoting,
bool trimWhitespace = TextLoader.Defaults.TrimWhitespace,
bool allowSparse = TextLoader.Defaults.AllowSparse)
=> TextLoader.CreateTextLoader<TInput>(CatalogUtils.GetEnvironment(catalog), hasHeader, separatorChar, allowQuoting,
allowSparse, trimWhitespace, dataSample: dataSample);

The dataSample argument is meant to be used to infer schema. Since TInput must contain at least one field, there is always at least one column in the schema. Then, this condition is never hit, and consequently, dataSample is never used to infer the schema with the CreateTextLoader<TInput> overload.

if (Utils.Size(cols) == 0 && !TryParseSchema(_host, headerFile ?? dataSample, ref options, out cols, out error))

Presence of the dataSample argument is confusing here as it implies that a sample can be provided. In other places, this sample is used to infer schema, so the user would expect this to be the case here as well, but dataSample is ignored here.

I will update the documentation to reflect this, but this should be removed. Since this will be an API breaking change, this should be revisited for 2.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    APIIssues pertaining the friendly APIP1Priority of the issue for triage purpose: Needs to be fixed soon.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions