Skip to content

[ML] find_file_structure not detecting CSV header with many long and highly variable field values #45047

Closed
@droberts195

Description

@droberts195

elastic/kibana#42114 contains an example of a CSV file where the find_file_structure endpoint didn't detect that the first row contained the column names.

The explanation is:

    "First row is not unusual based on length test: [1347.0] and [count=313, min=1231.000000, average=4363.025559, max=8911.000000]",
    "First row is not unusual based on Levenshtein test [count=100, min=1871.000000, average=4357.270000, max=8512.000000] and [count=100, min=1711.000000, average=3648.230000, max=5914.000000]"

In other words:

  1. The first row length is 1347 and other rows vary in length between 1231 and 8911 characters.
  2. The average Levenshtein distance between the first row and each of the next 100 rows is 4357.27 while the average distance between 100 other pairs of rows is 3648.23.

The file in question contains AirBNB listings data. Some owners have written huge amounts about their properties and other owners have written very little, and the two current tests are confused by this.

To a human it's flagrantly obvious that the first row is a header row, so we should be able to improve this.

One idea is to extend _excluding_ the biggest difference from:

/**
     * Sum of the Levenshtein distances between corresponding elements
     * in the two supplied lists _excluding_ the biggest difference.
     * The reason the biggest difference is excluded is that sometimes
     * there's a "message" field that is much longer than any of the other
     * fields, varies enormously between rows, and skews the comparison.
     */

to exclude all fields that are over a certain length in any row, as this indicates likely freeform text fields (and the AirBNB data has more than 1 such field per row).

Another idea would be to look at the number of distinct characters in each row. In the AirBNB data this could well notice a difference between the first row and others because the first row is all commas, lowercase letters and underscores whereas the other lines have many other characters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    :mlMachine learning

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions