Skip to content

Add more data sanity checks #74

Open
@piskvorky

Description

There's been a steady trickle of reports that LSI/LDA misbehave, produce degenerate models, crash Python etc.

Typically this is a user data problem (bad input data, feature id mismatch, ...), but since gensim targets the wide general public, this is gensim's "fault" anyway.

Create utility functions that perform basic sanity checks on user's input data:

  1. check that the all feature ids in a corpus are compatible with the user-provided dictionary (should avoid issues like http://projects.scipy.org/scipy/ticket/1582 )
  2. check that the data range is valid -- look for NaNs, Infs, explicit zeros => these are all illegal in gensim input.
  3. check that the data is not degenerate => all vectors identical/empty/?/model looks weird
  4. check corpus type and warn the user if it's plain list (promote the memory-friendly generator interface, shown in tutorials) NOT NEEDED

Metadata

Assignees

No one assigned

    Labels

    difficulty mediumMedium issue: required good gensim understanding & python skillsfeatureIssue described a new featuretestingIssue related with testing (code, documentation, etc)wishlistFeature request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions