Open
Description
There's been a steady trickle of reports that LSI/LDA misbehave, produce degenerate models, crash Python etc.
Typically this is a user data problem (bad input data, feature id mismatch, ...), but since gensim targets the wide general public, this is gensim's "fault" anyway.
Create utility functions that perform basic sanity checks on user's input data:
- check that the all feature ids in a corpus are compatible with the user-provided dictionary (should avoid issues like http://projects.scipy.org/scipy/ticket/1582 )
- check that the data range is valid -- look for NaNs, Infs, explicit zeros => these are all illegal in gensim input.
- check that the data is not degenerate => all vectors identical/empty/?/model looks weird
check corpus type and warn the user if it's plain list (promote the memory-friendly generator interface, shown in tutorials)NOT NEEDED