[New Check] Entire column of a table is not NaN#231
Conversation
for more information, see https://pre-commit.ci
nwbinspector/checks/tables.py
Outdated
| subindex_selection = np.unique(np.round(np.linspace(start=0, stop=column.shape[0] - 1, num=nelems)).astype(int)) | ||
| if np.any(~np.isnan(column[subindex_selection])): | ||
| continue | ||
| else: | ||
| yield InspectorMessage( | ||
| message=f"Column {column.name} has all NaN values. Consider removing it from the table." | ||
| ) |
There was a problem hiding this comment.
| subindex_selection = np.unique(np.round(np.linspace(start=0, stop=column.shape[0] - 1, num=nelems)).astype(int)) | |
| if np.any(~np.isnan(column[subindex_selection])): | |
| continue | |
| else: | |
| yield InspectorMessage( | |
| message=f"Column {column.name} has all NaN values. Consider removing it from the table." | |
| ) | |
| if np.all(np.isnan(column[:nelems])): | |
| yield InspectorMessage( | |
| message=f"Column {column.name} has all NaN values. Consider removing it from the table." | |
| ) |
There was a problem hiding this comment.
It's written this way in case np.any has an early return condition once it detects a single non-NaN value
There was a problem hiding this comment.
np.all can have the same early stopping
There was a problem hiding this comment.
Well, interesting results...
Indeed as you say, all does indeed have early stopping just like any...
nelems = 1000000
array1 = [True for _ in range(nelems)]
array1 += [False]
array2 = [False]
array2 += [True for _ in range(nelems)]
start = time()
all(array1)
runtime = time() - start
print(f"Took {runtime}s!")
start = time()
all(array2)
runtime = time() - start
print(f"Took {runtime}s!")
----------------------------------
Took 0.006882190704345703s!
Took 3.719329833984375e-05s!... however numpy does not appear to (likely because it's vectorized). Maybe that gives better performance for very large in-memory arrays, but for sparse inputs the early return from Python built-ins might be nicer.
nelems = 1000000
array1 = [True for _ in range(nelems)]
array1 += [False]
array2 = [False]
array2 += [True for _ in range(nelems)]
start = time()
np.all(array1)
runtime = time() - start
print(f"Took {runtime}s!")
start = time()
np.all(array2)
runtime = time() - start
print(f"Took {runtime}s!")
----------------------------------
Took 0.05324196815490723s!
Took 0.05084705352783203s!There was a problem hiding this comment.
By sparse do you mean short?
Either approach is fine with me, either using all or np.all.
There was a problem hiding this comment.
I guess 'sparsity' (% of the number of False items being passed into all) isn't as relevant - I just mean for any case where there is at least one False element occurring somewhat early on in the input to all (as opposed to occurring later in the vector, as seen above), it will exit a few ~ms faster per check (which could add up to seconds when streaming over an entire dataset).
Anyway, I implemented the all logic here.
…awithoutborders/nwbinspector into check_table_col_not_nan
for more information, see https://pre-commit.ci
…awithoutborders/nwbinspector into check_table_col_not_nan
|
The idea is if nelems = None then it reads all the data. Could you make that true for this usage mode too? |
Done |
Co-authored-by: Ben Dichter <ben.dichter@gmail.com>
|
how would you feel about moving this one line into the check instead of making it its own function? |
Codecov Report
@@ Coverage Diff @@
## dev #231 +/- ##
==========================================
+ Coverage 94.87% 94.93% +0.06%
==========================================
Files 17 17
Lines 936 948 +12
==========================================
+ Hits 888 900 +12
Misses 48 48
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Co-authored-by: Ben Dichter <ben.dichter@gmail.com>
Co-authored-by: Ben Dichter <ben.dichter@gmail.com>
Co-authored-by: Ben Dichter <ben.dichter@gmail.com>
Yeah that's fine since it's so parred down now, the only complicated bit is the logic for determining the 'by' of the slice but that should be readable enough as is - thanks for helping make that so much simpler! |
Along with #229, finishes #155
Whenever an entire column of a table is
NaN, informs the user that there probably isn't a need to include it. Samples thenelemsuniformly across the length of the column since it's quite possible to have a block of consecutive entries (evennelemsmany) that are purposefully and even informativelyNaN.Only giving it a
SUGGESTIONlevel of importance since this will fail for large columns of sparse non-NaN, and also could be a design choice of the user.