fread fails on file with inconsistent # columns #2267

MichaelChirico · 2017-07-12T00:08:34Z

I've got a file I made and continued to add to. Unfortunately at one point I switched from writing 15 to 14 columns and kept using the same file.

I expect two things from this file: 1) when I use fread on it, it should fail, but with an error that informs about the inconsistent # of columns 2) when I use fill = TRUE, the read is successful.

Unfortunately neither are true:

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-07-11 18:43:20 UTC; travis

URL = paste0('https://gist.githubusercontent.com/MichaelChirico/',
             '0f1a9ae0d419160ad8ef5b7ac5469336/raw/',
             'db7936fafaf2602e03e657bbfc9e49dd526260af/bad_fill.csv')
x = fread(URL, verbose = TRUE)

# Input contains no \n. Taking this to be a filename to open
# [1] Check arguments
# Using 2 threads (omp_get_max_threads()=2, nth=2)
# NAstrings = [<<NA>>]
# None of the NAstrings look like numbers.
# [2] Opening the file
# Opening file /tmp/RtmpMqWBHa/filee366e213235
# File opened, size = 34.88MB (36578984 bytes).
# Memory mapping ... ok
# [3] Detect and skip BOM
# [4] Detect end-of-line character(s)
# Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
# [6] Skipping initial rows if needed
# Positioned on line 1 starting: <<train_set,delx,dely,alpha,eta,>>
#   [7] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=','  with 100 lines of 15 fields using quote rule 0
# Detected 15 columns on line 1. This line is either column names or first data row. Line starts as: <<train_set,delx,dely,alpha,eta,>>
#   Quote rule picked = 0
# [8] Determine column names
# All the fields on line 1 are character fields. Treating as the column names.
# [9] Detect column types
# Number of sampling jump points = 101 because (36578905 bytes from row 1 to eof) / (2 * 15974 jump0size) == 1144
# Type codes (jump 000)    : 655552525552255  Quote rule 0

Error in fread(URL, verbose = TRUE) : Could not find first good line start after jump point 73 when sampling.

x = fread(URL, fill = TRUE, verbose = TRUE)

# Input contains no \n. Taking this to be a filename to open
# [1] Check arguments
# Using 2 threads (omp_get_max_threads()=2, nth=2)
# NAstrings = [<<NA>>]
# None of the NAstrings look like numbers.
# [2] Opening the file
# Opening file /tmp/RtmpMqWBHa/filee361f04c03
# File opened, size = 34.88MB (36578984 bytes).
# Memory mapping ... ok
# [3] Detect and skip BOM
# [4] Detect end-of-line character(s)
# Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
# [6] Skipping initial rows if needed
# Positioned on line 1 starting: <<train_set,delx,dely,alpha,eta,>>
#   [7] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=','  with 100 lines of 15 fields using quote rule 0
# Detected 15 columns on line 1. This line is either column names or first data row. Line starts as: <<train_set,delx,dely,alpha,eta,>>
#   Quote rule picked = 0
# fill=true and the most number of columns found is 15
# [8] Determine column names
# All the fields on line 1 are character fields. Treating as the column names.
# [9] Detect column types
# Number of sampling jump points = 101 because (36578905 bytes from row 1 to eof) / (2 * 15974 jump0size) == 1144
# Type codes (jump 000)    : 655552525552255  Quote rule 0

Error in fread(URL, fill = TRUE, verbose = TRUE) : Could not find first good line start after jump point 73 when sampling.

I was able to overcome the problem and fix my file by identifying the exact row where the switch occurred and doing:

x = fread('head -n 164161 ~/Desktop/fire_random_search.csv')
y = fread('tail -n +164162 ~/Desktop/fire_random_search.csv',
          col.names = names(x)[-ncol(x)])
z = rbind(x, y, fill = TRUE)

fwrite(z, '~/Desktop/fire_random_search.csv')

Also, this worked as expected in 1.10.4:

fread(URL, fill = TRUE)
#         train_set     delx     dely alpha      eta       lt     theta   k
#      1:  train_13 181.5100 620.9804     0 0.956508  7.00000 2.0617496 100
#      2:  train_13 181.5100 620.9804     0 0.956508  7.00000 2.0617496 100
#      3:  train_13 181.5100 620.9804     0 0.956508  7.00000 2.0617496 100
#      4:  train_13 181.5100 620.9804     0 0.956508  7.00000 2.0617496 100
#      5:  train_13 181.5100 620.9804     0 0.956508  7.00000 2.0617496 100
#     ---                                                                  
# 227516:  train_16 734.8028 406.3451     1 1.035084 23.08614 0.8077483  27
# 227517:  train_16 734.8028 406.3451     1 1.035084 23.08614 0.8077483  27
# 227518:  train_16 734.8028 406.3451     1 1.035084 23.08614 0.8077483  27
# 227519:  train_16 734.8028 406.3451     1 1.035084 23.08614 0.8077483  27
# 227520:  train_16 734.8028 406.3451     1 1.035084 23.08614 0.8077483  27
#            l1    l2   kde.bw kde.lags kde.win        pei      pai
#      1: 0.000 0e+00 615.3886        1      11 0.09523810 24.42448
#      2: 0.000 1e-05 615.3886        1      11 0.02380952  6.10612
#      3: 0.000 5e-05 615.3886        1      11 0.00000000  0.00000
#      4: 0.000 1e-04 615.3886        1      11 0.02380952  6.10612
#      5: 0.000 5e-04 615.3886        1      11 0.02380952  6.10612
#     ---                                                          
# 227516: 0.001 1e-05 501.8807        1      15 0.02857143       NA
# 227517: 0.001 5e-05 501.8807        1      15 0.02857143       NA
# 227518: 0.001 1e-04 501.8807        1      15 0.02857143       NA
# 227519: 0.001 5e-04 501.8807        1      15 0.02857143       NA
# 227520: 0.001 1e-03 501.8807        1      15 0.01904762       NA

The text was updated successfully, but these errors were encountered:

…works.

st-pasha added bug fread labels Jul 12, 2017

st-pasha mentioned this issue Jul 12, 2017

Master task for fread bugs / proposals #2247

Closed

st-pasha mentioned this issue Jan 10, 2018

Fix chunk boundaries detection logic h2oai/datatable#686

Merged

mattdowle added a commit that referenced this issue Feb 13, 2018

Pencilled in test from #2267. Added 'nocov' in C code to see if that …

7f48c74

…works.

mattdowle added this to the v1.10.6 milestone Feb 13, 2018

mattdowle mentioned this issue Feb 14, 2018

Better jump sync and run-on #2627

Merged

3 tasks

mattdowle added a commit that referenced this issue Feb 15, 2018

Added test for #2267

7516eba

mattdowle closed this as completed in #2627 Feb 16, 2018

TobiasGold mentioned this issue Feb 28, 2019

improve fread behaviour on inconsistent number of columns #3436

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fread fails on file with inconsistent # columns #2267

fread fails on file with inconsistent # columns #2267

MichaelChirico commented Jul 12, 2017 •

edited

Loading

fread fails on file with inconsistent # columns #2267

fread fails on file with inconsistent # columns #2267

Comments

MichaelChirico commented Jul 12, 2017 • edited Loading

MichaelChirico commented Jul 12, 2017 •

edited

Loading