Skip to content

Integer column randomly typed as string when threading enabled #1047

@klaff

Description

@klaff

The following code:

using CSV
using DataFrames
using Random

NCOLS = 30
NROWS = 150
Random.seed!(1)
fname = tempname()
f = open(fname, "w")
write(f, join("col".*string.(1:NCOLS), ","))
write(f, "\r\n")
for i in 0:NROWS
	write(f, join(string.(rand(Int16, NCOLS)), ","))
	write(f, "\r\n")
end
close(f)
df_by_threads = CSV.read(fname, DataFrame)
df_single_threaded = CSV.read(fname, DataFrame; ntasks=1)
print(eltype.(eachcol(df_by_threads)) == eltype.(eachcol(df_single_threaded)))

will print false , or at least it does on my Windows box, with 8 threads, running CSV v0.10.7.

If I reduce NCOLS or NROWS it will be true. If I choose a different random seed it may become true.

If one inspects the columns of df_by_threads, at least one column will be of type String7, but which column may vary with repeated execution and sometimes there are two such columns, even though the data written to the file is fixed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions