Write csv not save all lines of dataframe #3783

Miyake-Diogo · 2022-10-10T20:02:54Z

Describe the bug
When I try to save dataframe as csv, only around 400K of lines are saved.. data has more than 1M of lines.

To Reproduce
My code:

use datafusion::prelude::*;
use log::{debug, info, LevelFilter, trace};
use crate::datapipeline::data_utils::*;
pub mod datapipeline;
use datafusion::logical_plan::when;

use datafusion::arrow::datatypes::DataType::{Int64,Utf8};
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
  let ctx: SessionContext = SessionContext::new();
  let raw_fato_path: &str = "data/minilake/raw/fato_census/Data8277.csv";
  let stage_fato_path: &str = "data/minilake/stage/fato_census/";
  let fato_census_df = ctx.read_csv(raw_fato_path,  
                                  CsvReadOptions::new()).await?;
  
  let fato_census_df = fato_census_df.with_column("area",cast(
    col("Area"),
    Utf8))?;

  let fato_census_df = fato_census_df
    //.with_column("Area",concat_ws("-", &vec![lit("A"),col("Area")]))?
    .select(vec![
      col("Year").alias("year"),
      col("Age").alias("age"),
      col("Ethnic").alias("ethnic"),
      col("Sex").alias("sex"),
      col("Area").alias("area"),
      col("count").alias("total_count")
      ])?;
  
  // We can see the ..C values in Count column
  fato_census_df.show_limit(5).await?;
  print_schema_of_dataframe(&fato_census_df).await?;
  // Create a function to make trnasformation
  let transform_count_data = when(col("total_count")
    .eq(lit("..C")), lit(0_u32))
    .otherwise(col("total_count"))?;

  //Cast column datatype
  let fato_census_df = fato_census_df.with_column(
    "total_count",
    cast(transform_count_data, Int64))?;
  
  fato_census_df.write_csv(stage_fato_path).await?;

  Ok(())
  }

Dataset:

Age and sex by ethnic group (grouped total responses), for census usually resident population counts, 2006, 2013, and 2018 Censuses (RC, TA, SA2, DHB)
Expected behavior
See all lines saved:

But only this quantity are saved.

andygrove · 2022-10-11T19:29:02Z

@Miyake-Diogo So part-0.csv only has 400k lines but were there other csv files?

I tried running this code but it has dependencies that are not here:

error[E0583]: file not found for module `datapipeline`
 --> src/main.rs:4:1
  |
4 | pub mod datapipeline;

Do you have this code in GitHub somewhere? I am happy to help debug if you have a public repro case.

andygrove · 2022-10-11T19:47:05Z

Here is a smaller repro case:

use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx: SessionContext = SessionContext::new();
    let raw_fato_path: &str = "/mnt/bigdata/census/Data8277.csv";
    let stage_fato_path: &str = "/tmp/stage";
    let fato_census_df = ctx.read_csv(raw_fato_path, CsvReadOptions::new()).await?;
    fato_census_df.write_csv(stage_fato_path).await?;
    Ok(())
}

$ wc -l /tmp/stage/part-0.csv 
425985 /tmp/stage/part-0.csv

I tested with DataFusion 11, 12, and 13, and all have the same issue

andygrove · 2022-10-11T21:03:07Z

@Miyake-Diogo The issue is that this error is happening:

Error: ArrowError(ParseError("Error while parsing value CMB07601 for column 4 at line 431740"))

I recommend specifying the schema for the file since it contains mixed types for this column.

You did not see the error due to a bug with the error being ignored and the fix for that issue is in #3801

Miyake-Diogo · 2022-10-11T23:01:19Z

Hi @andygrove , all codes are in this repo: https://gitlab.com/miyake-diogo/rust-big-data-playground
How can I specify Schema on read? I don't found any example on documentation...

andygrove · 2022-10-30T17:09:26Z

@Miyake-Diogo Apologies for the late reply, but schema can be set in CsvReadOptions.

The root issue of not writing all results was fixed in #3801

Miyake-Diogo · 2022-10-31T23:47:23Z

Don't worry @andygrove thanks for answering me.

Miyake-Diogo added the bug Something isn't working label Oct 10, 2022

Miyake-Diogo changed the title ~~Write cssv not save all lines~~ Write csv not save all lines of dataframe Oct 10, 2022

andygrove mentioned this issue Oct 11, 2022

Stop ignoring errors when writing DataFrame to csv, parquet, json #3801

Merged

andygrove mentioned this issue Oct 11, 2022

[RUST][Datafusion] What causes "Error: Execution("file size of 4 is less than footer")" error? #3800

Closed

Dandandan mentioned this issue Oct 12, 2022

Don't try to infer nulls in CSV schema inference apache/arrow-rs#2859

Closed

andygrove added this to the 14.0.0 milestone Oct 30, 2022

andygrove closed this as completed Oct 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write csv not save all lines of dataframe #3783

Write csv not save all lines of dataframe #3783

Miyake-Diogo commented Oct 10, 2022

andygrove commented Oct 11, 2022

andygrove commented Oct 11, 2022 •

edited

Loading

andygrove commented Oct 11, 2022

Miyake-Diogo commented Oct 11, 2022

andygrove commented Oct 30, 2022

Miyake-Diogo commented Oct 31, 2022

Write csv not save all lines of dataframe #3783

Write csv not save all lines of dataframe #3783

Comments

Miyake-Diogo commented Oct 10, 2022

andygrove commented Oct 11, 2022

andygrove commented Oct 11, 2022 • edited Loading

andygrove commented Oct 11, 2022

Miyake-Diogo commented Oct 11, 2022

andygrove commented Oct 30, 2022

Miyake-Diogo commented Oct 31, 2022

andygrove commented Oct 11, 2022 •

edited

Loading