Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write csv not save all lines of dataframe #3783

Closed
Miyake-Diogo opened this issue Oct 10, 2022 · 6 comments
Closed

Write csv not save all lines of dataframe #3783

Miyake-Diogo opened this issue Oct 10, 2022 · 6 comments
Labels
bug Something isn't working
Milestone

Comments

@Miyake-Diogo
Copy link

Describe the bug
When I try to save dataframe as csv, only around 400K of lines are saved.. data has more than 1M of lines.

To Reproduce
My code:

use datafusion::prelude::*;
use log::{debug, info, LevelFilter, trace};
use crate::datapipeline::data_utils::*;
pub mod datapipeline;
use datafusion::logical_plan::when;

use datafusion::arrow::datatypes::DataType::{Int64,Utf8};
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
  let ctx: SessionContext = SessionContext::new();
  let raw_fato_path: &str = "data/minilake/raw/fato_census/Data8277.csv";
  let stage_fato_path: &str = "data/minilake/stage/fato_census/";
  let fato_census_df = ctx.read_csv(raw_fato_path,  
                                  CsvReadOptions::new()).await?;
  
  let fato_census_df = fato_census_df.with_column("area",cast(
    col("Area"),
    Utf8))?;

  let fato_census_df = fato_census_df
    //.with_column("Area",concat_ws("-", &vec![lit("A"),col("Area")]))?
    .select(vec![
      col("Year").alias("year"),
      col("Age").alias("age"),
      col("Ethnic").alias("ethnic"),
      col("Sex").alias("sex"),
      col("Area").alias("area"),
      col("count").alias("total_count")
      ])?;
  
  // We can see the ..C values in Count column
  fato_census_df.show_limit(5).await?;
  print_schema_of_dataframe(&fato_census_df).await?;
  // Create a function to make trnasformation
  let transform_count_data = when(col("total_count")
    .eq(lit("..C")), lit(0_u32))
    .otherwise(col("total_count"))?;

  //Cast column datatype
  let fato_census_df = fato_census_df.with_column(
    "total_count",
    cast(transform_count_data, Int64))?;
  
  fato_census_df.write_csv(stage_fato_path).await?;

  Ok(())
  }

Dataset:

Age and sex by ethnic group (grouped total responses), for census usually resident population counts, 2006, 2013, and 2018 Censuses (RC, TA, SA2, DHB)
Expected behavior
See all lines saved:

image

But only this quantity are saved.
image

@Miyake-Diogo Miyake-Diogo added the bug Something isn't working label Oct 10, 2022
@Miyake-Diogo Miyake-Diogo changed the title Write cssv not save all lines Write csv not save all lines of dataframe Oct 10, 2022
@andygrove
Copy link
Member

@Miyake-Diogo So part-0.csv only has 400k lines but were there other csv files?

I tried running this code but it has dependencies that are not here:

error[E0583]: file not found for module `datapipeline`
 --> src/main.rs:4:1
  |
4 | pub mod datapipeline;

Do you have this code in GitHub somewhere? I am happy to help debug if you have a public repro case.

@andygrove
Copy link
Member

andygrove commented Oct 11, 2022

Here is a smaller repro case:

use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx: SessionContext = SessionContext::new();
    let raw_fato_path: &str = "/mnt/bigdata/census/Data8277.csv";
    let stage_fato_path: &str = "/tmp/stage";
    let fato_census_df = ctx.read_csv(raw_fato_path, CsvReadOptions::new()).await?;
    fato_census_df.write_csv(stage_fato_path).await?;
    Ok(())
}
$ wc -l /tmp/stage/part-0.csv 
425985 /tmp/stage/part-0.csv

I tested with DataFusion 11, 12, and 13, and all have the same issue

@andygrove
Copy link
Member

@Miyake-Diogo The issue is that this error is happening:

Error: ArrowError(ParseError("Error while parsing value CMB07601 for column 4 at line 431740"))

I recommend specifying the schema for the file since it contains mixed types for this column.

You did not see the error due to a bug with the error being ignored and the fix for that issue is in #3801

@Miyake-Diogo
Copy link
Author

Hi @andygrove , all codes are in this repo: https://gitlab.com/miyake-diogo/rust-big-data-playground
How can I specify Schema on read? I don't found any example on documentation...

@andygrove
Copy link
Member

@Miyake-Diogo Apologies for the late reply, but schema can be set in CsvReadOptions.

The root issue of not writing all results was fixed in #3801

@Miyake-Diogo
Copy link
Author

Don't worry @andygrove thanks for answering me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants