Skip to content

dbunt1tled/parquet2csv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

50 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CSV ⇄ Parquet Converter

Go Version Go Reference Build Status Release Go Report Card License: MIT

A fast, reliable CLI tool for bidirectional conversion between CSV and Apache Parquet formats. Built in Go with Cobra CLI framework, it's designed for data workflows that need efficient, schema-aware columnar storage with support for both directions of conversion.

Features

  • πŸ”„ Bidirectional conversion: CSV ↔ Parquet
  • ⚑ High performance: Batch processing with configurable flush intervals
  • πŸ—œοΈ Compression support: Multiple compression algorithms
  • 🎯 Schema-aware: Automatic schema detection and type inference
  • πŸ“Š Verbose statistics: Runtime performance and memory usage reporting
  • πŸ› οΈ Flexible CLI: Powered by Cobra with intuitive subcommands

Dependencies

  • Cobra CLI Framework: github.com/spf13/cobra v1.10.1
  • Parquet Processing: github.com/xitongsys/parquet-go v1.6.2
  • High-Performance JSON: github.com/bytedance/sonic v1.14.1
  • Error Handling: github.com/pkg/errors v0.9.1
  • String Utilities: github.com/iancoleman/strcase v0.3.0
  • Dynamic Structs: github.com/ompluscator/dynamic-struct v1.4.0

Installation

From Source

git clone https://github.com/dbunt1tled/parquet2csv.git
cd parquet2csv
go build -o csv2parquet main.go

Using Go Install

go install github.com/dbunt1tled/parquet2csv@latest

Command Reference

Global Commands

csv2parquet                     # Root command
  β”œβ”€β”€ parquet <input> <output>  # Convert CSV to Parquet
  └── csv <input> <output>      # Convert Parquet to CSV

Available Flags

Flag Short Type Default Description
--compression -c int 0 Compression type (0=UNCOMPRESSED, 1=SNAPPY, 2=GZIP, 3=LZO)
--delimiter -d string "," Field delimiter for CSV files
--flush -f int 10000 Number of rows to process before flushing to disk
--verbose -v bool false Show detailed statistics and performance metrics
--help -h bool false Display help information

Help Commands

./csv2parquet --help                   # General help
./csv2parquet parquet --help           # CSV to Parquet help
./csv2parquet csv --help               # Parquet to CSV help

Examples

Basic Conversion Examples

# CSV to Parquet with default settings
./csv2parquet parquet data.csv

# Parquet to CSV with custom delimiter
./csv2parquet csv data.parquet --delimiter ";"

# CSV to Parquet with compression and verbose output
./csv2parquet parquet large_dataset.csv --compression 1 --verbose

Advanced Usage

# Process large files with custom flush interval
./csv2parquet parquet big_file.csv big_file.parquet \
  --flush 50000 \
  --compression 2 \
  --verbose

# Convert with pipe delimiter and detailed stats
./csv2parquet csv analytics.parquet analytics.csv \
  --delimiter "|" \
  --flush 1000 \
  --verbose

Performance Features

  • Batch Processing: Configurable row batch sizes for optimal memory usage
  • Compression: Support for multiple compression algorithms (SNAPPY, GZIP, LZO)
  • Memory Management: Efficient memory pooling and garbage collection
  • Progress Tracking: Runtime statistics including processing time and memory usage
  • Schema Optimization: Automatic type inference and schema generation

File Format Support

CSV Features

  • Custom delimiters (comma, semicolon, pipe, tab, etc.)
  • Header row detection and processing
  • Automatic type inference
  • Large file handling with streaming

Parquet Features

  • Columnar storage optimization
  • Schema preservation
  • Multiple compression algorithms
  • Efficient read/write operations
  • Row group size optimization (128MB default)

Development

Project Structure

β”œβ”€β”€ cmd/                    # Cobra CLI commands
β”‚   β”œβ”€β”€ root.go            # Root command definition
β”‚   β”œβ”€β”€ csv2parquet.go     # CSV to Parquet conversion
β”‚   └── parquet2csv.go     # Parquet to CSV conversion
β”œβ”€β”€ internal/
β”‚   β”œβ”€β”€ file/              # File operations and I/O
β”‚   β”œβ”€β”€ helper/            # Utility functions
β”‚   └── schema/            # Schema management
└── main.go                # Application entry point

Running Tests

go test ./...                 # Run all tests
go test -v ./...             # Verbose test output
go test -bench . ./...       # Run benchmarks

Building

go build -o csv2parquet main.go   # Build binary
make build                        # Using Makefile (if available)

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

About

small converter csv to parquet

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors