High-performance ODBC to Arrow data conversion for Python, built with Rust.
- 🚀 High Performance: Built with Rust for maximum speed
- 🔄 ODBC Integration: Direct connection to any ODBC-compatible database
- 📊 Arrow Format: Native Apache Arrow support for efficient data processing
- 🐼 Pandas/Polars Ready: Seamless integration with popular Python data libraries
- 🛡️ Type Safe: Rust-powered reliability with Python convenience
- 🎯 Two-Level API: Simple wrappers for common use + raw functions for advanced control
pip install ibarrow- GitHub: https://github.com/thomazyujibaba/ibarrow
- PyPI: https://pypi.org/project/ibarrow/
- Documentation: https://github.com/thomazyujibaba/ibarrow#readme
Important: You need an ODBC driver installed on your system for ibarrow to work.
- SQL Server: ODBC Driver for SQL Server
- PostgreSQL: psqlODBC
- MySQL: MySQL Connector/ODBC
- Oracle: Oracle Instant Client + ODBC
- SQL Server: Microsoft ODBC Driver for SQL Server on Linux
- PostgreSQL:
sudo apt-get install odbc-postgresql(Ubuntu/Debian) orsudo yum install postgresql-odbc(RHEL/CentOS) - MySQL:
sudo apt-get install libmyodbc(Ubuntu/Debian) orsudo yum install mysql-connector-odbc(RHEL/CentOS)
- Note: macOS support is currently not available. Please use Windows or Linux for now.
You can verify your ODBC drivers are installed by checking the system:
Windows:
# Check installed drivers
odbcad32.exeLinux/macOS:
# List available drivers
odbcinst -q -dibarrow provides a two-level API designed for different user needs:
query_polars(): Direct Polars DataFrame (zero-copy, fastest)query_pandas(): Direct Pandas DataFrame (maximum compatibility)
query_arrow_ipc(): Raw Arrow IPC bytes (maximum compatibility)query_arrow_c_data(): Raw Arrow C Data Interface (maximum performance)
| User Type | Recommended Function | Use Case |
|---|---|---|
| Beginners | query_polars() |
95% of cases - simple and fast |
| Pandas Users | query_pandas() |
When you need Pandas compatibility |
| Advanced Users | query_arrow_ipc() |
When you need raw Arrow data |
| Performance Critical | query_arrow_c_data() |
When you need maximum control |
ibarrow supports two ways to connect to databases:
# Requires pre-configured DSN in ODBC Data Sources
conn = ibarrow.connect(
dsn="my_database_dsn",
user="username",
password="password"
)# Direct connection like pyodbc - no DSN configuration needed
conn = ibarrow.connect(
dsn="DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;",
user="username",
password="password"
)import ibarrow
# Option 1: Using DSN (Data Source Name)
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password"
)
# Option 2: Using direct connection string (like pyodbc)
conn = ibarrow.connect(
dsn="DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;",
user="username",
password="password"
)
# Query and get Polars DataFrame
df = conn.query_polars("SELECT * FROM your_table")
print(df)import ibarrow
# Create config with custom batch size
config = ibarrow.QueryConfig(batch_size=2000)
# Create connection with configuration
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password",
config=config
)
# Query with custom batch size
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")import ibarrow
# Create custom configuration
config = ibarrow.QueryConfig(
batch_size=2000, # Rows per batch
read_only=True, # Read-only connection
connection_timeout=30, # Connection timeout in seconds
query_timeout=60, # Query timeout in seconds
max_text_size=32768, # Max text field size
max_binary_size=16384, # Max binary field size
isolation_level="READ_COMMITTED" # Transaction isolation
)
# Create connection with configuration
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password",
config=config
)
# Use the connection
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")import ibarrow
# Direct conversion to Polars DataFrame (uses pl.read_ipc internally)
# Create connection
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password"
)
# Get Polars DataFrame
df_polars = conn.query_polars("SELECT * FROM your_table")
# Get Pandas DataFrame
df_pandas = conn.query_pandas("SELECT * FROM your_table")
print(df_polars)
print(df_pandas)For maximum performance, use the Arrow C Data Interface functions that completely eliminate serialization:
import ibarrow
import polars as pl
import pyarrow as pa
# Zero-copy conversion to Polars DataFrame (fastest)
# Create connection
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password"
)
# Get Polars DataFrame directly
df_polars = conn.query_arrow_c_data("SELECT * FROM your_table", return_dataframe=True)
# Or get raw PyCapsules for manual control
schema_capsule, array_capsule = conn.query_arrow_c_data("SELECT * FROM your_table")
# Convert to PyArrow Table using zero-copy
schema = pa.Schema._import_from_c(schema_capsule)
array = pa.Array._import_from_c(array_capsule)
table = pa.Table.from_arrays([array], schema=schema)
# Convert to Polars
df = pl.from_arrow(table)Arrow C Data Interface Benefits:
- 🚀 Zero serialization: Data passes directly via pointers
- 💾 Zero copies: Eliminates memory overhead
- ⚡ Maximum speed: Ideal for large datasets
- 🔄 Compatibility: Works with PyArrow, Polars, Pandas
import ibarrow
import polars as pl
# Get raw Arrow IPC bytes
# Create connection
conn = ibarrow.connect(
dsn="your_dsn",
user="username",
password="password"
)
# Get Arrow IPC bytes
arrow_bytes = conn.query_arrow_ipc("SELECT * FROM your_table")
# Convert to Polars DataFrame manually
df = pl.read_ipc(arrow_bytes)
print(df)Creates a connection object for database operations.
Parameters:
dsn(str): ODBC Data Source Name or full connection string- DSN format:
"your_dsn"(requires pre-configured DSN) - Connection string format:
"DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;"(direct connection)
- DSN format:
user(str): Database usernamepassword(str): Database passwordconfig(QueryConfig, optional): Configuration object
Returns: IbarrowConnection object
Connection String Examples:
# SQL Server
dsn = "DRIVER={SQL Server};SERVER=localhost;DATABASE=mydb;"
# PostgreSQL
dsn = "DRIVER={PostgreSQL};SERVER=localhost;PORT=5432;DATABASE=mydb;"
# MySQL
dsn = "DRIVER={MySQL ODBC 8.0 Driver};SERVER=localhost;PORT=3306;DATABASE=mydb;"
# Oracle
dsn = "DRIVER={Oracle in OraClient19Home1};DBQ=localhost:1521/XE;"Execute a SQL query and return Arrow IPC bytes.
Parameters:
sql(str): SQL query to execute
Returns: bytes - Arrow IPC format data
Raises:
PyConnectionError: Database connection issuesPySQLError: SQL syntax or execution errorsPyArrowError: Arrow data processing errors
Execute a SQL query and return a Polars DataFrame directly.
Parameters: Same as query_arrow_ipc
Returns: polars.DataFrame - Ready-to-use DataFrame
Note: Uses pl.read_ipc() directly with bytes for optimal performance.
Execute a SQL query and return a Pandas DataFrame directly.
Parameters: Same as query_arrow_ipc
Returns: pandas.DataFrame - Ready-to-use DataFrame
Note: Converts Arrow IPC to Pandas via PyArrow for compatibility.
Configuration class for advanced query settings.
Parameters:
batch_size(int, optional): Number of rows per batch for processing (default: 1000)read_only(bool, optional): Read-only connection to avoid locks (default: True)connection_timeout(int, optional): Connection timeout in secondsquery_timeout(int, optional): Query timeout in secondsmax_text_size(int, optional): Maximum text field size in bytes (default: 65536)max_binary_size(int, optional): Maximum binary field size in bytes (default: 65536)isolation_level(str, optional): Transaction isolation level. Supported values: "read_uncommitted", "read_committed", "repeatable_read", "serializable", "snapshot"
batch_size: Controls memory usage and performance. Larger batches = more memory but faster processingread_only: Prevents locks and improves performance for read-only operations (effective only if ODBC driver supports this flag)connection_timeout: Protects against hanging connectionsquery_timeout: Prevents long-running queries from blockingmax_text_size: Handles large text fields (VARCHAR, TEXT) efficientlymax_binary_size: Handles large binary data (BLOB, VARBINARY) efficientlyisolation_level: Controls transaction isolation for concurrent access
read_only: Currently implemented via ODBC connection string (ReadOnly=1).batch_size: Controls how many rows are fetched per batch from the database, avoiding row-by-row fetching for better performance.query_timeout: Implemented via statement handle usingstmt.set_query_timeout(), which is more reliable than connection string timeouts.isolation_level: Standardized mapping from common names (e.g., "read_committed") to driver-specific ODBC connection string values (e.g., "Isolation Level=ReadCommitted").query_polars: Uses Arrow IPC stream withpl.read_ipc()for maximum compatibility and performance.- Native Types: Always preserves ODBC native types (INT, DECIMAL, FLOAT) as Arrow native types (Int64Array, Float64Array), avoiding expensive string conversions for maximum performance.
- Pipelining: Always processes data in streaming fashion, writing each batch immediately as it's fetched. This keeps memory usage constant (e.g., 10MB) regardless of dataset size (even 80GB+).
| Method | Level | Serialization | Memory Copies | Performance | Ideal Use |
|---|---|---|---|---|---|
query_polars |
High | Arrow IPC Stream | 1x (serialization) | ⭐⭐⭐⭐ | 95% of cases (recommended) |
query_pandas |
High | Arrow IPC Stream | 1x (serialization) | ⭐⭐⭐ | Pandas compatibility |
query_arrow_ipc |
Low | Arrow IPC Stream | 1x (serialization) | ⭐⭐⭐ | Maximum compatibility |
query_arrow_c_data |
Low | Zero | Zero | ⭐⭐⭐⭐⭐ | Maximum performance |
query_polars: ~200ms (Arrow IPC + polars.read_ipc) ⭐ RECOMMENDED
query_pandas: ~600ms (Arrow IPC + pyarrow + pandas)
query_arrow_ipc: ~500ms (Arrow IPC serialization)
query_arrow_c_data: ~50ms (zero-copy via pointers)
Native Types (Always Enabled):
- INT columns → Int64Array (native Arrow types)
- DECIMAL columns → DecimalArray (native Arrow types)
- FLOAT columns → Float64Array (native Arrow types)
- Performance: ~200ms for 1M numeric rows (Arrow IPC)
Pipelining (Always Enabled):
- Memory usage: Constant (~10MB) regardless of dataset size
- Processing: Streaming (fetch + write immediately)
- Latency: Lower (Python can start consuming data before completion)
- Example: 80GB dataset uses only ~10MB RAM
query_polars(): 95% of cases - Simple, fast, zero-copyquery_pandas(): When you need Pandas compatibility
query_arrow_ipc(): Maximum compatibility, save to diskquery_arrow_c_data(): Maximum performance, full control over data
The library provides specific exception types for different error scenarios:
import ibarrow
try:
# Create connection
conn = ibarrow.connect(dsn, user, password)
# Query with batch size
df = conn.query_polars(sql)
except ibarrow.PyConnectionError as e:
print(f"Connection failed: {e}")
except ibarrow.PySQLError as e:
print(f"SQL error: {e}")
except ibarrow.PyArrowError as e:
print(f"Arrow processing error: {e}")- Python 3.8+
- ODBC driver for your database
- Rust toolchain (for development)
# Clone the repository
git clone https://github.com/thomazyujibaba/ibarrow.git
cd ibarrow
# Install maturin
pip install maturin[patchelf]
# Install in development mode
maturin develop# Install test dependencies
pip install pytest pytest-cov
# Run tests
pytest tests/ -v# Build wheel
maturin build --release
# Build and install
maturin developMIT License - see LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
"Driver not found" errors:
- Ensure the ODBC driver is properly installed
- Check that the driver name in your DSN matches exactly
- Verify the driver architecture (32-bit vs 64-bit) matches your Python installation
Connection timeout errors:
- Check network connectivity to the database server
- Verify firewall settings
- Ensure the database server is running and accessible
Permission errors:
- Verify database credentials
- Check user permissions on the database
- Ensure the ODBC driver has necessary privileges
Performance issues:
- Adjust
batch_sizeinQueryConfigfor optimal memory usage - Use
read_only=Truefor read-only operations - Consider connection pooling for high-frequency queries
For issues and questions, please use the GitHub Issues page.