Skip to content

Commit

Permalink
Added some docstrings and updated README
Browse files Browse the repository at this point in the history
  • Loading branch information
laserson committed Jan 26, 2016
1 parent ef85aac commit 4608782
Show file tree
Hide file tree
Showing 5 changed files with 164 additions and 24 deletions.
37 changes: 20 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,47 @@
# impyla

Python DBAPI 2.0 client for Impala/Hive distributed query engine.
Python client for HiveServer2 implementations (e.g., Impala, Hive) for
distributed query engines.

For higher-level Impala functionality, see the [Ibis project][ibis].
For higher-level Impala functionality, including a Pandas-like interface over
distributed data sets, see the [Ibis project][ibis].

### Features

* Lightweight, `pip`-installable package for connecting to Impala and Hive
databases
* HiveServer2 compliant; works with Impala and Hive, including nested data

* Fully [DB API 2.0 (PEP 249)][pep249]-compliant Python client (similar to
sqlite or MySQL clients) supporting Python 2.6+ and Python 3.3+.

* Connects to HiveServer2; runs with Kerberos, LDAP, SSL
* Works with Kerberos, LDAP, SSL

* [SQLAlchemy][sqlalchemy] connector

* Converter to [pandas][pandas] `DataFrame`, allowing easy integration into the
Python data stack (including [scikit-learn][sklearn] and
[matplotlib][matplotlib])
[matplotlib][matplotlib]); but see the [Ibis project][ibis] for a richer
experience

### Dependencies

Required:

* Python 2.6+ or 3.3+

* `six`
* `six`, `bit_array`

* `thrift_sasl`
* `thrift` (on Python 2.x) or `thriftpy` (on Python 3.x)

* `bit_array`
For Hive and/or Kerberos support:

* `thrift` (on Python 2.x) or `thriftpy` (on Python 3.x)
* `thrift_sasl`

Optional:
* `python-sasl` (for Python 3.x support, requires
[cloudera/python-sasl@cython][python-sasl-cython] branch)

* `pandas` for conversion to `DataFrame` objects
Optional:

* `python-sasl` for Kerberos support (for Python 3.x support, requires
laserson/python-sasl@cython)
* `pandas` for conversion to `DataFrame` objects; but see the [Ibis project][ibis] instead

* `sqlalchemy` for the SQLAlchemy engine

Expand All @@ -54,7 +56,7 @@ Install the latest release (`0.12.0`) with `pip`:
pip install impyla
```

For the latest (dev) version, clone the repo:
For the latest (dev) version, install directly from the repo:

```bash
pip install git+https://github.com/cloudera/impyla.git
Expand Down Expand Up @@ -89,7 +91,7 @@ py.test --connect impyla
Leave out the `--connect` option to skip tests for DB API compliance.


### Quickstart
### Usage

Impyla implements the [Python DB API v2.0 (PEP 249)][pep249] database interface
(refer to it for API details):
Expand All @@ -99,7 +101,7 @@ from impala.dbapi import connect
conn = connect(host='my.host.com', port=21050)
cursor = conn.cursor()
cursor.execute('SELECT * FROM mytable LIMIT 100')
print cursor.description # prints the result set's schema
print cursor.description # prints the result set's schema
results = cursor.fetchall()
```

Expand Down Expand Up @@ -132,3 +134,4 @@ df = as_pandas(cur)
[pytest]: http://pytest.org/latest/
[sqlalchemy]: http://www.sqlalchemy.org/
[ibis]: http://www.ibis-project.org/
[python-sasl-cython]: https://github.com/laserson/python-sasl/tree/cython/sasl
62 changes: 62 additions & 0 deletions impala/dbapi.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,68 @@ def connect(host='localhost', port=21050, database=None, timeout=None,
password=None, kerberos_service_name='impala', use_ldap=None,
ldap_user=None, ldap_password=None, use_kerberos=None,
protocol=None):
"""Get a connection to HiveServer2 (HS2).
These options are largely compatible with the impala-shell command line
arguments. See those docs for more information.
Parameters
----------
host : str
The hostname for HS2. For Impala, this can be any of the `impalad`s.
port : int, optional
The port number for HS2. The Impala default is 21050. The Hive port is
likely different.
database : str, optional
The default database. If `None`, the result is
implementation-dependent.
timeout : int, optional
Connection timeout in seconds. Default is no timeout.
use_ssl : bool, optional
Enable SSL.
ca_cert : str, optional
Local path to the the third-party CA certificate. If SSL is enabled but
the certificate is not specified, the server certificate will not be
validated.
auth_mechanism : {'NOSASL', 'PLAIN', 'GSSAPI', 'LDAP'}
Specify the authentication mechanism. `'NOSASL'` for unsecured Impala.
`'PLAIN'` for unsecured Hive (because Hive requires the SASL
transport). `'GSSAPI'` for Kerberos and `'LDAP'` for Kerberos with
LDAP.
user : str, optional
LDAP user, if applicable.
password : str, optional
LDAP password, if applicable.
kerberos_service_name : str, optional
Authenticate to a particular `impalad` service principal. Uses
`'impala'` by default.
use_ldap : bool, optional
Specify `auth_mechanism='LDAP'` instead.
.. deprecated:: 0.11.0
ldap_user : str, optional
Use `user` parameter instead.
.. deprecated:: 0.11.0
ldap_password : str, optional
Use `password` parameter instead.
.. deprecated:: 0.11.0
use_kerberos : bool, optional
Specify `auth_mechanism='GSSAPI'` instead.
.. deprecated:: 0.11.0
protocol : str, optional
Do not use. HiveServer2 is the only protocol currently supported.
.. deprecated:: 0.11.0
Returns
-------
HiveServer2Connection
A `Connection` object (DB API 2.0-compliant).
"""
# pylint: disable=too-many-locals
if use_kerberos is not None:
warn_deprecate('use_kerberos', 'auth_mechanism="GSSAPI"')
Expand Down
73 changes: 67 additions & 6 deletions impala/hiveserver2.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,23 @@ def rollback(self):
raise NotSupportedError

def cursor(self, user=None, configuration=None, convert_types=True):
"""Get a cursor from the HiveServer2 (HS2) connection.
Parameters
----------
user : str, optional
configuration : dict of str keys and values, optional
Configuration overlay for the HS2 session.
convert_types : bool, optional
When `False`, timestamps and decimal values will not be converted
to Python `datetime` and `Decimal` values. (These conversions are
expensive.)
Returns
-------
HiveServer2Cursor
A `Cursor` object (DB API 2.0-compliant).
"""
# PEP 249
log.debug('Getting a cursor (Impala session)')

Expand All @@ -101,6 +118,10 @@ def cursor(self, user=None, configuration=None, convert_types=True):


class HiveServer2Cursor(Cursor):
"""The DB API 2.0 Cursor object.
See the PEP 249 specification for more details.
"""
# PEP 249
# HiveServer2Cursor objects are associated with a Session
# they are instantiated with alive session_handles
Expand Down Expand Up @@ -200,6 +221,26 @@ def _reset_state(self):
self._last_operation = None

def execute(self, operation, parameters=None, configuration=None):
"""Synchronously execute a SQL query.
Blocks until results are available.
Parameters
----------
operation : str
The SQL query to execute.
parameters : str, optional
Parameters to be bound to variables in the SQL query, if any.
Impyla supports all DB API `paramstyle`s, including `qmark`,
`numeric`, `named`, `format`, `pyformat`.
configuration : dict of str keys and values, optional
Configuration overlay for this query.
Returns
-------
NoneType
Results are available through a call to `fetch*`.
"""
# PEP 249
self.execute_async(operation, parameters=parameters,
configuration=configuration)
Expand All @@ -208,6 +249,27 @@ def execute(self, operation, parameters=None, configuration=None):
log.debug('Query finished')

def execute_async(self, operation, parameters=None, configuration=None):
"""Asynchronously execute a SQL query.
Immediately returns after query is sent to the HS2 server. Poll with
`is_executing`. A call to `fetch*` will block.
Parameters
----------
operation : str
The SQL query to execute.
parameters : str, optional
Parameters to be bound to variables in the SQL query, if any.
Impyla supports all DB API `paramstyle`s, including `qmark`,
`numeric`, `named`, `format`, `pyformat`.
configuration : dict of str keys and values, optional
Configuration overlay for this query.
Returns
-------
NoneType
Results are available through a call to `fetch*`.
"""
log.debug('Executing query %s', operation)

def op():
Expand Down Expand Up @@ -344,10 +406,10 @@ def fetchcolumnar(self):
"fetching")
batches = []
while True:
batch = (self._last_operation
.fetch(self.description,
self.buffersize,
convert_types=self.convert_types))
batch = (self._last_operation.fetch(
self.description,
self.buffersize,
convert_types=self.convert_types))
if len(batch) == 0:
break
batches.append(batch)
Expand Down Expand Up @@ -389,8 +451,7 @@ def __next__(self):
raise StopIteration

def ping(self):
"""Checks connection to server by requesting some info from the
server."""
"""Checks connection to server by requesting some info."""
log.info('Pinging the impalad')
return self.session.ping()

Expand Down
2 changes: 1 addition & 1 deletion impala/interface.py
Original file line number Diff line number Diff line change
Expand Up @@ -249,7 +249,7 @@ def _bind_parameters_dict(operation, parameters):


def _bind_parameters(operation, parameters):
# If parameters is a list, assume either qmark or numeric
# If parameters is a list, assume either qmark, format, or numeric
# format. If not, assume either named or pyformat parameters
if isinstance(parameters, (list, tuple)):
return _bind_parameters_list(operation, parameters)
Expand Down
14 changes: 14 additions & 0 deletions impala/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,20 @@ def get_logger_and_init_null(logger_name):


def as_pandas(cursor):
"""Return a pandas `DataFrame` out of an impyla cursor.
This will pull the entire result set into memory. For richer pandas-like
functionality on distributed data sets, see the Ibis project.
Parameters
----------
cursor : `HiveServer2Cursor`
The cursor object that has a result set waiting to be fetched.
Returns
-------
DataFrame
"""
from pandas import DataFrame # pylint: disable=import-error
names = [metadata[0] for metadata in cursor.description]
return DataFrame.from_records(cursor.fetchall(), columns=names)
Expand Down

0 comments on commit 4608782

Please sign in to comment.