Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
546 changes: 546 additions & 0 deletions datafusion/_modules/datafusion.html

Large diffs are not rendered by default.

5 changes: 3 additions & 2 deletions datafusion/_modules/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -397,8 +397,9 @@

<h1>All modules for which code is available</h1>
<ul><li><a href="builtins.html">builtins</a></li>
<li><a href="datafusion/functions.html">datafusion.functions</a></li>
<li><a href="functions.html">functions</a></li>
<li><a href="datafusion.html">datafusion</a></li>
<ul><li><a href="datafusion/functions.html">datafusion.functions</a></li>
</ul><li><a href="functions.html">functions</a></li>
</ul>

</div>
Expand Down
18 changes: 18 additions & 0 deletions datafusion/_sources/cli/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,24 @@ The Arrow DataFusion CLI is a command-line interactive SQL utility that allows
queries to be executed against CSV and Parquet files. It is a convenient way to
try DataFusion out with your own data sources.

Install and run using Homebrew (on MacOS)
=========================================

The easiest way to give DataFusion CLI a spin is via Homebrew (on MacOS). Install it as any other pre-built software like this:

.. code-block:: bash

brew install datafusion
# ==> Downloading https://ghcr.io/v2/homebrew/core/datafusion/manifests/5.0.0
# ######################################################################## 100.0%
# ==> Downloading https://ghcr.io/v2/homebrew/core/datafusion/blobs/sha256:9ecc8a01be47ceb9a53b39976696afa87c0a8
# ==> Downloading from https://pkg-containers.githubusercontent.com/ghcr1/blobs/sha256:9ecc8a01be47ceb9a53b39976
# ######################################################################## 100.0%
# ==> Pouring datafusion--5.0.0.big_sur.bottle.tar.gz
# 🍺 /usr/local/Cellar/datafusion/5.0.0: 9 files, 17.4MB

datafusion-cli

Run using Cargo
===============

Expand Down
13 changes: 6 additions & 7 deletions datafusion/_sources/python/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,10 @@ Simple usage:
.. code-block:: python

import datafusion
from datafusion import functions as f
from datafusion import col
import pyarrow

# an alias
f = datafusion.functions

# create a context
ctx = datafusion.ExecutionContext()

Expand All @@ -56,8 +55,8 @@ Simple usage:

# create a new statement
df = df.select(
f.col("a") + f.col("b"),
f.col("a") - f.col("b"),
col("a") + col("b"),
col("a") - col("b"),
)

# execute and collect the first (and only) batch
Expand All @@ -77,7 +76,7 @@ UDFs

udf = f.udf(is_null, [pyarrow.int64()], pyarrow.bool_())

df = df.select(udf(f.col("a")))
df = df.select(udf(col("a")))


UDAF
Expand Down Expand Up @@ -117,7 +116,7 @@ UDAF

df = df.aggregate(
[],
[udaf(f.col("a"))]
[udaf(col("a"))]
)


Expand Down
27 changes: 24 additions & 3 deletions datafusion/_sources/specification/roadmap.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ to provide:
- Additional constant folding / partial evaluation [#1070](https://github.com/apache/arrow-datafusion/issues/1070)
- More sophisticated cost based optimizer for join ordering
- Implement advanced query optimization framework (Tokomak) #440
- Finer optimizations for group by and aggregate functions

## Datasources

Expand Down Expand Up @@ -92,8 +93,28 @@ Note: There are some additional thoughts on a datafusion-cli vision on [#1096](h
- publishing to apt, brew, and possible NuGet registry so that people can use it more easily
- adopt a shorter name, like dfcli?

## Ballista
# Ballista

# Vision
Ballista is a distributed compute platform based on Apache Arrow and DataFusion. It provides a query scheduler that
breaks a physical plan into stages and tasks and then schedules tasks for execution across the available executors
in the cluster.

TBD
Having Ballista as part of the DataFusion codebase helps ensure that DataFusion remains suitable for distributed
compute. For example, it helps ensure that physical query plans can be serialized to protobuf format and that they
remain language-agnostic so that executors can be built in languages other than Rust.

## Ballista Roadmap

## Move query scheduler into DataFusion

The Ballista scheduler has some advantages over DataFusion query execution because it doesn't try to eagerly execute
the entire query at once but breaks it down into a directionally-acyclic graph (DAG) of stages and executes a
configurable number of stages and tasks concurrently. It should be possible to push some of this logic down to
DataFusion so that the same scheduler can be used to scale across cores in-process and across nodes in a cluster.

## Implement execution-time cost-based optimizations based on statistics

After the execution of a query stage, accurate statistics are available for the resulting data. These statistics
could be leveraged by the scheduler to optimize the query during execution. For example, when performing a hash join
it is desirable to load the smaller side of the join into memory and in some cases we cannot predict which side will
be smaller until execution time.
21 changes: 21 additions & 0 deletions datafusion/cli/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -395,6 +395,11 @@

<nav id="bd-toc-nav">
<ul class="visible nav section-nav flex-column">
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#install-and-run-using-homebrew-on-macos">
Install and run using Homebrew (on MacOS)
</a>
</li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#run-using-cargo">
Run using Cargo
Expand Down Expand Up @@ -453,6 +458,22 @@ <h1>DataFusion Command-line<a class="headerlink" href="#datafusion-command-line"
<p>The Arrow DataFusion CLI is a command-line interactive SQL utility that allows
queries to be executed against CSV and Parquet files. It is a convenient way to
try DataFusion out with your own data sources.</p>
<div class="section" id="install-and-run-using-homebrew-on-macos">
<h2>Install and run using Homebrew (on MacOS)<a class="headerlink" href="#install-and-run-using-homebrew-on-macos" title="Permalink to this headline">¶</a></h2>
<p>The easiest way to give DataFusion CLI a spin is via Homebrew (on MacOS). Install it as any other pre-built software like this:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>brew install datafusion
<span class="c1"># ==&gt; Downloading https://ghcr.io/v2/homebrew/core/datafusion/manifests/5.0.0</span>
<span class="c1"># ######################################################################## 100.0%</span>
<span class="c1"># ==&gt; Downloading https://ghcr.io/v2/homebrew/core/datafusion/blobs/sha256:9ecc8a01be47ceb9a53b39976696afa87c0a8</span>
<span class="c1"># ==&gt; Downloading from https://pkg-containers.githubusercontent.com/ghcr1/blobs/sha256:9ecc8a01be47ceb9a53b39976</span>
<span class="c1"># ######################################################################## 100.0%</span>
<span class="c1"># ==&gt; Pouring datafusion--5.0.0.big_sur.bottle.tar.gz</span>
<span class="c1"># 🍺 /usr/local/Cellar/datafusion/5.0.0: 9 files, 17.4MB</span>

datafusion-cli
</pre></div>
</div>
</div>
<div class="section" id="run-using-cargo">
<h2>Run using Cargo<a class="headerlink" href="#run-using-cargo" title="Permalink to this headline">¶</a></h2>
<p>Use the following commands to clone this repository and run the CLI. This will require the Rust toolchain to be installed. Rust can be installed from <a class="reference external" href="https://rustup.rs/">https://rustup.rs</a>.</p>
Expand Down
32 changes: 24 additions & 8 deletions datafusion/genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -477,18 +477,24 @@ <h2 id="B">B</h2>
<h2 id="C">C</h2>
<table style="width: 100%" class="indextable genindextable"><tr>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="python/generated/datafusion.Expression.html#datafusion.Expression.cast">cast() (datafusion.Expression method)</a>
</li>
<li><a href="python/generated/datafusion.ExecutionContext.html#datafusion.ExecutionContext.catalog">catalog() (datafusion.ExecutionContext method)</a>
</li>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.ceil">ceil() (in module datafusion.functions)</a>
</li>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.character_length">character_length() (in module datafusion.functions)</a>
</li>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.chr">chr() (in module datafusion.functions)</a>
</li>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.col">col() (in module datafusion.functions)</a>
</li>
<li><a href="python/generated/datafusion.DataFrame.html#datafusion.DataFrame.collect">collect() (datafusion.DataFrame method)</a>
</li>
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="python/generated/datafusion.DataFrame.html#datafusion.DataFrame.collect">collect() (datafusion.DataFrame method)</a>
</li>
<li><a href="python/generated/datafusion.Expression.html#datafusion.Expression.column">column() (datafusion.Expression static method)</a>
</li>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.concat">concat() (in module datafusion.functions)</a>
</li>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.concat_ws">concat_ws() (in module datafusion.functions)</a>
Expand Down Expand Up @@ -519,6 +525,8 @@ <h2 id="D">D</h2>
<h2 id="E">E</h2>
<table style="width: 100%" class="indextable genindextable"><tr>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="python/generated/datafusion.ExecutionContext.html#datafusion.ExecutionContext.empty_table">empty_table() (datafusion.ExecutionContext method)</a>
</li>
<li><a href="python/generated/datafusion.ExecutionContext.html#datafusion.ExecutionContext">ExecutionContext (class in datafusion)</a>
</li>
</ul></td>
Expand Down Expand Up @@ -547,11 +555,13 @@ <h2 id="I">I</h2>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.Volatility.immutable">immutable() (datafusion.functions.Volatility static method)</a>
</li>
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.in_list">in_list() (in module datafusion.functions)</a>
</li>
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.initcap">initcap() (in module datafusion.functions)</a>
</li>
<li><a href="python/generated/datafusion.Expression.html#datafusion.Expression.is_null">is_null() (datafusion.Expression method)</a>
</li>
</ul></td>
</tr></table>
Expand All @@ -572,6 +582,8 @@ <h2 id="L">L</h2>
<li><a href="python/generated/datafusion.DataFrame.html#datafusion.DataFrame.limit">limit() (datafusion.DataFrame method)</a>
</li>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.lit">lit() (in module datafusion.functions)</a>
</li>
<li><a href="python/generated/datafusion.Expression.html#datafusion.Expression.literal">literal() (datafusion.Expression static method)</a>
</li>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.ln">ln() (in module datafusion.functions)</a>
</li>
Expand Down Expand Up @@ -657,6 +669,8 @@ <h2 id="R">R</h2>
<h2 id="S">S</h2>
<table style="width: 100%" class="indextable genindextable"><tr>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="python/generated/datafusion.DataFrame.html#datafusion.DataFrame.schema">schema() (datafusion.DataFrame method)</a>
</li>
<li><a href="python/generated/datafusion.DataFrame.html#datafusion.DataFrame.select">select() (datafusion.DataFrame method)</a>
</li>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.sha224">sha224() (in module datafusion.functions)</a>
Expand All @@ -673,14 +687,14 @@ <h2 id="S">S</h2>
</li>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.sin">sin() (in module datafusion.functions)</a>
</li>
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="python/generated/datafusion.DataFrame.html#datafusion.DataFrame.sort">sort() (datafusion.DataFrame method)</a>

<ul>
<li><a href="python/generated/datafusion.Expression.html#datafusion.Expression.sort">(datafusion.Expression method)</a>
</li>
</ul></li>
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.split_part">split_part() (in module datafusion.functions)</a>
</li>
<li><a href="python/generated/datafusion.ExecutionContext.html#datafusion.ExecutionContext.sql">sql() (datafusion.ExecutionContext method)</a>
Expand All @@ -703,14 +717,16 @@ <h2 id="S">S</h2>
<h2 id="T">T</h2>
<table style="width: 100%" class="indextable genindextable"><tr>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="python/generated/datafusion.ExecutionContext.html#datafusion.ExecutionContext.table">table() (datafusion.ExecutionContext method)</a>
</li>
<li><a href="python/generated/datafusion.ExecutionContext.html#datafusion.ExecutionContext.tables">tables() (datafusion.ExecutionContext method)</a>
</li>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.tan">tan() (in module datafusion.functions)</a>
</li>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.to_hex">to_hex() (in module datafusion.functions)</a>
</li>
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.to_hex">to_hex() (in module datafusion.functions)</a>
</li>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.translate">translate() (in module datafusion.functions)</a>
</li>
<li><a href="python/generated/datafusion.functions.html#datafusion.functions.trim">trim() (in module datafusion.functions)</a>
Expand Down
2 changes: 1 addition & 1 deletion datafusion/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -458,7 +458,7 @@ <h2>Table of content<a class="headerlink" href="#table-of-content" title="Permal
<ul>
<li class="toctree-l1"><a class="reference internal" href="specification/roadmap.html">Roadmap</a></li>
<li class="toctree-l1"><a class="reference internal" href="specification/roadmap.html#datafusion">DataFusion</a></li>
<li class="toctree-l1"><a class="reference internal" href="specification/roadmap.html#vision">Vision</a></li>
<li class="toctree-l1"><a class="reference internal" href="specification/roadmap.html#ballista">Ballista</a></li>
<li class="toctree-l1"><a class="reference internal" href="specification/invariants.html">DataFusion’s Invariants</a></li>
<li class="toctree-l1"><a class="reference internal" href="specification/output-field-name-semantic.html">Datafusion output field name semantic</a></li>
</ul>
Expand Down
Binary file modified datafusion/objects.inv
Binary file not shown.
Loading