Skip to content

Latest commit

 

History

History
 
 

SplitPipeline

<html><title>README</title><body>
<h1>Parallel Data Processing in PowerShell</h1>
<p>PowerShell module for parallel data processing. Split-Pipeline splits the
input, processes parts by parallel pipelines, and outputs data for further
processing. It may work without collecting the whole input, large or infinite.</p>
<h2>Quick Start</h2>
<p><strong>Step 1:</strong> Get and install SplitPipeline.
SplitPipeline is distributed as the NuGet package <a href="https://www.nuget.org/packages/SplitPipeline">SplitPipeline</a>.
Download it to the current location as the directory <em>&quot;SplitPipeline&quot;</em> by this PowerShell command:</p>
<pre><code>iex (New-Object Net.WebClient).DownloadString('https://raw.github.com/nightroman/SplitPipeline/master/Download.ps1')
</code></pre>

<p>Alternatively, download it by NuGet tools or <a href="http://nuget.org/api/v2/package/SplitPipeline">directly</a>.
In the latter case rename the package to <em>&quot;.zip&quot;</em> and unzip. Use the package
subdirectory <em>&quot;tools/SplitPipeline&quot;</em>.</p>
<p>Copy the directory <em>SplitPipeline</em> to a PowerShell module directory, see
<code>$env:PSModulePath</code>, normally like this:</p>
<pre><code>C:/Users/&lt;User&gt;/Documents/WindowsPowerShell/Modules/SplitPipeline
</code></pre>

<p><strong>Step 2:</strong> In a PowerShell command prompt import the module:</p>
<pre><code>Import-Module SplitPipeline
</code></pre>

<p><strong>Step 3:</strong> Take a look at help:</p>
<pre><code>help about_SplitPipeline
help -full Split-Pipeline
</code></pre>

<p><strong>Step 4:</strong> Try these three commands performing the same job simulating long
but not processor consuming operations on each item:</p>
<pre><code>1..10 | . {process{ $_; sleep 1 }}
1..10 | Split-Pipeline {process{ $_; sleep 1 }}
1..10 | Split-Pipeline -Count 10 {process{ $_; sleep 1 }}
</code></pre>

<p>Output of all commands is the same, numbers from 1 to 10 (Split-Pipeline does
not guarantee the same order without the switch <code>Order</code>). But consumed times
are different. Let's measure them:</p>
<pre><code>Measure-Command { 1..10 | . {process{ $_; sleep 1 }} }
Measure-Command { 1..10 | Split-Pipeline {process{ $_; sleep 1 }} }
Measure-Command { 1..10 | Split-Pipeline -Count 10 {process{ $_; sleep 1 }} }
</code></pre>

<p>The first command takes about 10 seconds.</p>
<p>Performance of the second command depends on the number of processors which is
used as the default split count. For example, with 2 processors it takes about
6 seconds.</p>
<p>The third command takes about 2 seconds. The number of processors is not very
important for such sleeping jobs. The split count is important. Increasing it
to some extent improves overall performance. As for intensive jobs, the split
count normally should not exceed the number of processors.</p>
</body></html>