Skip to content

rename Parallel to Distributed #20486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 15, 2017
Merged

rename Parallel to Distributed #20486

merged 1 commit into from
Feb 15, 2017

Conversation

amitmurthy
Copy link
Contributor

This PR renames module Parallel to Distributed as discussed here - #20428 (comment)

The thinking is to differentiate multi-node distributed computation from other types of parallelism - threads, tasks for IO, GPU, etc.

Will keep this open for a few days.

Note that the manual needs to be updated too to better differentiate between the different types of parallelism. We have an open issue for that (#19579) and is not in the scope of this PR.

@tkelman
Copy link
Contributor

tkelman commented Feb 7, 2017

this isn't strictly multi node though.

@ViralBShah
Copy link
Member

The purpose of all those primitives are to do distributed computing. I think parallel is too broad, and this is good.

@amitmurthy
Copy link
Contributor Author

Going forward, I think most of the needs of single-node parallelism will be addressed via multi-threading and GPU computing. Distributed will be largely used for true multi-node or single node distributed (like the testing infrastructure).

We can recover the "Parallel" name for a framework that abstracts over all forms of parallelism, which is a distant goal at this point.

@tkelman
Copy link
Contributor

tkelman commented Feb 7, 2017

"Single node distributed" is a contradiction. It's about multiple processes, coordination between them and data movement, whether single-node or multi-node. Most of the current usage of this is likely single node for which this would be a misnomer - that may change a bit if our threading capabilities improve, but it'll be at the cost of a smaller portion of people using the multi-process model overall.

@amitmurthy
Copy link
Contributor Author

Considering the unit of distribution as physical processors and not OS processes, this can be presented as :

Threading leverages multiple processors on a single node
Distributed distributed computation across multiple processors, single node or multiple nodes.

"Single node distributed" is not a contradiction if Distributed is presented as distributed across CPU cores, single or multi-node.

It is up to how the literature presents it.

Let us also take a vote for Multiprocessing as an option. Any other suggestions are welcome too.

@tkelman
Copy link
Contributor

tkelman commented Feb 7, 2017

It isn't really physical processors either unless you're very careful about managing affinity, OS-level scheduling, hyperthreading etc which we don't do a whole lot of (we do some thread-level affinity, but not processes as far as I'm aware). The literature does distinguish distributed memory from shared memory parallelism, so I'd be fine with DistributedMemory as a more specific module name. That would also address the part-of-speech problem.

@amitmurthy
Copy link
Contributor Author

Multiprocessing at least lends itself to distributing computation over "multiple julia processes". IMO, DistributedMemory is more suited to something like DArrays which provide a single view of distributed memory.

Just fyi, not related this conversation, we do have some process-CPU affinity capability via https://github.com/JuliaParallel/ClusterManagers.jl#using-localaffinitymanager-for-pinning-local-workers-to-specific-cores, linux only for now.

@tkelman
Copy link
Contributor

tkelman commented Feb 7, 2017

more suited to something like DArrays which provide a single view of distributed memory.

I usually see that referred to in the literature as "partitioned global address space," whereas "distributed memory" is referring to conventional message passing or the underlying communication layer between workers that are used to implement a PGAS model. Global consistency and "single view" aren't implied by distributed memory (usually the opposite), as that can be expensive and isn't always needed.

@amitmurthy
Copy link
Contributor Author

The API exposed by this module, remotecall, @spawn, pmap, @parallel, etc, lend themselves more to distributed computation rather than any notion of distributed memory.

@ararslan ararslan added the parallelism Parallel or distributed computation label Feb 7, 2017
@JeffBezanson
Copy link
Member

I think it's nearly obvious that one can use a distributed computing library on a single machine as a degenerate special case. One would also expect such a library to support running N processes on each of M machines. The fact that that special case exists doesn't mean the library is misnamed.

If you're into the whole brevity thing, we could call it Multi, short for multiprocessing or multinode.

@Sacha0
Copy link
Member

Sacha0 commented Feb 7, 2017

In favor of not conflating abstractions and hardware one way or another:

Threads and processes are the standard language for the respective abstractions. Those abstractions stand independent of hardware, and the common association of processes with multi-node concurrency and threads with single-node concurrency often fails in practice: Expressing concurrency with processes is and will continue to be common on single node (and even single physical execution pipeline) systems as in e.g. CSP. Expressing concurrency with threads on systems traditionally considered physically distributed can also occur with e.g. PGAS and RDMA.

Moreover, the concepts of physical nodes and physically shared/distributed resources are progressively blurring (with, for example, [proliferation and tight integration of daughterboards with DMA], generalized NUMA, and RDMA); those concepts are tied to hardware, subject to change at the pace of hardware evolution, and likely will become only hazier with time (particularly as we march towards heterogeneity and virtualization as is happening now in HPC).

On the other hand, threads and processes are largely atemporal models. Basing a system's model's and terminology on the abstractions of threads and processes, and maintaining clear separation between those abstractions and hardware, is a future-proof decision.

Best!

@StefanKarpinski
Copy link
Member

@Sacha0, that's really convincing – but I'm not sure what I'm convinced of... 😬

@Sacha0
Copy link
Member

Sacha0 commented Feb 8, 2017

@Sacha0, that's really convincing – but I'm not sure what I'm convinced of... 😬

😄 One word: Plastics.

(Edit: This post reflects a misunderstanding of this module's long-term purpose. Please see #20486 (comment).)

I would hope the above convinces you to avoid names that risk conflation of abstractions and hardware (e.g. distributed), and favor instead names that eschew that conflation insofar as reasonable in practice (e.g. threads and processes). Apart from that sentiment I lack strong feelings.

Multithreading and Multiprocessing have a nice ring and symmetry with the terms threads and processes, though multithreading and multiprocessing both have context-dependent meanings (multithreading in software, multithreading in hardware, multiprocessing in hardware and software first and third paragraphs respectively) and whether multiprocessing (software context) necessarily implies parallelism is hazy. But perhaps the preceding potential concerns are negligible, given the context of their use and the possible story "Multithreading refers to thread-based concurrency/parallelism and Multiprocessing to process-based concurrency/parallelism" being delightfully simple and memorable.Multithread and Multiprocess might sidestep those potential concerns. Best!

@StefanKarpinski
Copy link
Member

We can interpret the word "distributed" as meaning distributed across processes, rather than distributed across physical machines and then the name is fine. Bonus that it's often both.

@tkelman
Copy link
Contributor

tkelman commented Feb 9, 2017

But then can't you also interpret @threads as being "distributed" across threads? Calling this message passing or (inter process) communication would be more specific to what it's doing, though they're somewhat loaded terms as well.

@Sacha0
Copy link
Member

Sacha0 commented Feb 9, 2017

We can interpret the word "distributed" as meaning distributed across processes, rather than distributed across physical machines

That's a bit like saying

We can interpret the word "airship" as meaning a vessel that moves through air (including cars and boats), rather than airship as in dirigible

You can say that, but such usage is sufficiently removed from common usage that confusion is inevitable (and particularly conflation of the process-based-concurrency/parallelism abstraction with distribution across multiple nodes). Re. common usage:

Googling "distributed" yields two hits related to computer science on the first page: Wikipedia's distributed computing entry ("... a model in which components located on networked computers communicate and coordinate their actions ...") and distributed.net, a web-scale distributed computing system (in the preceding sense).

Googling "computer science distributed" yields hits related to "distributed computing", "distributed systems", and "distributed storage", with all hits but one on the first page using "distributed" in the manner of the Wikipedia entry insofar as I see.

Googling "distributed parallelism" yields similar results, though also with references to "distributed memory". Re. top "distributed memory" hits, LLNL's Introduction to Parallel Computing page states "Distributed Memory. In hardware, refers to network based memory access for physical memory that is not common. As a programming model, tasks can only logically "see" local machine memory and must use communications to access memory on other machines where other tasks are executing."

The common usage googling suggests meets with my experience in scientific computing.

Simultaneously, the existing documentation on this functionality uses the term "multiprocessing". From the first paragraph of those docs: "... Julia provides a multiprocessing environment based on message passing to allow programs to run on multiple processes in separate memory domains at once. ..."

(For the same reasons I would argue that e.g. ClusterManager is a misnomer; a name like ProcessManager would be more accurate.)

(Edit: To clarify, this was not an argument for the term multiprocessing, but rather points out the inconsistency between this module's present use / documentation and the name distributed. Please see #20486 (comment).)

Bonus that it's often both.

It's often just the one as well :). Best!

@JeffBezanson
Copy link
Member

Yes I agree, "distributed" means among multiple machines, coordinated over a network. Now, does this module implement such a thing? Yes it does. What should it do when the number of machines equals 1? Should that be disallowed?

The airship analogy: unlike a dirigible, a car is incapable of flying through the air. But this module is capable of doing distributed computing. What we're talking about here is more like saying "the word 'airplane' is wrong, since airplanes also drive along the ground using their landing gear".

@catawbasam
Copy link
Contributor

I concur with the comments arguing that 'distributed' implies moving information around a network. Using the term to refer to multiple threads or processes on a single node is at best counter-intuitive.

@JeffBezanson
Copy link
Member

This package does in fact use message passing over a network.

If you add the ability to use e.g. unix domain sockets as a communication layer in a distributed computing package, can you no longer call it a distributed computing package?

@StefanKarpinski
Copy link
Member

Distributed means that there are one or more nodes nodes which is precisely the case here – one is just a degenerate case of distributed computing. Multiprocessing, on the other hand, actually does generally imply a single node – yes, with multiple processes. That implication is actively wrong here since there are potentially many nodes. As Jeff said, this is like insisting on calling a plane "a car" because it can roll around on wheels.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Feb 10, 2017

You folks do understand that this is for real, actual distributed functionality, right? Just like Hadoop or Spark. It does also, as a degenerate case, function as a multiprocessing library on a single node. But you can run frameworks like Hadoop and Spark on a single node. Would the objectors here all describe Spark or Hadoop as "multiprocessing frameworks"? If so, perhaps we should write to them and let them know that they should stop calling their projects "distributed".

@catawbasam
Copy link
Contributor

re: "Just like Hadoop or Spark."

vs. "Let’s try this out. Starting with julia -p n provides n worker processes on the local machine. " I add the added the emphasis on local, because I think it is important.

I spent part of most days working on Hadoop -- generally not the happiest part. On our cluster, the default is distributed, not local when working on e.g. pig or hive. This matters.

@JeffBezanson
Copy link
Member

It's perfectly possible to install julia on a cluster and provide a 1-line script that starts julia on every node via this library. Hadoop also requires a non-zero amount of configuration.

@JeffBezanson
Copy link
Member

Note also that we're talking about the name of the module, and you can't rename it based on how it's used, or how the defaults are set, or on how a majority of users use it. Presumably we're not going to rename it if the common style of use shifts over time, or if we change what the -p command line option does.

@JaredCrean2
Copy link
Contributor

If you look at section 1.8 of the MPI 3.0 standard (and I hope we can all agree that MPI is well established prior art for parallel computing), it makes the point that a standard which is designed for distributed memory computing can be implemented for a shared memory system. The distinction between the distributed memory and shared memory programming models is whether there is a fundamental assumption that different lines of execution (processes/threads/whatever you want to call them) have access to some common memory. This determines which constructs can be implemented efficiently. Where processes actually run is irrelevent. The important distinction is what kind of programming model the user has to work with.

Renaming Parallel to Distributed is a positive change because it informs the user that they should be using a distributed memory programming model, regardless of where the processes are run.

Would the objectors here all describe Spark or Hadoop as "multiprocessing frameworks"? If so, perhaps we should write to them and let them know that they should stop calling their projects "distributed".

@StefanKarpinski I'm not sure if this is how you meant it, but this reads rather snide to me. I enjoy a good debate as much as the next person, but lets not go overboard.

@catawbasam
Copy link
Contributor

catawbasam commented Feb 10, 2017

@JeffBezanson re "Hadoop also requires a non-zero amount of configuration." That's an understatement. The defaults on Hadoop are often poor as well. If anything, Spark is worse. Julia can and should do much better.

The name of the module is an indicator to users of its intended use. This code may or may not be used on a cluster, and it seems reasonable to expect further development of nodes with 64+ cpus that can handle large jobs without the hassle of cluster management.

@JaredCrean2, I cannot agree that "Where processes actually run is irrelevent." Good luck making this argument to programmers working on satellites or interplanetary missions. Distance matters, and physics is a real thing.

@JeffBezanson
Copy link
Member

it seems reasonable to expect further development of nodes with 64+ cpus that can handle large jobs without the hassle of cluster management

Agreed, but for that you'd probably want multithreading, and not this package. Such hardware shifts are part of why I'd rather name this package based on what it inherently does, and not how it will be used. And what it does is implement a distributed memory, message-based programming model. @tkelman suggested DistributedMemory, which is reasonably accurate but (1) pretty long, and (2) to me implies data parallelism, which this package isn't specifically focused on.

Good luck making this argument to programmers working on satellites or interplanetary missions.

I believe the point is that the purpose of a package like this is to provide a particular API, and the name mostly applies to the API and not to where your processes run.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Feb 10, 2017

Let's take a look how wikipedia defines "multiprocessing":

Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system.

It seems that calling what we have "multiprocessing" would be actively misleading since it directly contradicts the definition of the term. Well, maybe it's not actually used that way in practice. Let's take a look at some top Google hits for the word, e.g. Python's multiprocessing library:

The multiprocessing package offers both local and remote concurrency

Ok, that sounds promising – maybe they support real distributed computing like this module does! Oh wait, no, it's a client-server model where all jobs are run on a single node. But jobs can be submitted by other nodes. Not really the same thing. It's useful, but it's not a distributed computing framework.

Let's look at some others, like Node: no sign of any ability to run on multiple machines. How about Lisp? Surely if the term "multiprocessing" encompasses distributed computing, then Lisp implementation will have this feature! Nope: there is no mention of anything distributed or of multiple machines there either.

How about some Google search stats:

  • "multiprocessing future promise" – 286 thousand results
  • "multiprocessing remote reference" – 1.2 million results
  • "distributed future promise" – 40 million results
  • "distributed remote reference" – 58 million results

People – wikipedia, other programming languages, the internet – do not use the term "multiprocessing" to refer to systems designed to support distributed computing. They do, however, use the term "distributed" for systems that support many machines interacting with features like futures, promises and remote references. I find it hard to understand why we should use a term that is actively misleading instead of one that accurately describes what this code supports.

@amitmurthy amitmurthy force-pushed the amitm/rename_parallel branch 3 times, most recently from 35e23e8 to bb6c0cb Compare February 10, 2017 12:10
@amitmurthy amitmurthy force-pushed the amitm/rename_parallel branch from bb6c0cb to f448222 Compare February 11, 2017 12:45
@amitmurthy
Copy link
Contributor Author

Distributed it is.

Merging after a green CI.

@tkelman
Copy link
Contributor

tkelman commented Feb 11, 2017

Those results merely suggest to me that Distributed is an overly broad term.

DistributedMemory [...] to me implies data parallelism

How so? I don't see that implication at all.

@tkelman tkelman added the needs news A NEWS entry is required for this change label Feb 11, 2017
@amitmurthy
Copy link
Contributor Author

This change is internal and does not need a mention in NEWS just yet. The symbols are re-exported from Base.

@amitmurthy
Copy link
Contributor Author

CI has passed. I'll let the BDFLs take a call on the name and request them to merge (or not).

@amitmurthy amitmurthy removed the needs news A NEWS entry is required for this change label Feb 11, 2017
@tkelman
Copy link
Contributor

tkelman commented Feb 11, 2017

People have been using unexported symbols via Base.whatever and would appreciate a heads-up about where they've gone.

@amitmurthy
Copy link
Contributor Author

These are the ones -

process_messages,
remoteref_id,
channel_from_id,
worker_id_from_socket,
cluster_cookie,
start_worker,

Still available from Base.

@amitmurthy
Copy link
Contributor Author

Even the renaming of module Parallel to Distributed (or something else) is not required to be a permanent one at this time. It can be revisited again when we fully separate it out as an independent module to be explicitly imported. The symbols previously available in Base are still available as such -

julia/base/exports.jl

Lines 1344 to 1378 in 379f18e

# Parallel module re-exports
@spawn,
@spawnat,
@fetch,
@fetchfrom,
@everywhere,
@parallel,
addprocs,
CachingPool,
clear!,
ClusterManager,
default_worker_pool,
init_worker,
interrupt,
launch,
manage,
myid,
nprocs,
nworkers,
pmap,
procs,
remote,
remotecall,
remotecall_fetch,
remotecall_wait,
remote_do,
rmprocs,
workers,
WorkerPool,
RemoteChannel,
Future,
WorkerConfig,
RemoteException,
ProcessExitedException

At this time the name change is more relevant to contributors, with a zero impact on users.

@tkelman
Copy link
Contributor

tkelman commented Feb 11, 2017

Wasn't complete, you missed AbstractRemoteRef which caused DeferredFutures.jl to start failing. Possibly others as well.

@amitmurthy
Copy link
Contributor Author

Well those should be fixed independently and NEWS.md should not carry non-user facing changes at this time.

@amitmurthy
Copy link
Contributor Author

AbstractRemoteRef should be exported from Base as well.

@Sacha0
Copy link
Member

Sacha0 commented Feb 11, 2017

You folks do understand that this is for real, actual distributed [computing] functionality, right?

... this package ... what it does is implement a distributed memory, message-based programming model.

This was the crux of the matter on my end: I fundamentally misunderstood the long-term vision for this module (as opposed to its present state, use, and documentation). Rereading all concurrency / parallelism issues and mailing list threads remedied that misunderstanding. Distributed or something similar now strikes me as a solid name.

To avoid future confusion, @StefanKarpinski's suggestion along similar lines of renaming @parallel to @distributed seems great. (And particularly worthwhile to avoid confusion with the forthcoming Cilk-style parallel loop constructs.) Likewise this suggestion of a name like DistributedRef or DistributedAccumulator instead of ParallelAccumulator. The concurrency/parallelism manual section rewrite also seems imperative (particularly making clear that going forward this module's functionality will almost never be the tool of choice for concurrency / parallelism on a single node).

@tkelman suggested DistributedMemory, which is reasonably accurate but (1) pretty long, and (2) to me implies data parallelism, which this package isn't specifically focused on.

A similarly descriptive alternative that does not imply data parallelism might be DistributedProcessing. Answers the "distributed what?" question, though its similarly long.

If you look at section 1.8 of the MPI 3.0 standard (and I hope we can all agree that MPI is well established prior art for parallel computing), it makes the point that a standard which is designed for distributed memory computing can be implemented for a shared memory system. The distinction between the distributed memory and shared memory programming models is whether there is a fundamental assumption that different lines of execution (processes/threads/whatever you want to call them) have access to some common memory. This determines which constructs can be implemented efficiently. Where processes actually run is irrelevent. The important distinction is what kind of programming model the user has to work with.

I believe the point is that the purpose of a package like this is to provide a particular API, and the name mostly applies to the API and not to where your processes run.

💯 That the programming model / abstraction rather than realization in hardware is the important part is the sentiment I hoped to convey with #20486 (comment). These statements convey that sentiment better. Best!

@JeffBezanson
Copy link
Member

JeffBezanson commented Feb 11, 2017

DistributedMemory [...] to me implies data parallelism

How so? I don't see that implication at all.

DistributedMemory to me sounds like a memory-style API, designed to do load and store. There is also such a thing as "distributed shared memory" (sounds like an oxymoron, but is a real term), which makes many machines appear to have one big shared memory. This package does nothing like that, but those are the connotations I get from the word "memory".

@amitmurthy amitmurthy merged commit a3ebe1a into master Feb 15, 2017
@amitmurthy amitmurthy deleted the amitm/rename_parallel branch February 15, 2017 06:11
@tkelman
Copy link
Contributor

tkelman commented Feb 15, 2017

DistributedProcessing would be a better name for this. You don't type module names in full all that often, and we have tab completion. If this were a package (which it should be made anyway before 1.0), Distributed would be too general according to the naming guidelines - "Err on the side of clarity, even if clarity seems long-winded to you."

I'll let the BDFLs take a call on the name and request them to merge (or not).

So much for that then?

@kpamnany
Copy link
Member

kpamnany commented Mar 4, 2017

My $0.02 FWIW is that threads offer shared memory parallelism, and anything that crosses a process boundary is distributed memory parallelism. Whether the processes are local to a node or remote should only be relevant to the communication layer. Programmatically, a distributed memory application should not concern itself with the locality of the participating processes (excluding outlier situations like embedded platforms and such).

Having said that, I like @Sacha0's suggestion best: Multithreading and Multiprocessing. Symmetrical and unambiguous.

@Sacha0
Copy link
Member

Sacha0 commented Mar 5, 2017

Multiprocessing does seem regrettably ambiguous though, as discussed in e.g. #20486 (comment) and #20486 (comment)? Best!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants