Blue Sky: Difference between revisions
m (moved CpsParallelism to Blue Sky) |
No edit summary |
||
(3 intermediate revisions by the same user not shown) | |||
Line 21: | Line 21: | ||
It's perhaps a weird idea, passing along the context to the variable, but if messages have high overhead, then saving half the messages makes it seem like it could be worthwhile. | It's perhaps a weird idea, passing along the context to the variable, but if messages have high overhead, then saving half the messages makes it seem like it could be worthwhile. | ||
---- | |||
Active Messages are more of an implementation technique. Seem like a natural way to implement this idea. Here's a recent paper that might help introduce active messages: [http://portal.acm.org/citation.cfm?id=1854323 AM++] | |||
Reading about GASNet is interesting: [http://gasnet.cs.berkeley.edu/258-paper.pdf GASNet] | |||
A lot of old Cilk ideas seem related. Need to read more. | A lot of old Cilk ideas seem related. Need to read more. | ||
Line 32: | Line 33: | ||
If we have full bits, it's pretty simple to manipulate queues of tasks waiting at a sync variable, since the complete context is available locally. Note that both the cotext and the queue are in private memory. | If we have full bits, it's pretty simple to manipulate queues of tasks waiting at a sync variable, since the complete context is available locally. Note that both the cotext and the queue are in private memory. | ||
---- | |||
there are regimes, perhaps the ones we'll be operating under, where this would make sense. Such regimes are where one cannot fully utilize anything close to the bisection BW of the network and the added cost of formulating the context is low compared to the cost of a message. Bear in mind, this doesn't just save 1 message, but N/2 messages (where N is the remote requests). | |||
Regrettably, I think we'll be in such a space and we should look into this idea further. (I say regrettably because I really don't want to be there). | |||
---- | |||
Just catching up on this thread. Do note that you can but need not | |||
think in terms of applying the CPS (continuation-passing-style) | |||
transformation to the source program. You can instead achieve the | |||
same affect in the implementation from the compiler hacker's view: | |||
* Don't maintain your call-stack frames as usual. Rather, allocate | |||
them in the heap, just like everything else. The call-stack becomes a | |||
linked list (or whatever data-structure engineering you prefer) where | |||
each callee as an explicit pointer to the caller. | |||
* Treat shared-memory read/write as function calls. | |||
That's all that's really going on: instead of doing some remote memory | |||
access and then communicating the answer back to a node, you pass the | |||
continuation to the node doing the access. But you need not pass the | |||
entire linked-list representing the control stack, you can just pass | |||
the first element of it and each subsequent element will be retrieved | |||
in due course by whatever node needs it. | |||
---- | |||
Right. And we shouldn't be distracted by Lisp or Scheme; all the same ideas work for C or whatever. | |||
There's lots of room for engineering the representation of the stack. I would make the compiler think hard about what parts of the stack need to be in global memory and what parts can be private, then pass along the private parts so they can be accessed quickly. | |||
Also seems worth while to be able to make guarantees about the allocation of certain structures. For instance, we might provide ways that a vector of reasonable size can be allocated in global memory so that the compiler can be certain all elements are on the same node. Then when a thread goes to that node to examine an element, it can be sure that the other elements are available for local access without a new hop. | |||
(I foresee a new opportunity for punning terminology. Now we can literally chase pointers.) | |||
If we can know something about the allocation of very large vectors that are necessarily spread across many nodes, probably such info is only available at run time, then we can do things like dispatch a worker to each relevant node to deal with that node's share of the vector. | |||
And I suppose there's always a choice, perhaps made at runtime, of going to the data or fetching the data. To make blocking synchronization simple, I'd say that all synchronizing refs should go to the data. For other cases, we can maybe choose based on locality considerations; i.e., if we're going to do many references over "there" and we're done "here", then go. | |||
---- | |||
Burton pointed out ''traveling threads'', Peter Kogge and his student, Richard Murphy. |
Latest revision as of 21:49, 22 June 2011
Yesterday in our meeting, Jacob and I went briefly off on a tangent before being hauled back to earth by more sober-minded folks. Nevertheless, it was a cool idea and I spend a little quality time (in bed at 4am) thinking some more about it.
I don't recall how it arose, but the basic idea was to convert programs to continuation-passing style (CPS) and send one-way messages (closures, with context + continuation address) in place of shared memory references. So all threads migrate all the time, chasing down the values of shared variables to build up context, then performing some computation whenever adequate values have been collected (in whatever core happened to be managing the last memory accessed), and continuing.
Perhaps like Active Messages, though I don't know much about them.
I can't explain CPS in a short note, but I think it's pretty familiar to most folks. Here's a trivial example of what I mean.
Imagine we want to add a couple of integers stored in shared memory, say X and Y. First, let's think about what we'll do with the sum when we have it. In best Lisp style, we'll call a function named "continuation" and pass the sum as an argument. OK, now let's begin by gen'ing up a message to the processor that stores A (or rather, the core that manages the memory where A is stored, we'll call it procA). The message will have the address of A plus a pointer to some code.
In procA, the message comes along and procA executes a little code loading the value of A from the memory it controls and gen'ing up a message to procB. The message will have a the address of B, the value of A, and a pointer to some code.
In procB, the message comes along and procB executes a little code loading the value of B from the memory it controls. Since procB has the values of both A and B handy, we can add those values, then jump to "continuation", passing in the sum we just computed.
Note that this "gen'ing up a message" phrase is an entirely local computation, involving no references to shared memory, so it should run pretty fast on an x86 core.
No stacks are maintained, there are no calls, read and writes are always performed locally and the computation continues on the core where the memory reference was handled. Same for AMOs. (and since the delegate manages all references to its share of global memory, sequential consistency is easy, even without fences and such). A core never replies to a message.
I expect we could hack up an example pretty quickly, compiling a Scheme-like language. When I say scheme-like, I don't mean lists, I mean Scheme's control structures (nested functions, calls, if-then-else, global variables, ...) plus something like futures for parallelism.
It's perhaps a weird idea, passing along the context to the variable, but if messages have high overhead, then saving half the messages makes it seem like it could be worthwhile.
Active Messages are more of an implementation technique. Seem like a natural way to implement this idea. Here's a recent paper that might help introduce active messages: AM++
Reading about GASNet is interesting: GASNet
A lot of old Cilk ideas seem related. Need to read more.
If we have full bits, it's pretty simple to manipulate queues of tasks waiting at a sync variable, since the complete context is available locally. Note that both the cotext and the queue are in private memory.
there are regimes, perhaps the ones we'll be operating under, where this would make sense. Such regimes are where one cannot fully utilize anything close to the bisection BW of the network and the added cost of formulating the context is low compared to the cost of a message. Bear in mind, this doesn't just save 1 message, but N/2 messages (where N is the remote requests).
Regrettably, I think we'll be in such a space and we should look into this idea further. (I say regrettably because I really don't want to be there).
Just catching up on this thread. Do note that you can but need not think in terms of applying the CPS (continuation-passing-style) transformation to the source program. You can instead achieve the same affect in the implementation from the compiler hacker's view:
- Don't maintain your call-stack frames as usual. Rather, allocate
them in the heap, just like everything else. The call-stack becomes a linked list (or whatever data-structure engineering you prefer) where each callee as an explicit pointer to the caller.
- Treat shared-memory read/write as function calls.
That's all that's really going on: instead of doing some remote memory access and then communicating the answer back to a node, you pass the continuation to the node doing the access. But you need not pass the entire linked-list representing the control stack, you can just pass the first element of it and each subsequent element will be retrieved in due course by whatever node needs it.
Right. And we shouldn't be distracted by Lisp or Scheme; all the same ideas work for C or whatever.
There's lots of room for engineering the representation of the stack. I would make the compiler think hard about what parts of the stack need to be in global memory and what parts can be private, then pass along the private parts so they can be accessed quickly.
Also seems worth while to be able to make guarantees about the allocation of certain structures. For instance, we might provide ways that a vector of reasonable size can be allocated in global memory so that the compiler can be certain all elements are on the same node. Then when a thread goes to that node to examine an element, it can be sure that the other elements are available for local access without a new hop.
(I foresee a new opportunity for punning terminology. Now we can literally chase pointers.)
If we can know something about the allocation of very large vectors that are necessarily spread across many nodes, probably such info is only available at run time, then we can do things like dispatch a worker to each relevant node to deal with that node's share of the vector.
And I suppose there's always a choice, perhaps made at runtime, of going to the data or fetching the data. To make blocking synchronization simple, I'd say that all synchronizing refs should go to the data. For other cases, we can maybe choose based on locality considerations; i.e., if we're going to do many references over "there" and we're done "here", then go.
Burton pointed out traveling threads, Peter Kogge and his student, Richard Murphy.