[Arm-netbook] Adapteva Parallella: Thoughts?

Luke Kenneth Casson Leighton lkcl at lkcl.net
Thu Dec 29 08:19:38 GMT 2016


---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68


On Thu, Dec 29, 2016 at 8:03 AM, Lauri Kasanen <cand at gmx.com> wrote:
> On Thu, 29 Dec 2016 00:44:33 +0000
> Luke Kenneth Casson Leighton <lkcl at lkcl.net> wrote:
>
>> thats interesting. i worked for aspex semi in 2003 and they had the exact
>> same problem, programming ultra parallel devices is limited to a few
>> hundred competent people in the entire world.
>>
>> interesting to me because ericsson bought aspex.
>
> Surely that's changed now, with the ubiquity of modern GPUs? It's
> entirely normal there to handle hundreds or thousands of threads at
> once, at massively parallel workloads.

 for the ASP: no.  not a chance.  it was a massively-parallel (deep
SIMD) *two bit* string-array processor with (in some variations) 256
bits of content-addressable memory, that used "tagging" to make the
decision as to whether any one SIMD instruction was to be executed or
not.

 it was so far outside of mainstream processing "norms" that they
actually had to use gcc -E pre-processing "macros" to substitute
pre-defined pipeline-stuffing c code loaded with hexadecimal
representations of assembly-code instructions to be sent to the SIMD
unit.

 a similar trick was deployed by ingenic for their X-Burst VPU: their
pre-processing mechanism is a dog's dinner mess of awk and perl that
would look for appropriate patterns in pre-existing c-code, whereas
Aspex's technique was to just put capitalised macros directly
interspersed in c code and let the pre-processing phase explicitly
take care of it.

 it was utterly horrible and insane, and it was only tolerated on the
*promise* that, at the time each architecture was announced, it could
do *CERTAIN* tasks at a hundred times faster than the AVAILABLE
silicon of the time.

of course... by the time each architecture revision actually came out
(18 months + later) the speed of pentium processors had of course
increased so greatly that the gap was only 20, 10 or even 5 times
greater....

to write code for the ASP you measured the number of lines of code in
DAYS per line of (assembly-style) code.  you actually had to write a
spreadsheet to work out whether it was more efficient to map the
operands into single-bit linear per processor or to use the "string"
feature to process operands spread out in parallel across mulitple
neighbouring APUs.

the factor which made this analysis so insanely complex was that the
"load and unload" time had to be done linearly using a standard memory
bus, and was a looong time relative to the clock rate of the APUs.
thus, if you only needed to do a small amount of computation it was
best to use the single-bit technique (4,000 answers in a slower time,
to match the "load and unload" time), but if you had a lot of
computation to perform it was better to use the parallel technique, in
order to keep the little buggers busy whilst waiting for load or
unload.

... or... anything in between.  2,4, 5, 6, 8, 12, 24, 32, 64, 96, 128
or 256 bit parallel computation, it was all the same to an
array-string massively-parallel deep SIMD *bit-level* processor.

but it made programming it absolutely flat-out totally impractical and
even undesirable, except for those very very rare cases, usually
related to the ultra-fast content-addressable-memory capability.

i.e. extremely, extremely rare.

putting a "normal" c compiler on top of the ASP, or porting OpenCL to
it, would be an estimated 50-man-year research and programming effort
all on its own.  just... not worth the effort, sadly.

l.



More information about the arm-netbook mailing list