[Arm-netbook] [review] SoC proposal

Thu Feb 9 16:56:39 GMT 2012

lkcl luke wrote:
> On Thu, Feb 9, 2012 at 3:13 PM, Gordan Bobic <gordan at bobich.net> wrote:
>> lkcl luke wrote:
>>> On Thu, Feb 9, 2012 at 11:41 AM, Gordan Bobic <gordan at bobich.net> wrote:
>>>> lkcl luke wrote:
>>>>> On Thu, Feb 9, 2012 at 7:41 AM, Vladimir Pantelic <vladoman at gmail.com> wrote:
>>>>>
>>>>>>>   we evaluated the possibility of coping with 1080p30 video decode, and
>>>>>>> worked out that after one of the cores has been forced to deal with
>>>>>>> CABAC decode all on its own, the cores could then carry out the
>>>>>>> remaining parts of 1080p30 decode in parallel, at about 1ghz, quantity
>>>>>>> 4.
>>>>>> I would not recommend fully loading the cpu while decoding video, HD
>>>>>> video is becoming a commodity and people might soon use it as an "animated"
>>>>>> wallpaper while doing other CPU intensive stuff
>>>>>  last year the target speed was 1.5ghz, 4 cores.  this time we
>>>>> envisage 8 cores at over 1.2ghz, and the cores can be made to support
>>>>> VLIW which can result in 3x the effective clock-rate.  so i don't
>>>>> think that CPU horsepower is something to worry about.  the only thing
>>>>> that's of concern is to not put too _much_ horsepower down so that it
>>>>> goes beyond the gate-count budget.
>>>> I think you need to look at this from the practical standpoint.
>>>> Specifically:
>>>>
>>>> 1) Is there GCC support for this SoC's instruction set,
>>>  yes.
>>>
>>>> including VLIW,
>>>  don't know.  tensilica have a proprietary "pre-processor" compiler
>>> that turns c and c++ into VLIW-capable c and c++.
>> That sounds like another one of those hare-brained solutions like the
>> JZ4760. If it's going to be mainstream, it needs to be done cleanly,
>> i.e. a proper GCC back-end - preferably one that delivers decent
>> performance unlike the current vectorization efforts.
>>
>>>> SSE,
>>>  what is SSE?
>> Sorry, s/SSE/SIMD/
> 
>  ahh.  yes i believe so.

I rather doubt that - last time I checked vectorization didn't product 
meaningful gains even on x86/SSE. I'd like to think that things have 
improved recently, but I'd do some testing before believing it.

>>>> How many man-hours will that take before it is
>>>> sufficiently tested and stable for an actual product that the end
>>>> consumers can use?
>>>  don't know.  tensilica have proprietary software libraries for audio
>>> and video CODECs already (obviously)
>> Proprietary as in closed-source?
> 
>  yyyep.  good enough for commercial mass-volume purposes though.

So much for FOSS...

>>>> Look at the rate of progress Linaro is making, and they have a
>>>> multi-million $ budget to pay people to push things along, and an OSS
>>>> community that already has ARM well boot-strapped and supported.
>>>  yes.  that just adds to the cost of the CPUs, which is unacceptable.
>>>
>>>  i'd like to apply a leeetle bit more intelligence to the task, namely
>>> for example to add instruction extensions that will help optimise 3D,
>>> and to say... find out how much effort it would take to port llvm and
>>> to evaluate whether gallium3d on llvmpipe would be "good enough" if
>>> optimised to use the 3D-accelerating instructions etc. etc.
>> Don't put too much stock in that yet. Writing a good optimizing compiler
>> that produces beneficial SIMD/VLIW binary code is _hard_.
> 
>  well, tensilica seem to have done a decent job.

With their own proprietary compiler or have they contributed a 
half-decent GCC back-end?

>> So hard that
>> only Intel have so far managed to do a decent job of SIMD (SSE).
> 
>  add tensilica to the list [proprietary compiler of course... *sigh*].

That explains it. Good luck with building the kernel and glibc with that...

>> See the difference between AMD and Nvidia GPUs, for example. Radeon has
>> always had much, much higher theoretical throughput, but has always
>> lagged behind GeForce in practice. Unified shaders are generic so even a
>> relatively crap compiler can do something sensible.
> 
>  i still don't fully grok what the heck shaders are all about.  so i
> have no idea if optimising instruction sets to help _will_ help!
> *wandering aimlessly*....

It's basically a processor. A tiny processor that can do math 
operations. Unified shaders means that any shader can do any math 
operation. Non-unified shaders means that some shaders can do some math 
operations, while other shaders can do other operations, but not both. 
All shaders, unified or otherwise, can run simultaneously.

AMD GPUs have non-unified shaders, but more of them. If you can break 
your code up so that it maps onto those shaders with their capabilities 
well, you have more shaders that can do the calculations at the same 
time. This is sufficiently hard that AMD haven't managed to produce a 
compiler that makes their hardware work anywhere near the limit of it's 
potential in most cases. One exception is bitcoin mining, where AMD's 
GPUs are much faster than Nvidia's.

Nvidia GPUs have unified shaders. Any shader can do anything, so your 
compiler can be relatively primitive - it doesn't have to look at the 
code to identify independently executable operations and paths, map them 
onto the shaders available, and produce the results with maximum 
parallelism.

The problem here is that typical code does one operation on a large 
array of data, which means that if your compiler is crap, it'll only use 
the shaders that do that operation in that pass, while the shaders not 
capable of that operation will sit idle.

 From what I understand, VLIW is kind of similar to what the Radeons do 
because it allows you to fire off independent instructions to saturate 
different "engines" (shaders in GPU terms) of the CPU simultaneously. 
The downside is that you need a damn good compiler to leverage it - and 
GCC isn't it.

Gordan