[Arm-netbook] [review] SoC proposal

Wed Feb 8 21:18:29 GMT 2012

On Wed, Feb 8, 2012 at 8:16 PM, Tony Garnock-Jones <tonyg at ccs.neu.edu> wrote:
> On 02/08/2012 02:58 PM, lkcl luke wrote:
>>
>> that means that it could be used in at least the following products:
>> [everything]
>
>
> It sounds *amazing*. Is the cache coherency done in software, as you
> mentioned near the top of your message?

 no, the CPU core we've found, the company that's done it has a
hardware SMP 1st level cache, already done, up to 4 cores: doing 8
shouldn't be hard.

 the only thing is: it's 32-bit only, meaning a hard limit of 4gb of
RAM, and the peripherals will need to be memory-mapped into that
space.

> What size caches would exist on each
> core?

 haven't picked that yet, but there's definitely an upper limit based
on the target pricing.

 1st level caches probably something like 32k, 2nd level mayybe 128k
each - we can't go too large because we want to be able to get around
10,000 CPUs on a single wafer (which would be about $8k each, batches
of 16): that means the area musn't be bigger than 10sqmm including a
1mm gap for cutting them up.

 12in wafer, 6*6*pi sq.in, 113 sqin, => appx 7290 sq.cm => each CPU
must be err about 2.5x2.5 sq.mm which kinda sets a limit of somewhere
around 4 to 6 million gates.

 each CPU will be about 50k gates, the FPU per CPU will be another
50k; each bit of the cache will be a memory cell which is 1/2 a gate,
so if you set a limit of 128k that's 131072 * 8 / 2 = half a million
gates _just_ for the 2nd level cache *per cpu* - that's 4 million
gates _just_ for the 2nd level cache, 128k per CPU. whoops.

 the I/O peripherals and the routing are probably going to be another
1 million, possibly more, because it's a hell of a lot of AMBA bus
routing, and I/O pads, in order to get the current, often need to be
as big or bigger than the RISC core alone!

 so with that adding up to appx 6 million gates, it's pretty tight already.

> (I read on the FONC list the other day about a Tilera-based setup where a
> full Smalltalk image was running entirely in the 2MB of L2 of each core,
> treating main memory as remote storage...)

 cool :)

 well... that's not going to be possible, here, unless the L2 cache is
shared across the AMBA bus between all the CPUs.  that would do it,
but would result in 8-way contention.  4 CPUs isn't so much of a
problem, but 8's where CPUs start to tread on each others' toes if
you're not careful.  crossbar-switch designs and all that, which of
course introduces either massive amounts of routing or additional
latency, depending on which way you do it.

 in some ways i'm not so concerned about that happening: the choice of
8 CPUs is more so that the 3D performance is good (CPU-bound not
memory-bound), as well as power being lower, and also well.... because
8 has significance in chinese culture as being associated with
financial success.

 ... but, if you know of anyone who knows about this kinda stuff, and
the implications at an application level, i'd really like to hear from
them.  these kinds of decisions - bus widths, cache sizes, internal
routing architecture - are going to be critical to make sure that the
processor can deliver.

 for example, it's pretty obvious that you have to do 1080p30 at
least, of broadcast video which will be about 8mbit/s sec of H.264.
so we know immediately that there will need to be a hardware CABAC
decode block (idiots, idiots, idiots who designed H.264 to be
non-parallelisable), and then enough memory bandwidth to hold several
decoded frames in memory all at once, and then _also_ enough memory
bandwidth to be able to cope with 180 million bytes per second (1920 x
1080 x 24-bit x 30fps) on _top_ of all the other processing, just to
feed the video framebuffer.

the numbers are just completely nuts: when you look at it, it's no
wonder the OMAP3530 was struggling to do 1080p a few years ago.

l.