Dear luke,
I was thinking of your problem with needing to drive 128 gates from 1.
<disclaimer> There comes a time in life where people, or rather I, must demonstrate to the world just how stupid I really am. And there comes a time in life where people like you rebuke my idea in the foolish expectation that it will *never* be proposed again. :)
I am going to propose a solution that works on paper. I have NO experience with logic programming, architecture, or chip design of any sort. What I propose causes a major increase in complexity, while keeping the amount of transistors roughly the same, and maintaining the speed and power efficiency you desired. <\disclaimer>
I have no idea how well my idea will scale, with respect to the amount of driven gates. But I have throughly researched it as best as I can. You can find a nice table, which I can only partially understand here: https://courseware.ee.calpoly.edu/~dbraun/courses/ee307/F99/01_10/01_Francoi... It will explain how many gates you can drive from one with respect to the maximum and minimum current that the gates accept and output. The capacitance and resistance of the gates and wires may prove somewhat more difficult to ascertain.
Attached are 2 images (No, ASCII art would not have worked here). One shows a set of NAND gates with an ordinary layout. I chose NAND gates because NAND is one of two universal gates. The second image shows a modified version of the first. It uses a recursive layout of NAND gates enabling a tree like effect.
These were both drawn while riding in a car so don't expect perfection. Also, my scanner got both sides of the paper on one page (technically, image), which was annoying. So I cleaned the images up a bit.
I'm excited about my idea, however pitiful it is. Tell me what you think! Thanks luke, David