There are a few points I would like to raise regarding Raven's post. don't worry, its not an argumentative or humongous post
The reason the Power Architecture compatibility was chosen was simply because its a proven link to a current architecture, and supports multiple operating systems. Existing Power applications can apparently be run without modification on the Cell with the idea of making porting things across easier. To develop a completely new system could have taken 10 years (bare in mind the idea for the Cell CPU was conceived in around 2001 IIRC) and would have been even more of a shock to developers. At least this way programmers have a conventional platform they are familiar with to start from.
The SPEs are a little more clever than you give them credit for. They are dynamically configurable to provide support for content protection, have local memory, asynchronous coherent DMA, and a large unified register file to improve memory bandwidth. You mention that streaming is difficult to do, but it does do it well. The SPEs can be set up with serial or parallel pipelines to allow streams to go from one to another, each performing a task on it.
When you say its not a multicore processor it does have dual threaded, dual-issue PPE (Power Processor Element) with on-chip memory controller, and a controller for a configurable I/O interface, as well as the 8 SPEs. There is a high-bandwidth on-chip coherent bus and high bandwidth memory for power hungry applications and for interaction between the processor elements. The PPEs and SPEs can also share address space which has got to make things speedy. The whole design is 'short-wire' (no long wires) to limit communication delays every cycle. The core of the Cell interleaves instructions from two computational threads at the same time. This could be called a two-way multiprocessor with shared dataflow. It makes best use of issue slots for maximum efficiency and to reduce pipeline depth. To give an idea of speed, IBM state that simple arithmetic functions execute and forward their results in two cycles and double-precision floating-point instruction executes in ten cycles.
The Cell processor also has built-in support for virtualisation meaning you can actually have multiple operating systems running at the same time (like virtual PCs). A big part of the design brief also included CPU support for networking and communication. They have done this by making each of the SPEs capable of autonomously scheduling and receiving DMAs as well as interrupts so networking or user input can be acknowledged and dealt with immediately.
I am guessing that most developers for it are probably writing for the single threaded PPE - as with other systems because its the easiest to do (and most familiar). This could be why people say the power is not yet being fully utilised. The PPE in the Cell supports SIMD (Single-Instruction, Multiple-Data architecture) which has been shown to speed up multimedia applications (and gaming) and is also inside PC processors. Compilers to generate SIMD instructions are still maturing though. Once developers are familiar with developing for SIMD on the PPE they can then begin to move these instructions over to the SPEs which also support SIMD. That would really improve efficiency but may take a little while. Another hurdle for developers which can be used to greatly increase performance is the management of the local store memory. This has to be done by the programmer or using premade libraries. If this can be looked after by the compiler then it should make things much easier and faster.
You talk about memory being an issue too, but they have improved on other processors in the way that is utilised too to prevent the old 'memory wall' problem which would have an adverse affect on performance (i.e. there is a limit on memory so sometimes having more doesn't improve performance due to the overhead of managing it). The SPEs have another level of memory hierarchy beyond the registers of other processor architectures. This allows for a large number of memory transactions to be in flight simultaneously without requiring what they call 'deep speculation' that other processors use. Also, the new Rambus XDR DRAM memory delivers 12.8 GB/s per 32-bit memory channel, of which two are supported for a total bandwidth of 25.6 GB/s.
Your point that the word 'synergistic' is made up could also be applied to the word 'pentium'. That only existed since Intel made it up in 1993. 'Synergistic' does apply to something new so I think we can forgive STI for choosing a new name (however much it sounds like a 'buzzword'
)
For anyone interested, more info on the way the Cell works is described in the IBM paper entitled
Introduction to the Cell multiprocessor.
This post has ended up longer than I intended. Sorry if I bored anyone.