AMIDAR: A Class of Reconfigurable Processors
To test the applicability of our model we derived the architecture of a Java bytecode processor from it. A simulator implements this model to verify it and to prove that adaptivity leads to a measurable performance gain. The following figure shows the architecture of the Java bytecode processor.
Currently, the simulator adapts the communication structure to the requirements of the application (adding and removing of connections to buses, splitting and merging buses) and is able to exchange functional units.
To prove the effectiveness of adaptivity in the Java bytecode processor, two test applications with different characteristics had to be selected that are expected to lead to different structures. Therefore, a data dominated application (signal convolution) and a control dominated application (calculation of Ackerman's function) were chosen.
A Java program that calculates the convolution of two signals was used as the data dominated application. The control dominated application calculates Ackerman's function ack(3,2), which mainly results in recursive method calls and if-then-else structures. The resulting structures look different because of the very different usage of some functional units in the test applications. The convolution program heavily uses the object heap and the ALU, whereas the Ackerman program will use the jump unit and the method stack. Operand stack and local variable memory will be used in both applications in the same manner. The different characteristics are reflected in the appearance of different bytecodes. The convolution program consists basically of array and ALU operations. In contrast, the Ackerman program uses if-bytecodes, invokes, returns and only few ALU operations.
To evaluate the speedup of the dynamic bus adaption and FU exchange in comparison to a static architecture both test programs were run in the simulator in different modes. Firstly, we measured the minimal cost (worst case performance), which results in one single bus for all components and slowest FUs. Then we measured the best case performance. Faster FUs were used if they lead to a performance improvement. It turns out, that the maximum speedup is between 23% and 29%.
Secondly, we ran the applications in adaptivity mode with a cost limit. The adaptive circuit for the convolution program performs 6% slower than the best case architecture, and the circuit for the Ackerman program is less than 1% slower than the best case circuit. The cost limit in both cases is set to 83% of the best case circuit. The next figure shows clearly, that the adaptive circuit is nearly equally fast as the best case but only requires half the cost increase compared to the cost increase of the best case.
The following figure shows the speedup in relation to the cost limit.
The non-monotonic characteristics of the runtime is due to coarse grained cost of the functional units. Thus, it can happen, that cost points are transfered to the FU area which results in a degradation of the communication. Probably, this behavior can be eliminated with a more elaborate heuristics.
The aim of our future research will be: