Heuristics

The heuristics in AMIDAR processors have to decide, how the available chip area should be divided. There must be a global heuristics that calculates the ratio between chipsize for FUs and chipsize for busses. Furthermore, we need local heuristics to assign the resources within the different areas of adaptivity (communication and FUs). The following figure shows the interaction between global and local heuristics.

Interaction between global and local heuristics

Statistical Data

All heuristics are based on stalls and utilization of functional units and busses. The stalls are divided into input stalls and wait stalls, so every component requires two counters to store them. An input stall occurs, if a component (FU or bus) could not accept data because it is currently working. Input stalls can occur in busses and functional units. This is in indicator, that a component is apparently not efficient enough to handle incoming data. More precisely, an input stall in an FU indicates that the throughput of the FU is not high enough. Wait stalls occur if a component is ready to work, but has to wait for input data. Wait stalls are saved in the sender component, because the sender component is not efficient enough to deliver data in time, i.e., the latency is to high.

Global Heuristics

The collected statistical data is the input for the heuristics. The global heuristics is very straightforward. The main idea of the global heuristics is a slider, which divides the overall chipsize into chipsize for communication and chipsize for functional units. The following figure illustrates this idea.

Slider for Global Heuristics

Local Heuristics

The local heuristics decide autonomously how to deal with the globally assigned chip area. There are local heuristics for the communication structure and for the functional units.

If the local heuristics for busses can use more chipsize, the heuristics tries to split the bus with most stalls. The source and destination information for delayed data packages are used to make new connections between functional units. If the local heuristics for communication structures has to save chip area, it will first try to merge two busses. These busses are chosen by looking at time conflicts within the last clock ticks. Only busses can be merged that have time conflicts in less than a given percentage over the last n ticks. If no busses can be merged, connections of busses that were very seldom in use within the last n clock ticks would be removed.

The cost of busses are proportional to the number of functional units that are connected to this bus. The cost model is influenced by the idea of a connection matrix where every connection between a functional unit and a bus can be implemented as a tristate buffer with constant cost.

The heuristics for functional units has more different operations. If the heuristics can use more chipsize, it is possible to replace an FU by a more efficient one, duplicate an FU and synthesize a new FU. First of all, the heuristics tries to exchange the FU with most stalls by a more efficient one. The different types of stalls control, which variation of an FU is used to replace the current FU.

If an FU has many input stalls, it cannot accept data because it is currently working. The FU is replaced by a throughput optimized one. If the FU has many wait stalls, this indicates, that another FU was waiting for data. So we need a latency optimized version of the FU. A more generalized heuristics may be based on a vector space, where the number of stalls is the length of a vector in the corresponding dimension. The angle of the resulting vector to the reference vectors of different FU variations is the criterion, which FU is used.

If a functional unit could not be improved anymore, the heuristics may consider to duplicate this FU. This sounds to be a simple process, but requires an adaption of token distribution which has to care about load balancing between similar FUs.

If the heuristics for functional units has to save chipsize, it first removes duplicate FUs. Again, removing duplicate FUs is possible but not trivial due to the adaption of token distribution. If the heuristics has to save more chipsize, FUs with the fewest stalls and lowest utilization are replaced by slower but smaller versions.

A heuristics that triggers a direct synthesis of a specialized FU cannot be based on stalls inside an FU. The token generator has to track which opcode sequences or methods/functions are often used. If these sequences are often slowed down by stalls in the FUs, it may be feasible to trigger a synthesis process for this opcode sequence.