The memory hierarchy of high performance and embedded processors has been shown to be one of the major energy consumers. Extrapolating the current trends, this portion is likely to be increased in the near future. In this paper, a technique is proposed which uses an additional mini cache, called the L0-cache, located between the I-cache and the CPU core. This mechanism can provide the instruction stream to the data path, and when managed properly, it can efficiently eliminate the need for high utilization of the more expensive I-cache.
Cache memories are accounting for an increasing fraction of a chip's transistors and overall energy dissipation. Current proposals for resizable caches fundamentally vary in two design aspects: (1) cache organization, where one organization, referred to as selective-ways, varies the cache's set-associativity, while the other, referred to as selective-sets, varies the number of cache sets, and (2) resizing strategy, where one proposal statically sets the cache size prior to an application's execution, while the other allows for dynamic resizing both across and within applications.
Five techniques are proposed and evaluated which are used to the dynamic analysis of the program instruction access behavior and to proactively guide the L0-cache.
The basic idea is that only the most frequently executed portion of the code should be stored in the L0-cache, since this is where the program spends most of its time.Results of the experiments indicate that more than 60% of the dissipated energy in the I-cache subsystem can be saved.
Pipeline Micro Architecture
Figure 1 shows the processor pipeline we model in this research. The pipeline is typical of embedded processors such as StrongARM. There are five stages in the pipeline-
fetch, decode, execute, mem and writeback. There is no external branch predictor. All branches are predicted "untaken". There is two-cycle delay for "taken" branches. Instructions can be delivered to the pipeline from one of three sources: line buffer, I-cache and DFC. There are three ways to determine where to fetch instructions:
order or parallel based on prediction. Serial access results in minimal power because the most power efficient source is always accessed first. But it also results in the highest performance degradation because every miss in the first accessed source will generate a bubble in the pipeline. On the other hand, parallel access has no performance degradation. But I-cache is always accessed and there is no power savings in instruction fetch. Predictive access, if accurate, can have both the power efficiency of the serial access and the low performance degradation of the parallel access. Therefore, it is adopted in our approach. As shown in Figure 1, a predictor decides which source to access first based on current fetch address. Another functionality of the predictor is pipeline gating. Suppose a DFC hit is predicted for the next fetch at cycle N. The fetch stage is disabled at cycle N __ and the decoded instruction is sent from the DFC to latch 5. Then at cycle N _ _, the decode stage is disabled and the decoded instruction is sent from latch 5 to latch 2. If an instruction is fetched from the I-cache, the hit cache line is also sent to the line buffer. The line buffer can provide instructions for subsequent fetches to the same line.
- serial-sources are accessed one by one in fixed order;
- parallel-all the sources are accessed in parallel;
- predictive-the access order can be serial with flexible