Asynchronous Queue Machines with Explicit Forwarding

Lennart Beringer

Doctor of Philosophy
School of Informatics
University of Edinburgh
2002
Abstract

We consider computational models motivated by processors which exhibit architectural asynchrony and allow operands to bypass the register bank using a forwarding mechanism. We analyse the interaction between asynchrony and forwarding, derive constraints on the usage of forwarding for various models of operation, and study consequences for compilers targeting such processors.

Our approach to reasoning about processor behaviour is programming language based. We introduce an assembly language in which forwarding is explicitly visible. Operational models corresponding to processor abstractions are expressed as structural operational semantics for this language. The benefits of this approach for defining program execution and for relating processor models formally are demonstrated. Furthermore, we study the restrictions on the class of admissible programs for each operational model. Under our programming language perspective, these constraints are expressed as static semantics and formalised as type systems. Suitability of forwarding schemes for particular models of operation follows from soundness and completeness results which are established by standard programming language proof techniques.

Well-typed programs are structurally correct and cannot experience run-time errors due to ill usage of the forwarding mechanism.

Exposing asynchrony and forwarding to the programmer allows a compiler to optimise forwarding behaviour by scheduling operands. We show how program analysis can decide which values to communicate through registers and which ones to forward. The analysis is expressed as a dataflow problem for an intermediate language and is proven sound with respect to a dynamic semantics. Solutions to the dataflow equations yield translations into the assembly language which are functionally faithful to the operational semantics and also structure-preserving as resulting programs are well-typed.

The theoretical development of the translation is complemented by a prototypical implementation. Experimental results are included for a symbolic conversion of Java virtual machine code into the intermediate language, indicating that application programs contain sufficient opportunities for forwarding to make our approach viable.

In conclusion, we demonstrate the benefits of a programming language based view for reasoning about programs targeting architectures with asynchrony and forwarding.
Acknowledgements

First and foremost I would like to thank Colin Stirling, whose continuous guidance and support has been invaluable. I admire his ability to foresee the scientific development of my work, and many problems en route could not have been solved without his insight and experience. Colin’s commitment and encouragement at times when things did not look so bright is particularly acknowledged.

Secondly, I am grateful to Ian Stark, who joined our weekly meetings during the final year, and whose knowledgeable expertise (not only) on programming language topics guided the development of the later parts of this thesis. Colin and Ian also read several versions of this thesis, and their detailed feedback on content and presentation helped enormously to bring this document into its final shape.

I would like to thank D K Arvind for (twice) attracting me to Edinburgh, and for introducing me to forwarding and asynchrony.

Like many earlier generations of students, I experienced the atmosphere of LFCS and ICSA as extremely stimulating, both scientifically and socially. I am grateful to all their members for their encouragement, and to the secretarial, administrative and technical staff for providing a smooth organisational environment. Of course, my time in Edinburgh would not have been half as enjoyable had it not been for the fellow PhD students, flatmates and friends from EUMS and the resulting numerous social activities and discussions.

Financially, I am grateful to the Division of Informatics for covering my fees, and to the German Academic exchange Service (DAAD) for a one-year scholarship under the HSP III program, reference number D/98/04840. During the last months, I have been employed by the EU-funded research project Mobile Resource Guarantees (MRG), and I am indebted to its leaders Don Sannella and Martin Hofmann, as well as all other participants, for giving me the time to finish this thesis.

Paul Taylor’s prooftree package was used for typesetting proof rules and derivation trees.

I dedicate this thesis to my parents for their continuing support and encouragement.
Declaration

I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise, and that this work has not been submitted for any other degree or professional qualification. An earlier version of the material in Chapters 3 and 4 was presented at the Fourth International Symposium on Theoretical Aspects of Computer Software (TACS’01) [Ber01].
# Table of Contents

1 Introduction .................................................. 1
   1.1 Programming language based reasoning ............... 2
   1.2 Program analysis for scheduling communication ..... 3
   1.3 Summary of contributions ............................... 4
   1.4 Synopsis ............................................... 6

2 Background and motivation ................................. 9
   2.1 Asynchronous processors ............................... 10
   2.2 Forwarding and compounding .......................... 14
      2.2.1 Little or no forwarding ........................... 14
      2.2.2 Register bypass unit .............................. 14
      2.2.3 Forwarding in the control unit .................. 15
      2.2.4 Compiler-based forwarding ....................... 15
      2.2.5 Related synchronous architectures ............... 19
   2.3 Asynchronous queue machines .......................... 20
   2.4 Processor design and verification ...................... 22
   2.5 Programming language based reasoning ............... 24
   2.6 Related work .......................................... 27
      2.6.1 Programming language technology and processor design ... 27
      2.6.2 Typing in intermediate and low-level languages .... 29
      2.6.3 Program analysis frameworks ..................... 34
   2.7 Notation ............................................... 35
3 Sequential queue machines
3.1 ALEF: a language for explicit forwarding .......................... 40
3.2 Dynamic semantics ...................................................... 43
3.3 Static semantics .......................................................... 50
3.4 Principal types ............................................................ 57
3.5 Program execution ......................................................... 60
  3.5.1 Dynamic semantics .................................................. 61
  3.5.2 Static semantics ....................................................... 64
  3.5.3 Type inference ........................................................ 66
3.6 Discussion ................................................................. 72
  3.6.1 Summary and outlook ............................................... 74

4 Asynchronous queue machines ............................................. 75
4.1 Syntax and dynamic semantics ......................................... 76
4.2 Determinism, safety and completeness ............................... 81
4.3 Dealing with non-determinism ......................................... 86
  4.3.1 A static semantics for determinism ............................... 87
  4.3.2 Pipeline-dependencies ............................................. 96
  4.3.3 Data-dependencies .................................................. 99
  4.3.4 Other sequentialisation mechanisms .............................. 105
4.4 Program execution ....................................................... 106
  4.4.1 Dynamic semantics ................................................ 107
  4.4.2 Static semantics ..................................................... 108
4.5 Finite operand queues .................................................. 113
  4.5.1 Distributed execution .............................................. 119
4.6 AQM’s and Tomasulo’s algorithm ...................................... 123
  4.6.1 Tomasulo’s algorithm .............................................. 123
  4.6.2 Mapping to AQM’s ................................................ 127
  4.6.3 Limitations ........................................................... 127
  4.6.4 Conclusion .......................................................... 130
4.7 Discussion ............................................................... 130
  4.7.1 Summary and outlook .............................................. 133
5 Queue machines with registers

5.1 Sequential execution
5.1.1 Dynamic semantics
5.1.2 Static semantics
5.1.3 Program execution

5.2 Distributed execution
5.2.1 Dynamic semantics
5.2.2 Determinism, safety and completeness
5.2.3 A static semantics for determinism
5.2.4 Program execution

5.3 Discussion
5.3.1 AQM’s and compounding
5.3.2 Conclusion and outlook

6 Intermediate program analysis

6.1 Static single assignment

6.2 An intermediate language IL
6.2.1 Syntax and regular programs
6.2.2 Programs in Static Single Assignment form
6.2.3 Dynamic semantics

6.3 Analysis of variable usage
6.3.1 Dataflow analysis for liveness
6.3.2 Usage analysis for regular programs
6.3.3 Usage analysis for SSA

6.4 Soundness of usage analysis
6.4.1 Non-standard dynamic semantics
6.4.2 Soundness for regular programs
6.4.3 Soundness for SSA programs
6.4.4 Discussion

7 Translating intermediate programs into ALEF

7.1 Deciding forwardability
7.2 Eliminating Φ-instructions .............................................. 230
7.3 Translation .............................................................. 238
7.4 Type-correctness ......................................................... 241
7.5 Allocating operand queues and registers ............................ 255
7.6 Functional correctness .................................................. 259
7.7 Discussion ............................................................... 281
  7.7.1 Summary of the compilation ...................................... 281
  7.7.2 Extensions and future work ...................................... 282
  7.7.3 Conclusion .......................................................... 287

8 Implementation ............................................................. 289
  8.1 Analysis of JVM bytecode ............................................. 290
    8.1.1 Java virtual machine code .................................... 290
    8.1.2 Translating each stack position into a variable .......... 292
    8.1.3 Translating each push operation into a variable .......... 297
    8.1.4 Implemented instruction set .................................. 301
  8.2 Results ...................................................................... 302
    8.2.1 Conversion into IL ............................................... 302
    8.2.2 Translation into ALEF .......................................... 305
  8.3 Discussion ............................................................... 309

9 Discussion ................................................................. 313
  9.1 Summary ............................................................... 313
  9.2 Shortcomings and future work ..................................... 314
    9.2.1 Full implementation ............................................ 314
    9.2.2 Extensions of the computational model ................... 315
    9.2.3 Connections to linear logic and other intermediate forms . 317
    9.2.4 Explicitness of architectural features ...................... 318
    9.2.5 Formalisations .................................................. 319
  9.3 A future application .................................................. 321

Bibliography ................................................................. 323
## List of Figures

<table>
<thead>
<tr>
<th>Figure</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>2.1</td>
<td>Sutherland’s micro-pipeline</td>
<td>12</td>
</tr>
<tr>
<td>2.2</td>
<td>Compoundings for [AM99]’s example program</td>
<td>16</td>
</tr>
<tr>
<td>2.3</td>
<td>Verification against ISA using abstraction functions</td>
<td>24</td>
</tr>
<tr>
<td>3.1</td>
<td>General organisation of processors suitable for ALEF</td>
<td>41</td>
</tr>
<tr>
<td>3.2</td>
<td>Visualisation of program (3.1)</td>
<td>43</td>
</tr>
<tr>
<td>3.3</td>
<td>Architectural model for sequential dynamic semantics</td>
<td>44</td>
</tr>
<tr>
<td>3.4</td>
<td>Dynamic semantics of opcodes</td>
<td>46</td>
</tr>
<tr>
<td>3.5</td>
<td>Sequential dynamic semantics of instruction sequences</td>
<td>46</td>
</tr>
<tr>
<td>3.6</td>
<td>Example derivation for $C \xrightarrow{L} D$</td>
<td>47</td>
</tr>
<tr>
<td>3.7</td>
<td>Proof of determinacy for $t = sr$</td>
<td>49</td>
</tr>
<tr>
<td>3.8</td>
<td>Sequential dynamic semantics including jumps</td>
<td>62</td>
</tr>
<tr>
<td>3.9</td>
<td>Example derivation for $C \xrightarrow{L} D$ including jumps</td>
<td>62</td>
</tr>
<tr>
<td>3.10</td>
<td>Sequential dynamic semantics for programs $P = (N, A)$</td>
<td>63</td>
</tr>
<tr>
<td>3.11</td>
<td>Example program for type inference via unification</td>
<td>71</td>
</tr>
<tr>
<td>4.1</td>
<td>Architectural model for distributed dynamic semantics</td>
<td>77</td>
</tr>
<tr>
<td>4.2</td>
<td>Execution of a non-deterministic program</td>
<td>81</td>
</tr>
<tr>
<td>4.3</td>
<td>Execution traces for $\pi_s$ (left) and $\pi_r$ (right)</td>
<td>84</td>
</tr>
<tr>
<td>4.4</td>
<td>Type system for deterministic execution</td>
<td>88</td>
</tr>
<tr>
<td>4.5</td>
<td>Confluence of $u$ and $t$</td>
<td>96</td>
</tr>
<tr>
<td>4.6</td>
<td>Type system for finite operand queues</td>
<td>115</td>
</tr>
<tr>
<td>4.7</td>
<td>Interleavings of program (4.9) for various length restrictions</td>
<td>120</td>
</tr>
<tr>
<td>4.8</td>
<td>Tomasulo’s algorithm</td>
<td>124</td>
</tr>
<tr>
<td>4.9</td>
<td>Snapshot after issuing instruction [1]</td>
<td>125</td>
</tr>
</tbody>
</table>
4.11 Snapshot after issuing all instructions of program (4.10) ....... 126
4.12 Example for instruction execution in zero time ....... 129

5.1 Architecture for sequential dynamic semantics with registers ....... 138
5.2 Sequential execution with registers: dynamic semantics ....... 139
5.3 Sequential dynamic semantics including jumps and registers ....... 149
5.4 Dynamic semantics for programs involving registers ....... 165

6.1 Intermediate language IL: syntax ....... 175
6.2 Relationship between dataflow solutions, $\Rightarrow_\alpha$ and $\rightarrow$ ....... 216

7.1 States $\Pi_i$ and $\Delta_i$ ....... 235
7.2 Change of $Q(q)$ when executing $\llbracket ass \rrbracket_p = \text{id} q q$ ....... 270
7.3 Compilation from IL to ALEF ....... 281
### List of Tables

<table>
<thead>
<tr>
<th>Table</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>3.1</td>
<td>Type system</td>
<td>52</td>
</tr>
<tr>
<td>3.2</td>
<td>Type system including jumps</td>
<td>64</td>
</tr>
<tr>
<td>6.1</td>
<td>Dataflow information for programs (6.5) to (6.8)</td>
<td>197</td>
</tr>
<tr>
<td>6.2</td>
<td>Equations (6.13) applied to program (6.5)</td>
<td>198</td>
</tr>
<tr>
<td>6.3</td>
<td>Simplification of Table 6.2</td>
<td>199</td>
</tr>
<tr>
<td>6.4</td>
<td>Equations (6.13) applied to program (6.6)</td>
<td>199</td>
</tr>
<tr>
<td>7.1</td>
<td>Translation $[P]_P$ for IL program $P = (\text{H}, \text{A}, \text{succ}, \text{entry})$</td>
<td>239</td>
</tr>
<tr>
<td>7.2</td>
<td>Type system for extended instruction set</td>
<td>239</td>
</tr>
<tr>
<td>8.1</td>
<td>Operational semantics of some JVML instructions</td>
<td>291</td>
</tr>
<tr>
<td>8.2</td>
<td>Translation scheme A for non-jumps</td>
<td>293</td>
</tr>
<tr>
<td>8.3</td>
<td>Translation scheme A for jumps</td>
<td>294</td>
</tr>
<tr>
<td>8.4</td>
<td>Translation scheme B for non-jumps</td>
<td>297</td>
</tr>
<tr>
<td>8.5</td>
<td>Translation scheme B for jumps</td>
<td>298</td>
</tr>
<tr>
<td>8.6</td>
<td>Implemented JVML instructions</td>
<td>302</td>
</tr>
<tr>
<td>8.7</td>
<td>Linpack results for scheme A: Translation JVM → IL</td>
<td>303</td>
</tr>
<tr>
<td>8.8</td>
<td>Linpack results for scheme B: Translation JVM → IL</td>
<td>304</td>
</tr>
<tr>
<td>8.9</td>
<td>Linpack results for scheme A: Translation JVM → ALEF</td>
<td>305</td>
</tr>
<tr>
<td>8.10</td>
<td>Register/queue access operations (scheme A)</td>
<td>307</td>
</tr>
<tr>
<td>8.11</td>
<td>Linpack results for scheme B: Translation JVM → ALEF</td>
<td>308</td>
</tr>
<tr>
<td>8.12</td>
<td>Register/queue access operations (scheme B)</td>
<td>310</td>
</tr>
</tbody>
</table>
Chapter 1

Introduction

In this thesis we study computational models of asynchronous processors with operand forwarding. Using this architectural feature, instructions may communicate operands directly by sending them to the functional unit where they will be consumed, bypassing the register bank. In synchronous architectures, forwarding allows one to schedule instructions more aggressively as the number of pipeline-slots separating the producer of an operand from its consumer may be reduced [Fly95]. In these architectures, forwarding may for example be implemented using Tomasulo’s algorithm where data is broadcast to all functional units and consumed using a tag-matching operation [Tom67]. In asynchronous architectures, broadcasting is undesirable as it globally synchronises operations and couples functional units too tightly. On the other hand, a distributed forwarding regime where operands may be sent to specific destinations fits much better with the data-driven nature of asynchronous computation. Consequently, several recent asynchronous architectures explored different forwarding schemes or closely related approaches. However, the interaction between operand forwarding and asynchronous computation is complex and difficult to reason about due to the distribution of asynchronous state. As a result, forwarding schemes are often implemented in an ad-hoc manner without being complemented by detailed studies of the legality of forwarding schemes or the resulting programming model.
1.1 Programming language based reasoning

In order to fill this gap, we propose a programming language based approach where forwarding is explicitly visible to the programmer. While some architectures allow forwarding information to occur in the assembly language in order to expose scheduling opportunities to the compiler, a systematic investigation into the structural requirements for code using forwarding has not yet been provided. For example, the SCALP asynchronous architecture admits programs which use the forwarding mechanism in unsafe ways, potentially leading to a deadlock of the processor, operand overflow, operand starvation or non-determinism [End96]. Our programming language approach allows one to eliminate these hazards before runtime by characterising safe code, resulting in proven execution safety. Programs which might run into hazards are rejected before an attempt is made to execute them. We achieve the necessary mathematical rigour by following the approach of modern programming language design which complements the syntactic definition of a language by structural descriptions of runtime behaviour and of static program properties. These static and dynamic semantics allow us to reason about program characteristics in an implementation-independent way, to compare the operational behaviour of implementations systematically and to relate syntactically enforceable program properties to runtime behaviour. Often formalised as type systems, static semantics aim to eliminate all dynamic hazards under consideration at compile time, admitting only those programs for execution which could be proven to be free of hazards. As a result, not only is the safety of program execution ensured, but also performance benefits are observed as runtime checks and routines for recovering from hazardous situations need not be implemented.

As a test-bed for reasoning about forwarding schemes we define a small assembly language for explicit forwarding called ALEF. Different dynamic semantics are given to ALEF, corresponding to different abstractions of processor behaviour. The first dynamic semantics models sequential execution similar to conventional descriptions of the instruction set architecture (ISA). Later, we consider (interleaved) parallel execution and execution under limitations on hardware resources. The dynamic semantics are compared to each other formally, and give rise to different error conditions. Each model’s characteristic hazards are subsequently eliminated by typing calculi. Sound-
ness results are included, showing that well-typed programs will indeed not exhibit the runtime hazards under consideration.

1.2 Program analysis for scheduling communication

Exposing forwarding to the programmer allows the compiler to schedule the communication by deciding for each operand by which mechanism it should be sent. The architectures we concentrate on implement forwarding using operand queues where the head value is removed from the queue during consumption. Hence the compiler’s options for scheduling are limited as not all values are suitable for forwarding. In the second part of the thesis we analyse these restrictions and develop the program analysis necessary for detecting the forwardability of values in an intermediate language. We show that the forwarding discipline imposes linearity conditions on the usage of intermediate values which may be expressed using dataflow equations. We consider an operational semantics of the intermediate language which captures the essence of forwarding by restricting reading and writing capabilities of program variables. We then prove the dataflow equations sound with respect to this operational model. The earlier insight into the structural requirements on assembly programs is subsequently exploited as we demonstrate how dataflow solutions in the intermediate language correspond to well-typedness on the assembly level. In particular, we provide a translation from the intermediate language into the assembly language which is both functionally correct and structure preserving, including the allocation of operand queues and registers. We thus demonstrate that forwarding may be lifted from the architectural level to the level of program analysis. This links our work to current research in the programming language community such as typed compilation and typed assembly languages, as well as to recent work on the relationship between different program analysis frameworks [NNH99]. On the other hand, our work shows how programming language concepts may be used to raise the level of reasoning about computer architectures. We expect this to be relevant for other architectures which expose the communication structure to the programmer such as (reconfigurable) field-programmable gate arrays (FPGA’s [CH02]), transport-triggered architectures [CM91] and the Raw
model of computation [TKM+02]. In all of these models, safety of execution relies on the coordinated usage of the underlying hardware which is susceptible to structural properties of code.

Finally, we provide a prototypical implementation of our compilation approach, consisting of the dataflow analysis in the intermediate language, queue and register allocation for the forwardable and non-forwardable entities respectively, and the translation into ALEF. By linking our implementation to two symbolic conversions of Java virtual machine code [LY97] into our intermediate language, we give experimental results for well-known benchmark programs. This allows us to address the question whether the requirement of linear usage limits the forwarding opportunities to an extent which makes forwarding inviable. Both the conceptual analysis and the experimental results indicate that indeed a huge number of forwarding opportunities exist in application programs as virtually the whole JVM operand stack may be turned into forwarding. Under the assumption that forwarding queues may be built in a more energy-efficient way than large register banks with multiple read/write ports, our work hence provides a means to reduce the power consumption in virtual machine implementations which compile bytecode into machine language [Sun01], [AAB+00]. A particular benefit should result for mobile JVM’s running on personal digital assistants (PDA’s) or mobile phones where battery lifetime is limited.

### 1.3 Summary of contributions

On a pragmatic level, our main contribution consists of an argument for a programming language based approach to reasoning about processor architecture, compiler and instruction set design. The need for such a methodology is outlined in Chapter 2, motivated by the complexity of modern system architecture. The key ingredients are operational semantics for describing processor behaviour and static semantics for characterising some program properties. Although these ingredients have been of central importance in the programming language community, their application to processor architectures has to our knowledge not yet been explored.

We exercise our methodology by applying it to the forwarding of operands in pro-
cessors with instruction-level asynchrony. The second contribution thus consists of a novel abstract machine model, *asynchronous queue machines* (AQM’s). AQM’s unify existing architectures with respect to forwarding, highlighting their similarities and differences.

On a more technical level, the first part (Chapters 3 to 5) provides

- an assembly language for AQM’s called ALEF (Assembly Language for Explicit Forwarding). Exposing forwarding to the programmer/compiler, ALEF facilitates the usage of programming language technology and serves as our main vehicle for analysing forwarding behaviour and linking it to program analysis.

- various dynamic semantics for ALEF, each corresponding to a particular aspect of processor behaviour. These semantics demonstrate the flexibility gained by employing programming language notions for defining processor abstractions uniformly and comparing them formally.

- static semantics which eliminate structurally incorrect programs before runtime. As each dynamic semantics gives rise to characteristic error situations, the benefits of statically enforceable safety conditions become apparent. These conditions are expressed as type systems which are tied to the dynamic semantics by soundness and completeness results.

In order to demonstrate the approach without being overwhelmed by implementation details, we present simplified processor models and a deliberately restricted language. Starting from a sequential model of operation for pure forwarding (sequential AQM’s), we add other features (concurrency, hardware limitations, registers) as we proceed. At each step, dynamic and static semantics arise naturally from the corresponding notions of the previous model. This demonstrates the advantages of a structured approach, even more so as proofs of correctness generalise equally well. For example, in the model for concurrent operation (Chapter 4), we distinguish between the derivation of conditions necessary for correct execution and means for discharging such constraints. This separation reappears when registers are introduced in Chapter 5: as the sources of error conditions are generalisations of the ones examined in Chapter 4, the means for discharging the constraints are seen to be correct by invoking the earlier results.
Chapter 1. Introduction

The second part (Chapters 6 to 8) substantiates the claim that a programming language based approach is beneficial for linking processor behaviour to program analysis and compiler optimisation. We provide

- the program analysis for deciding which operand communication mechanism should be used for each intermediate result. The analysis is presented as a dataflow problem for an intermediate language, and is formally proven sound with respect to a non-standard operational semantics (Chapter 6).

- a compilation route from intermediate code into ALEF which is based on solutions to the dataflow equations and includes an allocation phase for registers and queue names. We prove functional correctness by relating the dynamic semantics of the intermediate language to the dynamic semantics of ALEF. The translation is also shown to be structure preserving as each dataflow solution yields ALEF code which is well-typed according to the linear type system of Chapters 3 to 5. We thus lay the foundations for a verifiable compilation from high-level code into ALEF, similar to recent advances in the compilation of functional programming languages (Chapter 7).

- a prototypical implementation of the compilation process (Chapter 8). Although concrete instructions are replaced by symbolic operations, program analysis and translation proceed as outlined in Chapters 6 and 7, starting from Java virtual machine (JVM) code. The resulting ALEF programs indicate the multitude of forwarding opportunities in application programs.

The last item thus provides modest experimental evidence for our approach, complementing the more theoretical justification presented in the preceding chapters.

1.4 Synopsis

The thesis is structured as follows. Chapter 2 describes the background and motivation for this work, summarising trends in (asynchronous) processor architecture, arguing the applicability of programming language concepts, discussing related work, and introducing some notational conventions.
The assembly language for explicit forwarding ALEF is introduced in Chapter 3. It is equipped with an operational semantics where instructions are executed sequentially. In order to understand the basic properties of forwarding in isolation, all operands are forced to be communicated via the forwarding mechanism, i.e. no support for registers is provided. The characteristic runtime hazards are discussed, and subsequently eliminated by a static semantics consisting of a type system based on notation from linear logic [Gir86]. Types abstractly characterise the shape of configurations, given by the number of items in each operand queue.

Chapter 4 considers alternative dynamic semantics which are closer to processor implementations. First, we consider a distributed model of execution where instructions await their operands in instruction queues, and head instructions may execute as soon as all their operands have arrived. The resulting out-of-order execution is purely governed by the delays inside the functional units, on which no assumptions are made. Consequently, an additional class of runtime hazards is observed as instructions compete for writing access to operand queues. The dynamic semantics is formally related to the earlier sequential dynamic semantics. Instead of modifying the operational semantics until full compatibility between the two models is achieved we characterise the programs for which compatibility holds. This approach is motivated by the desire to constrain the operational behaviour as little as possible, i.e. make no compromises regarding distributed asynchrony. Programs which fulfil the compatibility conditions are considered safe for distributed execution, and the earlier typing system is extended to eliminate unsafe programs. As the analysis is split into a phase of deriving conditions which must be fulfilled for safe execution and a phase of discharging conditions which are fulfilled, a scalable approach is taken where more programs may be proven safe at the cost of a more involved program analysis.

Secondly, we consider operational models for implementations where operand queues are of finite length. Again, the resulting restrictions are treated as conditions on the programs, and are indeed shown to be related to the properties necessary for safe distributed execution.

Chapter 5 introduces registers and reexamines the earlier operational models correspondingly. The robustness of our approach is demonstrated as extending the static se-
mantics in the expected way proves sufficient for guaranteeing hazard-free execution. In particular, the characteristic property of registers to allow the repeated consumption of values is reflected in the linear typing calculi by the expected exponentials.

Chapters 6 and 7 describe the program analysis for intermediate code and the translation of intermediate code into ALEF. We introduce a small intermediate language with assignments and branch instructions, and develop the dataflow equations for detecting single usage of program variables. We also consider programs in static single assignment (SSA) form [AWZ88], and show how the dataflow equations may be extended appropriately. In both cases, the dataflow equations are proven sound with respect to a dynamic semantics of the intermediate language which restricts the read/write capabilities of some variables similarly to the low-level forwarding. On the other hand, the semantics are also related to the standard dynamic semantics of the the intermediate language. Chapter 7 presents the translation of programs from the intermediate language into ALEF. We show how each solution to the dataflow equations results in a different ALEF program, and prove that the result is always well-typed with respect to the calculus of Chapter 5. The corresponding formal proof relates intermediate-level structure (dataflow equations) to low-level structure (assembly-level typing calculus), independently of the concrete allocation of operand queues and registers. Subsequently, we treat the functional correctness of the translation. We characterise when forwardable entities need to be mapped to different queues, and obtain operand queue allocations by colouring the corresponding conflict graph. The proof of functional correctness finally shows that certain invariants hold during ALEF program execution, corresponding to properties in the execution of the intermediate program.

Chapter 8 presents the experimental results. We discuss the two conversion schemes from JVM code into the intermediate language and compare their effectiveness for transforming operand stack communicated values into forwardable entities, based on the program analysis and the translation introduced in Chapters 6 and 7.

Finally, we conclude in Chapter 9 by discussing shortcomings and possible extensions of our work and outlining some topics for future research.
Chapter 2

Background and motivation

This chapter provides the background and motivation for the work presented in this thesis, outlines our approach, and discusses related work. We touch on a number of areas, ranging from asynchrony and hardware verification to programming language technology and program analysis.

The motivation for our research arises from the desire to reason about operand forwarding in asynchronous processors. Although we are not aiming to justify circuit-level asynchrony from a hardware design perspective, we summarise some arguments for asynchronous operation in Section 2.1 in order to explain the concepts underlying our later computational models. We then discuss the various schemes of forwarding occurring in recent architectures, leading from no forwarding via exclusively hardware-based forwarding to explicit forwarding and compounding (Section 2.2). Armed with this overview, Section 2.3 introduces asynchronous queue machines as a unifying model of computation for globally asynchronous architectures with forwarding. Subsequently, we discuss current approaches to processor verification in Section 2.4. This is followed by motivating why a programming language based approach may be more suitable for understanding forwarding conceptually, complementing the existent verification techniques (Section 2.5). Related work is discussed in Section 2.6, concentrating on applications of structural techniques to hardware design, and of type systems to assembly-level languages and compilation. Finally, Section 2.7 introduces notational conventions employed throughout the thesis.
2.1 Asynchronous processors

The operational models we consider in this thesis are inspired by processor architectures which employ architectural asynchrony, i.e. the activities of functional components of a processor are not centrally synchronised by a global clock. Instead, components may execute their tasks independently at different speed. Some effects of architectural asynchrony occur in globally synchronous systems, such as out-of-order execution in super-scalar architectures or multi-processor chips. The archetypical situation for architectural asynchrony however arises if components are clocked at different rates and communicate by handshaking (the locally synchronous, globally asynchronous regime) or if the system clock is removed altogether, resulting in a purely asynchronous regime. In the latter case, execution delays may be influenced by environmental factors such as voltage and temperature, variations during fabrication, or the actual data to be processed.

Both forms of asynchrony are being investigated as potential alternatives to traditional hardware design philosophies, motivated by the growing design complexity for globally synchronous systems. Advocates such as [BS95], [DN97], [Hau95] and [Man00] argue that a number of technological drawbacks found in conventional design methodologies may be overcome using global asynchrony:

Cost of computation versus cost of communication. As the integration density continues to increase, the relative cost of communicating values (wiring) dominates the cost of actually performing computation. The absence of a central clock removes the need to communicate clock signals and thus decreases the chip area needed for routing.

Absence of clock skew. Ensuring that the clock signal arrives at all components of a chip at the same time is one of the most time-consuming tasks during chip design. Timing closure is often achieved by using gated clocks and clock trees, with negative effects on power consumption or die size. Asynchronous systems do not rely on the arrival of a signal at a gate at a globally determined point in time, so the clock skew problem does not arise.

Average case performance. The clock speed of a synchronous system is determined
by most pessimistic assumptions on the speed of individual components, propagation delays, incoming data and environmental conditions. Unless the clock speed can be adapted at runtime, presence of more favourable situations cannot be exploited, and time and energy are wasted. Asynchronous systems often allow a computation to proceed at its natural speed and can thus adapt to various runtime conditions, resulting in expected average-case rather than worst-case performance.

**Modularity and modifiability.** As decoupled components do not rely on global timing assumptions, they can be designed in a more modular way. The performance of a design can be improved more easily by replacing bottleneck components by improved implementations, without having to reevaluate the timing constraints or re-routing other areas of the chip. Consequently, the life-time of a design is extended, leading to a more favourable cost-per-design ratio.

**Power consumption.** Although modern synchronous systems switch off or slow down unused parts of a chip when no useful work is performed, the overhead of the corresponding regulation mechanisms disappears if the activity of gates is determined locally, as is the case in globally asynchronous systems.

**Security.** By probing and observing characteristic current spikes, an outside attacker can identify executed instructions and analyse the data of a computation. Successful experiments on retrieving the complete keys of commonly used smart-card security protocols (RSA, DES) by probing the electro-magnetic (EM) characteristics have been performed [GMO01]. It is argued in [MAK00] that asynchronous systems blur patterns in the curves of current or EM radiation and thus help inhibiting attacks based on power analysis. Furthermore, the vulnerability of processors to modulation of clock and power by applying additional external signals and thus changing the processor’s behaviour is reduced under the asynchronous regime.

Most of these advocated advantages have been validated only in isolation or for special-purpose applications, and are counterbalanced by increased transistor counts, negatively effecting die area, power consumption and design complexity. In addition,
asynchronous systems need non-standard encoding techniques for both signals (handshaking protocols, transition signalling) and data (1-out-of-\(n\), \ldots) and are difficult to verify and test [LGS95][HBB94]. For these reasons, most mainstream hardware design is done using synchronous design styles.

Nevertheless, circuit-level design of asynchronous systems has improved over the last decade with the advent of specific design methodologies. These are often based on concurrency formalisms such as CSP [Mar86] [Bur87], CCS [LBM96] [BM00] or Petri nets [SKLY97] [YGL00] [CM99a] and have partially resulted in design and synthesis tools. Some of these formal methods have been validated during the design of full processors [MLM97] [DGY93] or parts thereof [MLM99] [CM99a].

On a higher level, novel processor architectures have been proposed [Pav94] [GFC99] [RB95] [SSM94] [Mul01] [OIU01] [WA01], most of which are based on Sutherland’s micro-pipeline [Sut89].

This basic processing structure (see Figure 2.1) consists of a series of storage elements realised by asynchronous composition and controlled by specific logic gates (Muller C gates). In the absence of circuitry between the stages, the micro-pipeline acts as an elastic FIFO queue as the number of values held in the queue may vary dynamically. If logic functionality is inserted between the stages, the micro-pipeline acts like a conventional processor pipeline, but delays are determined dynamically. A functional component may start to process data incoming from its left neighbour as soon as it has communicated the previous result to the neighbour to its right. If some stages are left without circuitry, an elastic processing pipeline is obtained holding a varying number
of items. The actions of each stage are governed by local constraints such as the availability of data, and buffering between the stages decouples the progress in adjacent functional units, resulting in a self-adapting throughput. For example, a long latency for a specific instruction in the first pipeline stage may be compensated by short latency for the same instruction in the second stage. In contrast, a synchronous pipeline would artificially extend the latency of the second stage to match the latency of the first stage, resulting in a reduced throughput. Concerning the design complexity, it could be argued that synchronous design violates one of the principles of modern processor design outlined in [HP96]: to make the frequent case fast and the infrequent cases (at least) correct. Assuming that the above latency behaviour occurs for a small percentage of instructions, design time and hardware spent on improving the speed of the first stage will never pay off, but leaving it unimproved wastes runtime performance in the common case.

The processors described in [SSM94] and [WA01] use counterflow pipelines where instructions and values travel in opposite directions. The rotary pipeline processor [MRW96] employs a circular datapath which values traverse clock-wise, visiting the respective functional units. Most other architectures mentioned above take a less radical approach and stick to the conventional order of functional stages as found in a RISC pipeline or employ a super-scalar approach by parallelising either functional units in the execution stage or full pipelines. The resulting out-of-order execution is often complemented by write-back stages containing reordering mechanisms, allowing precise exception handling and hiding execution delays from the programmer. Most of these architectures also employ some form of last result reuse, register bypassing or operand forwarding as outlined in the next section.

A recent article motivating architectural asynchrony in the context of dataflow and multi-threaded architectures can be found in [SRU01].

Asynchrony has also been proposed as a feature of biologically inspired reconfigurable electronic systems in the area of embryonics [JT01].
2.2 Forwarding and compounding

We concentrate our analysis on the high-level communication of operands between instructions. In traditional (synchronous) ISA’s, all operands are communicated through registers or memory [HP96]. Modern pipelined processor implementations however use an additional mechanism called *operand forwarding* or *register bypassing* as a fast mechanism to communicate operands. Avoiding the latency of register communication, the distance between producer and consumer in the pipeline can be reduced and data-dependent instructions can be executed in close succession [HP96] [Fly95]. As a further consequence the demand on the register bank to provide instructions with operands is reduced. As more instruction level parallelism (ILP) is exploited, this allows one to use register files with less read/write-ports, reducing hardware complexity and processor area. Often, either the control unit or the register bank contain special hardware to detect forwarding opportunities, based on the Tomasulo-algorithm [Tom67], score-boards [Tho64] or bypassing logic. In the context of asynchrony, forwarding is a natural means to capture the data-driven aspect of computation, and asynchronous architectures have consequently explored a wide spectrum of forwarding techniques.

2.2.1 Little or no forwarding

Most early asynchronous architectures did not implement any means of data forwarding [DGY93] or relied on register locking (Amulet1, [Pav94]), scoreboards [RB95], local reuse of a previously computed result inside a functional unit [FDG96], [RB95] and reuse of values previously loaded from memory [FDG96]. The recent DCAP-architecture [WA01] concentrates on the implementation of dynamic scheduling, does not include any forwarding and employs register renaming and other techniques in the control unit to resolve data dependencies.

2.2.2 Register bypass unit

A hardware-based solution consists of a bypass unit in front of the register bank which holds recent results and serves all operand requests for values it contains. Such a so-
2.2. Forwarding and compounding

Solution is employed in [MLM+97], but might not scale well as the number of operand requests grows. The design of [MLM+97] also presents a simple mechanism for reordering the instructions before the write-back stage: during (in-order) instruction issue, a tag is entered into a FIFO queue which allows the write-back stage to poll the functional units for results in the same order in which the instructions were issued. A number of low-level design decisions were explored during the development reported in [MLM+97], and a CSP-like formalism was employed.

2.2.3 Forwarding in the control unit

Under this regime, the control or instruction issue unit identifies opportunities for forwarding when decoding an instruction. Based on an analysis of data-dependencies, forwarding requests or bypassing tags are generated dynamically. An improved queue implementation proposed in [GG97] reducing power consumption and easing reading and writing access was implemented in Amulet3 [GFC99]. The Cascade-ALU [OIU+01] and some schemes in [Mul01] also employ control-unit based forwarding which has the advantage of finding more forwarding opportunities than static, compiler-based methods. The downside is runtime overhead in terms of speed and power consumption as additional hardware for detecting dependencies and avoiding deadlocks is employed.

2.2.4 Compiler-based forwarding

Several authors proposed to include forwarding information in the instruction code, either explicitly [End96] or by annotations [Mul01] or special interpretations of register names [RB95]. In all of these cases, the responsibility to schedule operand forwarding lies with the compiler, and consequently the hardware is vulnerable to compiler-errors leading to unsafe, deadlocking or functionally incorrect usage of the forwarding mechanism.

In FRED [RB96] [Ric96], the register name r1 carries a special meaning by allowing access to a queue of values, which delivers a different value each time it is accessed. Execution of branches is split into an evaluation of the branch condition and an operation reading the branch outcome and performing the jump. Although the mechanism
was considered interesting enough to be included in FRED, it was not given a prominent status in the further development and an evaluation of the usage of this mechanism is not provided.

2.2.4.1 Compounding

The compounding approach [AM00] [AM99] [Mul01] identifies forwarding opportunities in the data-dependence graph of programs by grouping adjacent instructions. For example, the program

\[
\begin{align*}
(1) & \quad r5 = \text{mem}[r2+4] \\
(2) & \quad r1 = \text{mem}[r2] \\
(3) & \quad r2 = r1 * 321 \\
(4) & \quad r3 = r2 + r5
\end{align*}
\]

(taken from [AM99]) can be compounded as indicated in Figure 2.2. At runtime, encircled data-dependencies result in forwarding and cross-compound dependencies in register usage. A single bit in the instruction code suffices to indicate compound membership provided that only syntactically neighbouring instructions can inhabit the same compound – the second and fifth compoundings in Figure 2.2 violate this requirement. If instructions may be reordered during compilation, arbitrary adjacency relations may thus result in forwarding, as long as each compound is a linear chain of instructions and the graph of compound instructions and inter-compound dependencies is acyclic. [Mul01] proposed a number of implementations for dynamic scheduling and forwarding, and explored their effect on overall performance for varying latencies of functional units and registers. The motivation for our investigation into forwarding grew from the
desire to characterise legal compoundings more abstractly, allow non-linear shapes, and extend compounding to forwarding across basic block boundaries.

### 2.2.4.2 SCALP

In SCALP [End96], identifiers of input ports of functional units may appear in the opcode, as in

```plaintext
add -> mul_b
```

This instruction would add the two values at the (implicit) input ports \textit{a} and \textit{b} of the functional unit \textit{ALU} and send the result to port \textit{b} of functional unit \textit{MUL}. Incorporating the destination of forwarded values explicitly was motivated by the desire for high power efficiency, realised by high code density and asynchronous parallelism. The resulting paradigm is more flexible than simple compounding as the need to restrict forwarding to neighbouring instructions is removed. Consider for example the program

```plaintext
(1) ld 4 -> mul_a;
(2) ld 2 -> alu_a;
(3) ld 3 -> mul_b;
(4) inc -> mem_a;
(5) mul -> mem_b
(6) st
```

where instruction (4) consumes the result of instruction (2). As input ports are generalised to operand queues, functional units become decoupled, and availability of several input ports at functional units enables compounds where forwarding chains merge. For example, the multiplication instruction (5) in the above program obtains operands from (1) and (3). On the other hand, introduction of a \textit{duplication} instruction where a value is sent to two destinations allows us to implement branching compounds. In the program

```plaintext
(1) ld 2 -> alu_a;
(2) ld 4 -> mul_a;
(3) dupl -> mem_a, mul_b;
(4) mul -> mem_b;
(5) st
```
the duplication instruction for the functional unit ALU duplicates the value 2, sending it to mem_a and mul_b.

Finally, forwarding across basic block boundaries becomes possible as shown in the program

(1) ld 2 -> mul_a;
(2) ld 4 -> alu_a;
(3) ld 7 -> mul_b;
(4) mul -> mul_a;
(5) dec -> alu_a;
(6) dupl -> bu_a alu_a;
(7) ifz (3) (8);
(8) . . .

Together, these generalisations allow more generous forwarding patterns than linear compounds, paid for by fixing the syntactic order of instructions when compared to the instruction dependence graph and by a more costly encoding of forwarding information (names of input ports instead of a single bit).

The specific implementation pursued in SCALP did not fulfil the motivating expectations of low power consumption and high code density. Architecturally, one of the reasons appears to be the decision to model the register bank as another functional unit, which requires additional transfer instructions for each register access. From our point of view, a more important question left unanswered in [End96] is about the programming model resulting from explicit forwarding. As the structure of the implementing hardware becomes visible to the programmer, the processor is sensitive to fine-grained properties of compiled code. The discussion in [End96] lists a number of illegal programs which should not be executed because they would run into hazards such as non-determinism, deadlock or queue over- or underflow. For some hazards, ad-hoc solutions are proposed such as the existence of sequentialisation instructions, while other errors are essentially assumed not to be present in programs. Given that explicit forwarding admits more generous forwarding schemes than compounding, structural incorrectness of code becomes difficult to detect. If runtime hazards cannot be tolerated, techniques are consequently needed for reasoning about the legality of forwarding schemes.
2.2.5 Related synchronous architectures

We briefly mention two architectural approaches from the synchronous processor design community related to forwarding.

2.2.5.1 Register files with queues in VLIW processors

Motivated by the increase of hardware complexity and processor area of traditional register files in VLIW architectures, queue register files have been proposed in [FLT97], with further exploration reported in [FLT98]. The results presented indicate the competitiveness of operand queues as the architecture is scaled, provided the right organisational structure is chosen. As mapping of instructions to functional units in VLIW processors is performed by the compiler and legality of operand queue usage depends on this mapping, the authors present a condition when instructions may share an operand queue. This appears to be similar to the task we are considering in Chapter 4, but the solution provided cannot be transferred to our models of computation as VLIW architectures behave highly synchronously. Indeed, scheduling for VLIW architectures is based on a detailed model of instruction delays, given by the number of cycles required for a particular instruction and functional unit. The success of the analysis presented in [FLT97] thus depends on the ability to characterise the availability of operands and functional units at particular cycle times.

2.2.5.2 Transport-triggered architectures

In transport-triggered architectures (TTA, [CM91]), only the movement of operands is explicitly specified by instructions, while the operations themselves are implicitly triggered by the availability of operands at input ports of functional units. The following example program taken from [MHC97]

\[
\begin{align*}
(CNST-10: ADD^O & \parallel CNST-20: ADD^T) \\
(ADD^R: ADD^O & \parallel CNST-30: ADD^T) \\
(ADD^R: REG^T & \parallel NOP)
\end{align*}
\]

is an optimised program for a clocked architecture where the data bus can transfer two values in each cycle. The first pair of instructions delivers two constant values to the
input ports \( O \) and \( T \) of functional unit \( \text{ADD} \). In the second cycle, their sum is transferred from the result port \( R \) of \( \text{ADD} \) back to the input port \( O \) and a third constant is delivered to port \( T \). The third pair of instructions writes the overall sum to a register.

Realised so far only in synchronous hardware, the programming paradigm offered by TTA is radically different from that of conventional architectures. For deadlock-free scheduling of TTA programs, reasoning mechanisms for structural properties of TTA code would be beneficial, but have apparently not been developed so far.

### 2.3 Asynchronous queue machines

Despite the differences between the implementations of forwarding, correct execution in all architectures mentioned relies on an understanding of the interaction between operand communication and concurrent instruction execution. In order to study these interactions systematically, models of computation are necessary which abstract from architectural details, concentrating on the relevant issues. This thesis proposes *asynchronous queue machines* (AQM’s) as a family of such models, and studies their suitability for the task at hand. AQM’s consist of a set of functional units located in parallel to each other, where instructions exchange operands by inserting them into (and reading them out of) *operand queues*. Different functional units may implement various sets of operations and operate independently of each other. Internally, they may be realised synchronously or asynchronously, and the outside environment may not make any assumptions regarding their latency, apart from finiteness. Operand queues are expected to be FIFO-queues, and architectures subscribing to the AQM regime may impose additional disciplines on the way operand queues may be used. For example, a processor may restrict the capability to read from/write to an operand queue to instructions executing on a particular functional unit, or may complement the operand queues by a set of registers. In fact, the queue machines considered in this thesis implement operand access destructively, i.e. the head value of an operand queue is deleted from the queue during its consumption.

Studying AQM’s amounts to reasoning about processor behaviour and program execution under unknown latencies of functional units, with particular emphasis being paid
2.3. Asynchronous queue machines

on the operand queues and their usage by an application program. The granularity of models of processor behaviour and the way in which a program uses the forwarding mechanism influences the resulting properties. One way to deal with this would be to fix a particular model of operation and a certain forwarding policy (such as an asynchronous implementation of a score-board) and to verify properties of the resulting architecture. Instead, we aim at reasoning about forwarding schemes, i.e. we develop conditions on forwarding policies. Verification under this perspective consequently means to show that any program which respects a particular forwarding policy fulfils the runtime properties under consideration.

We argue that such a general approach is beneficial for understanding the interaction between forwarding and asynchronous operation, for the following reasons.

- Firstly, our results are not specifically tied to any particular algorithm for deciding where forwarding should be used. In fact, we are not concerned with the task to implement or verify any such algorithm but concentrate on the resulting forwarding scheme directly. Indeed, we argue that once one understands the constraints on forwarding, one may compare the efficiency of various hardware- or software-based algorithms systematically, without having to verify each implementation individually. For example, the task to allocate operand queues to forwarded values may be performed by the processor (for example in the control unit) or the compiler (under knowledge of the hardware resources), but this design decision is independent from the legality of the resulting forwarding scheme.

- Secondly, we are able to study several models of processor behaviour and to compare the resulting restrictions on forwarding schemes. This enables system-level engineers to study design alternatives using a consistent formalism for various processor models. Reasoning about the consequences on forwarding (and other aspects) may thus influence design decisions.

- Thirdly, as no assumptions are made regarding the delays of functional units, large parts of our analysis remain valid when functional components are upgraded or functionality is added.
In this thesis, we consider various models of AQM operation and explore their forwarding constraints. We study sequential instruction execution as is captured by an instruction set architecture description as well as distributed instruction execution as found in (idealised) super-scalar implementations. In addition, we consider AQM’s with and without additional registers, and with operand queues of infinite and finite length.

Each of these operational models gives rise to specific conditions on the way forwarding is used, which manifest themselves in characteristic classes of runtime hazards. These are similar to error conditions occurring in SCALP or situations where forwarding has to be cancelled dynamically in the compounding approach. This demonstrates that our approach is useful in highlighting common properties of these architectures.

Before outlining our approach to verifying forwarding schemes in more detail, we discuss some aspects of processor verification.

### 2.4 Processor design and verification

The development of hardware architectures and the subsequent implementation of compilers and processors centers around the instruction set architecture (ISA). This interface between hard- and software defines the set of available instructions, their encodings, types and sizes of operands, and memory addressing schemes. Modern approaches [HP96] advocate to evaluate instruction set decisions by quantitative simulation of benchmark and application programs. During the validation of an architecture, this approach is useful for performance optimisation and testing. However, different concepts are needed to verify functional correctness, and processor verification is consequently a well established area of research. Indeed, techniques such as abstract state machines [CCL+97], (higher-order) logics [Mel88], theorem proving [CRSS94], model checking [BCL+94] [VB+97] and symbolic simulation [SB93] have been successfully applied on various levels of abstraction and represent the most prominent application area of formal methods. For a comparison of some of these techniques, see [Seg93] [Kro97].

Hardware verification has traditionally been separated into verification of the data path
and that of the control logic. Theorem proving and (higher-order) logics have been particularly successful in the former area while control issues have traditionally been tackled using model checking approaches. One of the reasons for this separation is the high regularity of control logic, enabling a fully mechanised exploration of the state space. The size of the state space may be reduced by symbolic techniques or compositional methods. In contrast, theorem proving usually involves considerable manual intervention. A variety of application problems has been successfully tackled, such as algorithms of Intel’s PentiumPro’s floating point instructions [OZGS99] or asynchronous micro-pipelines [BFGW97]. Recently, several proposals for merging the strengths of theorem proving and model checking have been put forward [Uri00] [Amj01] [SS99]. Symbolic simulation [SB93] arises from model checking by factoring the state space using data abstraction and restricting the expressiveness of the logic, such that a property may be proven by a simulation over the abstracted system.

At the architectural level, Burch and Dill [BD94] introduced abstraction functions between states of an implementation and ISA states. These are used in correctness proofs by showing that diagrams between the ISA level specification and the microprocessor implementation commute (see Figure 2.3). In order to verify a single instruction step, the processor is virtually stalled by flushing the pipeline: a sequence of micro-steps is considered (horizontal steps in Figure 2.3) where no issuing takes place but currently executing instructions may complete. Correctness is obtained if the two compositions of flushing and issuing one instruction (vertical steps) commute. By employing uninterpreted function symbols instead of concrete representations, the verification complexity is reduced. Burch and Dill’s approach was extended and modified by several groups who generate abstraction functions automatically, define completion functions for unfinished instructions and improve mechanisation [Cyr93] [SRC97] [SJD98] [SM95] [HSG98] [HSG99]. These improvements enabled the verification of (synchronous) microprocessors with realistic features such as pipelining, out-of-order execution and branch prediction. Recent combinations with model checking aim at improved automatic abstraction functions, compositional reduction and induction over cycle time [BBCZ98], [JM01].

Damm and Pnueli proposed an alternative approach for verifying out-of-order exe-
cutions using synchronous transition systems [DP97]. Interpretative approaches have been proposed [Moo98] and were applied to commercial microprocessors [BKM96].

2.5 Programming language based reasoning

The above techniques are very well suited for verifying that a processor implementation satisfies the specification of a processor. In particular, the approach using abstraction functions (Figure 2.3) ensures correctness of pipelined architectures with respect to the sequential ISA semantics. This method can consequently be used for verifying forwarding algorithms implemented in hardware. In fact, [BBCZ98] and [HGS99] use Tomasulo’s algorithm to demonstrate their method. The correctness resulting from Figure 2.3 is indeed strong: for all application programs, the processor implementation (or its abstraction as formalised in the small-step transition system) shows the same observable behaviour as its ISA specification.

In this thesis however, we do not aim at proving such a strong correctness result for a particular processor implementation, but wish to reason about forwarding schemes more generally. In particular, the above verification method appears not applicable to the compiler-based approaches to forwarding, as correctness of execution depends as much on properties of the program as it does on properties of the processor. In order to give structural properties of a program a more prominent status, we advocate an approach based on programming language technology, with the following ingredients.

First, we make forwarding explicitly visible at the ISA level by including operand queues in the syntax of the assembly language. This turns forwarding from a processor-internal feature into a programmable mechanism to schedule communication. Logi-
cally extending SCALP’s concept of explicit forwarding, we make both the destination \textit{and the source} operand queue(s) of an instruction visible and give operand queue names the same status as register names.

Second, the explicitness of forwarding in the programming language enables the application of reasoning techniques from the programming language design community. Models of processor behaviour may be expressed as dynamic semantics for the assembly language, as demonstrated by models for sequential (ISA-like) and distributed (super-scalar-like) execution. Indeed, as the dynamic semantics will be given in structural operational semantics (SOS) form [Plo81], we relate processor models structurally and prove properties by structural induction. At first sight, the presence of \textit{several} dynamic semantics might appear to deviate from the usage of SOS in programming language design, where a single formal definition is aimed at [MTHM97]. However, the process of evaluating design alternatives with respect to compilation, hardware implementation and runtime performance should not be compared with a final programming language, but with the process of designing one. Indeed, exploration of design alternatives regarding programming language constructs involves a number of dynamic semantics and occasionally several experimental languages.

The benefit of our approach becomes apparent in the third step as we express forwarding schemes as program properties. Instead of verifying processor behaviour, we classify the programs according to which runtime behaviour they exhibit. By complementing the dynamic semantics by static semantics (type systems), a better understanding of the interplay between operational model and forwarding is obtained. As the runtime hazards associated with the respective operational models manifest themselves as properties of the corresponding SOS systems (or incompatibilities between such systems), we design type systems for characterising programs which are safe for execution: a program which passes the type checker is guaranteed not to experience the runtime hazard under consideration. Thanks to the structural formalisation of dynamic and static semantics, the soundness of the typing calculi may be proven formally, using well-established proof methods such as structural induction [Bur69].

The separation between static and dynamic semantics meets our earlier requirement that the \textit{implementation} of a forwarding algorithm and the \textit{legality} of the resulting for-
warding scheme be separated. Indeed, while our calculi can also be employed for type inference, the decision how operand queues are used may also be made elsewhere. If our language with operand queues is chosen as the visible assembly language, we may in fact accept code from arbitrary sources, provided that we type-check it before execution. Alternatively, it is possible to understand the typing calculi as a specification of a hardware implementation of forwarding. For example, a forwarding implementation in the control unit may be considered correct if its output (the program as issued to the respective functional units) is well-typed. Although this aspect is not explored further in this thesis, we expect that it allows one to partition the verification of processors with hardware forwarding into a functional verification of the algorithm implemented in the control unit and an invocation of our results for the operational model realised in the data path.

The fourth aspect of our approach consists of its mediating role linking architecture and compilation. This benefit is demonstrated in the second part of this dissertation, where we show how software-based forwarding may be implemented by program analysis, formalised as a dataflow problem. While in general the output of such a compiler may be treated like any other program by type-checking it before execution, our earlier insight into forwarding is employed for eliminating this check. Indeed, code emitted from a compiler using our forwardability analysis is guaranteed to be well-typed and hence structurally safe for execution.

Finally, the fifth aspect of programming language technology consists in the ability to modify operational models and typing calculi quickly, and to extend them with annotations. This is of particular interest during the system design phase when configurational choices need to be explored. For example, by extending an SOS system with cost models comprising, say, measures for energy consumption in operand queues and registers, performance may be evaluated as required by [HP96]. A system designer may thus explore various operand communication schemes and optimise the corresponding scheduling algorithms using a consistent framework with solid formal foundations.
2.6 Related work

We now discuss related work on applying programming language concepts to hardware design, and on types in compilation and low-level languages.

2.6.1 Programming language technology and processor design

Immediately after the inception of structural operational semantics, Cardelli [Car82] explored its applicability to hardware description, layout, and VLSI design using algebraic techniques. Most of this work is concerned with lower level abstractions, and with the verification of properties regarding layout and timing. Since then, applications of structural operational semantics in the area of hardware have mostly concentrated on description languages such as ELLA [Goo93], VHDL [vT93], [TE01], Verilog [BJQ00] or Esterel [BG92] by following [Plo81]’s general approach of using SOS for defining the meaning of languages. As is the case for higher programming languages [Win93], SOS and typing turn out to be effective for defining hardware description languages precisely, reasoning about their properties and uncovering errors and ambiguities of earlier (and usually informal) descriptions. The resulting semantics may hence serve as specifications for hardware development tools such as simulators and synthesisers. While it is possible to obtain a precise description of a processor architecture by specifying its behaviour in a hardware description language, the programming language based reasoning we advocate acts on the meta-level of such a formalisation. In particular, abstract entities such as register or queue names, assembly instructions or operands are only available implicitly, rather than syntactically. The above work does therefore not solve the task we set.

Term rewriting systems (TRS’s) were used by Shen, Arvind et al. in a series of papers and technical reports for designing, verifying and comparing processor models with various architectural features including super-scalar and out-of-order execution [SA98b] [SA98a] [PHA98] [AS99]. This formalism expresses the above abstract entities explicitly and hence shares the advantages of our method with respect to the first two aspects mentioned in the previous section. On the other hand, the proof method consists of showing a confluence result similar to the one of Figure 2.3. Thus, the cited
work still aims at verifying hardware behaviour rather than classifying programs and reasoning about their execution. Consequently, the later steps (in particular complementing the dynamic semantics by static semantics and providing a link to compile-time analysis) are not provided. On the other hand, the authors report in [HA99] how the TRS descriptions can be transformed for generating hardware. This is a task we have not attempted, but we expect that the process could also be carried out starting from SOS descriptions.

The structured operational description of processor behaviour most related to our approach is represented by [MHC97]. This work gives an SOS description of transport-triggered architectures where the structure of the semantics follows the structure of the architecture and the hierarchy of entities is mirrored in the syntax of instructions. In contrast to VLIW architectures where the width of an instruction is determined by the number (and kinds of) functional units, instructions in transport-triggered architectures are parallel compositions of (conditional) *moves* and consist of as many parallel components as there are transport buses. For example, the program we gave in Section 2.2.5.2 is suitable for an architecture with two transport buses. Individual move operations are composed of a guard (for implementing conditional instructions) and one source and target socket each. Source sockets may be constant values or output ports of functional units while target sockets may be data or trigger input ports of functional units, where a value at a port of the latter sort may be a data item or a trigger for initiating the execution of an operation. Semantically, the effect of a move consists of transferring a value from the socket indicated in its source component to the one given in the target component, provided that the guard evaluates to true. Only a single dynamic semantics is given in [MHC97], with execution modelled as follows. An individual *move* operation consists of two phases. A *source* phase executes the operand access as required by the source component, resulting in the corresponding value. The second step combines this value with the current state, yielding a substitution which indicates the necessary update to the state, as required by the target component. An *instruction* is executed by performing the constituent move operations concurrently and then applying the resulting substitutions to the initial state. A *program* execution step first executes the instruction pointed to by the program counter (which is subse-
2.6. Related work

This document discusses the process of state modification in functional units, focusing on the modifications performed in response to functional operations. The semantics of these modifications are illustrated using small example programs, and some initial properties are proven by giving derivation trees. Regarding well-definedness of the substitutions, the authors note that conflicts may arise if several updates to a particular socket are required in the same cycle. These conflicts are similar to those we treat in Chapter 4, with the difference that conflicts in TTA's can only occur between components of the same instruction as the operation is synchronous and cycle-based. The authors remark that code legality hence relies on the ability of the compiler to schedule code in a way which avoids conflicts and mention the possibility to enforce this property using a static semantics. However, no details are given in [MHC97], and the topic was apparently not pursued any further. Likewise, the ability of a static semantics to enforce correctness in the case of a non-fully connective move network is mentioned, but not elaborated on.

2.6.2 Typing in intermediate and low-level languages

Static type systems for ensuring safety of execution as early as possible have recently found their way from high-level programming languages to intermediate and low-level languages.

2.6.2.1 Types in compilation (TIC)

Inclusion of types during the compilation process (TIC) has been advocated and explored in the FOX and FLINT projects [TMC+96] [PCHS00] [SA95] [Sha01]. While the FOX work, as well as [SA95], concentrates on compilation of functional languages (in particular, ML), the FLINT project [Sha01] aims more generally to provide a common, typed intermediate language for various higher-order, typed languages, including ML, Haskell and Java. In both cases, typed intermediate languages preserve high-level typing information throughout the compilation (including optimisation phases), with additional type information resulting from program analysis. As our program analysis is formalised in terms of dataflow equations, the typed-intermediate-languages concept does not apply directly. However, in the light of similarities between dataflow and
typing (see below) it may be possible to transfer our analysis to a typed framework. In particular, this might allow one to recast our compilation into assembly code with the conversion of dataflow properties to typing statements as a type-preserving translation. We will briefly return to this point in the discussion.

### 2.6.2.2 Typed assembly languages (TAL)

On even lower levels, typed assembly languages have been designed which enforce well-behavedness of code with respect to heap allocation, data layout, runtime stack and type-safe operand behaviour [CM99b] [AC01]. The TAL approach annotates assembly code with information for tracking the types of register contents and heap locations, including initialisation [CM99b]. In particular, by using polymorphic types, one may type code which is applied to data of various shapes. The concept was extended to stack-based languages in [MCGW98], with subsequent application to an Intel-X86-like architecture [MCG+99]. [MWCG98] showed how TAL may serve as the target language of a typed compilation, consisting of a number of (type-preserving) compilation steps. This work has demonstrated the applicability of typing at low levels, and we see the processor-architecture-based motivation as complementing it. In particular, we expect that aspects of polymorphism may be useful to our work when including code blocks such as procedures or function calls.

Xi and Harper’s DTAL extends TAL by a limited form of dependent types, allowing dynamic array bounds checks and tag checks for sum types to be eliminated [XH01]. As all typed-assembly-language work is approached from the compiler’s point of view, it is not surprising that all only single dynamic semantics are considered. The usage of several dynamic semantics motivated by our processor-architecture driven viewpoint thus adds a new facet to typing of assembly languages adds further motivation to the general concept.

### 2.6.2.3 Typing of Java bytecode

Typing in low-level languages received a further stimulus from the introduction of the Java virtual machine (JVM). As the safety of Java virtual machine language (JVML) programs represents a cornerstone of the JVM’s architecture, types have been used
2.6. Related work

for verifying operand stack behaviour, object initialisation, nesting of subroutines and method invocation [SA98c] [FM98] [FM99] [HT98]. Processor-architecture-inspired AQM’s differ from the abstract machine model of the JVM in various respects, such as the differences between a single stack and distributed queues and single and multiple dynamic semantics. However, general similarities exists (as exemplified by our implementation in Chapter 8), and the treatment of composition of typing statements for basic blocks in Chapter 3 was in fact motivated by [SA98c].

2.6.2.4 Ohori’s assembly-level Curry-Howard isomorphism

As a foundation of typed compilation, Ohori [Oho99a] presents an assembly-level isomorphism between typed low-level languages and formal proof systems. The low-level language is that of A-normal forms, an intermediate format closely related to continuation-passing style (CPS) [FSDF93]. Translating typed λ-calculus terms into this language is first shown to amount to a proof transformation taking a proof in natural deduction form to a proof in a restricted form of a Gentzen-style sequent calculus. Then, the evaluation of A-normal form programs using runtime environments is shown to correspond precisely to proof reduction in the sequent calculus. As [Oho99a] argues, this relation between types and formulae models low-level execution of code more precisely than the correspondence between high-level function application and cut elimination in the usual Curry-Howard isomorphism.

Complementing this work, [Oho99b] proposes a general style of logical abstract machines, the sequential sequent calculus (SSC). This proof system represents machine-level programs in a linear form where each instruction is modelled by a corresponding proof rule and rule applications are composed in program order. Dynamic program behaviour consequently amounts to proof reduction as the execution of an instruction eliminates the last inference step. The concept is presented for register-based and stack-based architectures, and code generation is related to proof transformation. The stack-based architecture was subsequently applied to a subset of the Java virtual machine, yielding an elegant de-compilation algorithm from JVML into a functional language [KO01].

The typing of our assembly language follows a more traditional rule format with an
explicit cut rule rather than an internalisation of the cut as rule application. As intermediate work towards this dissertation, we formalised some aspects of our typing calculus in SSC, but the details of this work and an understanding of the conversion between the two styles are left for future research.

2.6.2.5 Proof-carrying code

Building on the technology of intermediate and assembly level typing, proof-carrying code (PCC) has recently emerged as a novel technology for formally certifying programs [Nec97]. Instead of relying on trusted third parties for authentication, a formal proof object asserting intrinsic properties of a program is bundled with the code using type-theoretic concepts [BM92]. The task of program verification is then split into two phases. The code producer (e.g. compiler) constructs the proof together with the program, and bundles them together. Depending on the specification logic, proof construction may either be done fully automatically or involve manual intervention, possibly supported by verification condition generators, theorem provers or other tools. The code consumer (e.g. execution engine) validates the consistency of the transmitted code-proof pair. Thanks to the type-theoretic technology, the second step amounts to type-checking, which is often more feasible than proof inference or full verification.

From the point of view of the code consumer, the means employed for constructing the proof is irrelevant – indeed, the proof does not need to originate from the code producer at all as long as it matches the code’s properties.

Up to now, PCC has been mostly used for ensuring security properties, such as restricting the access to local data or bounding the consumption of computational resources. In the context of Java, additional security issues arise from code mobility, such as the choice of class loader used, and from class file verification. Specifying security policies in a (type-theoretic) specification logic allows one to employ PCC for enforcing safety of execution. Dynamically transmitted code is rejected by the virtual machine’s verifier unless it is equipped with a proof that certifies obedience to the safety policy. Third-party intervention during code transmission leading to code which violates the security model is detected. The intervention will manifest itself either by the proof not matching the program’s properties any longer or by an inconsistency between the proof
and the specified security properties. In contrast, outside intervention which leaves the security aspects untouched (or manipulates code and proof in a consistent and secure way) is not detected. Proof-carrying transmitted code may hence still behave functionally incorrect, but is guaranteed to respect the security policy.

Several extensions of the original PCC approach are currently being explored. These use sophisticated type systems for reducing the size of the trusted code base (verification condition generator, proof checker, . . . ) and logical embeddings of assembly and intermediate languages into higher-order theorem provers [AF00] [MA00] [App01] [SSTP02]. As is the case for earlier work such [Cur92], we expect that these embeddings may be relevant for a formalisation of our approach. In particular, sophisticated type systems should be beneficial for verifying programs with respect to architectures where not all forwarding paths are available and for developing typed compilation schemes starting from functional intermediate languages. These aspects are not treated in this dissertation but represent appropriate topics for future work, and we will elaborate on some ideas in our concluding discussion.

2.6.2.6 Typed register allocation

Two pieces of research consider typed approaches to register allocation. Thiemann [Thi98] presents implementation types which make resource allocation and conversion explicit. Register allocation is formalised in a type-directed way, and the transfer of values between registers and memory is tracked syntactically. While the source language consist of a restricted class of A-normal forms, the target code comprises a generic assembly language with simple register operations and immediate values. In order to model not only the assignment of locations to values, the type system is formalised using effects capturing those registers which are affected by the execution of an instruction, or by a function call. According to [Thi98], implementation types allow flexible calling conventions for function calls (caller-saved, callee-saved, . . . ) and support lightweight closures where function representations store only parts of their environment [SW97]. The last aspect should be relevant in an extension of our work for compiling functional languages, as will be briefly discussed in Chapter 9. As register allocation in our work represents only a minor aspect and our intermediate language
is not as well-structured as A-normal forms (though related to it), we did not formalise
our allocation using type and effect systems.

The same remark holds for the work of Agat [Aga97] who presents a type and effect
system for register usage in assembly languages. His motivation is to guarantee le-
gality of register allocation schemes for functional languages. Assembly programs are
expressed in a λ-calculus like notation, where for example a program

\[
\text{mov } 3, %r8 \ \text{mov } 5, %r6 \ \text{add } %r8, %r6, %r2
\]

is expressed as

\[
\text{let } x = 3_{r8} \text{ in let } y = 5_{r6} \text{ in ADD}_{r8, r6, r2} xy
\]

This program is assigned the type \( \text{Int}_{r2!}\{r2, r6, r8\} \), indicating that the result (of type
\( \text{Int} \)) will be delivered in register \( r2 \), and that all registers used for producing the result
are amongst \( r2, r6 \) and \( r8 \). This mechanism is extended to (higher-order) functions and
closures, and soundness of the corresponding typing system with respect to a dynamic
semantics with (finitely many) registers and an operand stack is proven. Again, we
expect that this calculus might provide guidance for reasoning about function calls and
a more formal treatment of allocation than ours.

It would also be interesting to explore which modifications are needed to the calculi
presented in [Thi98] and [Aga97] for treating (linear) forwarding or operand queue
allocation.

### 2.6.3 Program analysis frameworks

The second part of this thesis illustrates the benefit of explicit forwarding for schedul-
ing communication in the compiler. The program analysis presented uses the concept
of dataflow equations which are a well-known technology in compiler construction
[App98a]. The recent book [NNH99] stresses the similarities between dataflow anal-
ysis and other program analysis techniques such as abstract interpretation, typing and
constraint-based analysis. The techniques are presented using a uniform language,
with operational models given in SOS form. Included in [NNH99] is a soundness
proof of the dataflow equations for liveness analysis which shares similarities with our
approach to usage analysis. We will discuss these similarities but also differences in Chapter 6.

2.7 Notation

We follow the tradition of presenting programming language technology using formal derivation calculi. These are based on formal languages of *judgements* (*sequents*) whose grammatical structure depends on the purpose of the calculus. In our case, operational semantics of assembly languages will be defined using judgements of the form $C \vdash D$ where $C$ and $D$ are configurations and $t$ is a piece of assembly code. Typing calculi will employ judgements of the form $\Gamma \vdash t : \tau$ where $\tau$ denotes a type, $\Gamma$ an (optional) set of typing assumptions (also called a typing context) and $t$ is again a program fragment.

Judgements are related to each other by a set of rules

\[
\begin{array}{c}
\text{Name} \\
H_1 \quad \ldots \quad H_n \\
\hline
C
\end{array}
\]

linking hypothetical judgements $H_1, \ldots, H_n$ to concluding judgement $C$. Since most derivation systems contain large (often infinite) sets of rules, derivation systems are usually presented as a set of rule *schemas*

\[
\begin{array}{c}
\text{Name} \\
H_1 \quad \ldots \quad H_n \\
\hline
C \\
\text{side conditions}
\end{array}
\]

where judgements may contain meta-variables and side-conditions govern instantiation: any substitution of appropriate terms for the meta-variables satisfying the side conditions yields a valid rule, and the denoted derivation system is determined by the set of all such instantiations.

Rules for which the set of hypothetical judgements is empty (i.e. $n = 0$ holds) are called *axioms*, and instantiations are recursively combined to derivation trees with final sequents (roots) $C$:

- any instantiation of an axiom with concluding judgement $C$ is a derivation tree with root $C$, and
an instantiation of a rule with conclusion \( C \) and hypothesis \( H_1, \ldots, H_n \) yields a
derivation for \( C \) if there are trees \( T_i \) with final sequents \( H_i \) for all \( 1 \leq i \leq n \).

The purpose of the derivation rules thus consists of constructively defining subsets of
judgements comprising exactly those sequents \( C \) which are derivable (i.e. for which
a derivation tree with root \( C \) exists). In our case, judgements \( C \xrightarrow{T} D \) will define the
dynamic semantics of our assemble language: derivability of this sequent will encode
the fact that executing program \( t \) in initial configuration \( C \) leads to final configuration
\( D \). Likewise, the typing calculi will implicitly define a subset of programs, the set of
well-typed programs.

The structure of derivations may subsequently be exploited for proving properties of
derivable judgements, using the principle of (rule) induction: for showing that a property \( \phi \) holds for all judgements defined by a derivation system with rule schemata
\( R_1, \ldots, R_n \), it suffices to show the following

- \( \phi \) holds for all judgements which are final sequents of axiom instantiations, and
- \( \phi \) holds for any conclusion \( C \) of a rule with hypothesis \( H_1, \ldots, H_n \) provided that
  \( \phi \) holds for all hypothesis \( H_i \).

In our case, we will inductively link typing calculi and dynamic semantics in order to
eliminate runtime hazards: we will prove that the dynamic execution of a program \( t \)
fulfils certain properties whenever a judgement \( \Gamma \vdash t : \tau \) is derivable for some \( \Gamma \) and \( \tau \).

Often, judgements involve a syntactic category \( C \) such that each grammatical form of
terms in \( C \) occurs in the conclusion of at most one rule schema. This restricts the set
of rules which may have been applied in the last (closest to the root) derivation step,
which may be exploited to simplify a proof using rule induction.

The grammatical structure of terms of a particular syntactic category \( C \) is also exploited
in \textit{structural} induction. Here, hypothesis in a derivation system for judgements with
respect to terms of \( C \) may consist of judgements in a derivation system for judgements
for terms of a different syntactic category \( D \), where \( D \)-terms occur as components of
(some) \( C \)-terms. In this case, proving a property \( \phi \) of \( C \)-terms involves proving some
related property \( \psi \) for \( D \)-terms. For example, our dynamic semantics \( C \xrightarrow{\mu} D \) will be
defined in terms of a relation \( C \xrightarrow{\mu} D \) where \( \mu \) is a syntactic constituent of program \( t \).
A proof of a property regarding judgements $\text{C} \overset{t}{\rightarrow} \text{D}$ consequently rests on a proof of a related property for judgements $\text{C} \overset{\mu}{\rightarrow} \text{D}$.

The following conventions are followed throughout this dissertation.

The set of words over a syntactic category $\text{C}$ is denoted by $\text{C}^*$, and is ranged over by variables $w, v, \ldots$, obtained by juxtaposing elements from $\text{C}$. The empty word is denoted by $\lambda$, and the length of a word $w = c_1 \ldots c_n$ is $|w| = n$, with $|\lambda| = 0$.

Binary relations $\mathcal{R}$ are usually written in infix notation. We denote the reflexive and transitive closure of $\mathcal{R}$ by $\mathcal{R}^*$ and the transitive closure by $\mathcal{R}^+$. Functions between sets $A$ and $B$ are denoted by $f : A \rightarrow B$, with domain $\text{dom } f = A$ and codomain $\text{cod } f = B$. Partial maps with finite domain are denoted by $f : A \rightarrow B$, and $f[a \mapsto b]$ is used for both maps and functions to denote the modified function $f'$ which maps $a$ to $b$ and acts like $f$ on all other elements from $A$.

$\mathcal{P}(S)$ denotes the powerset of a set $S$, and for a directed graph $G = (V, E)$ with vertices $V$ and edges $E \subseteq V \times V$, the sets of predecessors and successors of $v \in V$ are given by $\text{preds}(v) = \{u \in V \mid (u, v) \in E\}$ and $\text{succs}(v) = \{u \in V \mid (v, u) \in E\}$, respectively.
Chapter 3

Sequential queue machines

The first step in understanding forwarding consists of defining an assembly language in which forwarding information is captured explicitly in the syntax and thus exposed to the programmer. The benefit of this formalisation is that different dynamic semantics may be related to each other and to static semantics expressing well-formedness conditions. Reasoning techniques for programming languages can be used to prove these relationships. Furthermore, explicitness of forwarding information represents a well-defined interface to the compiler which may thus optimise the forwarding behaviour. Concepts such as low-level typing systems can be used for proving structural correctness of code emitted by a compiler and for giving hints to the runtime system regarding constraints for dynamic scheduling.

This chapter consequently introduces ALEF, an assembly language for explicit forwarding. In ALEF, operand queue names and register names occur syntactically in the same position, so a compiler can choose which mechanism to use for each operand. A dynamic semantics for ALEF is given which models sequential execution similar to the instruction set architecture of a processor. This model aims at understanding the basic requirements of forwarding schemes. For this analysis, registers can be disregarded and the dynamic semantics only supports operands which are sent through forwarding queues. At first, only straight-line code is considered, i.e. code which does not contain jump instructions. Its execution can encounter runtime errors which inhibit further execution, similar to deadlocks in SCALP programs. As a first example of the benefits of a programming language based view we present a static semantics for eliminating
these errors using a type system based on notation from linear logic [Gir86]. Types abstract from the order of operands in operand queues and from individual values. We then extend dynamic and static semantics to programs with jumps, following an approach introduced by Stata and Abadi [SA98c] when combining the typings of basic blocks. We require that input and output types of neighbouring basic blocks match, so that the number of elements in operand queues is statically bound.

Static and dynamic semantics are related by soundness theorems which guarantee that well-typed programs are free of dynamic errors. In the case of straight-line code, a completeness theorem complements soundness, stating that any error-free program is indeed typable. We also show that the type system admits a notion of principal types, from which any other typing for the same program may be obtained. The corresponding type inference task for basic blocks is solved by unification.

**Synopsis**  We start by defining the syntax of ALEF in Section 3.1. Subsequently, we restrict our attention to programs without registers or branch instructions. The dynamic semantics for the restricted set of programs is defined in Section 3.2. Its characteristic dynamic errors motivate a static semantics which is introduced in Section 3.3. Soundness and completeness with respect to the operational model are discussed, proving that programs which pass the type system will not experience the dynamic hazards under consideration. In Section 3.4, we show that the type system admits a notion of principal types. For each well-typed program \( t \) we can single out a type of which all other types for \( t \) are generalisations. In Section 3.5, we generalise our approach to programs involving branch instructions. Again, we first give a dynamic semantics, before introducing the static semantics and discussing type inference. Finally, Section 3.6 summarises the achievements and discusses some additional aspects.

### 3.1 ALEF: a language for explicit forwarding

ALEF, the assembly language for explicit forwarding, is intended for execution on architectures whose general organisation is shown in Figure 3.1. Functional units of different types occur in parallel, each executing a specific subset of the instructions.
3.1. ALEF: a language for explicit forwarding

Figure 3.1: General organisation of processors suitable for ALEF

Instructions are sent to their functional unit by the control unit and may exchange operands through registers or named operand queues before retiring after execution. The syntax of ALEF is given by the following grammar where instructions \( \text{ins} \) are pairs \([n]\text{code} \) of unique labels \( n \in \mathbb{N} \) and opcodes with operand queue names \( q \) and registers \( r \) as arguments. Some instructions are parametric in the type \( \text{fu} \) of the functional unit they execute on. Operand values \( a \in \text{Val} \) and memory locations both range over integers.

\[
\begin{align*}
a \in \text{Val} & ::= \ldots -1 \mid 0 \mid 1 \ldots \\
n \in \mathbb{N} & ::= 1 \mid 2 \mid \ldots \\
q \in Q & ::= q_n \\
r \in R & ::= r_n \\
op \in \text{OP} & ::= q \mid r \\
\text{fu} \in \text{FU} & ::= \text{ALU} \mid \text{MEM} \mid \text{MUL} \mid \text{BU} \\
\text{code} \in \text{Opcode} & ::= \text{dec} \ op_1 \ op_2 \mid \text{add} \ op_1 \ op_2 \ op_3 \mid \text{mul} \ op_1 \ op_2 \ op_3 \\
& \mid \text{ldc} \ a \ op \mid \text{ld} \ op_1 \ op_2 \mid \text{st} \ op_1 \ op_2 \mid \text{idfu} \ op_1 \ op_2 \\
& \mid \text{dupfu} \ op_1 \ op_2 \ op_3 \mid \text{skipfu} \ op \mid \text{if} \ op \ n_1 \ n_2 \mid \text{jmp} \ n \\
\text{ins} \in \text{Instr} & ::= [n]\text{code} \\
t, s, \ldots \in \text{Iseq} & ::= \varepsilon \mid \text{ins} \mid st
\end{align*}
\]
Chapter 3. Sequential queue machines

The first three opcode forms are arithmetic instructions for decrementing, adding and multiplying values. For example, the instruction \( \text{add } q_1 q_2 q_2 \) removes the head values from operand queues \( q_1 \) and \( q_2 \) and inserts their sum into \( q_2 \). The next three instructions are memory instructions. \( \text{ldc } a \text{ op} \) inserts the value \( a \) into \( \text{op} \) while \( \text{ld } \text{op}_1 \text{ op}_2 \) interprets the value of \( \text{op}_1 \) as memory address and inserts the value found at that location into \( \text{op}_2 \). \( \text{st } \text{op}_1 \text{ op}_2 \) stores the value found in \( \text{op}_2 \) at the location given by the value from \( \text{op}_1 \). The next three instruction forms are parametric in the functional unit. For each parameter \( f_u \), \( \text{ld}^{f_u} \) transfers the value of \( \text{op}_1 \) to \( \text{op}_2 \) while \( \text{dup}^{f_u} \) in addition sends a copy of it to \( \text{op}_3 \) and \( \text{skip}^{f_u} \) simply consumes the value read from \( \text{op} \). The last two instructions are conditional and unconditional jumps and will be dealt with in Section 3.5.

Instruction sequences \( s, t, \ldots \) are lists of instructions with \( \varepsilon \) denoting the empty sequence. It can be proven that composition behaves associatively and \( \varepsilon \) neutrally for all notions in this thesis, and instruction sequences are consequently often treated as flat programs \( \text{ins}_1 \ldots \text{ins}_n \). Occasionally, we omit labels of instructions and confuse opcodes and instructions.

When compared to simple compounds, ALEF imposes fewer restrictions on the forwarding discipline. Firstly, forwarding between instructions which are not syntactical neighbours is possible. Secondly, a program may use forwarding across basic block boundaries. Thirdly, the linearity restriction is lifted as binary instructions may obtain both operands through forwarding and a \( \text{dup}^1 \) instruction may send its result to two operand queues. The program

\[
\begin{align*}
[1] & \text{ldc } 4 \text{ q}_1 \\
[2] & \text{ldc } 2 \text{ q}_2 \\
[3] & \text{ldc } 7 \text{ r}_1 \\
[4] & \text{dupl } q_2 q_2 q_2 \\
[5] & \text{add } q_1 q_2 q_1 \\
[6] & \text{dec } r_1 r_1 \\
[7] & \text{if } r_1 8 4 \\
[8] & \text{add } q_2 q_1 r_1
\end{align*}
\]

contains examples for all three forwarding patterns. The Java excerpt

```java
int i = 4; int j = 2;
for (int k=7; k>0; k--){i = i+j;}
k = i+j;
```
may result in this code, and Figure 3.2 shows a visualisation of the forwarding behaviour, where forwardings across basic block boundaries are shown as dashed arrows and solid arrows represent forwardings inside basic blocks.

When compared to SCALP, ALEF specifies not only the destination of instruction’s results but also the sources of the operands. In SCALP, a typical binary instruction such as add would expect its operands at the two input ports alu_a and alu_b of its functional unit. In contrast, the source queue of forwarded operands is explicitly visible in ALEF and different add instructions may well obtain their operands from different queues.

The remainder of this chapter as well as Chapter 4 aim at deriving basic principles of explicit forwarding. These can be examined most purely in the absence of registers, and the following sections hence completely ignore registers and assume that all operand fields op are of the form q.

### 3.2 Dynamic semantics

The dynamic semantics for sequential execution corresponds to the instruction set architecture (ISA) of a processor. A simplified architectural model is shown in Figure 3.3. The structure of forwarding paths in modern architectures motivate an asymmetric access policy for operand queues [Fly95]. While instructions can send their result to any operand queue, they can obtain their operands only from queues which are associated to their functional unit. No reduction is possible if an instruction attempts to read
from a foreign operand queue. We let $\gamma(q)$ denote the functional unit associated to $q$. In SCALP [End96], instructions in fact do not specify the source queues of operands at all but implicitly consume values from the input ports of their functional unit.

The dynamic semantics executes one instruction at a time, determined by a relation $\mathcal{C} \rightarrow \mathcal{D}$ where an instruction sequence $t$ relates initial configuration $\mathcal{C}$ and final configuration $\mathcal{D}$. In this section, configurations are pairs $(Q, M)$ consisting of total maps $Q : \mathcal{Q} \rightarrow \text{Val}^\mathcal{Q}$ and $M : \text{Val} \rightarrow \text{Val}$. The queue configuration $Q$ assigns a word of values to each operand queue, while $M$ represents the memory state by mapping addresses to values. We write $Q(q)$ to represent $Q$'s entry at $q$ and $Q[q \mapsto w]$ for the queue configuration which agrees with $Q$ everywhere except $q$ and maps $q$ to the word $w$. For the memory component $M$, we let $M(a)$ denote the value at address $a$ in $M$ and write $M[a \mapsto b]$ for the memory $M$ updated by value $b$ at address $a$. Often, we only mention the non-trivial parts of queue configurations by writing $[q_1 \mapsto w_1, \ldots, q_n \mapsto w_n]$ for the $Q$ with $Q(q_i) = w_i$ and $Q(q) = \lambda$ for $q \notin \{q_1, \ldots, q_n\}$. A similar convention is used for memories $M$.

The dynamic semantics is defined structurally, based on (semantic) micro-instructions \textit{read(fu)} and \textit{write} which are ranged over by $\mu$. These model the low-level access to operand queues, and their actions are parametrised by the operand queue they access.
and the value they operate on.

\[
\begin{align*}
\text{WR} & \quad w \xrightarrow{\text{write}, a, q} wa \\
\text{RD} & \quad aw \xrightarrow{\text{read}(f, a, q)} \gamma(q) = fu
\end{align*}
\]

The enqueueing operation (rule WR) appends the value \( a \) to the word \( w \in \text{Val}^* \) and carries no side condition. Dequeueing (rule RD) removes the head value \( a \) of a word \( aw \). The rule is only applicable if the side condition \( \gamma(q) = fu \) holds and the current queue content is of the form \( aw \).

The effect of micro-instructions is promoted to the level of configurations by

\[
\begin{align*}
\text{\( \mu \quad Q(q) \xrightarrow{\mu, a, q} w \quad \text{(Q,M)} \xrightarrow{\mu, a, q} (Q[q \leftarrow w], M) \)}
\end{align*}
\]

and the dynamic semantics of opcodes is given in terms of micro-instructions by the rules CODE in Figure 3.4. Finally, the dynamic semantics of instruction sequences is defined in terms of the relation \( C \xrightarrow{\text{code}} D \) in the rules INSTR and COMP (Figure 3.5).

**Example.** Let \( t \) be the program \([1] \text{d} \text{c} \ 4 \ q_1 \ [2] \text{a} \text{d} \ q_1 \ q_2\), and the configurations \( C, \ldots, G \) be given by

- \( C = ([q_1 \leftarrow 3, q_2 \leftarrow 5], M) \)
- \( D = ([q_1 \leftarrow \lambda, q_2 \leftarrow 57], M) \)
- \( E = ([q_1 \leftarrow 34, q_2 \leftarrow 5], M) \)
- \( F = ([q_1 \leftarrow 4, q_2 \leftarrow 5], M) \)
- \( G = ([q_1 \leftarrow \lambda, q_2 \leftarrow 5], M) \)

where \( M \) is arbitrary. Figure 3.6 shows a derivation for \( C \xrightarrow{L} D \), proving that executing \( t \) in configuration \( C \) leads to configuration \( D \) provided that \( \gamma(q_1) = \text{ALU} \) holds. For typographical reasons, the side conditions are shown in the position of the hypothesis and rule names have been omitted.

The final configuration \( D \) in the above example is uniquely determined once \( C \) and \( t \) are fixed, as the sequential dynamic semantics is deterministic.
Chapter 3. Sequential queue machines

**Figure 3.4: Dynamic semantics of opcodes**

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Box 1</th>
<th>Box 2</th>
<th>Box 3</th>
<th>Box 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>LDC</td>
<td>$C \xrightarrow{\text{write},a,op} D$</td>
<td>$C \xrightarrow{\text{dec},a,op} D$</td>
<td>$C \xrightarrow{\text{read}(\text{ADD}),a,op_1} E$</td>
<td>$C \xrightarrow{\text{write},a-1,op_2} D$</td>
</tr>
<tr>
<td>DEC</td>
<td>$C \xrightarrow{\text{dec},op_1,op_2} D$</td>
<td>$C \xrightarrow{\text{read}(\text{ADD}),a,op_1} E$</td>
<td>$C \xrightarrow{\text{write},a,op_1} D$</td>
<td>$C \xrightarrow{\text{write},a,op_2} D$</td>
</tr>
<tr>
<td>SKIP</td>
<td>$C \xrightarrow{\text{skip},op_1} D$</td>
<td>$C \xrightarrow{\text{read}(\text{fu}),a,op_1} E$</td>
<td>$C \xrightarrow{\text{write},op_2} D$</td>
<td>$C \xrightarrow{\text{write},op_1} D$</td>
</tr>
<tr>
<td>ID</td>
<td>$C \xrightarrow{\text{read}(\text{fu}),a,op_1} E$</td>
<td>$C \xrightarrow{\text{write},op_2} D$</td>
<td>$C \xrightarrow{\text{write},a,op_2} D$</td>
<td>$C \xrightarrow{\text{write},a,op_2} D$</td>
</tr>
<tr>
<td>LD</td>
<td>$C \xrightarrow{\text{read}(\text{MEM}),a,op_1} (Q,M) \xrightarrow{\text{write},M(a),op_2} D$</td>
<td>$C \xrightarrow{\text{read}(\text{MEM}),a,op_1} (Q,M) \xrightarrow{\text{write},M(a),op_2} D$</td>
<td>$C \xrightarrow{\text{read}(\text{MEM}),a,op_1} (Q,M) \xrightarrow{\text{write},M(a),op_2} D$</td>
<td>$C \xrightarrow{\text{read}(\text{MEM}),a,op_1} (Q,M) \xrightarrow{\text{write},M(a),op_2} D$</td>
</tr>
<tr>
<td>ST</td>
<td>$C \xrightarrow{\text{read}(\text{MEM}),a,op_1} D$</td>
<td>$C \xrightarrow{\text{read}(\text{MEM}),b,op_2} (Q,M) \xrightarrow{\text{write},M(a \mapsto b)} (Q,M)$</td>
<td></td>
<td></td>
</tr>
<tr>
<td>ADD</td>
<td>$C \xrightarrow{\text{read}(\text{ALU}),a,op_1} E$</td>
<td>$C \xrightarrow{\text{read}(\text{ALU}),b,op_2} F$</td>
<td>$C \xrightarrow{\text{write},a+b,op_3} D$</td>
<td></td>
</tr>
<tr>
<td>MUL</td>
<td>$C \xrightarrow{\text{read}(\text{MUL}),a,op_1} E$</td>
<td>$C \xrightarrow{\text{read}(\text{MUL}),b,op_2} F$</td>
<td>$C \xrightarrow{\text{write},a\cdot b,op_3} D$</td>
<td></td>
</tr>
<tr>
<td>DUPL</td>
<td>$C \xrightarrow{\text{read}(\text{fu}),a,op_1} E$</td>
<td>$C \xrightarrow{\text{write},a,op_2} F$</td>
<td>$C \xrightarrow{\text{write},a,op_3} D$</td>
<td></td>
</tr>
</tbody>
</table>

**Figure 3.5: Sequential dynamic semantics of instruction sequences**

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Box 1</th>
<th>Box 2</th>
<th>Box 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>INSTR</td>
<td>$C \xrightarrow{\text{code}} D$</td>
<td>$C \xrightarrow{\text{[n]code}} D$</td>
<td>$C \xrightarrow{\text{st}} D$</td>
</tr>
<tr>
<td>COMP</td>
<td>$C \xrightarrow{s \rightarrow E \rightarrow f \rightarrow D}$</td>
<td>$C \xrightarrow{\text{st}} D$</td>
<td></td>
</tr>
</tbody>
</table>


### Figure 3.6: Example derivation for $C \xrightarrow{t} D$.

#### Proposition 1. (Determinacy of sequential execution) If $C \xrightarrow{t} D$ and $C \xrightarrow{t} E$ then $D = E$.

![Derivation Diagram]

**Proof.** We perform a structural induction on $t$. We first show determinacy of the rules $WR$, $RD$ and $\mu$. Next, we show that each rule $CODE$ is deterministic and promote this result to the instruction level by rule INSTR. Finally, we perform the induction on the structure of $t$.

**WR Claim 1:** If $w \xrightarrow{(\text{write}, a, q)} u$ and $w \xrightarrow{(\text{write}, a, q)} v$ then $u = v$.

**Proof** For $w \xrightarrow{(\text{write}, a, q)} u$, the definition of rule WR implies $u = wa$, and for $w \xrightarrow{(\text{write}, a, q)} v$ the definition of WR implies $v = wa$, hence $u = v$ holds.

**RD Claim 2:** If $w \xrightarrow{(\text{read}(fa), a, q)} u$ and $w \xrightarrow{(\text{read}(fa), b, q)} v$ then $a = b$ and $v = u$.

**Proof** For $w \xrightarrow{(\text{read}(fa), a, q)} u$ and $w \xrightarrow{(\text{read}(fa), b, q)} v$, the definition of rule RD implies $w = au$ and $w = bv$, hence $au = w = bv$ follows and thus $a = b$ and $v = u$.

**$\mu$** There are two claims.

**Claim 3:** If $C \xrightarrow{(\text{write}, a, q)} D$ and $C \xrightarrow{(\text{write}, a, q)} E$ then $D = E$.

**Proof** Writing $C = (Q, M)$, the definition of rule $\mu$ implies that there are $w$ and $v$ with
• \( Q(q) \xrightarrow{\text{write},a,q} w \) and \( D = (Q[q \mapsto w], M) \) and

• \( Q(q) \xrightarrow{\text{write},a,q} v \) and \( E = (Q[q \mapsto v], M) \).

By claim 1 (determinacy of WR) \( w = v \) follows, hence \( Q[q \mapsto w] = Q[q \mapsto v] \) and \( D = E \).

Claim 4: If \( C \xrightarrow{\text{read}(fu),a,q} D \) and \( C \xrightarrow{\text{read}(fu),b,q} E \) then \( D = E \) and \( a = b \).

Proof For \( C = (Q, M), D = (P, N) \) and \( E = (R, L) \), the definition of rule \( \mu \) implies that there are \( w \) and \( v \) with

• \( Q(q) \xrightarrow{\text{read}(fu),a,q} w \) and \( (P, N) = (Q[q \mapsto w], M) \) and

• \( Q(q) \xrightarrow{\text{read}(fu),b,q} v \) and \( (R, L) = (Q[q \mapsto v], M) \).

By claim 2 (determinacy of RD) \( a = b \) and \( w = v \) follow, hence \( D = (P, N) = (Q[q \mapsto w], M) = (Q[q \mapsto v], M) = (R, L) = E \) and \( a = b \).

CODE. Claim 5: If \( C \xrightarrow{\text{code}} D \) and \( C \xrightarrow{\text{code}} E \) then \( D = E \).

Proof The claim is proven by a case distinction on \( \text{code} \). We treat the rule ADD for \( \text{code} = \text{add} \; op_1 \; op_2 \; op_3 \) explicitly, the other cases being similar.

Case \( \text{code} = \text{add} \; op_1 \; op_2 \; op_3 \). For \( C \xrightarrow{\text{code}} D \) and \( C \xrightarrow{\text{code}} E \) there are by rule ADD configurations \( F, G, H \) and \( I \) such that

\[
\begin{align*}
C & \xrightarrow{\text{read}(ALU),a_1,op_1} F & & \text{and} & & C & \xrightarrow{\text{add} \; op_1 \; op_2 \; op_3} D \\
F & \xrightarrow{\text{read}(ALU),b_1,op_2} G & & \text{and} & & G & \xrightarrow{\text{write} \; a_1 + b_1,op_3} D \\
C & \xrightarrow{\text{add} \; op_1 \; op_2 \; op_3} E \\
H & \xrightarrow{\text{read}(ALU),b_2,op_2} I & & \text{and} & & I & \xrightarrow{\text{write} \; a_2 + b_2,op_3} E \\
C & \xrightarrow{\text{add} \; op_1 \; op_2 \; op_3} E
\end{align*}
\]

Applying the second claim for rule \( \mu \) (claim 4) twice yields first \( F = H \) and \( a_1 = a_2 \) and then \( G = I \) and \( b_1 = b_2 \). Consequently, \( a_1 + b_1 = a_2 + b_2 \), and applying the first claim for rule \( \mu \) (claim 3) yields \( D = E \).

Cases \( \text{code} \neq \text{add} \; op_1 \; op_2 \; op_3 \). Similar.
3.2. Dynamic semantics

Figure 3.7: Proof of determinacy for \( t = sr \).

**INSTR Claim 6:** If \( C \xrightarrow{ins} D \) and \( C \xrightarrow{ins} E \) then \( D = E \).

**Proof** For \( ins = [n] code \), \( C \xrightarrow{ins} D \) and \( C \xrightarrow{ins} E \), rule INSTR implies \( C \xrightarrow{code} D \) and \( C \xrightarrow{code} E \), so applying the statements for rules CODE (claim 5) yields \( D = E \).

Finally, the structural induction for showing that \( C \xrightarrow{t} D \) and \( C \xrightarrow{t} E \) implies \( D = E \) consists of three cases.

**If** \( t = \epsilon \), **then** the preconditions \( C \xrightarrow{t} D \) and \( C \xrightarrow{t} E \) are false as no rule for deriving \( C \xrightarrow{\epsilon} D \) exists.

**If** \( t = ins \), **then** the last rule used in the derivations for \( C \xrightarrow{t} D \) and \( C \xrightarrow{t} E \) is INSTR, so \( D = E \) follows from claim 6.

**If** \( t = sr \), **then** (see Figure 3.7) the last rule in the derivations for \( C \xrightarrow{t} D \) and \( C \xrightarrow{t} E \) must have been COMP, so there are \( F \) and \( G \) such that

\[
\begin{align*}
C \xrightarrow{s} F \quad & F \xrightarrow{t} D \\
\hline
C \xrightarrow{sr} D
\end{align*}
\quad \text{and} \quad
\begin{align*}
C \xrightarrow{s} G \quad & G \xrightarrow{t} E \\
\hline
C \xrightarrow{sr} E
\end{align*}
\]

By applying the induction hypothesis twice we obtain first \( F = G \) and then \( D = E \).

\( \square \)

Programs fail to execute (i.e. do not have a derivation) whenever a side condition of rule RD is not fulfilled. Failure of the condition \( \gamma(op) = fu \) represents a (static) operand...
queue mismatch where an instruction tries to access an operand queue which is not connected to its FU. For example, program (3.2)

\[
[1] \text{ldc } 4 \text{ q}_2 \ [2] \text{dec } \text{q}_2 \text{ q}_1
\]  

(3.2)
deadlocks after executing the first instruction unless \(\gamma(\text{q}_2) = \text{ALU}\) holds. In the SOS system, all derivations \(C \rightarrow D\) contain an axiom with the side condition \(\gamma(\text{q}_2) = \text{ALU}\).

Failure of the second condition is a dynamic error which occurs if previous instructions or the initial configuration did not provide sufficiently many operands. For example, for \(C = ([\text{q}_2 \leftarrow 3], M)\) the program

\[
[1] \text{ldc } 4 \text{ q}_2 \ [2] \text{add } \text{q}_2 \text{ q}_1 \text{ q}_1
\]  

(3.3)
deadlocks after executing the first instruction as the addition can not be executed. No \(D\) exists such that \(C \rightarrow D\) is derivable. We call this error starvation or a deadlock due to operand queue underflow. Other configurations might well have derivations. For example, for configurations \(E = ([\text{q}_1 \leftarrow 5], M)\) and \(F = ([\text{q}_1 \leftarrow 9], M)\) we can derive

\[
E \rightarrow [1] \text{ldc } 4 \text{ q}_2 \ [2] \text{add } \text{q}_2 \text{ q}_1 \text{ q}_1 \rightarrow F.
\]

Both kinds of hazards are undesirable as they occur at runtime, when other parts of the program have successfully completed. In order to avoid costly monitoring of program execution or deadlock resolution at runtime, one would like to eliminate both hazards prior to program execution. Both example programs are syntactically correct, hence more sophisticated technology is needed to discover that they should not be executed. The following static semantics performs the corresponding program analysis, formalised as a type system. Derivations show the requirements of a program: the shape of admissible initial configurations is indicated in the final typing judgement and all necessary relations \(\gamma(q) = fu\) are visible in axioms.

### 3.3 Static semantics

The static semantics appropriate for the sequential dynamic semantics eliminates illegal operand queue access and deadlock due to operand queue underflow. It is formalised as a type system, where each instruction is assigned a type characterising its
net-effect on the number of items in operand queues. As the type system aims at abstracting from the order of items in queues and the values of items, we use a linear notation based on Girard’s linear logic [Gir86].

Judgements are of the form \( t : A \rightarrow B \) where \( t \) is an instruction sequence and \( A \) and \( B \) are formal products over the set of operand identifiers \( op \).

\[
\begin{align*}
A, B \ldots & \in \text{Product} \quad ::= \quad 1 \mid op \otimes A \\
\tau & \in \text{Type} \quad ::= \quad A \rightarrow B
\end{align*}
\]

Linear operators \( \otimes \) and \( \rightarrow \) are motivated by the fact that the multiplicity of items in an operand queue matters. We treat \( \otimes \) associatively and commutatively with \( 1 \) as neutral element, and for \( i \geq 0 \) we write \( op^i \) for \( \underbrace{op \otimes \ldots \otimes op}_i \), with \( op^0 = 1 \). Types thus abstract from the particular values of operands and from the order of items in each queue. Occasionally, we say that \( op \) divides \( A \), occurs in \( A \) or is a factor of \( A \) if \( A \) can be written as \( A = op \otimes B \) for some \( B \).

For each instruction form \( ins \), the type system contains an axiom

\[
\text{AX} \quad \frac{\text{code}}{X \otimes A_{\text{code}} \rightarrow X \otimes B_{\text{code}} \quad \text{SC(code)}}
\]

where products \( A_{\text{code}} \) and \( B_{\text{code}} \) are given in Table 3.1, product \( X \) is arbitrary and the side condition \( \text{SC(code)} \) is

\[
\gamma(q_1) = \ldots = \gamma(q_n) = FU(\text{code})
\]

for \( A_{\text{code}} = q_1 \otimes \ldots \otimes q_n \).

Typed program fragments are combined using a cut rule.

\[
\text{CUT} \quad \frac{\text{AX}}{s : A \rightarrow B \quad t : B \rightarrow C \quad st : A \rightarrow C}
\]

The type system fulfils the following property.

**Proposition 2.** If \( t : A \rightarrow B \) and \( t : C \rightarrow D \) then \( A = C \) iff \( B = D \).

**Proof.** Induction on the structure of \( t \).
<table>
<thead>
<tr>
<th>code</th>
<th>FU(code)</th>
<th>A_{code}</th>
<th>B_{code}</th>
</tr>
</thead>
<tbody>
<tr>
<td>dec op_1 op_2</td>
<td>ALU</td>
<td>op_1</td>
<td>op_2</td>
</tr>
<tr>
<td>add op_1 op_2 op_3</td>
<td>ALU</td>
<td>op_1 \otimes op_2</td>
<td>op_3</td>
</tr>
<tr>
<td>mul op_1 op_2 op_3</td>
<td>MUL</td>
<td>op_1 \otimes op_2</td>
<td>op_3</td>
</tr>
<tr>
<td>ldc a op</td>
<td>MEM</td>
<td>1</td>
<td>op</td>
</tr>
<tr>
<td>ld op_1 op_2</td>
<td>MEM</td>
<td>op_1</td>
<td>op_2</td>
</tr>
<tr>
<td>st op_1 op_2</td>
<td>MEM</td>
<td>op_1 \otimes op_2</td>
<td>1</td>
</tr>
<tr>
<td>id^fu op_1 op_2</td>
<td>fu</td>
<td>op_1</td>
<td>op_2</td>
</tr>
<tr>
<td>dupl^fu op_1 op_2 op_3</td>
<td>fu</td>
<td>op_1</td>
<td>op_2 \otimes op_3</td>
</tr>
<tr>
<td>skip^fu op</td>
<td>fu</td>
<td>op</td>
<td>1</td>
</tr>
</tbody>
</table>

Table 3.1: Type system

If $t = \varepsilon$, no derivations for $t : A \rightarrow B$ and $t : C \rightarrow D$ exist and the claim is trivially fulfilled.

If $t = \text{ins}$, the typing judgements $t : A \rightarrow B$ and $t : C \rightarrow D$ were obtained by the rule AX, so there are $X$ and $Y$ such that for $\text{ins} = [n] \text{code}$ the equalities $A = A_{code} \otimes X$, $B = B_{code} \otimes X$, $C = A_{code} \otimes Y$, $D = B_{code} \otimes Y$ hold. Consequently, $A = C$ implies $X = Y$ and hence $B = D$, and $B = D$ implies $X = Y$ and hence $A = C$.

If $t = \text{sr}$, the last rule in the derivations for $t : A \rightarrow B$ and $t : C \rightarrow D$ was CUT, so there are $E$ and $F$ such that $s : A \rightarrow E$ and $r : E \rightarrow B$ and $s : C \rightarrow F$ and $r : F \rightarrow D$. Consequently, for $A = C$ the induction hypothesis yields first $E = F$ and then $B = D$, and for $B = D$ the induction hypothesis yields first $E = F$ and then $A = C$.

The type system eliminates both sources of runtime hazards. For example, program (3.2) can only be typed if the condition $\gamma(q_2) = \text{ALU}$ is fulfilled. Likewise, a typing $t : A \rightarrow B$ for program (3.3) requires $A$ to contain at least the factor $q_1$, indicating that any initial configuration should at least contain one value in that queue. Formally, the static semantics and the sequential dynamic semantics are linked using products which represent the shape of configurations.
**Definition 1.** The shape of $C = (Q, M)$ is $\text{shape}(C) = \otimes_{q \in Q} |Q(q)|$.

The following result shows soundness and completeness of the type system.

**Theorem 1.** Let $t$ be an instruction sequence.

1. If $t : A \rightarrow B$ and $\text{shape}(C) = A$ then there is a unique $D$ such that $C \xrightarrow{t} D$, and $\text{shape}(D) = B$ holds.

2. If $C \xrightarrow{t} D$ then $t : \text{shape}(C) \rightarrow \text{shape}(D)$.

![Diagram](image)

**Proof.** Both parts are shown separately, and the proofs follow the syntactic structure.

1. We first prove the corresponding claims for rules WR, RD, $\mu$ and CODE.

**RD Claim 1:** If $w = av$ and $\gamma(q) = fu$ then there are unique $u$ and $b$ with $w \xrightarrow{\text{read}[fu], b.q} u$. Furthermore, $|u| = |w| - 1$ holds.

**Proof** For $w = av$ and $\gamma(q) = fu$ both side conditions of rule RD are fulfilled and the claim holds for the unique $a = b$ and $u = v$.

**WR Claim 2:** For arbitrary $w$, $a$ and $q$ there is a unique $u$ such that $w \xrightarrow{\text{write}, a, q} u$. Furthermore, $|u| = |w| + 1$ holds.

**Proof** Rule WR does not carry any side conditions, so the claim follows for $u = wa$.

**$\mu$** There are two claims.

**Claim 3:** If $\text{shape}(C) = q \otimes X$ and $\gamma(q) = fu$ then there are unique $D$ and $a$ with $C \xrightarrow{\text{read}[fu], a.q} D$. Furthermore, $\text{shape}(D) = X$ holds.

**Proof** For $C = (Q, M)$ and $\text{shape}(C) = q \otimes X$ we have $|Q(q)| \geq 1$, so $Q(q) = aw$ for some unique $a$ and $w$. By claim 1, there are unique $u$ and $b$ with $Q(q) \xrightarrow{\text{read}[fu], b.q} u$, and $|u| = |Q(q)| - 1$ holds. Hence, $C \xrightarrow{\text{read}[fu], a.q} D$ is
derivable for $D = (Q[q \mapsto u], M)$, and uniqueness of $D$ and $a$ follow as in Proposition 1.

Furthermore, $|Q[q \mapsto u](q)| = |u| = |Q(q)| - 1$ holds and $|Q[q \mapsto u](q')| = |Q(q')|$ for $q' \neq q$, hence $\text{shape}(D) \otimes q = \text{shape}(C) = q \otimes X$ and therefore $\text{shape}(D) = X$.

**Claim 4:** If $\text{shape}(C) = A$ and $q$ and $a$ arbitrary, then there is a unique $D$ with $C \xrightarrow{\text{write}_{a,q}} D$. Furthermore, $\text{shape}(D) = A \otimes q$ holds.

**Proof** For $C = (Q, M)$ there is by claim 2 a unique $u$ with $Q(q) \xrightarrow{\text{write}_{a,q}} u$, and $u = |Q(q)| + 1$ holds. Hence, $C \xrightarrow{\text{write}_{a,q}} D$ is derivable for $D = (Q[q \mapsto u], M)$ and uniqueness of $D$ follows as in Proposition 1.

Furthermore, $|Q[q \mapsto u](q)| = |u| = |Q(q)| + 1$ holds and $|Q[q \mapsto u](q')| = |Q(q')|$ for $q' \neq q$, hence $\text{shape}(D) = \text{shape}(C) \otimes q = A \otimes q$.

**CODE Claim 5:** If $[n] \text{code} : A \rightarrow B$ and $\text{shape}(C) = A$ then there is a unique $D$ such that $C \xrightarrow{\text{code}} D$. Furthermore, $\text{shape}(D) = B$ holds.

**Proof** We treat the representative case ADD explicitly.

**Case code = add op$_1$ op$_2$ op$_3$.** For $[n] \text{add op}_1 \text{ op}_2 \text{ op}_3 : A \rightarrow B$, the typing axiom AX and Table 3.1 guarantee $\gamma(op_1) = \gamma(op_2) = \text{ALU}$ and $A = op_1 \otimes op_2 \otimes X$ and $B = op_3 \otimes X$ for some $X$. By the assumption that all operand identifiers $op$ are operand queue identifiers $q$, there are $q_1$, $q_2$, and $q_3$ such that $q_i = op_i$ holds for $i \in \{1, 2, 3\}$. Therefore, $\text{shape}(C) = A = q_1 \otimes q_2 \otimes X$ and $\gamma(q_1) = \gamma(q_2) = \text{ALU}$, so by claim 3 there are unique $E$ and $a$ such that $C \xrightarrow{\text{(read ALU), a, q}_1} E$, and $\text{shape}(E) = q_2 \otimes X$ holds. Applying claim 3 again yields the existence of unique $F$ and $b$ with $E \xrightarrow{\text{(read ALU), b, q}_2} F$, and $\text{shape}(F) = X$ holds. By applying claim 4, there is a unique $D$ with $F \xrightarrow{\text{write}_{a+b,q_3}} D$, and $\text{shape}(D) = X \otimes q_3$ holds. Consequently, the following derivation is possible

$$
\text{ADD} \xrightarrow{C \xrightarrow{\text{(read ALU), a, q}_1}} E \xrightarrow{E \xrightarrow{\text{(read ALU), b, q}_2}} F \xrightarrow{F \xrightarrow{\text{write}_{a+b,q_3}} D} C \xrightarrow{\text{add}_{q_1 q_2 q_3}} D
$$

Using $q_i = op_i$ for $i \in \{1, 2, 3\}$ we thus obtain a derivation for $C \xrightarrow{\text{code}} D$,
and $\text{shape}(D) = X \otimes q_3 = X \otimes op_3 = B$ holds. Uniqueness of $D$ with respect to $C \xrightarrow{\text{code}} D$ follows as in Proposition 1.

**Cases** $\text{code} \neq \text{add} \; op_1 \; op_2 \; op_3$. Similar.

For proving the main claim, we perform a structural induction on $t$.

**If** $t = e$, no judgement $t : A \rightarrow B$ is derivable, so the claim is trivially fulfilled.

**If** $t = \text{ins}$, then claim 5 guarantees that there is a unique $D$ such that $C \xrightarrow{\text{code}} D$ holds, where $\text{ins} = [n]\text{code}$. By rule INSTR we obtain $C \xrightarrow{\text{ins}} D$. Furthermore, claim 5 guarantees $\text{shape}(D) = B$, and uniqueness of $D$ with respect to $C \xrightarrow{\text{ins}} D$ follows from Proposition 1.

**If** $t = \text{sr}$, then the typing rule CUT guarantees that there is a $C$ such that $s : A \rightarrow C$ and $r : C \rightarrow B$, and a simple structural induction shows that $C$ is unique. Applying the induction hypothesis to $s$ and $r$ yields unique $E$ and $D$ for $C \xrightarrow{s} E$ and $E \xrightarrow{r} D$, and $\text{shape}(E) = C$ and $\text{shape}(D) = B$ hold, so $C \xrightarrow{\text{sr}} D$. Uniqueness of $D$ with respect to $C \xrightarrow{\text{sr}} D$ follows from Proposition 1.

2. Again, the claim follows by induction on $t$, based on claims for the rules WR, RD, $\mu$ and CODE.

**RD Claim 1**: If $w \xrightarrow{(\text{read}(fu),a,q)} v$ then $w = av$ and $\gamma(q) = fu$ and $|w| = |v| + 1$ hold.

**Proof** For $w \xrightarrow{(\text{read}(fu),a,q)} v$, $w$ must have the form $w = av$ by rule RD and the side condition $\gamma(q) = fu$ must hold. Hence $|w| = |av| = |v| + 1$.

**WR Claim 2**: If $w \xrightarrow{(\text{write},a,q)} v$ then $v = wa$ and $|v| = |w| + 1$.

**Proof** For $w \xrightarrow{(\text{write},a,q)} v$, $v$ must have the form $v = wa$ by rule WR, so $|v| = |wa| = |w| + 1$ holds.

**$\mu$ There are two claims.**

**Claim 3**: If $C \xrightarrow{(\text{read}(fu),a,q)} D$ then $\text{shape}(C) = \text{shape}(D) \otimes q$ and $\gamma(q) = fu$.

**Proof** For $C \xrightarrow{(\text{read}(fu),a,q)} D$ to hold, rule $\mu$ requires $Q(q) \xrightarrow{(\text{read}(fu),a,q)} w$ for $C = (Q,M)$ and $D = (Q[q \rightarrow w],M)$. By claim 1, $\gamma(q) = fu$ and $Q(q) = aw$
and $|Q(q)| = |w| + 1$ hold. Hence, we obtain

$$\text{shape}(C) = \otimes_{p \neq q} p^{Q(p)} q^{w+1}$$

$$= \otimes_{p \neq q} p^{Q(p)} q^{w} q$$

$$= \otimes_{p \neq q} p^{Q[q \to w](p)} q^{w} q$$

$$= \text{shape}(D) \otimes q.$$

**Claim 4:** If $C \xrightarrow{\text{write}, a, q} D$ then $\text{shape}(D) = \text{shape}(C) \otimes q.$

**Proof** For $C \xrightarrow{\text{write}, a, q} D$ to hold, rule $\mu$ requires $Q(q) \xrightarrow{\text{write}, a, q} w$ for $C = (Q, M)$ and $D = (Q[q \to w], M)$. By claim 3, $w = Q(q)a$ and $|w| = |Q(q)| + 1$ hold. Hence, we obtain

$$\text{shape}(D) = \otimes_{p \neq q} p^{Q[q \to w](p)} q^{w}$$

$$= \otimes_{p \neq q} p^{Q(p)} q^{Q(q)+1}$$

$$= \otimes_{p \neq q} p^{Q(p)} q^{Q(q)} q$$

$$= \text{shape}(C) \otimes q.$$

**CODE Claim 5:** If $C \xrightarrow{\text{code}} D$ then $[a] \text{code} : \text{shape}(C) \to \text{shape}(D)$.

**Proof** The claim is proven by a case distinction on $\text{code}$, and we treat the case ADD explicitly.

**Case code $= \text{add op}_1 \text{ op}_2 \text{ op}_3.$** For $C \xrightarrow{\text{add}, \text{ op}_1 \text{ op}_2 \text{ op}_3} D$, the definition of rule ADD yields the existence of configurations $E$ and $F$ such that

$$C \xrightarrow{\text{read(\text{ALU}), a, q}_1} E, \quad E \xrightarrow{\text{read(\text{ALU}), b, q}_2} F \quad \text{and} \quad F \xrightarrow{\text{write, a+b, q}_3} D$$

hold, where $q_i = \text{op}_i$ holds for $i \in \{1, 2, 3\}$ by the assumption that all operand identifiers are queue names. By applying claim 3 to the statement $E \xrightarrow{\text{read(\text{ALU}), b, q}_2} F$ we obtain $\text{shape}(E) = \text{shape}(F) \otimes q_2$ and $\gamma(q_2) = \text{ALU}$. Likewise, applying claim 3 to $C \xrightarrow{\text{read(\text{ALU}), a, q}_1} E$ yields $\text{shape}(C) = \text{shape}(E) \otimes q_1$ and $\gamma(q_1) = \text{ALU}$. Applying claim 4 to $F \xrightarrow{\text{write, a+b, q}_3} D$ results in $\text{shape}(D) = \text{shape}(F) \otimes q_3$.

Summarising, we have

$$\text{shape}(C) = \text{shape}(E) \otimes q_1 = \text{shape}(F) \otimes q_2 \otimes q_1,$$

$$\text{shape}(D) = \text{shape}(F) \otimes q_3.$$
3.4 Principal types

The type system admits a derived weakening rule

\[
\text{WK} \quad \frac{t : A \rightarrow B}{\frac{t : A \otimes X \rightarrow B \otimes X}
\]

and \( \gamma(q_1) = \gamma(q_2) = \text{ALU} = FU(\text{add } op_1 \ op_2 \ op_3) \). Using the weakening \( X = \text{shape}(F) \), axiom AX yields \([n] \text{add } op_1 \ op_2 \ op_3 : \text{shape}(C) \rightarrow \text{shape}(D)\).

**Cases** \( code \neq \text{add } op_1 \ op_2 \ op_3 \). Similar.

For proving the main claim, we again perform a structural induction on \( t \).

**If** \( t = \varepsilon \), no judgement \( C \vdash \text{D} \) is derivable, so the claim is trivially fulfilled.

**If** \( t = \text{ins} = \lbrack n \rbrack \text{code} \), then the last rule used for deriving \( C \vdash \text{D} \) must have been \( \text{INSTRU} \), so we must have \( C \xrightarrow{\text{code}} \text{D} \), and applying claim 5 yields \( \text{ins} : \text{shape}(C) \rightarrow \text{shape}(D) \).

**If** \( t = \text{sr} \), then the last rule in the derivation for \( C \vdash \text{D} \) must have been \( \text{COMP} \), so there is an \( E \) with \( C \xrightarrow{\text{sr}} E \) and \( E \vdash \text{D} \). Applying the induction hypothesis yields \( s : \text{shape}(C) \rightarrow \text{shape}(E) \) and \( r : \text{shape}(E) \rightarrow \text{shape}(D) \), so the rule \( \text{CUT} \) implies \( s \text{r} : \text{shape}(C) \rightarrow \text{shape}(D) \).

The type system presented in this section is based on the same architectural information as the operational model: the mapping of instructions to functional units and the binding of operand queues to functional units are identical. While this architectural transparency is useful during system design, type-checkers in realistic compilers are not expected to have full architectural information available. Instead, they will derive architectural conditions which a processor has to fulfil in order to be admissible. We will discuss an extension of our framework which might support such reasoning in Chapter 9.
where $X$ is arbitrary. Operationally, typings have different effects on the function a program calculates. For example, if program (3.2) is given type $1 \rightarrow q_1$ and $C$ is a configuration of shape $1$, the first instruction will insert the value 4 into the (initially empty) queue $q_2$, and the second instruction will decrement this value by one and hence deposit 3 in $q_1$. If a typing $A \rightarrow B$ is chosen where $A = q_2 \otimes X$ for some $X$, then a configuration of shape $A$ will contain an initial value in $q_2$. The second instruction will consequently decrement that initial value rather than the value provided by $1 \rightarrow c \ 4 \ q_2$. Hence, a different input-output behaviour is observed.

The legality of a program with respect to sequential execution is not affected by the weakening rule. For straight-line code, the following notion of principal type can be defined.

**Definition 2.** A typing $t : A \rightarrow B$ is principal if for every typing $t : C \rightarrow D$ there is an $X$ such that $C = A \otimes X$ and $D = B \otimes X$.

![Diagram showing principal types](image)

Every well-typed program has a unique principal type which can be obtained by type inference using a modified type system with judgements of the form $t :: A \rightarrow B$. Axioms are of the non-weakened form

$P-AX \quad n \text{code} :: A_{\text{code}} \rightarrow B_{\text{code}} \quad SC(\text{code})$

and composition performs the minimal weakening necessary for a cut.

$P-CUT \quad s :: A \rightarrow B \quad t :: C \rightarrow D \quad st :: A \otimes U \rightarrow V \otimes D \quad B/C = V/U$

Here $(V, U)$ is the cancellation of $(B, C)$, defined as follows.

**Definition 3.** Products $A$ and $B$ are relatively prime, written $\text{prime}(A, B)$, if the factors of $A$ and $B$ are disjoint.
3.4. Principal types

Types V and U are called the cancellation of A and B, written $A/B = V/U$, if V and U are relatively prime and $U \otimes A = V \otimes B$.

Occasionally, we write $A/B = C V/U$ where the greatest common divisor C is given by the (unique) product such that $A = C \otimes V$ and $B = C \otimes U$.

**Proposition 3.** $t : A \rightarrow B$ iff $t : A \rightarrow B$ is the principal typing for $t$.

**Proof.** Structural induction on $t$.

If $t = \varepsilon$, there is no derivation for $t : A \rightarrow B$ or $t : A \rightarrow B$, so the claim is trivially fulfilled.

If $t = [n] code$, we perform a case distinction on code.

Case code = add $op_1 \ op_2 \ op_3$. If $t : A \rightarrow B$ then the side condition $SC(code)$ is fulfilled and $A = op_1 \otimes op_2$ and $B = op_3$. Take $X = 1$ to obtain a valid typing $t : A \rightarrow B$. Suppose $t : C \rightarrow D$ is another typing for $t$, then by the typing rule for add we have $C = op_1 \otimes op_2 \otimes Y$ and $D = op_3 \otimes Y$ for some $Y$, so $C \rightarrow D$ can be obtained from $t : A \rightarrow B$ by weakening.

Conversely, suppose $t : A \rightarrow B$ is the principal typing for $t$. Then $t : A \rightarrow B$ holds, so $\gamma(op_1) = \gamma(op_2) = \text{ALU}$ and $A = op_1 \otimes op_2 \otimes X$ and $B = op_3 \otimes X$ for some $X$. Take $C = op_1 \otimes op_2$ and $D = op_3$, then $t :: C \rightarrow D$ and $t : C \rightarrow D$ hold since $\gamma(op_1) = \gamma(op_2) = \text{ALU}$. Since $t : A \rightarrow B$ is the principal typing for $t$, there must by a $Y$ such that $op_1 \otimes op_2 = C = A \otimes Y = op_1 \otimes op_2 \otimes X \otimes Y$ and $op_3 = D = B \otimes Y = op_3 \otimes X \otimes Y$. Consequently, $X = Y = 1$ and $A = C$ and $B = D$.

Cases code $\neq$ add $op_1 \ op_2 \ op_3$. Similar.

If $t = sr$, then $t : A \rightarrow B$ implies that there are $C,D,E,F$ such that

\[
P\text{-CUT} \quad \frac{s :: C \rightarrow D \quad r :: E \rightarrow F}{t :: C \otimes U \rightarrow V \otimes F \quad D/E = V/U}
\]

where $A = C \otimes U$ and $B = V \otimes F$. Consequently, $U \otimes D = V \otimes E$. By induction hypothesis, we have $s : C \rightarrow D$ and $r : E \rightarrow F$, so $s : C \otimes U \rightarrow D \otimes U$ and $r : E \otimes V \rightarrow F \otimes V$ by weakening. By the cut rule, we obtain $sr : C \otimes U \rightarrow F \otimes V$,
i.e. \( sr : A \rightarrow o B \) as a valid typing. Suppose \( sr : G \rightarrow o H \). Then there is an \( I \) such that \( s : G \rightarrow o I \) and \( r : I \rightarrow o H \). By induction hypothesis, \( s :: C \rightarrow o D \) and \( r :: E \rightarrow o F \) are principal, so there are \( X \) and \( Y \) with \( G = C \otimes X \), \( I = D \otimes X \), \( I = E \otimes Y \) and \( H = F \otimes Y \). Consequently, \( D \otimes X = I = E \otimes Y \). Since \( V \) and \( U \) are prime and fulfil \( U \otimes D = V \otimes E \), it can be shown that there must be a \( Z \) with \( X = U \otimes Z \) and \( Y = V \otimes Z \). Hence, \( G = C \otimes X = C \otimes U \otimes Z = A \otimes Z \) and \( H = F \otimes Y = F \otimes V \otimes Z = B \otimes Z \), so \( t : G \rightarrow o H \) can be obtained from \( t : A \rightarrow o B \) by weakening.

Conversely, let \( sr : A \rightarrow o B \) be the principal typing for \( t \). Then there is a \( C \) such that \( s : A \rightarrow o C \) and \( r : C \rightarrow o B \). Let \( s : D \rightarrow o E \) and \( r : F \rightarrow o G \) be the principal typings for \( s \) and \( r \), respectively. Then by induction hypothesis, \( s :: D \rightarrow o E \) and \( r :: F \rightarrow o G \) hold, and by principality there are \( X \) and \( Y \) such that \( A = D \otimes X \), \( E \otimes X = C = F \otimes Y \) and \( B = G \otimes Y \). Let \( E/F = V/U \). Then \( sr :: D \otimes U \rightarrow o V \otimes G \). Since \( V \) and \( U \) are prime, there is a \( Z \) such that \( U \otimes Z = X \) and \( V \otimes Z = Y \). Consequently, \( A = D \otimes X = D \otimes U \otimes Z \) and \( B = G \otimes Y = G \otimes V \otimes Z \). Since \( sr :: D \otimes U \rightarrow o V \otimes G \) implies \( sr : D \otimes U \rightarrow o V \otimes G \), we have \( Z = 1 \) by the principality of \( sr : A \rightarrow o B \). Thus \( sr :: A \rightarrow o B \).

\[ \square \]

We will see in the next section that weakening plays an important role for programs involving jump instructions.

### 3.5 Program execution

In this section we consider the language including the conditional and unconditional jumps \( if \ op n_1 \ n_2 \) and \( jmp \ n \), but no registers. Both dynamic and static semantics are defined in terms of basic blocks, i.e. sequences of instructions where at most the last instruction is a jump instruction and no instruction except the first one is the target of an incoming control-flow arrow.

Formally, a program \( P = (N, A) \) consists of a partial map \( N : N \rightarrow Iseq \), together with an association \( A \subset dom N \times dom N \) such that
the labelling of instructions is unique

- the first instruction of $n(n)$ has label [$n$]

- at most the last instruction of $n(n)$ is a jump instruction

- $A$ contains exactly those pairs $(n,m)$ where the opcode of last instruction in $n(n)$ is either jmp $m$ or if $op_n_1 n_2$ with $m \in \{n_1,n_2\}$.

### 3.5.1 Dynamic semantics

We modify the dynamic semantics from Section 3.2 by extending configurations by a third component. In a configuration $(Q,M,\bar{n})$ the component $\bar{n} \in N \cup \{nil\}$ represents the optional label of the basic block to be executed next. Operand queue configuration $Q$ and memory component $M$ are as before, and the effect of opcodes is still defined in terms of the relation $(Q,M) \xrightarrow{\{mu.a,q\}} (P,N)$. The dynamic semantics for instructions is given in Figure 3.8, embedding the rules from Figure 3.4 by the rule INJ. Additional rules define the meaning of jump instructions. These insert the branch target into the third component of a configuration, possibly depending on an evaluation of a branch condition. Both jump instructions are mapped to the functional unit $BU$ (branch unit), and there are two rules for the opcode if $op_n_1 n_2$, one for each possible outcome of the branch condition. All jump instructions require the third component of their initial configuration to be empty, in accordance with the definition of the rules for issuing basic blocks which will be given shortly. The rules INSTR and COMP are formally identical to the rules in Figure 3.5, but now relate configurations of the extended form.

A typical derivation for a statement $C \xrightarrow{L} D$ is shown in Figure 3.9, where

\[
C = ([],M,nil) \\
E = ([q_1 \leftarrow 4],M,nil) \\
D = ([],M,5)
\]

Determinacy of sequential execution carries over to the extended dynamic semantics, i.e. $D = E$ holds provided that $C \xrightarrow{L} D$ and $C \xrightarrow{L} E$.

Programs $P$ are executed by issuing the constituent basic blocks according to the outcome of branches. The corresponding relation is $C \xrightarrow{w} D$ where $w \in N^*$ is a word over
IF-T \[ (Q, M) \xrightarrow{ \text{(read(BU),0,op)} } (P, N) \]

\[ (Q, M, \text{nil}) \xrightarrow{ \text{if op n}_1 n_2 } (P, N, n_1) \]

IF-F \[ (Q, M) \xrightarrow{ \text{(read(BU),a,op)} } (P, N) \]

\[ (Q, M, \text{nil}) \xrightarrow{ \text{if op n}_1 n_2 } (P, N, n_2) \] \( a \neq 0 \)

JMP \[ (Q, M, \text{nil}) \xrightarrow{ \text{jmp n} } (Q, M, n) \]

INJ \[ (Q, M) \xrightarrow{ \text{code} } (P, N) \]

\[ (Q, M, \tilde{n}) \xrightarrow{ \text{code} } (P, N, \tilde{n}) \] \( \text{FU(code)} \neq \text{BU} \)

INSTR \[ C \xrightarrow{\text{code}} D \]

\[ C \xrightarrow{\text{[n] code}} D \]

COMP \[ C \xrightarrow{s \cdot E, E \cdot t \cdot D} D \]

\[ C \xrightarrow{s t} D \]

\[ \lambda \xrightarrow{\text{write, 4, q}_1} 4 \]

\[ ([], M) \xrightarrow{\text{write, 4, q}_1} ([q_1 \rightarrow 4], M) \]

\[ ([], M) \xrightarrow{\text{l} \text{dc} 4 \text{q}_1} ([q_1 \rightarrow 4], M) \]

\[ C \xrightarrow{\text{l} \text{dc} 4 \text{q}_1} E \]

\[ C \xrightarrow{[1] \text{l} \text{dc} 4 \text{q}_1 \cdot E} \]

\[ \gamma(q_1) = \text{BU} \]

\[ 4 \xrightarrow{\text{(read(BU),4,q}_1)} \lambda \]

\[ ([q_1 \rightarrow 4], M) \xrightarrow{\text{(read(BU),4,q}_1)} ([], M) \]

\[ E \xrightarrow{\text{if q}_1 \text{3} \text{5}} D \]

\[ E \xrightarrow{[2] \text{if q}_1 \text{3} \text{5}} D \]

\[ C \xrightarrow{[1] \text{l} \text{dc} 4 \text{q}_1 \cdot [2] \text{if q}_1 \text{3} \text{5}} D \]

Figure 3.8: Sequential dynamic semantics including jumps.

Figure 3.9: Example derivation for \( C \xrightarrow{l} D \) including jumps.
The rules for \( C \xrightarrow{w} D \) are given in Figure 3.10. The rule EXECUTE represents the second component of the interplay between the execution of jump instructions and the issuing of basic blocks. As issuing the instructions of the target block deletes the basic block label \( n \) in the third component of the initial configuration, a jump at the end of \( \mathbb{N}(n) \) will again be able to execute.

By an induction on \(|w| > 0\) one can show that program execution is deterministic.

**Proposition 4.** If \( C \xrightarrow{w} D, C \xrightarrow{v} E \) and \(|w| = |v| > 0\) then \( v = w \) and \( D = E \).

*Proof.* For \(|w| = 1\), \( w = n \) holds for some \( n \) and \( C \xrightarrow{n} D \) is only derivable if \( C \) has the form \( (Q, M, n) \) with \( \mathbb{N}(n) = t \) and \( (Q, M, \text{nil}) \xrightarrow{t} D \). Likewise, \( v = m \) holds for some \( m \) and \( C \xrightarrow{m} E \) is only derivable if \( C \) has the form \( (Q, M, m) \) with \( \mathbb{N}(m) = s \) and \( (Q, M, \text{nil}) \xrightarrow{s} E \). Since \( C \) is unique it must be \( n = m \), so \( v = w \) and \( t = \mathbb{N}(n) = \mathbb{N}(m) = s \), and therefore \( D = E \) by (generalised) Proposition 1.

For \(|w| = |v| > 1\), there are non-empty \( w_1, w_2 \) and \( F \) and non-empty \( v_1, v_2 \) and \( G \) such that the last steps in the derivations for \( C \xrightarrow{w} D \) and \( C \xrightarrow{v} E \) are

\[
\text{COMPOSE } \frac{C \xrightarrow{w_1} F \quad F \xrightarrow{w_2} D}{C \xrightarrow{w} D} \quad \text{and} \quad \text{COMPOSE } \frac{C \xrightarrow{v_1} G \quad G \xrightarrow{v_2} E}{C \xrightarrow{v} E}
\]

Without loss of generality, \(|w_1| = |v_1| \) holds – any derivation where the leftmost branch \( C \xrightarrow{u} H \) does not fulfil \(|u| = 1\) can be transformed into a derivation for \( C \xrightarrow{u} H \) where the leftmost branch does fulfil \(|u| = 1\) by reordering the axioms. Hence, we obtain \( w_1 = v_1 \) and \( F = G \) by induction hypothesis as \( w_1 \) and \( v_1 \) are non-empty. Also, \(|w_2| = |w| - |w_1| = |v| - |v_1| = |v_2| \) follows, so by applying the induction hypothesis again we obtain \( D = E \).

By the definition of programs, an execution will never attempt to issue a basic block for which no code exists. However, programs might experience deadlocks due to operand queue underflow at the boundary between basic blocks. The following section presents an extension of the static semantics which eliminates these errors.
### 3.5.2 Static semantics

Basic blocks can be typed using the type system from Section 3.3 if Table 3.1 is extended by the data given in Table 3.2. Instantiating the rule AX correspondingly amounts to the explicit rules

\[
\text{AX-IF}_{[n]} \text{if } op\ n_1\ n_2 : X \otimes op \rightarrow \gamma(op) = BU
\]

and

\[
\text{AX-JMP}_{[n]} \text{jmp } m : X \rightarrow \gamma X
\]

with \(X = 1\) for the calculus of principal types.

Theorem 1 (soundness and completeness for sequential execution) can be generalised to the bodies of basic blocks where \(\text{shape}(C) = \text{shape}(Q,M)\) for \(C = (Q,M,n)\) generalises Definition 1.

For programs \(P = (N,A)\) to be executable, the interfaces of basic blocks must be compatible. We follow an approach taken by Stata-Abadi [SA98c] where the shapes of basic blocks linked by a control flow dependency must match. All operands expected by a basic block must be provided by earlier basic blocks and all operands delivered must be either consumed or passed on to successor basic blocks. For example, if \(N(1),\ N(3)\) and \(N(5)\) are of the form

\[
\begin{align*}
N(1) & = [1] ldc\ 4\ q_1\ [2] \text{jmp}\ 5 \\
N(3) & = [3] ldc\ 7\ q_1\ [4] \text{jmp}\ 5 \\
N(5) & = [5] dec\ q_1\ q_2\ [6] \text{skip}\ q_2
\end{align*}
\]

then \(N(1)\) and \(N(3)\) each provide exactly the right number of operands for \(N(5)\).

Formally, Stata-Abadi’s condition means that \(B_n = A_m\) should hold whenever \(N(n) : A_n \rightarrow \gamma B_n, \ N(m) : A_m \rightarrow \gamma B_m\) and \((n,m) \in \Delta\). We hence introduce a judgement of the

<table>
<thead>
<tr>
<th>code</th>
<th>FU(code)</th>
<th>(A_{\text{code}})</th>
<th>(B_{\text{code}})</th>
</tr>
</thead>
<tbody>
<tr>
<td>if (op\ n_1\ n_2)</td>
<td>BU</td>
<td>(op)</td>
<td>1</td>
</tr>
<tr>
<td>jmp (n)</td>
<td>BU</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

Table 3.2: Type system including jumps
form $\Sigma \vdash P : \circ$ where $\Sigma$ is a type environment mapping basic block labels to types. For each basic block $n(n)$, $\Sigma$ has exactly one entry $n : A \rightarrow B$ where $N(n) : A \rightarrow B$ must hold. The rule for $\Sigma \vdash P : \circ$ then requires the above equality to hold for all arcs.

$$\Sigma = \bigcup_{n \in \text{dom } N} \{ n : A_n \rightarrow B_n \}$$

$$(n,m) \in A \text{ implies } B_n = A_m$$

$\text{PROG}$$

$$\Sigma \vdash P : \circ \quad \text{implies} \quad P = (N, A)$$

Consequently, basic blocks can be composed as required by the outcome of branches.

**Proposition 5.** Let $\Sigma \vdash P : \circ$, $C = (Q, M, n)$ and let $n : \text{shape}(C) \rightarrow B$ be the entry for $n$ in $\Sigma$. Then there is a unique $D$ such that $C \xrightarrow{n} D$. Furthermore, $\text{shape}(D) = B$ holds, and if $D = (P, L, m)$ and $m \neq \text{nil}$ then $m : B \rightarrow C$ holds for $m$’s entry in $\Sigma$ and some $C$.

$$C \xrightarrow{[n]} D = (P, L, m) \xrightarrow{[m]} \ldots$$

Proof. The definition of $\Sigma$ implies $n(n) : \text{shape}(C) \rightarrow B$, so from Theorem 1 (extended to ternary configurations) we obtain $(Q, M, \text{nil}) \xrightarrow{n(n)} D$ for a unique $D$, and $\text{shape}(D) = B$ holds. The rule EXECUTE consequently guarantees $C \xrightarrow{n} D$ as claimed. If $D$ has the form $(P, L, m)$ with $m \neq \text{nil}$, then $n(n)$ contains an instruction $\text{jmp } m$ or $\text{if operator } m_1 \text{ } m_2$ with $m \in \{ m_1, m_2 \}$. By the definition of programs $P$ there is at most one such instruction in $N(n)$, and it must be the last instruction, so $(n, m) \in A$ holds, and $m \in \text{dom } N$. Hence, there is an entry $m : A \rightarrow C$ in $\Sigma$, and by the second condition of the rule PROG, $\Sigma \vdash P : \circ$ implies $B = A$.

Hence, programs either do not terminate due to loops or terminate in a well-defined configuration. No runtime errors related to the usage of the forwarding mechanism occur.

The type system is restrictive in that only one entry for each block is allowed in $\Sigma$. This is motivated by the desire to control the shape of final configurations. Likewise, a more generous type system where constraints $B_n = A_m$ are replaced by constraints $B_n = A_m \otimes$
Chapter 3. Sequential queue machines

X for arbitrary X would still guarantee the executability of programs. However, loops would be admitted which increase the size of a configuration in each iteration. Since the number of iterations can in general not be determined statically, the restriction \( X = 1 \) is necessary for controlling the size of configurations.

### 3.5.3 Type inference

Starting from typings \( \mathcal{N}(n) : A_n \rightarrow B_n \) for the bodies of basic blocks, the typing rule for \( \Sigma \vdash P : \circ \) requires that the types of neighbouring basic blocks be unifiable. This means that factors \( X_i \) and \( X_j \) should exist such that for the weakenings \( A_i \otimes X_i \rightarrow B_i \otimes X_i \) and \( A_j \otimes X_j \rightarrow B_j \otimes X_j \), the equation \( B_i \otimes X_i = A_j \otimes X_j \) holds whenever \( (i, j) \in A \). Type inference has to (dis)prove the existence of factors \( X_i \), and we employ the following notion of unifier for this task.

**Definition 4.** Let \( \mathcal{N} \) be a partial map \( \mathcal{N} : \mathcal{N} \rightarrow \text{Iseq}, \text{dom} \mathcal{N} = \{n_1, \ldots, n_k\} \), \( A \subseteq \text{dom} \mathcal{N} \times \text{dom} \mathcal{N} \) and for \( i \leq k \) let \( \mathcal{N}(n_i) : A_i \rightarrow B_i \). A tuple \( \bar{X} = (X_1, \ldots, X_k) \) unifies \( \mathcal{N} \) for \( A \) if \( (n_i, n_j) \in A \) implies \( B_i \otimes X_i = A_j \otimes X_j \).

The following proposition shows the appropriateness of unifiers for type inference.

**Proposition 6.** Let \( P = (\mathcal{N}, A) \) be a program.

1. If \( \bar{X} \) is a unifier for \( \mathcal{N} \) and \( A \) and \( \Sigma = \cup_{n_i \in \text{dom} \mathcal{N}} \forall n_i : A_i \otimes X_i \rightarrow B_i \otimes X_i \) then \( \Sigma \vdash P : \circ \).

2. If \( \Sigma \vdash P : \circ \) and \( \Sigma = \bigcup_{n_i \in \text{dom} \mathcal{N}} n_i : A_i \rightarrow B_i \) then \( \bar{X} \) unifies \( \mathcal{N} \) for \( A \).

**Proof.**

1. For \( \Sigma = \bigcup_{n_i \in \text{dom} \mathcal{N}} n_i : A_i \otimes X_i \rightarrow B_i \otimes X_i \) and \( (n_i, n_j) \in A \) we have \( B_i \otimes X_i = A_j \otimes X_j \) since \( \bar{X} \) unifies \( \mathcal{N} \) for \( A \). Consequently, all conditions in the rule for \( \Sigma \vdash P : \circ \) are fulfilled.

2. By the rule for \( \Sigma \vdash P : \circ \), \( B_i = A_j \) must hold whenever \( (n_i, n_j) \in A \), where \( \Sigma = \bigcup_{n_i \in \text{dom} \mathcal{N}} n_i : A_i \rightarrow B_i \). Consequently for \( X_i = X_j = 1 \) we have \( B_i \otimes X_i = A_j \otimes X_j \) for all \( (n_i, n_j) \in A \). 

Of particular interest is the minimal unifier.
**Definition 5.** For $k$-tuples $\bar{Y}$ and $\bar{Z}$ let $\bar{Y} \otimes \bar{Z} = (Y_1 \otimes Z_1, \ldots, Y_k \otimes Z_k)$.

If $\bar{X}$ unifies $\bar{N}$ for $\bar{A}$, it is minimal if for every $\bar{Y}$ which unifies $\bar{N}$ for $\bar{A}$ there is a $\bar{Z}$ such that $\bar{X} \otimes \bar{Z} = \bar{Y}$.

For given $\bar{N}$ and $\bar{A}$ at most one minimal unifier exists.

**Lemma 1.**

1. If $\bar{X}$ and $\bar{Y}$ unify $\bar{N}$ for $\bar{A}$ and $\bar{X} = \bar{Y} \otimes \bar{Z}$ for some $\bar{Z}$, then $Z_i = Z_j$ holds for all $(n_i, n_j) \in A^*$.  

2. If $\bar{X}$ and $\bar{Y}$ are minimal unifiers for $\bar{N}$ and $\bar{A}$ then $\bar{X} = \bar{Y}$.

**Proof.**

1. Let $\bar{X}$ and $\bar{Y}$ unify $\bar{N}$ for $\bar{A}$ and let $\bar{X} = \bar{Y} \otimes \bar{Z}$. Then for $(n_i, n_j) \in A$ we have

$$A_j \otimes Y_j \otimes Z_j = A_j \otimes X_j = B_i \otimes X_i = B_i \otimes Y_i \otimes Z_i = A_j \otimes Y_j \otimes Z_i$$

and thus $Z_j = Z_i$. Consequently, $Z_i = Z_j$ whenever $(n_i, n_j)$ is in the transitive closure of $A$. Closure under symmetry of $A$ is trivial.

2. Since $\bar{X}$ is minimal and $\bar{Y}$ unifies $\bar{N}$ for $\bar{A}$, there is a $\bar{Z}$ such that $\bar{X} \otimes \bar{Z} = \bar{Y}$. Likewise, minimality of $\bar{Y}$ implies the existence of a $\bar{W}$ such that $\bar{Y} \otimes \bar{W} = \bar{X}$. Together, we have $\bar{X} = \bar{Y} \otimes \bar{W} = \bar{X} \otimes \bar{Z} \otimes \bar{W}$, i.e. $X_i = X_i \otimes Z_i \otimes W_i$ for all $i$. This can only be fulfilled for $W_i = Z_i = 1$, so $X_i = Y_i$ as required.

The following theorem shows how to obtain a minimal unifier for $\bar{A} \cup \{(n, m)\}$ from one for $\bar{A}$ provided that it exists.

**Theorem 2.** Suppose $\bar{X}$ is the minimal unifier for $\bar{N}$ and $\bar{A}$ and for $n, m \in \text{dom} \ N$ let $a = (m, n) \notin A$. Let

$$S = \{l | (l, m) \in A^*\}, T = \{l | (l, n) \in A^*\}$$

and $(B_m \otimes X_m)/(A_n \otimes X_n) = V/U$. Then

1. $\bar{X}$ is the minimal unifier for $\bar{A} \cup \{a\}$ if $a \in A^*$ and $V \otimes U = 1$.

2. $\bar{A} \cup \{a\}$ has no unifier if $a \in A^*$ and $V \otimes U \neq 1$. 

Chapter 3. Sequential queue machines

3. if \( a \notin A^S \), then \( S \cap T = \emptyset \) and \( \bar{Y} \) is the minimal unifier for \( A \cup \{a\} \), where

\[
Y_l = \begin{cases} 
X_l \otimes U & \text{if } l \in S \\
X_l \otimes V & \text{if } l \in T \\
X_l & \text{otherwise}
\end{cases}
\]

Proof. The proof consists of four parts.

1. Show that \( \bar{X} \) is the minimal unifier for \( N \) and \( A \cup \{a\} \) if \( a \in A^S \) and \( V \otimes U = 1 \).

If \((i, j) \in A\) then \( B_i \otimes X_i = A_j \otimes A_j \) since \( \bar{X} \) unifies \( N \) for \( A \).

If \((i, j) = (m, n)\) then

\[
B_m \otimes X_m = B_m \otimes X_m \otimes U = A_n \otimes X_n \otimes V = A_n \otimes X_n.
\]

Therefore, \( \bar{X} \) unifies \( N \) for \( A \cup \{a\} \).

Suppose that \( \bar{Y} \) is minimal unifier for \( A \cup \{a\} \). Then \( \bar{Y} \) unifies \( A \), so \( \bar{X} \otimes \bar{Z} = \bar{Y} \) for some \( \bar{Z} \) by the minimality of \( \bar{X} \) for \( A \). But \( \bar{Y} \) is minimal for \( A \cup \{a\} \) and \( \bar{X} \) unifies \( A \cup \{a\} \), so \( \bar{Y} \otimes \bar{W} = \bar{X} \) for some \( \bar{W} \). Consequently, \( \bar{Z} = \bar{W} = (1, \ldots, 1) \).

Therefore, \( \bar{X} \) is minimal for \( N \) and \( A \cup \{a\} \).

2. Show that \( A \cup \{a\} \) has no unifier if \( a \in A^S \) and \( V \neq 1 \) or \( U \neq 1 \).

Suppose \( V \otimes U \neq 1 \) and \( a \in A^S \) and \( \bar{Y} \) unifies \( N \) for \( A \cup \{a\} \). Then \( \bar{Y} \) also unifies \( N \) for \( A \), so \( \bar{X} \otimes \bar{Z} = \bar{Y} \) for some \( \bar{Z} \). By Lemma 1(1) we have \( Z_m = Z_n \) since \( a = (m, n) \in A^S \). Consequently,

\[
Z_m \otimes B_m \otimes X_m = B_m \otimes Y_m = A_n \otimes Y_n = A_n \otimes X_n \otimes Z_m = A_n \otimes X_n \otimes Z_m.
\]

and hence \( B_m \otimes X_m = A_n \otimes X_n \), in contradiction to \( V \otimes U \neq 1 \).

3. Show that for \( a \notin A^S \) the vector \( \bar{Y} \) as given in the theorem unifies \( N \) for \( A \cup \{a\} \).

If \((i, j) \in A\), then \((i, j) \in A^S\), so one of the following three cases applies.

- \{i, j\} \subseteq S. We have \( Y_i = X_i \otimes U \) and \( Y_j = X_j \otimes U \), so

\[
B_i \otimes Y_i = B_i \otimes X_i \otimes U = A_j \otimes X_j \otimes U = A_j \otimes Y_j
\]

because \( \bar{X} \) unifies \( A \).
• \{i, j\} \subset T. Similar to the first case, we obtain $B_i \otimes Y_i = A_j \otimes Y_j$ due to $Y_i = X_i \otimes V$ and $Y_j = X_j \otimes V$.

• \{i, j\} \cap (S \cup T) = \emptyset. We have $B_i \otimes Y_i = B_i \otimes X_i = A_j \otimes X_j = A_j \otimes Y_j$.

If \((i, j) = (m, n)\), then $m \in S$ and $n \in T$ and thus $Y_m = X_m \otimes U$ and $Y_n = X_n \otimes V$. Consequently,

$$B_m \otimes Y_m = B_m \otimes X_m \otimes U = A_n \otimes X_n \otimes V = A_n \otimes Y_n.$$  

4. Show that for $a \not\in A^* \bar{Y}$ is minimal for $A \cup \{a\}$.

Assume that $\bar{K} \neq \bar{Y}$ is minimal unifier for $A \cup \{a\}$. Then there is a $\bar{Z}$ such that $\bar{Y} = \bar{K} \otimes \bar{Z}$, and $\bar{K} \neq \bar{Y}$ implies that $Z_\alpha \neq 1$ holds for some $\alpha \leq k$, where $k$ is the length of $\bar{Z}$.

One of the following three cases applies.

(a) $\alpha \in S$. We have $Z_l = Z_\alpha$ for all $l \in S$ by Lemma 1(1) because $\bar{Y}$ and $\bar{K}$ both unify $A$. Therefore,

$$\forall l \in S. K_l \otimes Z_\alpha = Y_l = X_l \otimes U \quad (3.4)$$

In particular, $m \in S$, and thus

$$A_n \otimes K_n \otimes Z_\alpha = B_m \otimes K_m \otimes Z_\alpha = B_m \otimes X_m \otimes U = A_n \otimes X_n \otimes V$$

and thus

$$K_n \otimes Z_\alpha = X_n \otimes V. \quad (3.5)$$

For $l \in T$ we have $Y_l = X_l \otimes V$, so $X_l \otimes V = K_l \otimes Z_l$. In particular, $n \in T$ and thus

$$X_n \otimes V = K_n \otimes Z_n. \quad (3.6)$$

Combining (3.6) and (3.5), we obtain $Z_n = Z_\alpha$. Applying Lemma 1(1) again, we obtain $Z_l = Z_\alpha$ for all $l \in T$. 


Let \( U / Z_{\alpha} = M A / B \) and \( V / Z_{\alpha} = N C / D \). Using equation (3.4) we obtain
\[
X_l \otimes M \otimes A = X_l \otimes U = K_l \otimes Z_{\alpha} = K_l \otimes M \otimes B
\]
and hence \( X_l \otimes A = K_l \otimes B \) for \( l \in S \). Since \( A \) and \( B \) are prime this implies that there are \( W_l \) with \( K_l = A \otimes W_l \) and hence \( X_l = W_l \otimes B \) for all \( l \in S \).

Similarly for \( l \in T \), we have
\[
X_l \otimes N \otimes C = X_l \otimes V = Y_l = K_l \otimes Z_l = K_l \otimes Z_{\alpha} = K_l \otimes N \otimes D
\]
and thus \( X_l \otimes C = K_l \otimes D \). Therefore, for \( l \in T \) there are \( W_l \) such that \( K_l = C \otimes W_l \) and \( X_l = W_l \otimes D \).

For \( l \not\in (S \cup T) \) take \( W_l = X_l \).

For showing that \( \bar{W} \) unifies \( A \), let \((i, j) \in A \). There are three cases.

i. \( \{i, j\} \subset S \). Then \( B_i \otimes W_i \otimes B = B_i \otimes X_i = A_j \otimes X_j = A_j \otimes W_j \otimes B \).

ii. \( \{i, j\} \subset T \). Then \( B_i \otimes W_i \otimes D = B_i \otimes X_i = A_j \otimes X_j = A_j \otimes W_j \otimes D \).

iii. \( \{i, j\} \cap (S \cup T) = \emptyset \). Then \( B_i \otimes W_l = B_i \otimes X_i = A_j \otimes X_j = A_j \otimes W_j \).

In all three cases we obtain \( B_i \otimes W_i = A_j \otimes W_j \).

Also, \( \bar{W} \) occurs in \( \bar{X} \) via \( \bar{X} = \bar{W} \otimes \bar{L} \) where
\[
L_l = \begin{cases} 
B & \text{if } l \in S \\
D & \text{if } l \in T \\
1 & \text{otherwise}
\end{cases}
\]

In order to obtain a contradiction to the minimality of \( \bar{X} \) for \( A \), suppose \( B = D = 1 \). Then \( N = Z_{\alpha} = M \) holds, and \( U = M \otimes A \) and \( V = N \otimes C = M \otimes C \) imply \( M = 1 \) because \( U \) and \( V \) are prime. Consequently, \( Z_{\alpha} = M = 1 \), in contradiction to the initial assumption \( Z_{\alpha} \neq 1 \). Therefore \( B \neq 1 \) or \( D \neq 1 \) and so \( L_l \neq 1 \) for \( l = m \) or \( l = n \) which means that \( \bar{X} \) is not minimal.

(b) \( \alpha \in T \). Similar to case (4a)

(c) \( \alpha \not\in S \cup T \). Let \( R = \{ l \mid (l, \alpha) \in A^* \} \) and \( Q = \{ l \mid l \not\in S \cup T \cup R \} \). Then \( S, T, R \) and \( Q \) are mutually disjoint and by Lemma 1(1) \( Z_l = Z_{\alpha} \) holds for all \( l \in R \).

Take \( W_l = K_l \) for \( l \in R \) and \( W_l = X_l \) for \( l \not\in R \).
3.5. Program execution

Figure 3.11: Example program for type inference via unification

Then \( \widehat{W} \) unifies \( A \): For \( \{i, j\} \subset R \) we have

\[
Z_\alpha \otimes B_i \otimes W_i = B_i \otimes K_i \otimes Z_i = B_i \otimes Y_i = A_j \otimes Y_j = A_j \otimes K_j \otimes Z_j = Z_\alpha \otimes A_j \otimes W_j
\]

and thus \( B_i \otimes W_i = A_j \otimes W_j \). For \( \{i, j\} \cap R = \emptyset \), \( B_i \otimes W_i = A_j \otimes W_j \) holds trivially.

Again, we obtain a contradiction to the minimality of \( \bar{X} \) for \( A \) because \( \bar{X} = \bar{W} \otimes \bar{L} \) where \( L_l = Z_\alpha \neq 1 \) for \( l \in R \) and \( L_l = 1 \) for \( l \notin R \) and \( R \neq \emptyset \).

Consequently, type inference for \( P = (N, A) \) can proceed by typing the bodies of all basic blocks separately and then visiting the arcs \( A \) in any order. An algorithm based on Theorem 2 either successively calculates minimal unifiers for arcs \( \emptyset \subset A_1 \subset \ldots \subset A_n = A \) or rejects the program as being ill-formed.

**Example.** Consider the program \( P \) with basic blocks \( N1, \ldots, N4 \) and arrows and initial typings as shown in Figure 3.11.

Visiting the arrow \( (N1, N2) \) first, we find that \( X_1 = q_2 \) and \( X_2 = q_3 \) are needed for unifying \( B_1 \) and \( A_2 \). Visiting \( (N2, N3) \) next, the extended product \( B_2 \otimes X_2 = q_1 \otimes q_3 \) is equal to \( A_3 \otimes 1 \), so no further weakening is needed for unifying \( N_2 \) and \( N_3 \) and we define \( X_3 = 1 \). Closing the right loop is possible when visiting \( (N3, N1) \) next as \( B_3 \otimes X_3 = q_2 = A_1 \otimes X_1 \) holds.
Visiting \((N2, N4)\) next, we see that \(N2\) has to be further weakened by \(q_4\) in order to unify with \(A_4\). \(N4\) in turn has to be weakened by \(q_1\), so we define \(Y_4 = q_1\) and \(Y_2 = X_2 \otimes q_4\).

In order to keep the earlier equations valid, we have to promote the new weakening to all blocks already connected to \(N2\), i.e. define \(Y_1 = X_1 \otimes q_4\) and \(Y_3 = X_3 \otimes q_4\). Notice that by construction, the right loop stays unified.

Finally, when we visit the last arrow \((N4, N1)\) we find that the blocks are not unifiable. The blocks \(N4\) and \(N1\) are already transitively related and the identity \(B_4 \otimes Y_4 = A_1 \otimes Y_1\) does not hold. Any weakening we apply to \(A_1 \otimes Y_1\) is promoted via \(N2\) back to \(N4\) and does hence not lead to unification. Consequently, the program is rejected as non-unifiable. This is correct as each iteration through the left loop produces an item in \(q_1\) and consumes an item in \(q_2\). Since the number of iterations can in general not be determined statically, we should not admit the program for execution.

\[\Diamond\]

### 3.6 Discussion

This chapter introduced an assembly language where operand queue names appear in the position of register identifiers. The language generalised the simple compounding approach as instructions can receive several arguments through forwarding, send their result to several operand queues and can forward across basic block boundaries. The language was given a dynamic semantics corresponding to an instruction set architecture of a processor, where only operand queues can be used. A static semantics complemented the dynamic semantics, ruling out characteristic error situations similar to those present in SCALP. Loops were required to be of neutral net-effect, i.e. each iteration could consume only as many items as it produced, and vice versa.

For the patterns of forwarding expressible in ALEF the notation based on linear logics could have been replaced by other formalisms, such as formal monomials over the set of queue names. The linear formalism was motivated by the fact that linear logics has already been successfully applied to type systems for programming languages and concurrency calculi [Laf88], [TWM95], [BS94], [EW90]. Furthermore, more generous forwarding schemes than the one realised by ALEF might need additional operators for combining types, for which some of the hitherto unused connectives of linear
logics may prove useful.
We briefly discuss two aspects of our design decision to capture operand queue names explicitly in the syntax of assembly programs.

Firstly, the computational model of ALEF results in a less flexible forwarding mechanism than hardware-based forwarding. For example, a translation of program

\[
\begin{align*}
\text{for (int } i=1; i<n; i++) \{ j = j*k; \\
\text{    } j = j-1;
\end{align*}
\]

into ALEF needs to fix the destination queue for the result of the multiplication statically. The two consuming instructions, however, execute on different functional units, and the exclusive binding of operand queues to functional units forbids a common destination. Consequently, a translation is only possible at the expense of an additional instruction, such as in

\[
\begin{align*}
[3]... \\
[4]\text{mul } q_1 q_2 q_1 \\
[5]\text{if }...3 6 \\
[6]\text{idmul } q_1 q_3 \\
[7]\text{dec } q_3 q_4
\end{align*}
\]

where we assume that \( q_1 \) and \( q_2 \) are bound to \( \text{MUL} \) and \( q_3 \) to \( \text{ALU} \). In contrast, a hardware based forwarding mechanism might generate different forwarding requests in each iteration, matching the effect of the additional instruction.

It is expected that more generous forwarding schemes can be permitted for an extended instruction set and more sophisticated type systems. For example, the destination of the result of the multiplication coincides with the outcome of the branch instruction. The destination should be \( q_1 \) whenever the branch condition is fulfilled, and \( q_3 \) otherwise. This relationship can be discovered statically, and a more flexible language would contain a corresponding conditional forwarding mechanism. It should thus be possible to match most of the flexibility provided by hardware-based forwarding in a static semantics. Weighting the costs and limitations of static typing against dynamic error recovery then amounts to a design decision where the complexity of the language and the compiler is traded off against the hardware complexity and the costs of error recovery at runtime.
Secondly, exposing the architectural structure at the assembly level means that the legality of programs is affected by architectural modifications. In particular, alterations to the binding of queues to functional units are not shielded from the programmer by the ISA boundary and directly influence the behaviour of programs. Traditionally, system designers would consider this visibility a disadvantage of our computational model as programs need to be type-checked whenever the architecture is modified. However, for reasoning about architectural modifications and their effect on program compilation, explicitness of architectural information is beneficial as formulae and propositions may directly refer to architectural entities. Furthermore, the usage of specific operand queues and the scheduling of operands to communication paths represents an important aspect of program optimisation, which may only be performed statically if the compiler has sufficient knowledge of the executing hardware. The ALEF model supports this optimisation phase by enabling the compiler to refer to architectural entities directly. Finally, future developments of our techniques will show that not all architectural information needs to be revealed, as long as program and processor fulfil some key requirements. Our computational model thus helps clarifying which parts of an architecture should be explicitly exposed to the programmer, with benefits for future instruction set designers and compiler writers.

### 3.6.1 Summary and outlook

The introduction of ALEF delivered the first technical contribution promised in Section 1.3. We also discussed how ALEF’s model of computation arises from, and relates to, core aspects of SCALP. Dynamic semantics (second item in Section 1.3) and static semantics (third item) were given for AQM’s with sequential execution, and their compatibility was demonstrated by proofs of soundness and completeness.

In the next chapter, we will consider AQM’s with concurrent execution and execution under operand queues of finite length. This will motivate modifications to the calculi for both dynamic and static semantics, but their general shape will remain unchanged.
Chapter 4

Asynchronous queue machines

Program execution in modern processors does not follow the strict sequentiality suggested by the ISA definition. Instead, instructions are partitioned into micro-operations corresponding to stages of a processing pipeline. Super-scalar architectures contain several instruction pipelines and process instructions concurrently. Branch prediction and speculative execution lead to additional complexity.

This chapter presents an alternative dynamic semantics for ALEF which takes the concurrency of multiple pipelines into account and is thus more closely related to the execution in modern processors. Studying the interaction between this distributed dynamic semantics and forwarding, we discover that interleaved execution may experience additional error conditions. This makes program execution unsafe as a distributed execution does not honour the ISA specification. We characterise the source of these hazards and study various alternatives for eliminating them. This is supported by our programming language based view as it enables us to characterise programs which are unsafe.

Having obtained safety of distributed execution, we modify the architecture further by limiting the length of operand queues. Additional hazards arise both with respect to the sequential as well as the distributed dynamic semantics as programs might deadlock due to operand queue overflows. We show how a static semantics can derive the minimal queue capacities for executing a program safely and consider the effect of extending the queue lengths. It is shown that absence of queue overflows is only preserved for those programs which were safe in the unrestricted distributed model.
Synopsis The chapter is structured similarly to the previous chapter. Section 4.1 introduces the dynamic semantics for distributed execution of straight-line code. The operational model interleaves reductions of the sequential model on the instruction level. Observing that the static semantics from Chapter 3 guarantees executability but not determinacy, we discuss the relationship between determinism, safety and typability in Section 4.2. This analysis leads to a static semantics which separates the detection of race conditions from the task of eliminating them. An extended type system is given in Section 4.3.1 where judgements contain constraints on the relative execution order of instructions. These constraints are shown to imply determinacy and hence safety of execution: distributed and sequential execution yield the same result. Sections 4.3.2 to 4.3.4 discuss how the constraints may be discharged by exploiting instruction orderings arising from the dynamic semantics or the program structure. Hence, safety of distributed execution may be obtained by (design time) analysis of the dynamic semantics, (compile time) analysis of the program and runtime mechanisms. Following this, we reintroduce branching instructions in Section 4.4 and adapt the analysis from Section 3.5 correspondingly. Models with operand queues of finite length are treated in Section 4.5. In Section 4.6, we discuss how distributed AQM’s relate to Tomasulo’s algorithm. This demonstrates the expressiveness of our computational model and provides a novel exposition of Tomasulo’s algorithm itself. Finally, we discuss technical improvements and comment on some design decisions in Section 4.7.

As in Chapter 3, all operands in this chapter are communicated using the forwarding mechanism and registers are ignored in the discussion.

4.1 Syntax and dynamic semantics

The dynamic semantics for distributed execution models the instruction-level interleaving in super-scalar asynchronous processors (see Figure 4.1). Instructions are loaded from memory and inserted into waiting stations in front of functional units. These are implemented as instruction queues, and progress in different pipelines is decoupled: once an instruction has entered its waiting station, no central control is exercised and execution of head instructions is purely triggered by the availability of operands.
4.1. Syntax and dynamic semantics

Syntactically, we model execution by parallel programs with as many components as there are instruction pipelines. For simplicity, the pipelines and the functional units are in one-to-one correspondence, and each pipeline can hence be represented by an instruction sequence of instructions mapped to the same functional unit. For straight-line code, the syntax of distributed programs is given by

$$\pi \in P ::= t_1 \parallel t_2 \parallel t_3$$

where $\parallel$ is non-commutative. For $\pi_1 = t_1 \parallel t_2 \parallel t_3$ and $\pi_2 = s_1 \parallel s_2 \parallel s_3$ we define $\pi_1 \pi_2 = t_1 s_1 \parallel t_2 s_2 \parallel t_3 s_3$.

We denote the parallel program resulting from inserting all instructions of a sequential program $t$ into the correct pipelines by $\pi_t$, given inductively by

$$\pi_\varepsilon = \varepsilon \parallel \varepsilon \parallel \varepsilon$$

$$\pi_{\text{ins}} = \begin{cases} 
\text{ins} \parallel \varepsilon \parallel \varepsilon & \text{if } FU(\text{ins}) = \text{ALU} \\
\varepsilon \parallel \text{ins} \parallel \varepsilon & \text{if } FU(\text{ins}) = \text{MUL} \\
\varepsilon \parallel \varepsilon \parallel \text{ins} & \text{if } FU(\text{ins}) = \text{MEM} 
\end{cases}$$

$$\pi_{st} = \pi_s \pi_t$$

The dynamic semantics is obtained by embedding sequential operation into the distributed model. We define $\mathcal{C} \vdash_{fit} D$ to hold if $\mathcal{C} \vdash D$ is derivable in the sequential dynamic semantics from Section 3.2 and all instructions $[n] code$ in $t$ fulfil $FU(code) = fu$. 
Formally, the relation $C \overset{t}{\rightarrow} D$ is obtained by the rules

\[
\text{FU}_1 \quad \frac{C \overset{n \text{code}}{\rightarrow} \text{D}}{\text{FU} \overset{\text{code}}{\rightarrow} \text{fu}} \quad \text{FU} \overset{\text{code}}{\rightarrow} \text{fu} \quad \text{and} \quad \text{FU}_2 \quad \frac{C \overset{\text{fu}}{\rightarrow} E \overset{t}{\rightarrow} D}{\text{Fu} \overset{t}{\rightarrow} E \overset{\text{fu}}{\rightarrow} D}
\]

where the premise in rule $\text{FU}_1$ refers to the calculus for binary configurations in Figure 3.5. We define distributed (interleaved) execution $C \overset{\pi}{\rightarrow} D$ by

\[
\text{DIS}_1 \quad \frac{C \overset{\text{fu}}{\rightarrow} D \overset{\pi}{\rightarrow} \text{D}}{\text{DIS}} \quad \text{and} \quad \text{DIS}_2 \quad \frac{C \overset{\pi_1 \pi_2}{\rightarrow} E \overset{\pi_1 \pi_2}{\rightarrow} \text{D}}{\text{DIS}}
\]

The relationship between sequential and distributed execution is as follows.

**Proposition 7.**

1. If $C \overset{t}{\rightarrow} D$ then $C \overset{\pi}{\rightarrow} D$.

2. If $C \overset{\pi}{\rightarrow} D$ then there is a $t$ such that $\pi_t = \pi$ and $C \overset{t}{\rightarrow} D$.

3. If $C \overset{\pi_t}{\rightarrow} D$, $t = \text{ins}_1 \ldots \text{ins}_n$ and $\text{ins}_i = [n_i \text{code}_i]$ then $(\otimes_{i=1}^{n} \text{code}_i) \otimes \text{shape}(D) = (\otimes_{i=1}^{n} \text{B}_{\text{code}_i}) \otimes \text{shape}(C)$.

**Proof.** All three parts are shown separately.

1. Induction on the structure of $t$.

   **Case** $t = \varepsilon$. As there is no derivation for $C \overset{t}{\rightarrow} D$, the claim is trivially fulfilled.

   **Case** $t = \text{add} \; op_1 \; op_2 \; op_3$. From $C \overset{t}{\rightarrow} D$ and $\text{FU} \overset{\text{add} \; op_1 \; op_2 \; op_3}{\rightarrow} \text{ALU}$ we obtain $C \overset{\text{ALU}}{\rightarrow} D$, hence $C \overset{\pi_t}{\rightarrow} D$ follows by rule $\text{DIS}_1$.

   **Other base cases similar.**

   **Case** $t = \text{su}$. For $C \overset{\text{su}}{\rightarrow} D$, rule $\text{COMP}$ guarantees that there is an $E$ with $C \overset{s}{\rightarrow} E$ and $E \overset{u}{\rightarrow} D$. Applying the induction hypothesis to $C \overset{s}{\rightarrow} E$ and $E \overset{u}{\rightarrow} D$ yields $C \overset{\pi_t}{\rightarrow} E$ and $E \overset{\pi_u}{\rightarrow} D$, so the claim follows by rule $\text{DIS}_2$ and $\pi_s \pi_u = \pi_{su} = \pi_t$.

2. If $C \overset{\pi}{\rightarrow} D$ holds via $\text{DIS}_1 \quad \frac{C \overset{\text{fu}}{\rightarrow} D \overset{\pi}{\rightarrow} \text{D}}{\text{DIS}}$ then $\pi = \pi_s$ and the claim is fulfilled for $t = s$.

   If $C \overset{\pi}{\rightarrow} D$ holds via $\text{DIS}_2 \quad \frac{C \overset{\pi_1 \pi_2}{\rightarrow} E \overset{\pi_1 \pi_2}{\rightarrow} \text{D}}{\text{DIS}}$ with $\pi = \pi_1 \pi_2$, then we have $C \overset{s}{\rightarrow} E$ and $E \overset{u}{\rightarrow} D$ for some $s$ and $u$ with $\pi_s = \pi_1$ and $\pi_u = \pi_2$ by induction hypothesis. For $t = su$ we consequently obtain $\pi_t = \pi_{su} = \pi_s \pi_u = \pi_1 \pi_2 = \pi$ and $\text{COMP} \overset{\pi}{\rightarrow} D$.
3. By part (2), there is an $s$ with $\pi_s = \pi_t$ and $C \xrightarrow{\sim} D$, so $s$ consists of the same instructions as $t$. We may therefore prove the claim by showing that $(\otimes_{i=1}^n A_{\text{code}_i}) \otimes \text{shape}(D) = (\otimes_{i=1}^n B_{\text{code}_i}) \otimes \text{shape}(C)$ holds for an instruction sequence $s$ which consists of instructions $\text{ins}_i$ and fulfills $C \xrightarrow{\sim} D$.

**Case INSTR.** For $C \xrightarrow{\text{ins}_1} D$, Theorem 1 implies $\text{ins}_1 : \text{shape}(C) \rightarrow \text{shape}(D)$, and rule AX yields $\text{shape}(C) = A_{\text{code}_1} \otimes X$ and $\text{shape}(D) = B_{\text{code}_1} \otimes X$ for some $X$. Combining these two facts yields

$$A_{\text{code}_1} \otimes \text{shape}(D) = A_{\text{code}_1} \otimes B_{\text{code}_1} \otimes X = B_{\text{code}_1} \otimes \text{shape}(C).$$

**Case COMP.** For $s = ru$ there must be an $E$ such that $C \xrightarrow{r} E$ and $E \xrightarrow{u} D$, so by induction hypothesis we obtain

$$(\otimes_{j=1}^m A_{\text{code}_{ij}}) \otimes \text{shape}(E) = (\otimes_{j=1}^m B_{\text{code}_{ij}}) \otimes \text{shape}(C)$$

and

$$(\otimes_{j=m+1}^n A_{\text{code}_{ij}}) \otimes \text{shape}(D) = (\otimes_{j=m+1}^n B_{\text{code}_{ij}}) \otimes \text{shape}(E)$$

where $r = \text{ins}_{i_1}\ldots\text{ins}_{i_m}$ and $u = \text{ins}_{i_{m+1}}\ldots\text{ins}_{i_n}$. Hence,

$$(\otimes_{i=1}^n A_{\text{code}_i}) \otimes \text{shape}(D) = (\otimes_{j=1}^m A_{\text{code}_{ij}}) \otimes (\otimes_{j=m+1}^n A_{\text{code}_{ij}}) \otimes \text{shape}(D)$$

$$= (\otimes_{j=1}^m A_{\text{code}_{ij}}) \otimes (\otimes_{j=m+1}^n B_{\text{code}_{ij}}) \otimes \text{shape}(E)$$

$$= (\otimes_{j=1}^m B_{\text{code}_{ij}}) \otimes (\otimes_{j=m+1}^n B_{\text{code}_{ij}}) \otimes \text{shape}(C)$$

$$= (\otimes_{i=1}^n B_{\text{code}_i}) \otimes \text{shape}(C).$$

In particular, part (3) of Proposition 7 guarantees $\text{shape}(D) = \text{shape}(E)$ whenever $C \xrightarrow{\pi_t} D$ and $C \xrightarrow{\pi_r} E$.

As a further consequence, the type system presented in Section 3.3 is sound for distributed execution in the sense that typability entails executability.
**Proposition 8.** If $t : A \rightarrow B$ and $\text{shape}(C) = A$ then there is a $D$ such that $C \xrightarrow{\pi_i} D$ and $\text{shape}(D) = B$.

**Proof.** For $t : A \rightarrow B$ and $\text{shape}(C) = A$ there is by Theorem 1 a (unique) $D$ such that $\text{shape}(D) = B$ and $C \xrightarrow{\pi} D$. By Proposition 7 we have $C \xrightarrow{\pi_i} D$.

It is no longer complete, i.e. $C \xrightarrow{\pi_i} D$ does not imply $t : \text{shape}(C) \rightarrow \text{shape}(D)$.

**Example.** Let $t = [1]\text{add } q_1 q_2 q_3 [2]1\text{dc } 2 q_2$. $M$ arbitrary, $C = ([q_1 \rightarrow 3], M)$ and $D = ([q_3 \rightarrow 5], M)$. Then $\text{shape}(C) = q_1$, $\text{shape}(D) = q_3$ and $C \xrightarrow{\pi} D$, but $t$ cannot be given the type $q_1 \rightarrow q_3$. Indeed, the principal typing for $t$ is $t : q_1 \otimes q_2 \rightarrow q_3 \otimes q_2$ and a sequential execution of $t$ starting in $C$ deadlocks.

In fact, distributed programs do in general not execute deterministically.

**Definition 6.** A distributed program $\pi$ is deterministic if $C \xrightarrow{\pi} D$ and $C \xrightarrow{\pi} E$ implies $D = E$.

**Example.** Let $\pi = [1]\text{add } q_1 q_2 q_3 \parallel \varepsilon \parallel [2]1\text{dc } 1 q_3$ and $C = ([q_1 \rightarrow 2, q_2 \rightarrow 3], M)$. Then there are $D \neq E$ with $C \xrightarrow{\pi} D$ and $C \xrightarrow{\pi} E$, as shown in Figure 4.2.

Non-determinism arises from race hazards between instructions which execute on different functional units but forward to the same operand queue.

The type system’s adequacy for parallel execution is thus limited, in theoretical (incompleteness) and practical (non-determinism) respects. Even a well-typed program may yield different results as the configuration $D$ in Proposition 8 need not be unique.

The shape of all final configurations are identical by Proposition 7, but the values inside the operand queues may differ.

Rather than studying distributed programs and their non-determinism directly, we are more interested in characterising instruction sequences for which distributed execution coincides with sequential execution, and with typability. The following section therefore discusses the relationship between distributed execution, sequential execution, race hazards and typing for individual instruction sequences.
4.2 Determinism, safety and completeness

We first introduce some terminology.

Definition 7. An instruction sequence \( t \) is called

- deterministic if \( C \xrightarrow{\text{p}} D \) and \( C \xrightarrow{\text{p}} E \) implies \( D = E \)
- safe if \( C \xrightarrow{\text{p}} D \) implies \( C \xrightarrow{\text{L}} D \)
- complete if \( C \xrightarrow{\text{p}} D \) implies \( t : \text{shape}(C) \rightarrow \text{shape}(D) \).

For practical purposes, determinacy of \( t \) is important for studying the processor’s behaviour as modelled in the distributed dynamic semantics, without referring to the ISA definition. For establishing correctness of processor behaviour in relation to the ISA specification, safety is most appropriate as for safe programs, sequential and distributed dynamic semantics coincide. Proposition 7 showed that the processor may deliver the same result as the ISA semantics promises, and safety guarantees that any result delivered by the processor is indeed the one promised. Completeness finally allows us to reason about distributed execution in the compiler. It guarantees that whatever interleaving the processor chooses (possibly influenced by the outcome of race conditions),

Figure 4.2: Execution of a non-deterministic program
the shape of final configurations is determined by the shape of the initial configuration, based on Proposition 2.

The three notions relate to each other as follows.

**Proposition 9.** For an instruction sequence \( t \), the following (non-)implications hold.

- \( t \) safe \( \iff \) \( t \) complete and \( t \) deterministic
- \( t \) deterministic \( \not\implies \) \( t \) safe
- \( t \) deterministic \( \not\implies \) \( t \) complete
- \( t \) complete \( \not\implies \) \( t \) safe
- \( t \) complete \( \not\implies \) \( t \) deterministic

Before proving Proposition 9, we introduce relativised variants of the above notions.

**Definition 8.** An instruction sequence \( t \) is called

- deterministic for \( A \) if for all \( C \) with \( \text{shape}(C) = A \), \( C \xrightarrow{\pi} D \) and \( C \xrightarrow{\pi} E \) implies \( D = E \)
- safe for \( A \) if for all \( C \) with \( \text{shape}(C) = A \), \( C \xrightarrow{\pi} D \) implies \( C \xrightarrow{t} D \)
- complete for \( A \) if for all \( C \) with \( \text{shape}(C) = A \), \( C \xrightarrow{\pi} D \) implies \( t : \text{shape}(C) \rightarrow \text{shape}(D) \).

By definition, the unrelativised notions are obtained from the relativised notions by quantifying over all products \( A \).

**Lemma 2.** The following (non-)implications hold for instruction sequence \( t \) and product \( A \).

- \( t \) safe for \( A \) \( \iff \) \( t \) complete for \( A \) and \( t \) deterministic for \( A \)
- \( t \) deterministic for \( A \) \( \not\implies \) \( t \) safe for \( A \)
- \( t \) deterministic for \( A \) \( \not\implies \) \( t \) complete for \( A \)
\begin{itemize}
\item \(t\) complete for \(A\) \(\not\Rightarrow t\) safe for \(A\)
\item \(t\) complete for \(A\) \(\not\Rightarrow t\) deterministic for \(A\)
\end{itemize}

\textbf{Proof.} The first claim consists of two parts.

\(\Rightarrow.\) We have to show that for \(\text{shape}(C) = A, C \xrightarrow{\pi_t} D\) and \(C \xrightarrow{\pi} E\) implies \(D = E\) and \(t : \text{shape}(C) \rightarrow \text{shape}(D)\).

By the definition of safety with respect to \(A, C \xrightarrow{\pi_t} D\) implies \(C \xrightarrow{t} D\) and \(C \xrightarrow{\pi} E\) implies \(C \xrightarrow{t} E\). By Proposition 1, \(D = E\) follows. Also, \(C \xrightarrow{t} D\) implies \(t : \text{shape}(C) \rightarrow \text{shape}(D)\) by Theorem 1.

\(\Leftarrow.\) We have to show that \(\text{shape}(C) = A\) and \(C \xrightarrow{\pi_t} D\) implies \(C \xrightarrow{t} D\).

By the definition of completeness with respect to \(A, C \xrightarrow{\pi_t} D\) yields \(t : \text{shape}(C) \rightarrow \text{shape}(D)\), so by Theorem 1 there is an \(E\) with \(C \xrightarrow{t} E\), and \(\text{shape}(E) = \text{shape}(D)\) holds. By applying Proposition 7(part (1)), we obtain \(C \xrightarrow{\pi} E\), so determinism with respect to \(A\) implies \(D = E\), hence \(C \xrightarrow{t} D\).

For the non-implications, consider the programs \(s\) and \(t\)

\[
\begin{align*}
    s &= [1] \text{dec } q_1 q_2 [2] 1 \text{dc } 5 q_1 \\
    t &= [1] 1 \text{dc } 3 q_1 [2] 1 \text{dc } 5 q_2 [3] \text{dec } q_1 q_2
\end{align*}
\]

whose execution traces for initial configurations of shape \(1\) are shown in Figure 4.3.

Program \(s\) is deterministic for \(1\), since for any configuration \(C\) with empty queue \(q_1\), the distributed program \(\pi_s\) can only follow the interleaving \([2][1]\), so \(C \xrightarrow{\pi_s} D\) and \(C \xrightarrow{\pi} E\) implies \(D = E\). However, \(s\) is neither safe for \(1\) nor complete for \(1\). It is not safe because for any \(C\) with empty \(q_1\) there is no \(D\) with \(C \xrightarrow{s} D\). It is not complete because \(s\) cannot be given type \(1 \rightarrow B\) for any \(B\). Hence, program \(s\) proves the second and the third claim.

Program \(t\) is complete for \(1\), because \(t : 1 \rightarrow q_2 \otimes q_2\) holds, and for any \(C\) and \(D\) with \(\text{shape}(C) = 1\) and \(C \xrightarrow{\pi_t} D\) we have \(\text{shape}(D) = q_2 \otimes q_2\). On the other hand, \(t\) is neither safe for \(1\) nor deterministic for \(1\). It is not safe because for \(C = ([], M)\) and \(D = ([q_2 \rightarrow 25])\) we have \(\text{shape}(C) = 1\) and \(C \xrightarrow{\pi_t} D\) via the interleaving \([1][3][2]\), but \(C \xrightarrow{t} D\). It
is not deterministic for 1 because for the same C, configuration E = (q₂ → 52) fulfills C \( \pi_t \) E via the interleaving [1][2][3], but \( D \neq E \) holds. Hence, program \( t \) proves the forth and fifth claim.

The results in Proposition 9 are now obtained as corollaries of the corresponding statements in Lemma 2. Safety of \( t \) is equivalent to completeness and determinacy because for each product A this equivalence holds relative to A. Likewise, the examples in the proof of Lemma 2 also prove the non-implications in Proposition 9. In each case, we can exhibit a program \( t \) and (an A and) a configuration C (of shape A) demonstrating that the implication does not hold.

**Proposition 10.** If \( t : A \rightarrow B \) then \( t \) complete for A.

**Proof.** For \( t : A \rightarrow B \), \( shape(C) = A \) and \( C \xrightarrow{\pi_t} D \) we have to show \( t : shape(C) \rightarrow shape(D) \).

Since \( t : A \rightarrow B \) and \( shape(C) = A \) hold, Theorem 1 implies \( C \xrightarrow{\pi_t} E \) for some E with \( shape(E) = B \). By applying Proposition 7(part (1)) we obtain \( C \xrightarrow{\pi_t} E \), so Proposition
7(part (3)) yields \( \text{shape}(D) = \text{shape}(E) \), hence \( \text{shape}(D) = B \) as claimed. □

In the next section, we will extend the type system by constraints which guarantee (relativised) determinism. Fulfilling the constraints will hence result in (relativised) safety, based on the following consequences of Propositions 9 and 10.

**Corollary 1.** Let \( t \) be an instruction sequence and \( t : A \rightarrow B \).

- \( t \) is deterministic for \( A \) \( \Leftrightarrow \) \( t \) is safe for \( A \).

- If \( A = 1 \) then \( t \) is complete.

- If \( A = 1 \) then \( t \) is deterministic iff \( t \) is safe.

**Proof.**

- \( t : A \rightarrow B \) implies that \( t \) is complete for \( A \) by Proposition 10, so the first claim of Lemma 2 implies the equivalence.

- By applying the weakening rule WK we may type \( t : C \rightarrow C \otimes B \) for any \( C \), and for each choice Proposition 10 implies that \( t \) is complete for \( C \). Hence \( t \) is complete for all \( C \) and thus complete.

- Combine the previous item and Proposition 9. □

None of the relativised notions are monotone with respect to \( A \).

**Example.** Consider the programs

\[
t = [1] \text{add } q_1 q_2 q_1 [2] \text{dc } 3 q_1 \quad \text{and} \quad s = [1] \text{add } q_1 q_2 q_3 [2] \text{dc } 3 q_1
\]

with the respective principal typings \( t : q_1 \otimes q_2 \twoheadrightarrow q_1 \otimes q_1 \) and \( s : q_1 \otimes q_2 \twoheadrightarrow q_1 \otimes q_3 \). Then \( t \) is complete for \( 1 \) and \( q_2 \otimes q_1 \) but not for \( q_2 \), and it is deterministic for \( q_2 \) but not for \( q_2 \otimes q_1 \). Program \( s \) is safe for \( 1 \) and \( q_2 \otimes q_1 \) but not for \( q_2 \). □
4.3 Dealing with non-determinism

Race hazards leading to non-determinism can be detected at compile-time by analysing the forwarding behaviour, and several choices exist for handling them.

In some applications, non-determinism might be desirable. For example, the latency in globally asynchronous systems depends on environmental conditions such as the temperature. In applications for security, the temperature might hence be (a part of) the secret key necessary for decoding a message correctly, and attempts to decode a message under incorrect temperature settings experience different interleavings and thus fail.

For applications where non-determinism cannot be tolerated, race hazards can be eliminated either in hardware or in software. A simple hardware-based approach would aim at modifying the processor implementation to block the progress in one pipeline until other pipelines have reached a certain state. This corresponds to changing the dynamic semantics by imposing side-conditions on the reduction rules. If implemented naively, such a solution depends on a notion of global state, is thus likely to result in central control and therefore undesirable in an asynchronous setting as the operand-driven mode of operation is compromised.

The programming language view allows us to understand race-hazards and their elimination as two separate issues. First, there are requirements on the relative order of some instructions which need to be fulfilled for determinism to hold. These constraints are specific to each input program. Second, there are mechanisms for proving that requirements are fulfilled. These discharging techniques exploit existent ordering relations between instructions which originate from the dynamic semantics or are specific to the input program or the initial configuration. They may be of arbitrary complexity as long as they can be implemented as static or dynamic program analysis.

This section therefore presents an extended type system which derives constraints necessary for determinism. Following that, we examine two techniques for discharging constraints, based on the dynamic semantics and on program analysis. The first technique fulfils constraints irrespectively of the shape of initial configurations, and can therefore be used for obtaining determinism. The second technique depends on the shape of an initial configuration and can thus yield at most relativised determinism.
4.3. Dealing with non-determinism

As the extended type system contains the rules for deriving $t : A \rightarrow B$, the respective determinacy property immediately yields the corresponding safety property, and thus guarantees the predictability of program execution according to the distributed dynamic semantics.

4.3.1 A static semantics for determinism

The type system from Section 3.3 is extended such that it derives sequentiality constraints which guarantee deterministic execution: whenever two instructions forward their results to the same operand queue their execution order must agree in all interleavings. These constraints will later be discharged using instruction dependencies inherent in the dynamic semantics or the structure of the program.

For the dynamic semantics $C \xrightarrow{n} D$, the only source of non-determinism in a well-typed program is the presence of a race condition between instructions which execute independently but forward to the same operand queue. In order to reason about these output conflicts, we extend sequents $t : A \rightarrow B$ by annotations $\Gamma$ and $<$. This yields sequents of the form $(\Gamma, <) \vdash t : A \rightarrow B$. The set $\Gamma$ contains entries of the form $op_1 : (l_1, k_1), \ldots, op_n : (l_n, k_n)$ such that the $op_i$ are distinct and the $l_i, k_i \in \mathbb{N}$ are labels of instructions in $t$. An entry $op : (l, k)$ in $\Gamma$ represents the fact that instructions $l$ and $k$ forward their result to $op$, and $l$ is the first such instruction in $t$ and $k$ the last one.

The second component, $<$, is a partial order over $\mathbb{N}$ which relates instructions in $t$ by $[n] < [m]$ whenever they forward to the same destination such that $[n]$ precedes $[m]$ in $t$. Soundness of our calculus consists of showing that $t$ is deterministic in the sense of Section 4.2 provided that all executions of $t$ contain $[n]$ before $[m]$. By embedding the calculus $t : A \rightarrow B$, we restrict our attention to programs which are sequentially executable and aim for safety rather than pure determinism.

The typing rules for deriving $(\Gamma, <) \vdash t : A \rightarrow B$ are shown in Figure 4.4 and are defined in terms of the typing relation $t : A \rightarrow B$ from Section 3.3. They use the following definitions.

**Definition 9.** For $B_{\text{code}} = op_1^{k_1} \otimes \ldots \otimes op_m^{k_m}$ we set

$$\Gamma_{[n]_{\text{code}}} = \{ op_i : (n, n) \mid k_i > 0 \}. $$
For contexts \((\Gamma, \subset)\) and \((\Delta, \prec)\), the context \((\Gamma \# \Delta, \prec)\) is given by

\[
\Gamma \# \Delta = \{ op : (l,h) | op : (l,k) \in \Gamma \text{ and } op : (g,h) \in \Delta \} \\
\cup \{ op : (l,k) | op : (l,k) \in \Gamma \text{ and } op \notin \text{dom} \Delta \} \\
\cup \{ op : (g,h) | op : (g,h) \in \Delta \text{ and } op \notin \text{dom} \Gamma \}
\]

and

\[
\prec = (\subset \cup \prec \cup \{ (k,g) | op : (l,k) \in \Gamma \text{ and } op : (g,h) \in \Delta \})^+.
\]

The type system is faithful to the calculus from Section 3.3.

**Lemma 3.**

1. \(t : A \rightarrow \omega B\) holds exactly if there are \(\Gamma\) and \(\prec\) such that \((\Gamma, \prec) \vdash t : A \rightarrow \omega B\) holds.

2. If \((\Gamma, \prec) \vdash t : A \rightarrow \omega B\) and \((\Delta, \prec) \vdash t : C \rightarrow \omega D\) then \(\Gamma = \Delta\) and \(\prec = \omega\).

3. If \((\Gamma, \prec) \vdash t : A \rightarrow \omega B\), then

   - all labels \(n\) occurring in \(\Gamma\) or \(\prec\) are labels of instructions in \(t\).
   - if \(n < m\) then \(t\) can be written as \(s[n] \text{code}_n r[m] \text{code}_m u\) and \(B \text{code}_n\) and \(B \text{code}_m\) have a common factor \(op\).
   - \(op : (n, \omega) \in \Gamma\) iff \(n\) is the first instruction \([m] \text{code}\) in \(t\) for which \(op\) is a factor of \(B \text{code}\).
   - \(op : (\omega, n) \in \Gamma\) iff \(n\) is the last instruction \([m] \text{code}\) in \(t\) for which \(op\) is a factor of \(B \text{code}\).
   - \(op : (n, n) \in \Gamma\) iff \(n\) is the only instruction \([m] \text{code}\) in \(t\) for which \(op\) is a factor of \(B \text{code}\).
4.3. Dealing with non-determinism

4. If \((\Gamma, \prec) \vdash t : A \rightarrow B\), \(t\) has the form \(s[n] \text{code}_1 r[m] \text{code}_2 u\) and \(op\) is a factor of both \(B_{\text{code}_1}\) and \(B_{\text{code}_2}\), then \(n < m\) holds.

Proof. We prove all four parts simultaneously by induction on the structure of \(t\).

Case \(t = \varepsilon\). Neither \(t : A \rightarrow B\) nor \((\Gamma, \prec) \vdash t : A \rightarrow B\) are derivable, so all claims are trivially fulfilled.

Case \(t = \text{ins}\). 1. If \(\text{ins} : A \rightarrow B\) then \((\Gamma, \prec) \vdash t : A \rightarrow B\) holds for \(\Gamma = \Gamma_{\text{ins}}\) and \(< = \emptyset\) by rule AX<.

If \((\Gamma, \prec) \vdash \text{ins} : A \rightarrow B\) is derivable then \(\text{ins} : A \rightarrow B\) is derivable because it occurs as the premise of AX<.

2. For \((\Gamma, \prec) \vdash \text{ins} : A \rightarrow B\) and \((\Delta, \prec) \vdash \text{ins} : C \rightarrow D\) we have \(\Gamma = \Gamma_{\text{ins}} = \Delta\) and \(< = \emptyset = \prec\) by rule AX<.

3. For \((\Gamma, \prec) \vdash \text{ins} : A \rightarrow B\) we have \(\Gamma = \Gamma_{\text{ins}} = \emptyset\), so the first two claims hold trivially, and \(op : (m, l)\) occurs in \(\Gamma\) exactly if \(m = l\) is the label of \(\text{ins}\) and for \(B_{\text{code}} = op_{k_1}^{k_1} \otimes \ldots \otimes op_{k_n}^{k_n}\) there is an \(i\) with \(op = op_i\) and \(k_i > 0\), i.e. \(op\) is a factor of \(B_{\text{code}}\).

4. \(\text{ins}\) does not have the form \(s[n] \text{code}_1 r[m] \text{code}_2 u\), so the claim is trivially fulfilled.

Case \(t = \text{sr}\). 1. The last step in the derivation for \(t : A \rightarrow B\) must be

\[
\text{CUT} \quad \frac{s : A \rightarrow C \quad r : C \rightarrow B}{t : A \rightarrow B}
\]

for some \(C\). By induction hypothesis, there are hence \(\Lambda, \subset, \Delta\) and \(<\) such that \((\Lambda, \subset) \vdash s : A \rightarrow C\) and \((\Delta, \prec) \vdash r : C \rightarrow B\) hold. For \(\Gamma = \Lambda \# \Delta\) and \(<\) defined by

\[(\subset \cup < \cup \{(k, g) \mid op : (l, k) \in \Lambda\) and \(op : (g, h) \in \Delta\})^+\]

we obtain \((\Gamma, <) \vdash t : A \rightarrow B\) by rule \(\text{CUT}<\).

Conversely, the last step in a derivation for \((\Gamma, <) \vdash t : A \rightarrow B\) must be of the form

\[
\text{CUT}< \quad \frac{(\Lambda, \subset) \vdash s : A \rightarrow C \quad (\Delta, \prec) \vdash r : C \rightarrow B}{(\Gamma, <) \vdash t : A \rightarrow B}
\]
where $\Gamma = \Lambda\#\Delta$ and $<$ equals

\[
(c \cup \prec \cup \{(k, g) \mid \text{op}: (l, k) \in \Lambda \text{ and } \text{op}: (g, h) \in \Delta\})^+.
\]

By induction hypothesis, $s : A \to C$ and $r : C \to B$ hold, hence $sr : A \to B$ follows by rule CUT.

2. The last rule in the derivations for $(\Gamma, \prec) \vdash t : A \to B$ and $(\Delta, \prec) \vdash t : C \to D$ is CUT$^<$, so there are $\Lambda_i, \Sigma_i, \subset_i$ and $\prec_i$ for $i \in \{1, 2\}$ such that

\[
\begin{align*}
(\Lambda_1, \subset_1) & \vdash s : A \to E \\
\text{CUT}^< & \frac{\Sigma_1, \prec_1 \vdash r : E \to B}{(\Gamma, \prec) \vdash sr : A \to B} \quad \text{and} \quad \text{CUT}^< \frac{\Sigma_2, \prec_2 \vdash r : F \to D}{(\Delta, \prec) \vdash sr : C \to D}
\end{align*}
\]

In particular, $\Gamma = \Lambda_1\#\Sigma_1, \Delta = \Lambda_2\#\Sigma_2$ and

\[
\begin{align*}
\prec &= (c_1 \cup \prec_1 \cup \{(k, g) \mid \text{op}: (l, k) \in \Lambda_1 \text{ and } \text{op}: (g, h) \in \Sigma_1\})^+ \\
\prec &= (c_2 \cup \prec_2 \cup \{(k, g) \mid \text{op}: (l, k) \in \Lambda_2 \text{ and } \text{op}: (g, h) \in \Sigma_2\})^+.
\end{align*}
\]

By induction hypothesis, $\Lambda_1 = \Lambda_2, \Sigma_1 = \Sigma_2, \subset_1 = \subset_2$ and $\prec_1 = \prec_2$ hold, hence

\[
\Gamma = \Lambda_1\#\Sigma_1 = \Lambda_2\#\Sigma_2 = \Delta
\]

and

\[
\begin{align*}
\prec &= (c_1 \cup \prec_1 \cup \{(k, g) \mid \text{op}: (l, k) \in \Lambda_1 \text{ and } \text{op}: (g, h) \in \Sigma_1\})^+ \\
&= (c_2 \cup \prec_2 \cup \{(k, g) \mid \text{op}: (l, k) \in \Lambda_2 \text{ and } \text{op}: (g, h) \in \Sigma_2\})^+ \\
&= \prec.
\end{align*}
\]

3. The last step in the derivation for $(\Gamma, \prec) \vdash sr : A \to B$ must have been

\[
\begin{align*}
(\Lambda, \subset) & \vdash s : A \to C \\
\text{CUT}^< & \frac{\Delta, \prec \vdash r : C \to B}{(\Gamma, \prec) \vdash sr : A \to B}
\end{align*}
\]

for some $\Lambda, \Delta, \subset, \prec$ and $C$. By induction hypothesis, all labels occurring in $\Lambda$ or $\subset$ are labels of instructions in $s$, and all labels occurring in $\Delta$ or $\prec$ are labels of instructions in $r$, so all labels occurring in $\Gamma$ or $\prec$ are labels
of instructions in \( t = sr \) by the definition of \# and \(<\). This proves the first item.

For the third item, observe that for \( op : (n, \_), \in \Gamma \), either \( op : (n, \_), \in \Lambda \) or \( op \notin \text{dom } \Lambda \) and \( op : (n, \_), \in \Delta \) hold. In the first case, \([n]\) is the first instruction for which \( op \) occurs in \( B_{\text{code}} \) in \( s \) by induction hypothesis, hence \( n \) is the first such instruction in \( sr = t \). In the second case, there is no instruction in \( s \) for which \( op \) occurs in \( B_{\text{code}} \) by induction hypothesis (reverse direction), but \([n]\) is the first instruction for which \( op \) occurs in \( B_{\text{code}} \) in \( r \), also by induction hypothesis. Hence \([n]\) the first such instruction in \( sr = t \).

Conversely, if \( n \) is the first instruction in \( s \) for which \( op \) occurs in \( B_{\text{code}} \) in \( sr \) then either \( n \) is the first such instruction or \( s \) does not have any instruction for which \( op \) occurs in \( B_{\text{code}} \) and \([n]\) is the first such instruction in \( r \). In the first case, the induction hypothesis implies \( op : (n, \_), \in \Lambda \), so \( op : (n, \_), \in \Gamma \).

In the second case, the induction hypothesis implies \( op \notin \text{dom } \Lambda \), and \( op : (n, \_), \in \Delta \) holds, so \( op : (n, \_), \in \Gamma \).

The forth and fifth item are proven similarly to the third one.

Finally, for proving the second item, note that by the definition of \( < \), any \( n \) and \( m \) with \( n < m \) fulfill either \( n \subset m \) or \( n \prec m \) or there are \( l, h \) and \( op \) such that \( op : (l, n), \in \Lambda \) and \( op : (m, h), \in \Delta \). In the first case, the induction hypothesis guarantees that \( s \) can be written as \( s_1[n]s_2[m]s_3 \) and there is a common factor of \( B_{\text{code}_n} \) and \( B_{\text{code}_m} \), so \( t = sr \) can be written as \( s_1[n]s_2[m]s_3r \) and there is a common factor of \( B_{\text{code}_n} \) and \( B_{\text{code}_m} \). Similarly, the second case results in \( t \) having the form \( sr_1[n]r_2[m]r_3 \) with a common factor of \( B_{\text{code}_n} \) and \( B_{\text{code}_m} \). In the third case, the first item guarantees that \( s \) is of the form \( s_1[n]s_2 \) and that \( r \) is of the form \( r_1[m]r_2 \), so \( t \) is of the form \( t_1[n]t_2[m]t_3 \) for \( t_1 = s_1, t_2 = s_2r_1 \) and \( t_3 = r_2 \). Items four and five guarantee that \( op \) is a factor of both \( B_{\text{code}_n} \) and \( B_{\text{code}_m} \).

4. For \((\Gamma, <) \vdash st : A \rightarrow B\) and \( sr \) is of the form

\[
u[n]\text{code}_1v[m]\text{code}_2w
\]

such that \( op \) is a factor of \( B_{\text{code}_1} \) and \( B_{\text{code}_2} \), the rule \( \text{CUT}^< \) implies that
there are $\Lambda, \subset, \Delta$ and $\prec$ such that $(\Lambda, \subset) \vdash s : A \rightarrow \circ C$ and $(\Delta, \prec) \vdash r : C \rightarrow \circ B$
hold, where $\Gamma = \Lambda \# \Delta$ and $\prec$ is equal to

$$(\subset \cup \prec \cup \{(k, g) \mid \text{op} : (l, k) \in \Lambda \text{ and } \text{op} : (g, h) \in \Delta\})^+. $$

There are three cases.

(a) Instructions $[n]$ and $[m]$ both occur in $s$. Then $s$ can be written in the form $s_1 [n] \text{code}_1 s_2 [m] \text{code}_2 s_3$, and the induction hypothesis yields $[n] \subset [m]$, so from the definition of $\prec$ we obtain $[n] \prec [m]$.

(b) Instructions $[n]$ and $[m]$ both occur in $r$. Similarly to the previous case, we obtain $[n] \prec [m]$ by induction hypothesis and thus $[n] \prec [m]$ from the definition of $\prec$.

(c) Instruction $[n]$ occurs in $s$ and instruction $[m]$ occurs in $r$. Let $[l]$ be the last instruction in $s$ such that $\text{op}$ is a factor of $B_{\text{code}}$ and $[k]$ be the first such instruction in $r$. By induction hypothesis, $l \neq n$ implies $n \subset l$ and $m \neq k$ implies $k \prec m$. It therefore suffices to show $l < k$, which holds by the definition of $\prec$ as item (3) yields $\text{op}(\omega, l) \in \Lambda$ and $\text{op}(k, \omega) \in \Delta$.

For $(\Gamma, \prec) \vdash t : A \rightarrow \circ B$, instructions in $t$ are thus related by $\prec$ whenever they forward to the same operand queue. The order of such instructions for each queue agrees with the order in which they appear in $t$.

**Definition 10.** Let $t$ be an instruction sequence and $\sqsubseteq$ a partial order over $\mathbb{N}$. Then $t$ respects $\sqsubseteq$ if $t$ can be written as $t = s [n] \text{code}_n r [m] \text{code}_m u$ whenever $[n] \sqsubseteq [m]$ holds.

For proving soundness, we show that $(\Gamma, \prec) \vdash t : A \rightarrow \circ B$ yields determinism relative to $A$, provided that all constraints $\prec$ are met: for $\text{shape}(C) = A$, we suppose that all distributed executions of $t$ respect $\prec$, i.e. that for $C \xrightarrow{\pi_s} E$, all $s$ which fulfil $C \xrightarrow{\pi_t} E$ and $\pi_s = \pi_t$ respect $\prec$. We show that for the unique $D$ with $C \xrightarrow{\pi} D$ (existent by Theorem 1) the equality $D = E$ holds.

**Definition 11.** Program $s$ is an interleaving for $t$ and $A$ if $\pi_s = \pi_t$ holds and for any $C$ with $\text{shape}(C) = A$ there is a $D$ with $C \xrightarrow{\pi} D$. 
For a partial order $\sqsubseteq$ over $\mathbb{N}$, program $t$ is called sequential for $\sqsubseteq$ and $A$ if all interleavings $s$ for $t$ and $A$ respect $\sqsubseteq$.

**Lemma 4.** Let $(\Gamma, \prec) \vdash t : A \rightarrow B$, $\text{shape}(C) = A$, and $u = \text{ins'}$ an interleaving of $t$ and $A$. Let $t$ be sequential for $\sqsubseteq$ and $A$, let $\prec$ be contained in $\sqsubseteq$, and let $C \xrightarrow{\text{ins}} D_1$ and $t = t_1 \text{inst}_2$. Then there are $\Delta$ and $\prec'$ such that for $C = \text{shape}(D_1)$ and $s = t_1 t_2$ the following derivation exists

$$
\frac{(\Gamma_{\text{ins}}, \emptyset) \vdash \text{ins} : A \rightarrow C \quad (\Delta, \prec) \vdash s : C \rightarrow B}{(\Gamma_{\text{ins}} \# \Delta, \prec) \vdash \text{ins}\, s : A \rightarrow B}
$$

In particular, $\prec'$ is contained in $\prec$.

**Proof.** Since $u$ is an interleaving for $t$ and $A$, $t$ is sequential for $\sqsubseteq$ and $A$ and $\text{shape}(C) = A$ holds, there is a (unique) $D$ such that $C \xrightarrow{\text{ins}} D_1 \rightarrow D$. Theorem 1 implies $\text{ins} : A \rightarrow C$, hence $(\Gamma_{\text{ins}}, \emptyset) \vdash \text{ins} : A \rightarrow C$ by the rule $AX\prec$.

Since $(\Gamma, \prec) \vdash t : A \rightarrow B$ holds, Lemma 3 implies $t : A \rightarrow B$, so there is by Theorem 1 a (unique) $E$ with $C \xrightarrow{t} E$, and $\text{shape}(E) = B$ holds. The typing $t : A \rightarrow B$ also implies that there are $D$ and $E$ such that

$$
t_1 : A \rightarrow D \quad \text{ins} : D \rightarrow E
$$

$$
t_1 \text{ins} : A \rightarrow E \quad t_2 : E \rightarrow B
$$

is a valid derivation. Since $\text{ins} : D \rightarrow E$ holds, we know that there is an $X$ with

$$D = A_{\text{code}} \otimes X \quad \text{and} \quad E = B_{\text{code}} \otimes X$$

where $\text{ins} = [n] \text{code}$.

On the other hand, $\text{ins} : A \rightarrow C$ implies

$$A = A_{\text{code}} \otimes Y \quad \text{and} \quad C = B_{\text{code}} \otimes Y$$

for some $Y$. Consequently,

$$t_1 : A_{\text{code}} \otimes Y \rightarrow A_{\text{code}} \otimes X \quad \text{and} \quad t_2 : B_{\text{code}} \otimes X \rightarrow B.$$

Since $\text{ins}$ is the head instruction of its pipeline, instructions in $t_1$ do not access elements in $A_{\text{code}}$, so $A_{\text{code}}$ is used as a weakening in the typing of $t_1$, i.e. we have $t_1 : Y \rightarrow X$. By applying the alternative weakening $B_{\text{code}}$, we obtain $t_1 : Y \otimes B_{\text{code}} \rightarrow X \otimes B_{\text{code}}$, so

$$s : Y \otimes B_{\text{code}} \rightarrow B.$$
Since \( Y \otimes B_{\text{code}} = C \) holds, we have \( s : C \rightarrow B \), and hence
\[
(\Delta, \prec) \vdash s : C \rightarrow B \quad \text{for some } \Delta \text{ and } \prec
\]
by Lemma 3. Consequently,
\[
(\Gamma_{\text{ins}} \# \Delta, \sqsubset) \vdash \text{ins } s : A \rightarrow B
\]
for \( \sqsubset = (\prec \cup \{([n],[m]) | \text{op} : ([m],u) \in \Delta, \text{op factor of } B_{\text{code}} \})^+ \). Instruction sequences \( u \) and \( t \) are both interleavings for \( t \) and \( A \rightarrow u \) by assumption and \( t \) due to \( \pi_t = \pi_t \) and \( t : A \rightarrow B \). Since \( t \) is sequential for \( \sqsubset \) and \( A, u \) and \( t \) both respect \( \sqsubset \).

Therefore, \([n] \) must be minimal in \( \sqsubset \). For suppose there is an \([m] \) with \([m] \sqsubset [n] \), then \( u \)
must have the form \( u_1 [m] u_2 [n] u_3 \), in contradiction to \( u = \text{ins } u' \).

As \([m] < [n] \) implies \([n] \sqsubset [m] \), instruction \([n] \) is also minimal for \( < \), and \([n] \) does not occur in \( \Delta \) or \( \prec \), since it does not occur in \( s \) (see the previous Lemma).

Let \([l] < [k] \). By Lemma 3, \( s = t_1 t_2 \) has the form \( v_1 [l] \text{code}_1 v_2 [k] \text{code}_k v_3 \) and there is a common factor \( \text{op} \) of \( B_{\text{code}_1} \) and \( B_{\text{code}_k} \). Consequently, \([l] \) occurs in front of \([k] \) also in \( t = t_1 \text{inst}_2 \), so \([l] < [k] \) holds by the last item of Lemma 3.

\[\square\]

**Proposition 11.** Let \((\Gamma, \prec) \vdash t : A \rightarrow B \) and \( \prec \) be contained in the partial order \( \sqsubset \). If \( t \) is sequential for \( \sqsubset \) and \( A \) then \( t \) is deterministic for \( A \).

**Proof.** Since \((\Gamma, \prec) \vdash t : A \rightarrow B \) implies \( t : A \rightarrow B \), it suffices by Corollary 1 to prove that for all \( C \) with \( \text{shape}(C) = A \), \( C \xrightarrow{\pi_t} D \) implies \( C \xrightarrow{t} D \). So let \( \text{shape}(C) = A \) and \( C \xrightarrow{\pi_t} D \). By Proposition 7(part (2)) there is a \( u \) with \( \pi_u = \pi_t \) and \( C \xrightarrow{u} D \), so \( u \) is an interleaving for \( t \) and \( A \). As \( t \) is sequential for \( \sqsubset \) and \( A, u \) respects \( \sqsubset \), and thus \( \prec \). We write \( t = t_1 \ldots t_n \) and \( u = u_1 \ldots u_n \) and show \( C \xrightarrow{t} D \) by induction on \( n \).

- For \( n = 1 \), \( \pi_u = \pi_t \) implies \( u = u_1 = t_1 = t \), so \( C \xrightarrow{u} D \) implies \( C \xrightarrow{t} D \).

- For \( n > 1 \), let and \( u = u_1 u' \) and \( t = t_1 t' \). \( C \xrightarrow{u} D \) implies the existence of a \( D_1 \) with \( C \xrightarrow{u_1} D_1 \) and \( D_1 \xrightarrow{u'} D \). Since \((\Gamma, \prec) \vdash t : A \rightarrow B \) holds, Lemma 3(part (1)) implies \( t : A \rightarrow B \), so by Theorem 1 and \( \text{shape}(C) = A \) there is a (unique) \( E \) such that \( C \xrightarrow{t} E \) and \( \text{shape}(E) = B \). Consequently, \( t \) is an interleaving for \( t \) and \( A \), so it respects \( \sqsubset \) and \( \prec \), and there is an \( E_1 \) such that \( C \xrightarrow{u_1} E_1 \) and \( E_1 \xrightarrow{t'} E \). We perform a case distinction on whether \( FU(u_1) = FU(t_1) \) holds.
4.3. Dealing with non-determinism

Case $FU(u_1) = FU(t_1)$. Instructions $u_1$ and $t_1$ both execute in configuration $C$, and they are in the same pipeline, so they must be equal, and $D_1 = E_1$ and $\pi_{u_1} = \pi_{t_1}$ follow. By Lemma 4, there are $\Delta$ and $\prec$ such that for $C = shape(E_1) = shape(D_1)$ there is a derivation for $(\Delta, \prec) \vdash t' : C \rightsquigarrow B$, and $\prec$ is contained in $\prec$ and hence in $\subseteq$. Furthermore, $t'$ is sequential for $\subseteq$ and $C$ as for any interleaving $s$ for $t'$ and $C$, the program $t_1 s$ is an interleaving for $t$ and $A$ and thus respects $\subseteq$, so $s$ must respect $\subseteq$. We can hence apply the induction hypothesis to obtain that $t'$ is deterministic for $C$. Therefore, for any $F$ with $shape(F) = C$, $F \xrightarrow{\pi_{u_1}} G$ and $F \xrightarrow{\pi_{t_1}} H$ implies $G = H$. We choose $F = D_1 = E_1$, $G = D$, $F \xrightarrow{\pi_{t_1}} D$ via interleaving $u'$, and $H = E$ via interleaving $t'$. Consequently, $E = D$ holds.

Case $FU(u_1) \neq FU(t_1)$. Instructions $u_1$ and $t_1$ are completely independent: for their principal typings $u_1 :: A_{u_1} \rightsquigarrow B_{u_1}$ and $t_1 :: A_{t_1} \rightsquigarrow B_{t_1}$, $prime(A_{u_1}, A_{t_1})$ and $prime(B_{u_1}, B_{t_1})$ hold, proven as follows.

Suppose $q$ is a joint factor of $A_{u_1}$ and $A_{t_1}$. Then $q$ fulfils $\gamma(q) = FU(u_1) \neq FU(t_1) = \gamma(q)$, a contradiction. Suppose $q$ is a joint factor of $B_{u_1}$ and $B_{t_1}$, then Lemma 3(part (4)) implies $t_1 < u_1$ because $u_1 = t_1$ holds for some $l > 1$ due to $\pi_{u_1} = \pi_{t_1}$. As $\prec$ is contained in $\subseteq$, this contradicts the fact that $u$ respects $\subseteq$, i.e. no instruction may precede $u_1$ in $\subseteq$.

Consequently, $shape(C)$ contains both $A_{u_1}$ and $A_{t_1}$ and can thus be written as $shape(C) = A_{u_1} \otimes A_{t_1} \otimes X$ for some $X$. Theorem 1 now implies

$$shape(D_1) = B_{u_1} \otimes A_{t_1} \otimes X \quad and \quad shape(E_1) = A_{u_1} \otimes B_{t_1} \otimes X.$$ 

Hence $t_1$ can be executed in $D_1$ and $u_1$ in $E_1$, and the above properties on the disjointness of the $A_i$ and $B_i$ imply that there is a common $C_1$ with

$$E_1 \xrightarrow{u_1} C_1, \quad D_1 \xrightarrow{t_1} C_1 \quad and \quad shape(C_1) = B_{u_1} \otimes B_{t_1} \otimes X$$

as visualised in Figure 4.5. As $u_1 = t_1$ occurs in $t'$ and $t_1 = u_k$ in $u'$ for some $k$, there are $s', s'', r'$ and $r''$ such that

$$t' = s' u_1 s'' \quad and \quad u' = r' t_1 r''$$
Furthermore, $u'$ and $t'$ are sequential for \( \text{shape}(D_1) \) and \( \text{shape}(E_1) \) as they are postfixed for $u$ and $t$. We can thus apply Lemma 4 to the outer spines of the execution tree in Figure 4.5, so there are $\Delta$, $\Lambda$, $\prec$ and $\sqsubset$ such that $\Delta, \prec \vdash s's'' : \text{shape}(C_1) \rightarrow B$ and $(\Lambda, \sqsubset) \vdash r'r'' : \text{shape}(C_1) \rightarrow B$ where $\prec$ and $\sqsubset$ are contained in $\prec$ and thus in $\sqsubset$.

By applying the induction hypothesis we obtain determinacy of $t'$ and $u'$ for \( \text{shape}(E_1) \) and \( \text{shape}(D_1) \), respectively. Code sequences $u_1 s's''$ and $t_1 r'r''$ are interleavings of $t'$ and $s'$ for \( \text{shape}(E_1) \) and \( \text{shape}(D_1) \), respectively, as $u_1$ and $t_1$ are at the head positions in $E_1$ and $D_1$ and $\pi_{u_1 s's''} = \pi_{s'd} u_1 s'' = \pi_{t'}$ and $\pi_{t_1 r'r''} = \pi_{r't_1 r''} = \pi_{u_1}$ hold. Thus, $C_1 \xrightarrow{s's''} E$ and $C_1 \xrightarrow{r'r''} D$ hold. By construction, for $s = s's''$ and $r = r'r''$ the equality $\pi_r = \pi_s$ holds as both sides result from $\pi_t$ by executing $t_1$ and $u_1$. Again, $s$ and $r$ are sequential for \( \text{shape}(C) \) as they are interleavings of a postfix of $t$, hence the claim follows from the induction hypothesis.

\[ \square \]

The derived order $\prec$ thus represents the constraints necessary for (relativised) determinism. Arbitrary means may be used for establishing the existence of a relation $\sqsubset$ which contains $\prec$.

### 4.3.2 Pipeline-dependencies

Many programs can be proven deterministic by exploiting the order of instructions in the instruction pipelines. As the dynamic semantics inserts instructions into wait-
Dealing with non-determinism

As stations in program order and functional units remove only head instructions, the relative order of instructions $[n]code_n$ and $[m]code_m$ which are mapped to the same functional unit is preserved.

**Definition 12.** For instruction sequence

$$s[n]code_n r[m]code_m u,$$

$[m]$ is pipeline-dependent on $[n]$, written $[n] \preceq [m]$, if $FU(code_n) = FU(code_m)$ holds.

Members of the same pipeline can never show a race hazard because no two instructions can be in operation concurrently.

**Proposition 12.** Let $t$ be a code sequence, $A$ arbitrary and let $\preceq$ be the order of pipeline-dependencies for $t$. Then $t$ is sequential for $\preceq$ and $A$.

**Proof.** For interleaving $s$ for $t$ and $A$ we have to show that $s$ respects $\preceq$, i.e. that $[n] \preceq [m]$ implies that $s$ can be written as $s'[n]code_n s''[m]code_m s'''$.

As $[n] \preceq [m]$ is the pipeline-order with respect to $t$, there are $t', t''$ and $t'''$ such that $t = t'[n]code_n t''[m]code_m t'''$ and $FU(code_n) = FU(code_m)$ holds. For the component $i \in \{1, 2, 3\}$ corresponding to $FU(code_n)$ and $\pi_s = t_1 \parallel t_2 \parallel t_3$ we thus obtain $t_i = u'[n]code_n u''[m]code_m u'''$ for some $u', u''$ and $u'''$. As $s$ is an interleaving of $t$, $\pi_s = \pi_t$ holds, so for $\pi_s = s_1 \parallel s_2 \parallel s_3$ we have $s_i = t_i$. By the definition of $\pi_s$ the claim follows.

Consequently, programs are safe if all constraints $<$ can be discharged using pipeline-dependencies:

**Corollary 2.** If $(\Gamma, \prec) \vdash t : A \rightarrow B$ and $\prec$ is contained in the order $\prec$ of pipeline-dependencies in $t$, then $t$ is safe for $A$. If additionally $A = 1$ holds, then $t$ is safe.

**Proof.** Combine Propositions 12 and 11 and Corollary 1.

As $(\Gamma, \prec) \vdash t : A \rightarrow B$ implies $(\Gamma, \prec) \vdash t : A \otimes X \rightarrow B \otimes X$, safety with respect to $A$ implies safety with respect to $A \otimes X$, provided that only pipeline-dependencies are used for discharging the constraints.
Often, serialisation by pipeline-dependencies suffices to ensure determinism. For example, consider the program (4.1)

\[
\begin{align*}
[1] & \text{ldc } q_1 & [3] & \text{ldc } q_2 & [5] & \text{mul } q_2 q_1 q_1 & [7] & \text{st } q_3 q_4 \\
[2] & \text{ldc } q_2 & [4] & \text{ldc } q_4 & [6] & \text{add } q_1 q_1 q_3
\end{align*}
\] (4.1)

Provided that \(\gamma(q_1) = \text{ALU}, \gamma(q_2) = \text{MUL}\) and \(\gamma(q_3) = \gamma(q_4) = \text{MEM}\) hold, this program can be given the type \((\Gamma, <) \vdash t : 1 \rightarrow 1\) where \([2] < [3], [1] < [5]\) and \(\Gamma = \{q_1 : ([1], [5]), q_2 : ([2], [3]), q_3 : ([6], [6]), q_4 : ([4], [4])\}. \) If an architecture has sufficiently many queues such that one queue can be dedicated to each forwarding path \(fu_1 \rightarrow fu_2\), program (4.1) can be transformed to

\[
\begin{align*}
[1'] & \text{ldc } q_5 & [3'] & \text{ldc } q_2 & [5'] & \text{mul } q_2 q_1 q_1 & [7'] & \text{st } q_3 q_4 \\
[2'] & \text{ldc } q_2 & [4'] & \text{ldc } q_4 & [6'] & \text{add } q_5 q_1 q_3
\end{align*}
\] (4.2)

where the new queue \(q_5\) is dedicated to forwardings \(\text{MEM} \rightarrow \text{ALU}\) and the other queues are assigned to forwardings

\[
q_1 : \text{MUL} \rightarrow \text{ALU}, q_2 : \text{MEM} \rightarrow \text{MUL}, q_3 : \text{ALU} \rightarrow \text{MEM} \text{ and } q_4 : \text{MEM} \rightarrow \text{MEM}.
\]

The typing for the transformed program reads \((\Delta, <) \vdash t' : 1 \rightarrow 1\) where \([2'] < [3']\) is the only conflict. This constraint is easily seen to be fulfilled by \(<\) as instructions \([2']\) and \([3']\) both execute on \(\text{MEM}\). The program is deterministic for all \(A\) as the shape of initial configurations has no influence on the execution order of instructions in the same pipeline. Functionally, distributed and sequential executions of program (4.2) also agree with the configuration resulting from (sequential execution of) program (4.1).

Not every architecture has enough operand queues for the above transformation to be applied – the number of operand queues appears to grow quadratically in the number of functional units. Furthermore, some programs need more than one queue for each forwarding path in order to guarantee functional correctness. For example, suppose \(\gamma(q_1) = \text{ALU} \text{ and } \gamma(q_2) = \gamma(q_3) = \text{MEM}\). The program

\[
\begin{align*}
[1] & \text{ldc } q_1 & [3] & \text{ldc } q_4 q_1 & [5] & \text{add } q_1 q_1 q_3 \\
[2] & \text{dec } q_1 q_2 & [4] & \text{ldc } q_4 & [6] & \text{st } q_3 q_2
\end{align*}
\] (4.3)
4.3. Dealing with non-determinism

cannot be blindly converted into the form where only one queue of connectivity $\text{ALU} \rightarrow \text{MEM}$ is employed. The resulting program

$$
[1] \text{ldc} 4 \ q_1 \quad [3] \text{ldc} 6 \ q_1 \quad [5] \text{add} \ q_1 \ q_2 \ q_2 \\
[2] \text{dec} \ q_1 \ q_2 \quad [4] \text{ldc} 8 \ q_1 \quad [6] \text{st} \ q_2 \ q_2
$$

would be functionally different as the values inside queue $q_2$ are swapped, and the unsymmetric instruction [6] would hence produce an incorrect result. Occasionally, the compiler might be able to reorder instructions to recover the correct behaviour, but as a general transformation rule, the above translation is incorrect.

However, an architecture might have a mechanism for varying the connectivity pattern, and for these architectures the above transformation might be appropriate. Consequently, the number of queues does not necessarily have to grow quadratically in the number of functional units, and at certain times more than one queue may be available for a particular forwarding path. For example, in reconfigurable architectures such as field-programmable gate arrays, the communication pattern may be modified at predefined moments in time. The above descriptions such as $q_1 : \text{MUL} \rightarrow \text{ALU}$ can thus be interpreted as generalisations of the association function $\gamma$ which describe the connectivity needed for a particular region of code. This approach may be extensible to verify ordinary code and reconfiguration instructions in combination, using a type-based calculus where connectivity descriptions are part of the typing context. In architectures where dynamic reconfiguration is not possible, such a calculus for regions of connectivity may still be useful. In this context, connectivity descriptions capture an aspect of the scheduling policy: for the code region in question, operand queues are used as described by the context, and at the boundaries between regions a virtual reconfiguration is performed, i.e. a change of usage policy.

4.3.3 Data-dependencies

In addition to serialisation by pipeline-dependencies, a compiler may fulfil serialisation constraints by exploiting data-dependencies. Operand queues serialise the execution of the producer of a value and its consumer. Each item needs to be inserted into an operand queue before it can be consumed, and consumption removes an item from its
queue. For example, program

\[ t = [1] \text{dc } 4 \text{ q1 } [2] \text{dec } q1 \text{ q1} \]  \hspace{1cm} (4.5)

can be typed \((\Gamma, [1] < [2]) \vdash t : 1 \rightarrow q_1\). For an initial configuration of shape 1, no interleaving of \(t\) will execute \([2]\) before \([1]\), so the race hazard in fact does not occur and the constraint can be discharged. Serialisation using data-dependencies can be combined with pipeline-dependencies. Program

\[ [1] \text{dc } 7 \text{ q2 } [2] \text{dc } 4 \text{ q1 } [3] \text{dec } q1 \text{ q2} \]  \hspace{1cm} (4.6)

carries the constraint \([1] < [3]\) which can be fulfilled for any configuration of shape 1 by observing that \([2]\) is always pipeline-dependent on \([1]\), and that \([3]\) is data-dependent on \([2]\) relative to that shape, so \([1]\) always precedes \([3]\) and the serialisation constraint is fulfilled.

Serialisation by operand queues concerns the observed data-dependencies, and the compiler should ensure that these coincide with the true data-dependencies in the original program. At a low level, analysing data-dependencies is more complex than analysing pipeline-dependencies because the former ones are influenced by contextual instructions and the shape of initial configurations. For example, prefixing program (4.5) by code of type \(A \rightarrow q_1\) where \(A\) is arbitrary destroys the dependency between \([1]\) and \([2]\) and results in a race hazard. In particular, data-dependencies are not preserved by the weakening rule.

**Definition 13.** Let \(t = \text{ins}_1 \ldots \text{ins}_m \ldots \text{ins}_n\) where \(m \geq 1\) and \(\text{ins}_i = [n_i] \text{code}_i\). Then \(\text{ins}_m\) is called the \(i\)-th writer to \(q\) in \(t\) if there are \(l, X, Y\) and \(k > 0\) such that \(\otimes_{j=1}^{m-1} B_{\text{code}_j} = q^l \otimes X, B_{\text{code}_m} = q^k \otimes Y, \text{prime}(q, X \otimes Y)\) and \(l < i \leq k + l\).

Likewise, \(\text{ins}_m\) is called the \(i\)-th reader from \(q\) in \(t\) if there are \(l, X, Y\) and \(k > 0\) such that \(\otimes_{j=1}^{m-1} A_{\text{code}_j} = q^l \otimes X, A_{\text{code}_m} = q^k \otimes Y, \text{prime}(q, X \otimes Y)\) and \(l < i \leq k + l\).

For \(q = q_1 = q_2\), an instruction like \(\text{add } q_1 q_2 q_3\) is the \(i\)-th as well as the \((i + 1)\)-th reader from \(q\) in \(t\), and a \(\text{dup1}\) instruction may similarly be a multiple writer.

**Definition 14.** For \(A\) with \(t : A \rightarrow B, [m]\) is data-dependent on \([n]\) for \(A\), written \([n] <_A [m]\), if there are \(q, i > 0\) and \(X\) such that \([n]\) is the \(i\)-th writer to \(q\) in \(t\) and \([m]\) is the \((i + k)\)-th reader, where \(A = q^k \otimes X\) and \(\text{prime}(q, X)\).
Proposition 13. Let $t : A \rightarrow B$. Then $\prec_A$ respects the order of instructions in $t$.

Proof. We have to show that for $[n] \prec_A [m]$ there are $r$, $s$ and $u$ with

$$t = r[n] \text{code}_n s[m] \text{code}_m u.$$ 

For $[n] \prec_A [m]$ there are by definition $q$, $i > 0$ and $X$ such that $[n]$ is the $i$-th writer to $q$ in $t$ and $[m]$ is the $(i + k)$-th reader, where $A = q^k \otimes X$ and $\text{prime}(q, X)$. In particular, instructions $[n]$ and $[m]$ occur in $t$, and $B_{\text{code}_n} = q^{kn} \otimes Y$ and $A_{\text{code}_m} = q^{km} \otimes Z$ hold with $k_n > 0$ and $k_m > 0$ and $\text{prime}(q, Y \otimes Z)$.

Suppose $[m]$ occurs before $[n]$, i.e. there are $r'$, $s'$ and $u'$ with

$$t = r'[m] \text{code}_m s'[n] \text{code}_n u'.$$

Since $t : A \rightarrow B$ holds, there are $C, \ldots, F$ with

$$r' : A \rightarrow C \quad [m] : C \rightarrow D \quad s' : D \rightarrow E \quad [n] : E \rightarrow F \quad u' : F \rightarrow B$$

$$r'[m] \text{code}_m s'[n] \text{code}_n u' : A \rightarrow B$$

This typing yields

$$k_1 = k + k_2 - k_3$$

whenever $C = q^{k_1} \otimes X_1$, $B_{\text{code}} = q^{k_2} \otimes X_2$, $A_{\text{code}} = q^{k_3} \otimes X_3$ and $\text{prime}(q, X_1 \otimes X_2 \otimes X_3)$ hold.

We now perform an induction on $i$, aiming at a contradiction to the typability of $t$.

Case $i = 1$. As $[n]$ is the first writer to $q$ in $t$, there are no writers to $q$ in $r'[m] s'$. In particular, there are no writers to $q$ in $r'$, so all instructions $[.] \text{code}$ in $r'$ fulfil $\text{prime}(q, B_{\text{code}})$ and $k_2 = 0$ follows.

On the other hand, $[m]$ is the $(k + 1)$-th reader from $q$ in $t$, so $r'$ contains $k$ readers, hence $k_3 = k$. As $A$ contains $q^k$ and there are $k$ readers in $r'$ and no writers, $\text{prime}(q, C)$ holds:

$$k_1 = k + k_2 - k_3 = 0.$$ 

Consequently, no composition of $r'$ with $[m]$ is possible as $[m]$ is a reader from $q$, so requires $C = A_{\text{code}_m} \otimes Z = q \otimes Z' \otimes Z$ for some $Z'$ and $Z$. 
Case $i > 1$. Then $r'$ includes at most $i-1$ writers to $q$ as exactly $i-1$ writers precede $[n]$ in $t$. Hence $k_2 \leq i-1$. As $k+i-1$ readers precede $[m]$ in $t$, there must be $k+i-1$ readers in $r'$, i.e. $k_3 = k+i-1$. Hence,

$$k_1 = k + k_2 - k_3 \leq k + i - 1 - k - i + 1 = 0.$$ 

Again, $\text{prime}(q,C)$ follows and no composition with $[m] : q \otimes C' \rightarrow D$ is possible.

In both cases we obtain a contradiction to the well-typedness of $t$.

Program (4.5) suggests that $<_A$ can be used for discharging constraints, as $[1] <_1 [2]$ holds. Likewise, program (4.6) motivates us to use pipeline-dependencies and data-dependencies in combination, as $[2] <_1 [3]$ and $[1] < [2]$ holds, so the constraint $[1] < [3]$ can be fulfilled. However, some care is needed, since data-dependencies do not entail sequentiality, and neither does their union with pipeline-dependencies.

**Example.** Let $\gamma(q_1) = \text{ALU}$ and $\gamma(q_2) = \text{MUL}$ and

$$t = [1] \text{dc} 3 \ q_1[2] \text{id}^\text{\text{MUL}} \ q_2 \ q_1[3] \text{dc} \ q_1 \ q_3.$$  (4.7)

For $A = q_2$ and $B = q_1 \otimes q_3$ there is a derivation for $(\Gamma, [1] < [2]) \vdash t : A \rightarrow B$, and $[1] <_A [3]$ holds since $[1]$ is the first writer to $q_1$, $[3]$ is the first reader from $q_3$ and $\text{prime}(q_1, A)$. There are no pipeline-dependencies, so $<_A$ and $<_ \cup <_A$ consist of the pair $([1], [3])$.

Take $u = [2][3][1]$. Then $u : A \rightarrow B$ is an interleaving for $t$ and $A$ as $\pi_u = [3] \parallel [2] \parallel [1] = \pi_t$ holds and for any $C$ with $\text{shape}(C) = A$ there is a $D$ with $C \xrightarrow{u} D$. However, $u$ does not respect $<_A$, hence $t$ is not sequential for $<_A$ or $<_ \cup <_A$ and $A$.

Consequently, an analogue to Proposition 12 for data-dependencies would fail. The data-dependency in the program (4.7) is only fulfilled provided that $[3]$ is bound never to consume the result of $[2]$. Of course, in order to prove this fact one needs to show that any interleaving executes $[1]$ before $[2]$ -- which is what we set out to prove in the first place when aiming at discharging $[1] < [2]$. This appears to imply that data-dependencies cannot be used for discharging constraints as they need sequentiality before they can be used for proving sequentiality. However, pipeline-dependencies
and data-dependencies can be combined in a different way. Instead of taking their union, we extend the set of pipeline-dependencies by a single data-dependency at a time. An entry \([n] <_A [m]\) is added provided that all instructions which are in an output conflict with \([n]\) are already related to \([n]\). In fact, this inductive process succeeds if \(<_A\) is replaced by any order \(\leq\) for which \(t\) is sequential.

**Proposition 14.** Let \(t\) be sequential for \(\subset\) and \(A\) and let \([n] <_A [m]\). For all instructions \([l]\) in \(t\) with \(n \neq l \neq m\), let

- \([l] \subset [n] \); or \([n] \subset [l]\) hold if there is a \(q\) which occurs in \(B_{\text{code}_1}\) and \(B_{\text{code}_n}\)
- \([l] \subset [m] \); or \([m] \subset [l]\) hold if there is a \(q\) which occurs in \(A_{\text{code}_1}\) and \(A_{\text{code}_m}\).

If \(\subset \cup \{(n, [m])\}\) is a partial order, then \(t\) is sequential for \(\subset \cup \{(n, [m])\}\) and \(A\).

**Proof.** Let \(s\) be an interleaving for \(t\) and \(A\), i.e. \(\pi_s = \pi_t\) and for any \(C\) with \(\text{shape}(C)\) there is a \(D\) such that \(C \preceq D\). We have to show that \(s\) respects \(\subset \cup \{(n, [m])\}\) where \(\subset\) and \([n] <_A [m]\) are as given in the proposition and the side condition on competitors \([l]\) is fulfilled.

Any pair \(([k_1], [k_2])\) in \(\subset \cup \{(n, [m])\}\) with \(([k_1], [k_2]) \neq (n, [m])\) is contained in \(\subset\) (modulo the transitive closure) and hence respected by \(s\) by assumption, i.e. \([k_1]\) occurs before \([k_2]\) in \(s\). Hence, we only have to show that \(s\) respects \((n, [m])\).

Suppose \([m]\) occurs before \([n]\) in \(s\). Then \(s\) can be written as

\[s = r[m] u[n] v\]

for some code sequences \(r\), \(u\) and \(v\). For \(C\) with \(\text{shape}(C) = A\) we have \(C \preceq D\) for some \(D\), so there are \(D_1\) and \(D_2\) such that

\[C \xrightarrow{r} D_1 \xrightarrow{[m]} D_2 \xrightarrow{u[n]v} D.\]

Since \(s\) is an interleaving of \(t\) for \(A\) which respects \(\subset\),

- \(r\) contains all instructions \(\text{ins}\) with \(\text{ins} \subset [m]\).
- \(u[n] v\) contains all instructions \(\text{ins}\) with \([m] \subset \text{ins}\).
- \(r[m] u\) contains all instructions \(\text{ins}\) with \(\text{ins} \subset [n]\).
Chapter 4. Asynchronous queue machines

- $v$ contains all instructions $ins$ with $[n] \subset ins$.

Since $[n] \prec_A [m]$ holds, there are $q$ and $i > 0$ such that $[n]$ is the $i$-th writer to $q$ in $t$ and $[m]$ the $(k+i)$-th reader, where $A = q^h \otimes X$ and $\text{prime}(q,X)$. By the side condition on instructions forwarding to the same queue as $[n]$, instructions $[l]$ which are the $j$-th writer to $q$ in $t$ fulfil

- $[l] \subset [n]$ (or $l = n$) if $1 \leq j < i$
- $[n] \subset [l]$ (or $l = n$) if $j > i$.

In particular, any writer $[l]$ to $q$ which occurs in $r$ cannot fulfil $[n] \subset [l]$ (see above, $s$ obeys $\subset$), so it must fulfil $[l] \subset [n]$ as $l = n$ does not hold. There are at most $i - 1$ such writers as other writers fulfil $[n] \subset [l]$.

Similarly, readers $[l]$ from $q$ which occur in $t$ before $[m]$ must occur in $r$ as they fulfil $[l] \subset [m]$ – if they occurred in $u[n]v$ then $s$ would not respect $\subset$. There are exactly $k + i - 1$ such readers since $[m]$ is the $(k+i)$-th reader and other readers fulfil $[m] \subset [l]$.

Consequently, for $\text{shape}(D_1) = q^h \otimes Y$ with $\text{prime}(q,Y)$ we obtain

$$h \leq k + (i - 1) - (k + i - 1) = 0.$$  

This contradicts $D_1 \xrightarrow{[m]} D_2$ as $[m]$ is a reader from $q$, so $q$ is a factor of $A_{\text{code}_m}$ and $D_1$ must consequently contain at least one entry in $q$. $\square$

Plugging in pipeline-dependencies $\ll$ for the initial $\subset$ is merely a convenient choice for starting the iterative process – an axiom in a formal inductive derivation system for sequentialisation orders in which the process described in the proposition represents the step case. Using pipeline-dependencies ensures that the side condition relating readers $[l]$ with $[m]$ is always fulfilled.

**Example.** In program (4.5), there is no $[1] \neq [l] \neq [2]$, so the side condition for extending the (empty!) pipeline-dependencies is trivially fulfilled and we can discharge $[1] \prec [2]$.

4.3. Dealing with non-determinism

In program (4.7), the pipeline-dependencies do not get the iterative process started. All instructions queue in different pipelines, and the data-dependency $[1] <_A [3]$ can never be used as the condition on competitor $[1'] = [2]$ of $[1]$ is not fulfilled.

4.3.4 Other sequentialisation mechanisms

We already discussed program transformations as a means for creating pipeline-dependencies, and other transformations may prove useful, too. For example, the constraint $[1] < [2]$ in

$$[1]1d q_1 q_2 [2]dec q_3 q_2$$

may be resolved by introducing artificial dependencies. A compiler could (for the typing $A \rightarrow B$ where $A = q_1 \otimes q_3$) insert additional instructions to yield

$$[1]1d q_1 q_2 [1']1dc 0 q_4 [1''] skip^{\text{ALJ}} q_4 [2]dec q_3 q_2 : A \rightarrow B$$

and thus discharge $[1] < [2]$ using $[1] \ll [1'] <_A [1''] \ll [2]$. Little instruction-level parallelism is compromised as the instructions $[1]$ and $[2]$ are not supposed to execute concurrently anyway. On the other hand, performance degrades if this transformation is employed excessively as additional instructions need to be loaded from memory, inserted into pipelines and executed. Therefore, a compiler should first aim to achieve sequentialisation by reordering instructions (aiming at utilising more pipeline-dependencies) before introducing such instructions.

SCALP contains a sequentialisation instruction with behaviour of the shape

$$\gamma(op_1) = \gamma(op_2)$$

$$\text{seq } op_1 op_2 op_3 op_4 : op_1 \otimes op_2 \rightarrow op_3 \otimes op_4$$

which awaits operands in $op_1$ and $op_2$ and then sends them unchanged to $op_3$ and $op_4$. However, this requires operands to be communicated more often than necessary and apparently forces $op_1$ and $op_2$ to be distinct.

More sophisticated assembly languages might encode the above communication patterns into the opcodes by allowing instructions to exchange notifications of completion. These signals could be realised by special-purpose operand queues $p_i$ and encodings.
the (initially empty) queue $p_1$ achieves the same result as the above code insertion if $\text{ldp}$ and $\text{pdec}$ are typed via axioms of the form

$$\text{AX-P: } 1\text{dp } op_1 \text{ op}_2 \text{ p : X } \otimes op_1 \text{ op}_2 \text{ p } \otimes X \gamma(op_1) = \text{MEM}$$

and

$$\text{P-AX: } \text{pdec } p \text{ op}_1 \text{ op}_2 : p \otimes op_1 \otimes X \rightarrow op_2 \otimes X \gamma(p) = \gamma(op_1) = \text{ALU}$$

and executed by

$$\text{LD-P: } C \xrightarrow{(\text{read(MEM), } a, op_1)} (Q, M) \rightarrow (Q, M) \xrightarrow{(\text{write}, M(a) \text{, op}_2)} E \xrightarrow{(\text{write}, \cdot, p)} D$$

and

$$\text{P-DEC: } C \xrightarrow{(\text{read(ALU), } \cdot, p)} F \rightarrow F \xrightarrow{(\text{read(ALU), } a, op_1)} E \xrightarrow{(\text{write}, a-1, \text{op}_2)} D$$

where $\cdot$ denotes an anonymous value. Queues $p_i$ need only be wide enough to carry a single signal (no data) and may thus be not too expensive in hardware, although some runtime costs are involved. A high-performance compiler would replace ordinary instructions with instructions of the new form only when other sequentialisation techniques fail.

Our calculus thus has the necessary ingredients for exploring the trade-offs between program analysis, code transformation techniques and hardware modifications. If the dynamic semantics is equipped with performance annotations a system designer can study design alternatives by simulation.

### 4.4 Program execution

Programs including jump instructions are treated similarly to Section 3.5. Basic blocks consist of instruction sequences where at most the last instruction may be a jump, and
programs \( P \) are represented as pairs \( P = (N, A) \) of basic blocks and control flow arrows. Modern processor implementations employ a variety of means how execution of instructions inside basic blocks and scheduling of new blocks interact. Our dynamic semantics models a scheme where the loading of instructions from memory may interleave arbitrarily with the execution of instructions. Whenever a branch target becomes known, the issue unit may load instructions from memory and insert them into the pipelines.

### 4.4.1 Dynamic semantics

The dynamic semantics generalises the ternary parallel programs of the previous sections to programs with four components \( t_1 \parallel t_2 \parallel t_3 \parallel t_4 \) where the rightmost component represents the BU pipeline. Composition \( \pi_1 \pi_2 \) and conversion of a sequential program \( t \) into its parallel program \( \pi_t \) are defined similarly to Section 4.1. Formally, reduction \( C \xrightarrow{\text{fu}} D \) is defined by the rules

\[
\text{3FU}_1 \quad \frac{C \xrightarrow{[n]\text{code}} D}{C \xrightarrow{\text{fu}} D} \quad \begin{cases} \text{FU}(\text{code}) = \text{fu} \\ C = (Q, M, \bar{n}) \end{cases}
\]

and

\[
\text{3FU}_2 \quad \frac{C \xrightarrow{s\text{fu}} E \quad E \xrightarrow{t\text{fu}} D}{C \xrightarrow{st\text{fu}} D} \quad C = (Q, M, \bar{n})
\]

where the side conditions \( C = (Q, M, \bar{n}) \) indicate that configurations are of the ternary shape from Section 3.5.1 (Figure 3.8), and similarly for

\[
\text{3DIS}_1 \quad \frac{C \xrightarrow{\pi_t} D}{C \xrightarrow{\text{fu}} D} \quad C = (Q, M, \bar{n}) \quad \text{and} \quad \text{3DIS}_2 \quad \frac{C \xrightarrow{\pi_1} E \quad E \xrightarrow{\pi_2} D}{C \xrightarrow{\pi_1\pi_2} D} \quad C = (Q, M, \bar{n})
\]

The propositions and statements relating distributed and sequential dynamic execution, can be transferred to ternary configurations and the extended instruction set.

Interleaving the execution of instructions with the loading of basic blocks depending on the outcome of branch conditions is achieved through a relation \( C \xrightarrow{w} D \), similar to the rules EXECUTE and COMPOSE in Section 3.5.1. We present a semantics where the issue unit is represented as a forth component of configurations, and consists of a
parallel program \( \pi \). Loading the basic block labelled \( n \) from the memory inserts the instructions \( n(n) \) into the issue unit. Arbitrary prefixes of \( \pi \) may then enter the pipelines and execute completely unsupervised. Consequently, the reloading and instruction execution may interleave freely. Formally, the dynamic semantics for distributed program execution is given by the rules EXEC, LOAD and COMPOSE.

\[
\text{LOAD} \quad (Q, M, n, \pi) \xrightarrow{n} (Q, M, \text{nil}, \pi \pi_t) \quad n(n) = t
\]

\[
\text{EXEC} \quad (Q, M, \tilde{m}, \pi_1 \pi_2) \xrightarrow{\pi_1} (P, N, \tilde{m}) \quad \lambda (Q, M, \tilde{m}, \pi_1 \pi_2) \xrightarrow{\lambda} (P, N, \tilde{m}, \pi_2)
\]

\[
\text{COMPOSE} \quad C \xrightarrow{\nu} E \quad E \xrightarrow{\nu} D \quad C \xrightarrow{\nu} D
\]

Rule COMPOSE formally agrees with the identically named rule in Section 3.5.1, but now relates configurations with four components.

### 4.4.2 Static semantics

Instantiating the typing rules from Section 4.3.1 (Figure 4.4) to \( \text{if} \) and \( \text{jmp} \) amounts to the explicit rules

\[
[\text{if}] \quad \text{if} \ op \ n_1 \ n_2 : A \rightarrow B \quad \text{and} \quad \text{jmp} \ m : A \rightarrow B
\]

and race conditions within basic blocks can be dealt with as discussed before. It should be noted that the analysis of data-dependencies \( <_A \) has to be performed with respect to the initial shape \( A \) arising from matching a block with its neighbours, since applying the weakening rule changes the observed data-dependencies. Consequently, determinism and safety can be achieved for the bodies of basic blocks as discussed in the previous sections.

For typing programs \( P = (N, A) \), we insert a judgement \( (\Gamma, <) \vdash n : A \rightarrow B \) into the type environment \( \Sigma \) provided that \( n \in \text{dom} N \) and \( (\Gamma, <) \vdash n(n) : A \rightarrow B \) holds such that \( t \) is safe for \( A \). Again \( \Sigma \) contains exactly one such entry for each \( n \in \text{dom} N \). Given that pipelines might include instructions of several basic blocks we also have to consider
race hazards between instructions of different basic blocks. We therefore introduce additional constraints \([l] < [k]\) between instructions labelled \([l]\) and \([k]\) whenever there is a queue \(q\) such that \([l]\) is the last writer to \(q\) in its block \(B(n)\), \([k]\) is the first writer to \(q\) in block \(B(m)\) and there is a path from \(n\) to \(m\) in \(A\). Since different copies of a basic block execute identical sets of instructions, it suffices to consider simple paths \(n, \ldots, m\) where at most the two outermost nodes \(n\) and \(m\) are equal and no intermediate block mentions \(q\). We denote such a path by \(SP(n, op, m)\). In the typing rule

\[
\Sigma = \bigcup_{n \in dom \mathbb{N}} \{ (\Gamma_n, <_n) \vdash n : A_n \rightarrow B_n \}
\]

\(SP(n, op, m)\) is a queue \(\text{Proof.}\) The constraints \(\{<_n | (\Gamma_n, <_n) \vdash n : A_n \rightarrow B_n \in \Sigma \} \cup\) are interpreted modulo instantiations due to loops: in a constraint \([l] < [k]\) the right \([l]\) refers to the instance of \([l]\) in the next iteration.

These constraints suffice for making any program slice deterministic:

**Proposition 15.** Let \((\Sigma, <) \vdash P : \Box\) and \((n_1, \ldots, n_k)\) be a path in \(A\). Then there are \(\Gamma\) and \(<\) such that \((\Gamma, <) \vdash B(n_1) \ldots B(n_k) : A_{n_1} \rightarrow B_{n_k}\) holds and \(<\) is contained in \(\{<_n | (\Gamma_n, <_n) \vdash n : A_n \rightarrow B_n \in \Sigma \} \cup\).

**Proof.** Case \(k = 1\). Since \(n_1 \in dom \mathbb{N}\), we have \((\Gamma_{n_1}, <_{n_1}) \vdash n_1 : A_{n_1} \rightarrow B_{n_1} \in \Sigma\) for some \(\Gamma_{n_1}, <_{n_1}, A_{n_1}\) and \(B_{n_1}\). By the definition of \(\Sigma\), \((\Gamma_{n_1}, <_{n_1}) \vdash B(n_1) : A_{n_1} \rightarrow B_{n_1}\) holds. Take \(<_n = <_1\) and \(\Gamma = \Gamma_{n_1}\).

Case \(k > 1\). For a path \((n_1, \ldots, n_k, n_{k+1})\) in \(A\) and \(i \leq k + 1\), there are \(\Gamma_{n_i}, <_{n_i}, A_{n_i}\) and \(B_{n_i}\) with \((\Gamma_{n_i}, <_{n_i}) \vdash B(n_i) : A_{n_i} \rightarrow B_{n_i} \in \Sigma\). By the definition of \(\Sigma\), this implies

\[
(\Gamma_{n_i}, <_{n_i}) \vdash B(n_i) : A_{n_i} \rightarrow B_{n_i} \in \Sigma\text{ for all }i \leq k + 1
\]

(4.8)

For \(1 \leq i \leq k\) it is \((n_i, n_{i+1}) \in A\) by the definition of a path, so \(B_{n_i} = A_{n_{i+1}}\) by the side-condition for \(P : \Box\). In particular, \(B_{n_k} = A_{n_{k+1}}\).
By induction hypothesis, there are Δ and □ such that

$$\Delta, \Box \models \Pi(n_1) \ldots \Pi(n_k) : A_{n_1} \circ B_{n_k}$$

and □ is contained in $$\bigcup_{n \in \text{dom}_N} \{< n \mid (\Gamma_n, < n) \models \Pi(n) : A_n \circ B_n \in \Sigma \} \cup <.$$ Also, specialising property (4.8) for $$i = k + 1$$ yields

$$(\Gamma_{n_{k+1}}, <_{n_{k+1}}) \models \Pi(n_{k+1}) : A_{n_{k+1}} \circ B_{n_{k+1}}.$$

The cut rule (see Figure 4.4), thus implies

$$(\Gamma, <) \models \Pi(n_1) \ldots \Pi(n_{k+1}) : A_{n_1} \circ B_{n_{k+1}}$$

for $$\Gamma = \Delta \# \Gamma_{n_{k+1}}$$ and < equal to

$$\Box \cup <_{n_{k+1}} \cup \{(m, g) \mid \text{op : (l, m) } \in \Delta \text{ and op : (g, h)} \in \Gamma_{n_{k+1}} \}.$$ 

As remarked above, □ is contained in $$\bigcup_{n \in \text{dom}_N} \{< n \mid (\Gamma_n, < n) \models \Pi(n) : A_n \circ B_n \in \Sigma \} \cup <$$, and so is (obviously) $$<_{n_{k+1}}$$. For showing that all of < is contained in this relation, it remains to show that pairs $$(m, g)$$ where op : (l, m) $$\in \Delta$$ and op : (g, h) $$\in \Gamma_{n_{k+1}}$$ are contained in this relation.

For any such pair $$(m, g)$$, there is a $$j < k + 1$$ such that instruction m is the last writer to op in $$\Pi(n_j)$$ and g is the first writer to op in $$\Pi(n_{k+1})$$, and no instruction in $$\Pi(n_{j+1}), \ldots, \Pi(n_k)$$ writes to op. In particular, all blocks $$\Pi(n_{j+1}), \ldots, \Pi(n_k)$$ are different from $$\Pi(n_j)$$. By removing loops in $$\Pi(n_{j+1}), \ldots, \Pi(n_k)$$, the path $$(\Pi(n_j), \ldots, \Pi(n_k), \Pi(n_{k+1}))$$ contains a simple path $$SP(\Pi(n_j), op, \Pi(n_{k+1}))$$. By the side condition on $$(\Sigma, <) \models P : \Box$$, the pair $$(m, g)$$ is hence related by $$[m] < [g]$$.

Consequently, distributed executions are safe if we can fulfil the constraints in <. For inserting the entries $$(\Gamma_n, < n) \models n : A_n \circ B_n$$ into $$\Sigma$$ we had to prove them safe, but no further work needs to be spent on their bodies.

The following result relates distributed program execution as defined in this section to the dynamic semantics for sequential execution in Section 3.5.

**Proposition 16.** Let $$(\Sigma, <) \models P \circ \Box, n \in \text{dom}_N$$ and $$(\Gamma_n, < n) \models \Pi(n) : A_n \circ B_n \in \Sigma.$$
4.4. Program execution

1. If \((Q, M, n) \xrightarrow{w} (P, L, \bar{m})\) then \((Q, M, n, \pi_e) \xrightarrow{w} (P, L, \bar{m}, \pi_e)\).

2. Let \(C = (Q, M, n, \pi_e)\), \(\text{shape}(C) = A_n\) and \(C \xrightarrow{w} (P, L, \bar{m}, \pi_e)\). Let \(\sqsubset\) be a partial order containing \(<\) and for all simple paths \(SP(n(n_i), op, n(n_j))\) let the instruction sequence \(n(n_i) \ldots n(n_j)\) : \(A_n \rightarrow B_{n_j}\) be sequential for \(\sqsubset\) and \(A_n\). Then \((Q, M, n) \xrightarrow{w} (P, L, \bar{m})\).

**Proof.**

1. Rule induction for \((Q, M, n) \xrightarrow{w} (P, L, \bar{m})\) (see Section 3.5.1).

   **Case** EXECUTE. By the definition of the rule EXECUTE, \(w\) has the form \(n\) and \((Q, M, \text{nil}) \xrightarrow{t} (P, L, \bar{m})\) holds where \(t = N(n)\). Proposition 7 implies \((Q, M, \text{nil}) \xrightarrow{\pi_e} (P, L, \bar{m})\), so

   \[
   \begin{align*}
   &\frac{(Q, M, \text{nil}) \xrightarrow{\pi_e} (P, L, \bar{m}) \quad \text{by rule COMPOSE, where } C = (Q, M, \text{nil}, \pi_e).}{}
   &\frac{(Q, M, n, \pi_e) \xrightarrow{n} C \quad \text{and for all simple paths } SP(n(n_i), \text{op}, n(n_j)) \quad \text{let the instruction sequence } n(n_i) \ldots n(n_j) : A_n \rightarrow B_{n_j} \quad \text{be sequential for } \sqsubset \quad \text{and } A_n.}{(Q, M, n, \pi_e) \xrightarrow{n} (P, L, \bar{m}, \pi_e)}
   \end{align*}
   \]

   **Case** COMPOSE. By the definition of the rule, there are \(v, w\) and \(E\) such that \((Q, M, n) \xrightarrow{v} E\) and \(E \xrightarrow{w} (P, L, \bar{m})\), so for \(E = (Q', M', \bar{l})\) it must be \(\bar{l} \neq \text{nil}\), hence \((Q, M, n) \xrightarrow{v} (Q', M', \bar{l})\). The induction hypothesis consequently yields

   \[
   (Q, M, n, \pi_e) \xrightarrow{w} (Q', M', \bar{l}, \pi_e)
   \]

   and

   \[
   (Q', M', \bar{l}, \pi_e) \xrightarrow{w} (P, L, \bar{m}, \pi_e).
   \]

   Now \((Q, M, n, \pi_e) \xrightarrow{vw} (P, L, \bar{m}, \pi_e)\) follows by rule COMPOSE.

2. Induction on the length of \(w\).

   For \(|w| = 0\) we have \(w = \lambda\), so the only rule applicable is EXEC, but no reduction is possible as the forth component of \(C\) is \(\pi_e\).

   For \(w = 1\), the derivation of \(C \xrightarrow{w} (P, L, \bar{m}, \pi_e)\) must have the form

   \[
   \begin{array}{ccc}
   \text{LOAD} & \xrightarrow{n} & \text{EXEC} \xrightarrow{\lambda} \text{COMPOSE} \\
   \text{COMPOSE} & \xrightarrow{n} & \text{EXEC} \\
   \text{COMPOSE} & \xrightarrow{n} & \text{COMPOSE} \\
   \end{array}
   \]

   : 

   \[
   \begin{align*}
   &\frac{C \xrightarrow{n} D \quad \text{EXEC} \xrightarrow{\lambda} E \quad \text{COMPOSE} \xrightarrow{n} \text{COMPOSE} \quad \text{COMPOSE}}{}
   &\frac{C \xrightarrow{n} \text{EXEC} \xrightarrow{\lambda} E \quad \text{COMPOSE} \xrightarrow{n} \text{COMPOSE}}{}
   &\frac{C \xrightarrow{n} (P, L, \bar{m}, \pi_e)}{}
   \end{align*}
   \]
or

\[
\begin{array}{c|c|c}
\text{LOAD} & \text{EXEC} & \text{COMPOSE} \\
\hline
C \xrightarrow{n} D & \frac{D \xrightarrow{\lambda} E}{E \xrightarrow{\lambda} (P, L, \tilde{m}, \pi_e)} & D \xrightarrow{\lambda} (P, L, \tilde{m}, \pi_e) \\
\hline
\text{COMPOSE} & \frac{C \xrightarrow{n} (P, L, \tilde{m}, \pi_e)}{D \xrightarrow{\lambda} (P, L, \tilde{m}, \pi_e)} \\
\end{array}
\]

because the only reduction possible from C is LOAD. By LOAD’s definition, D is of the form \((Q, M, \text{nil}, \pi_t)\), and both derivation trees yield \((Q, M, \text{nil}, \pi_t) \xrightarrow{\lambda} (P, L, \tilde{m}, \pi_e)\). The instructions involved in this reduction are exactly \(n(n)\), and all single steps are applications of the rule \text{EXEC}. For each step, \(D = D_0 \xrightarrow{\lambda} D_1 \xrightarrow{\lambda} \ldots \xrightarrow{\lambda} D_m = E\), there are by rule \text{EXEC} prefixes \(\pi_i\) such that for the ternary configurations \(D_i\) corresponding to the configurations \(C\) the reductions \(D_i \xrightarrow{\pi_i} D_j\) hold. By Proposition 7 an \(u_i\) with \(\pi_{u_i} = \pi_i\) and \(D_i \xrightarrow{\pi_{u_i}} D'_i\), so \(u_1 \ldots u_m\) is an interleaving for \(n(n)\) and \(A\). As \(n(n)\) is safe for \(A\) due to the formation condition of \(\Sigma\) and the shape of \(D\), \(D'_0 \xrightarrow{n(n)} (P, L, \tilde{m})\) holds, i.e. \((Q, M, \text{nil}) \xrightarrow{n(n)} (P, L, \tilde{m})\). By rule \text{EXECUTE} we obtain \((Q, M, n) \xrightarrow{n} (P, L, \tilde{m})\).

For \(|\nu| > 1\), we have \(\nu = vu\) for some \(v\) and \(u\), and there is an \(E\) such that \(C \xrightarrow{\nu} E\) and \(E \xrightarrow{\nu} G\) where \(G = (P, L, \tilde{m}, \pi_e)\). For \(E = (R, N, \tilde{\pi}, \pi)\) we have two cases.

**Case** \(\pi = \pi_e\). Then \(\tilde{\pi} \neq \text{nil}\) must hold – otherwise no further reductions would be possible. The induction hypothesis yields

\[
(Q, M, n) \xrightarrow{\nu} (R, N, l) \text{ and } (R, N, l) \xrightarrow{\nu} (P, L, \tilde{m})
\]

so \((Q, M, n) \xrightarrow{\nu} (P, L, \tilde{m})\) holds by rule \text{COMPOSE}.

**Case** \(\pi = \pi_1 \pi_2 \neq \pi_e\). Since \text{COMPOSE} is associative, we can assume that the left branch of \(E \xrightarrow{\nu} G\) is either LOAD or \text{EXEC}. If it is a LOAD step, its index must be \(l\), i.e.

\[
\begin{array}{c|c|c|c}
\text{LOAD} & \text{COMPOSE} & \text{X} \\
\hline
E \xrightarrow{l} F & F \xrightarrow{\nu_1} D & \frac{F \xrightarrow{\nu_2} G}{X \xrightarrow{\nu_1 \nu_2} G} \\
\hline
\text{COMPOSE} & & \\
\end{array}
\]

and the step \(X\) cannot be a LOAD because \(F = (Q', M', \text{nil}, \pi_3, \pi_3)\) (again, step \(X\) is w.l.o.g. a single step). Hence, \(X\) must be \text{EXEC}, so \(\nu_1 = \lambda\). If
the reduction $F \xrightarrow{\lambda} D$ involves instructions from $N(l)$, we can delay them because $\prec$ is contained in $\subseteq$ and by Proposition 15 no race conditions exist in the composition $N(n_1) \ldots N(l)$ of the basic blocks $n_1, \ldots, l$, since $n_1, \ldots, l$ is a prefix of $w$ and thus a path in $A$. We can thus reorder and obtain $E \xrightarrow{\lambda^*} H \xrightarrow{I} D$ for some $H$. Repeated application leads to $H$ with empty issue unit, i.e. $H = (\varepsilon, \varepsilon, l, \pi_\varepsilon)$. Now the first case applies.

\[ \square \]

Again, for satisfying $[l] < [g]$ several natural serialisations can be used. In practice, analysis of data-dependencies along (simple) paths might be too costly. On the other hand, the condition on simple paths may be strengthened and thus be made more easy to verify. As pipeline dependencies can be analysed more efficiently, one can for example require that $FU(code_l) = FU(code_g)$ holds whenever $op : (k, l) \in \Gamma_n$ and $op : (g, h) \in \Gamma_m$ and $(n, m) \in A^+$. Additionally, control-dependencies may be used for discharging constraints along one particular path: any conditional jump between $l$ and $g$ which is dependent on $l$ guarantees that $l$ will have retired when $g$ is issued.

Consequently, typing $\Sigma \vdash p : \Box$ is sound for distributed execution. Provided that the constraints are met, an execution is either infinite or terminates in the same configuration as sequential execution (see Proposition 5).

### 4.5 Finite operand queues

Processor implementations of our computational models restrict operand queues to be of finite length. In contrast, the sequential and distributed dynamic semantics considered so far admitted queues with arbitrarily many entries. In this section, the dynamic semantics are modified such that for each operand queue a maximal number of elements is specified. No further execution of instructions is possible if all successor configurations violate the length restrictions. We modify the static semantics to determine length restrictions necessary for deadlock-free execution and examine the effect of relaxing length restrictions.

The dynamic semantics for execution under finite operand queues are obtained by re-
placing the rule WR by the rule WR-LIM(C)

\[
\text{WR-LIM}(C) \xrightarrow{\text{write},{a},q} \exists X. q^{[\text{write}]} \otimes X = C
\]

where C is a \(\otimes\)-product. We write \(C \overset{t}{\rightarrow} D\) and \(C \overset{\pi}{\rightarrow} D\) if \(\text{shape}(C) \otimes X = C\) holds for some \(X\) and \(C \overset{t}{\rightarrow} D\) and \(C \overset{\pi}{\rightarrow} D\) can be derived using WR-LIM(C) instead of WR. Using structural induction one can show that \(\text{shape}(D) \otimes Y = C\) holds for some \(Y\) whenever \(C \overset{t}{\rightarrow} D\) or \(C \overset{\pi}{\rightarrow} D\).

**Example.** For \(C = ([], M)\), \(D = ([q_1 \mapsto 3], M)\) and any \(M\), program

\[
t = [1] \text{dc} 4 \ q_1 \ [2] \text{dc} \ q_1 \ q_1
\]

fulfils \(C \overset{t}{\rightarrow}_{q_1 \otimes q_2} D\) but not \(C \overset{t}{\rightarrow}_{q_2} D\).

Sequential execution under finite operand queues is deterministic as \(C \overset{t}{\rightarrow} D\) implies \(C \overset{\pi}{\rightarrow} D\). Determinism is preserved when length limitations are relaxed, and distributed executability under identical limitations is guaranteed.

**Proposition 17.** If \(C \overset{t}{\rightarrow} D\) and \(X\) arbitrary then \(C \overset{\pi}{\rightarrow} C \otimes X \overset{\pi}{\rightarrow} D\).

**Proof.** For \(C \overset{\pi}{\rightarrow} D\) choose the interleaving \(t\) of \(\pi.\) For \(C \overset{t}{\rightarrow} C \otimes X \overset{\pi}{\rightarrow} D\) observe that \(\text{shape}(E) \otimes Y = C\) implies \(\text{shape}(E) \otimes Y \otimes X = C \otimes X\) and apply this to \(E = C\) and \(E = D\).

For eliminating deadlock due to operand queue overflow, Figure 4.6 shows a modified type system for sequential execution. Judgements \(\vdash_C t : A \rightarrow B\) capture operand queue lengths necessary for executing \(t\) in an initial configuration of shape \(A\). In fact, \(\vdash_C t : A \rightarrow B\) gives the tightest length limitations necessary for executing a program successfully.

**Proposition 18.** Let \(\vdash_C t : A \rightarrow B\) and \(\text{shape}(C) = A\). Then

1. \(A \otimes X = C = B \otimes Y\) for some \(X\) and \(Y\).

2. \(C \overset{t}{\rightarrow} D\) for some \(D\).
### 4.5. Finite operand queues

\[
\begin{align*}
\text{AX-LIM} & \quad \frac{}{\Gamma \vdash X \otimes A_{\text{code}} \otimes V \ \text{code} : X \otimes A_{\text{code}} \rightarrow X \otimes B_{\text{code}}} \quad \left\{ \begin{array}{c}
\text{SC(code)} \\
B_{\text{code}} \otimes A_{\text{code}} = V / U
\end{array} \right. \\
\text{CUT-LIM} & \quad \frac{C_1 \vdash C_2 \vdash C_1 \vdash D \quad C_2 \vdash B \quad C_1 \otimes V \vdash st : A \rightarrow B}{C_2 / C_1 = V / U}
\end{align*}
\]

Figure 4.6: Type system for finite operand queues

3. $C \overset{t}{\rightarrow} D \implies D = C \otimes W$ for some $W$.

**Proof.** All three parts are proven separately.

1. Induction on the rules for $\vdash_C t : A \rightarrow B$. In the rule for single instructions, $A = X \otimes A_{\text{code}}$ occurs explicitly in $C$ and $B = X \otimes B_{\text{code}}$ occurs in $C$ due to $C = X \otimes A_{\text{code}} \otimes V = X \otimes B_{\text{code}} \otimes U$. In the cut rule, we may assume $A \otimes X_1 = C_1 = D \otimes Y_1$ and $D \otimes X_2 = C_2 = B \otimes Y_2$ by induction hypothesis. Hence $C = C_1 \otimes V = A \otimes X_1 \otimes V$ and $C = C_1 \otimes V = C_2 \otimes U = B \otimes Y_2 \otimes U$.

2. We first prove auxiliary claims regarding rules WR-LIM($C$), RD, $\mu$ and CODE.

**RD Claim 1:** If $w = av$ and $\gamma(q) = fu$ then there are unique $u$ and $b$ with $w \overset{\text{read}(fu),b,q}{\rightarrow} C u$. Furthermore, $|u| = |w| - 1$ holds.

**Proof** For $w = av$ and $\gamma(q) = fu$ both side conditions of rule RD are fulfilled and the claim holds for the unique $a = b$ and $u = v$.

**WR-LIM($C$) Claim 2:** For $C = q^k \otimes X$ with $\text{prime}(q,X)$, $|w| < k$ and arbitrary $a$ there is a unique $u$ such that $w \overset{\text{write},a,q}{\rightarrow} C u$. Furthermore, $|u| = |w| + 1$ holds.

**Proof** Rule WR-LIM($C$) carries exactly the assumptions as side conditions, so the claim follows for $u = wa$.

**$\mu$** There are two claims.

**Claim 3:** If $\text{shape}(C) = q \otimes X$, $\gamma(q) = fu$ and $C$ arbitrary, then there are unique $D$ and $a$ with $C \overset{\text{read}(fu),a,q}{\rightarrow} D$. Furthermore, $\text{shape}(D) = X$ holds.

**Proof** For $C = (Q,M)$ and $\text{shape}(C) = q \otimes X$ we have $|Q(q)| \geq 1$, so $Q(q) = aw$ for some unique $a$ and $w$. By claim 1, there are unique $u$ and $b$ with
Chapter 4. Asynchronous queue machines

Claim 4: If $\text{shape}(C) = A$, $A \otimes q \otimes X = C$ for some $X$ and $a$ arbitrary, then there is a unique $D$ with $C \xrightarrow{\text{write}(a,q)} D$. Furthermore, $\text{shape}(D) = A \otimes q$ holds.

Proof The definition of $\otimes$ implies that for $A \otimes q \otimes X = C$, $C = q^k \otimes Y$, $A = q^l \otimes Z$ and $\text{prime}(q,Y \otimes Z)$, $l < k$ holds. Hence, for $C = (Q,M)$ with $\text{shape}(C) = A$ we obtain $|Q(q)| = l < k$. By claim 2, there is consequently a unique $u$ with $Q(q) \xrightarrow{\text{write}(a,q)} C u$, and $u = |Q(q)| + 1$ holds. Hence, $C \xrightarrow{\text{write}(a,q)} C D$ is derivable for $D = (Q[q \mapsto u], M)$ and uniqueness of $D$ follows as in Proposition 1.

Furthermore, $|Q[q \mapsto u](q)| = |u| = |Q(q)| + 1$ holds and $|Q[q \mapsto u](q')| = |Q(q')|$ for $q' \neq q$, hence $\text{shape}(D) = \text{shape}(C) \otimes q = A \otimes q$.

CODE Claim 5: If $\vdash_C [n] \text{code} : A \rightarrow B$ and $\text{shape}(C) = A$ and then there is a unique $D$ such that $C \xrightarrow{\text{code}} D$. Furthermore, $\text{shape}(D) = B$ holds.

Proof We treat the representative case ADD explicitly.

Case code = add $op_1 \ op_2 \ op_3$. For $\vdash_C [n] \text{add} op_1 \ op_2 \ op_3 : A \rightarrow B$, the axiom AX-LIM and Table 3.1 guarantee

- $\gamma(op_1) = \gamma(op_2) = \text{ALU}$
- $A = op_1 \otimes op_2 \otimes Y$ and $B = op_3 \otimes Y$ for some $Y$
- $C = op_1 \otimes op_2 \otimes Y \otimes V$ where $op_3/\otimes op_1 \otimes op_2 = V/U$.

Therefore, $Y \otimes op_3 \otimes U = C$. By the assumption that all operand identifiers $op$ are operand queue identifiers $q$, there are $q_1$, $q_2$ and $q_3$ such that $q_i = op_i$ holds for $i \in \{1,2,3\}$. Therefore, $\text{shape}(C) = A = q_1 \otimes q_2 \otimes Y$ and $\gamma(q_1) = \gamma(q_2) = \text{ALU}$, so by claim 3 there are unique E
and a with \( C \xrightarrow{\text{(read [ALU],a,q_1)}} C E \), and \( \text{shape}(E) = q_2 \otimes Y \) holds. Applying claim 3 again yields unique \( F \) and \( b \) with \( E \xrightarrow{\text{(read [ALU],b,q_2)}} C F \), and \( \text{shape}(F) = Y \) holds.

The above \( Y \otimes \text{op}_3 \otimes U = C \) implies \( \text{shape}(F) \otimes q_3 \otimes U = C \), so by applying claim 4, there is a unique \( D \) with \( F \xrightarrow{\text{(write},a+b,q_3)} C D \), and \( \text{shape}(D) = \text{shape}(F) \otimes q_3 \) holds. Consequently, the following derivation is possible

\[
\begin{align*}
\text{ADD} & \quad C \xrightarrow{\text{(read [ALU],a,q_1)}} C E \\
& \quad E \xrightarrow{\text{(read [ALU],b,q_2)}} C F \\
& \quad F \xrightarrow{\text{(write},a+b,q_3)} C D \\
& \quad C \xrightarrow{\text{add} \ q_1 \ q_2 \ q_3} C D
\end{align*}
\]

Using \( q_i = \text{op}_i \) for \( i \in \{1,2,3\} \) we thus obtain a derivation for \( C \xrightarrow{\text{code}} D \), and \( \text{shape}(D) = \text{shape}(F) \otimes q_3 = Y \otimes \text{op}_3 = B \) holds. Uniqueness of \( D \) with respect to \( C \xrightarrow{\text{code}} D \) follows as in Proposition 1.

**Cases** \( \text{code} \neq \text{add \ op}_1 \ \text{op}_2 \ \text{op}_3 \). Similar.

For proving the main claim, we perform a structural induction on \( t \).

**If** \( t = \varepsilon \), no typing judgement \( \vdash_C t : A \rightarrow B \) is derivable, so the claim is trivially fulfilled.

**If** \( t = \text{ins} \), then claim 5 guarantees that there is a unique \( D \) with \( C \xrightarrow{\text{code}} C D \), where \( \text{ins} = [n] \text{code} \). By rule INSTR we obtain \( C \xrightarrow{\text{ins}} C D \). Furthermore, claim 5 guarantees \( \text{shape}(D) = B \), and uniqueness of \( D \) with respect to \( C \xrightarrow{\text{ins}} C D \) follows from Proposition 1.

**If** \( t = \text{sr} \), then the typing rule CUT-LIM guarantees that there are \( D, C_1 \) and \( C_2 \) with \( \vdash_{C_1} s : A \rightarrow D, \vdash_{C_2} r : D \rightarrow B \) and \( C = C_1 \otimes V \), where \( C_2/C_1 = V/U \). A simple structural induction shows that \( \vdash_{C_1 \otimes V} s : A \rightarrow D \) and \( \vdash_{C_2 \otimes U} r : D \rightarrow B \) are derivable, i.e. \( \vdash_C s : A \rightarrow D \) and \( \vdash_C r : D \rightarrow B \). As \( \text{shape}(C) = A \) holds, we can thus apply the induction hypothesis to \( s \) to obtain a unique \( E \) with \( C \xrightarrow{s} E \) and \( \text{shape}(E) = D \). Applying the induction hypothesis again yields the existence of a \( D \) with \( E \xrightarrow{s} D \), and \( \text{shape}(D) = B \) hold. Consequently, \( C \xrightarrow{\text{sr}} C D \). Uniqueness of \( D \) with respect to \( C \xrightarrow{\text{sr}} C D \) follows from Proposition 1.
3. Again, the claim follows by induction on $t$.

If $t = e$, no judgement $C \vdash_D D$ is derivable, so the claim is trivially fulfilled.

If $t = \text{ins} = [n]\text{code}$, the rule $\text{AX-LIM}$ implies that there is a (unique) $X$ with $A = X \otimes A_{\text{code}}$ and $B = X \otimes B_{\text{code}}$, and that for $B_{\text{code}}/A_{\text{code}} = V/U$, $C = A \otimes V$ holds. The assumption $C \vdash_{[n]\text{code}} E$ yields $C \vdash_{[n]\text{code}} E$, so for $\text{shape}(C) = A$, $\text{shape}(E) = B$ follows, and the definition of $\rightarrow_D$ guarantees the existence of some $Y$ and $Z$ with

$$X \otimes A_{\text{code}} \otimes Y = \begin{array}{c}
A \otimes Y = \text{shape}(C) \otimes Y = D
\end{array}$$

$$= \begin{array}{c}
\text{shape}(E) \otimes Z = B \otimes Z = X \otimes B_{\text{code}} \otimes Z
\end{array}$$

Consequently, $A_{\text{code}} \otimes Y = B_{\text{code}} \otimes Z$ follows, so the minimality of $V$ and $U$ implies that there is a $W$ such that

$$Y = V \otimes W \text{ and } Z = U \otimes W,$$

from which the claim follows due to $D = A \otimes Y = A \otimes V \otimes W = C \otimes W$.

If $t = sr$, then the last rule in the derivation for $C \vdash_D E$ must have been $\text{COMP}$, so there is a unique $F$ with $C \vdash_D F$ and $F \vdash_D E$. Furthermore, for $\vdash_C sr : A \rightarrow B$ to hold there must by rule $\text{CUT-LIM}$ be $C_1, C_2$ and $E$ with

$$\vdash_{C_1} s : A \rightarrow E, \quad \vdash_{C_2} t : E \rightarrow B \quad \text{and} \quad C = C_1 \otimes V = C_2 \otimes U$$

where $C_2/C_1 = V/U$, and it can easily be seen that the entities $C_1, C_2$ and $E$ are uniquely determined. By induction hypothesis there are thus $W_1$ and $W_2$ such that

$$C_1 \otimes W_1 = D = C_2 \otimes W_2.$$

Since $V$ and $U$ are prime and fulfil $C_1 \otimes V = C_2 \otimes U$, there must hence be a $W$ such that $W_1 = V \otimes W$ and $W_2 = U \otimes W$, and for this $W$ we obtain $D = C_1 \otimes W_1 = C_1 \otimes V \otimes W = C \otimes X$. 

$\square$
4.5.1 Distributed execution

For distributed execution, a deadlock occurs if progress in all pipelines is blocked, due either to a lack of operands or to an overflow in a queue used as the destination for a forwarding operation. Such a situation occurs if an interleaving of \( t \) enters a configuration \( E \) but is unable to reach a final configuration from \( E \) under the same restrictions \( C \).

**Definition 15.** A program \( t \) is deadlock-free for the pair \((A, C)\) if \( A \otimes X = C \) for some \( X \) and for all \( C, E, \pi_1 \) and \( \pi_2 \) with shape \( (C) = A, \pi_1 \pi_2 = \pi_t \) and \( C \xrightarrow{\pi_1} E \) there is a \( D \) such that \( E \xrightarrow{\pi_2} C \) \( D \).

Absence of deadlock is a priori not preserved when operand queues are extended.

**Example.** For \( \gamma(q_2) = \gamma(q_3) = \text{MUL} \) and \( \gamma(q_1) = \text{ALU} \) consider the program \( t \) in (4.9)

\[
\begin{align*}
1 & \text{ldc } 3 \ q_1 \\
2 & \text{dec } q_1 \ q_2 \\
3 & \text{ldc } 4 \ q_1 \\
4 & \text{ldc } 3 \ q_2 \\
5 & \text{dec } q_1 \ q_3 \\
6 & \text{mul } q_3 \ q_2 \ q_1 \\
7 & \text{ldc } 5 \ q_2
\end{align*}
\]

(4.9)

The distributed program \( \pi_t \) is \([2] [5] \parallel [6] \parallel [1][3][4][7]\). For \( C = q_1 \otimes q_2 \otimes q_2 \otimes q_3 \) we obtain \( \vdash_{\pi} t : 1 \rightarrow q_1 \otimes q_2 \otimes q_2 \), and for \( C = ([], M) \) and \( D = ([q_1 \mapsto [6], q_2 \mapsto [3,5], M) \), \( C \xrightarrow{\pi_1} D \) follows. Indeed, \( t \) is deadlock-free for \((A, C)\), as shown in Figure 4.7 (left), where the interleavings starting in \( C \) under restrictions \( C \) are depicted. The states are given as triples \((i, j, k)\) such that \( q_1^i \otimes q_2^j \otimes q_3^k \) is the shape of the corresponding configuration. States violating the length restrictions are marked, and execution paths are cut off at these points. Absence of deadlock can be detected by observing that each non-marked configuration has at least one successor which is also non-marked.

The in-order execution is seen to obey restrictions \( C \) as the interleaving \([1] \ldots [7]\) is a successful path.

Relaxing length restrictions to \( C \otimes X \) with \( X = q_4 \) enables more interleavings – see the graph in Figure 4.7 (right). Again, states violating the (extended) length restrictions are marked. Some new interleavings are harmless, such as any interleaving starting with \([1][3][2]\). However, the configuration reached by the prefix \([1][3][4][7]\) is a deadlock -
Figure 4.7: Interleavings of program (4.9) for various length restrictions.

its only successor violates the extended length restrictions. This path could have been completed, had we extended $C$ by $q_1 \otimes q_2$ instead of $q_1$. \hfill $\lozenge$

Failure of this monotonicity property is practically relevant. Extending some operand queues supports the exploitation of the instruction-level parallelism inherent in a program. The performance gain should not be paid for by new deadlocks in programs which were previously deadlock-free. Of course, reexamination of programs should also be avoided. For programs without race hazards this is possible – and the above example program is indeed non-deterministic for shape 1.

**Proposition 19.** Let $(\Gamma, <) \vdash t : A \rightarrow B$, $<$ be contained in $\sqsubset$ and $t$ sequential for $\sqsubset$ and $A$. If $t$ is deadlock-free for $(A, C)$ then $t$ is deadlock-free for $(A, C \otimes X)$.

**Proof.** Let $\text{shape}(C) = A$, $\pi_1 \pi_2 = \pi_t$ where both $\pi_i$ are non-empty and $C \xrightarrow{\pi_1} C \otimes X E$. We have to show that $E \xrightarrow{\pi_2} C \otimes X D$ holds for some $D$.

Let $n$ be the length of $t$ and $u$ be the interleaving of $\pi_1$, i.e. $C = C_0 \xrightarrow{u_1} C \otimes X C_1 \xrightarrow{u_2} C \otimes X \ldots \xrightarrow{u_l} C \otimes X C_l = E$ for some $l \geq 0$. Let $0 \leq m \leq l$ be smallest such that there is no $X_m$
with \( \text{shape}(C_m) \otimes X_m = C \),
\[
C_0 \xrightarrow{u_1} C \xrightarrow{u_2} C \cdots \xrightarrow{u_{m-1}} C \xrightarrow{u_m} C \otimes X \xrightarrow{u_{m+1}} C \otimes X \cdots \xrightarrow{u_l} C \otimes X \xrightarrow{C_{l+1}} C
\]
and suppose there are no \( w \) and \( C_{l+1} \) with \( C_l \xrightarrow{w} C \otimes X \xrightarrow{C_{l+1}} C \).

Since \( t \) is deadlock-free for \( (A, C) \) there is a \( Y \) such that \( \text{shape}(C) \otimes Y = A \otimes Y = C \), hence \( m > 0 \) must hold. Furthermore, absence of deadlock for \( (A, C) \) implies that there is an interleaving \( v \) of \( \pi_{u_m} \ldots \pi_l \pi_2 \) such that \( C_{m-1} \xrightarrow{v} C \xrightarrow{D} \).

Let \( v = v_1 \ldots v_k \).

1. If \( v_1 \) occurs in \( u_m \ldots u_l \), say \( v_1 = u_i \), then instructions \( u_m \) to \( u_{i-1} \) are completely independent from \( v_1 \), i.e. they queue in different instruction queues and forward to different operand queues. This follows from the sequentiality of \( t \) for \( \sqsubseteq \) and \( A \): since \( v_1 \) is executable in \( C_{m-1} \) it must be at the head of its instruction queue. Since \( v_1 \) is not in \( u_m \ldots u_{i-1} \), it is still at the head position in \( C_{i-1} \). If an instruction within \( u_m \ldots u_{i-1} \) forwarded to the same queue as \( v_1 \) we had a race hazard in the unrestricted model, in contradiction to Proposition 11. Consequently, we can reorder the instructions and obtain
\[
C_{m-1} \xrightarrow{v_1} C \xrightarrow{D_1} \cdots \xrightarrow{u_{i-1}} D_{i-1} \xrightarrow{u_{i+1}} D \cdots \xrightarrow{u_l} D_k = C_l
\]
We have thus decreased the distance between the last configuration of shape \( C \) and \( C_l \) by one instruction, hence by induction on \( l - m \) we obtain \( C_0 \xrightarrow{\pi_u} C \xrightarrow{C_l} \). Consequently \( C_l \xrightarrow{v} C \xrightarrow{D} \) and \( \pi_v = \pi_2 \) follow, hence \( C_l \xrightarrow{v} C \xrightarrow{D} \) and \( C_l \xrightarrow{\pi_2} C \xrightarrow{D} \) by Proposition 17.

2. If \( v_1 \) does not occur in \( u_m \ldots u_l \) then it must still be at the head of its instruction queue in configuration \( C_l \). Since all its operands were available in \( C_{m-1} \), they are still available in \( C_l \) (no intermediate instruction can have removed them because they queue in different queues) and no instruction in \( u_m \ldots u_l \) can have forwarded to a queue to which \( v_1 \) forwards (otherwise we had a race condition). Consequently, we have \( C_l \xrightarrow{v_1} C \xrightarrow{D_1} \) for some \( D_1 \), in contradiction to the assumption that no \( w \) and \( C_{l+1} \) exist with \( C_l \xrightarrow{w} C \xrightarrow{C_{l+1}} C \). Consequently, the case \( v_1 \notin \{ u_m \ldots u_l \} \) does not occur.

\( \square \)
The calculi $\Gamma, < \vdash t : A \rightarrow B$ and $(\Gamma, <) \vdash t : A \rightarrow B$ can be combined to a calculus for sequential and deterministic distributed execution with finite queues where judgements are of the form $(\Gamma, <) \vdash t : A \rightarrow B$. For this calculus, we obtain the following result.

**Theorem 3.** Let $(\Gamma, <) \vdash t : A \rightarrow B$ and $t$ sequential for $\square$ and $A$ where $<$ is contained in $\square$. Let shape$(C) = A$. Then

1. $C \xrightarrow{t} C \otimes X \ D$ for some unique $D$ and any $X$.
2. $t$ is deterministic for $A$ and deadlock-free with respect to $(A, C \otimes X)$ for any $X$.
3. $C \xrightarrow{\pi_t} C \otimes X \ D$ for any $X$.

**Proof.**

1. Combine Propositions 18 and 17 to obtain $C \xrightarrow{t} C \otimes X \ D$ for some $D$ and any $X$. Uniqueness of $D$ follows from Proposition 1.

2. Given Propositions 11 and 19, it remains to show that $t$ is deadlock-free with respect to $(A, C)$.

Let $\pi_t = \pi_1 \pi_2$ and $C \xrightarrow{\pi_1} C \ C_1$ such that $u = u_1 \ldots u_l$ is the interleaving of $\pi_1$ and

$$C \xrightarrow{u_1} C \ C_1 \xrightarrow{u_2} C \ldots \xrightarrow{u_l} C \ C_l.$$  

Furthermore, part (1) implies

$$C \xrightarrow{t_1} C \ D_1 \xrightarrow{t_2} C \ldots \xrightarrow{t_n} C \ D$$

for some $D_i$ and $t = t_1 \ldots t_n$. Let $k$ be smallest such that $t_k \notin \{u_1, \ldots, u_l\}$. Then $C_1 \xrightarrow{t_k} C \ E$ for some $E$ because

(a) $t_k$ is at the head of its instruction queue: since $t_1, \ldots, t_{k-1}$ are in $\{u_1, \ldots, u_l\}$ there is an $i \leq l$ such that $t_k$ is at the head of its instruction queue in $C_i$. Since distributed execution respects $<$, $t_k$ is still at the head of its instruction queue in $C_l$.

(b) all operands for $t_k$ are available in $C_l$: they are available in $C_i$, and no instruction in $u_{i+1}, \ldots, u_l$ consumed them.
4.6. AQM’s and Tomasulo’s algorithm

In order to demonstrate how AQM’s with distributed execution relate to hardware implementations of forwarding we briefly discuss Tomasulo’s algorithm [Tom67] [HP96]. As the material presented here will not be needed in later chapters, a reader who is more interested in the technical development may skip this section and proceed directly to the discussion (Section 4.7).

4.6.1 Tomasulo’s algorithm

In Tomasulo’s algorithm (see Figure 4.8), functional units are equipped with reservation stations. At issue time, instructions are assigned a symbolic tag and sent to a reservation station associated to a suitable functional unit. If an operand is not available at this time, the corresponding operand field is filled with the tag of the instruction producing the operand. Instructions inside the reservation stations whose operand fields contain valid operands may execute. The functional units attach the instruction’s tags to the results and then arbitrate for access to a common data bus (CDB). The CDB delivers the result to the register bank, possibly mediated by a reorder buffer. Results are also made available to instructions inside the reservation stations. These match the tags of outstanding operands against the tag of a result provided by the CDB and substitute...
the result into the operand field whenever the match succeeds. Correctness of execution under Tomasulo’s algorithm relies on the allocation of tags. In most implementations, tags represent either the index of the reservation station holding the instruction which will produce the result [HP96], or its index in the reorder buffer [KMP99]. Due to the complexity of out-of-order operation in the functional units, concurrent tag matching in reservation stations and tag allocation in the issue unit, Tomasulo’s algorithm has been a popular case study for formal hardware verification [HGS99] [McM98] [SH98] [AP99] [McM00] [JM01].

**Example.** We consider an architecture using [HP96]’s interpretation of tags, but without a reorder buffer. For the code sequence

\[
\begin{align*}
[1] & \text{ld } r_2 & r_{16} & \quad [4] & \text{sub } r_{16} & r_3 & r_{18} \\
[2] & \text{ld } r_3 & r_{12} & \quad [5] & \text{div } r_{14} & r_{10} & r_{20} \\
[3] & \text{mul } r_{12} & r_{14} & r_{10} & \quad [6] & \text{add } r_{18} & r_{10} & r_{16}
\end{align*}
\]  

\[(4.10)\]

Figure 4.9 shows a possible situation after issuing instruction [1]. Reservation station \text{Mem1} contains [1]’s code and the operand obtained from register \(r_2\). Also, the tag
\textbf{4.6. AQM's and Tomasulo's algorithm}

Mem1 has been written to the target register of [1], \( r_{16} \). Thus, when instructions [2] to [5] are issued (Figure 4.10), any read access to \( r_{16} \) yields the tag Mem1 instead of any former value. For example, when instruction [4] is allocated to reservation station ALu1, its first operand field is set to Mem1. Notice that reservation stations for a particular functional unit are not allocated in any specific order. Finally, when instruction [6] is issued, the content of \( r_{16} \) is overwritten – see Figure 4.11 where [6] is allocated to ALu3. Any read access from \( r_{16} \) by instructions [7] and later will yield the tag ALu3.

When an instruction is executed, its result is written into the target register containing the producer’s reservation station index: if instruction [1] executes and retires prior to the issuing of [6], its result will overwrite the tag Mem1 - further instructions reading from \( r_{16} \) would thus obtain the operand directly. If instruction [1] retires after instruction [6] has issued, no register contains the tag Mem1 any longer and the result of [1] will not be written back. Independently of instruction [6]’s issue status at the time of [1]’s retirement, the result of [1] will be substituted for the placeholder tag in the first
Figure 4.10: Snapshot after issuing instructions [1] to [5].

Figure 4.11: Snapshot after issuing all instructions of program (4.10).
operand of instruction [4] in reservation station \texttt{Al1}. At the same time, the reservation station \texttt{Mem1} will be freed, again triggered by the availability of the current entry’s result on the CDB.

\subsection*{4.6.2 Mapping to AQM’s}

For mapping Tomasulo’s architecture to the AQM model, we replace the unspecific broadcasting of results via the CDB by directed forwarding to the consuming instructions. As forwarding in AQM’s is initiated by the producer of the result and no assumptions are made regarding execution delays inside functional units, the order in which operands arrive at a particular reservation station is undetermined. In contrast to the concurrent matching of result tags against placeholder tags in reservation station entries, correct behaviour is achieved by having several operand queues delivering values to each functional unit (see Figure 4.1). Verifying the allocation of operand queues to operands is thus equally important as verifying the allocation of tags in Tomasulo’s algorithm, but is in ALEF under the control of the compiler.

\textbf{Example.} Suppose an architecture with $\gamma(q_1) = \text{ALU}$ and $\gamma(q_2) = \text{MUL}$ is given, and an initial configuration of shape \texttt{1}. For program (4.10), registers $r_{16}$ and $r_{18}$ may both be replaced by forwarding to $q_1$ as no race condition can develop. If additionally, the usage of $r_{12}$ is replaced by forwarding to $q_2$ we obtain

\begin{align*}
[1] & \text{ld} r_2 q_1 & [4] & \text{sub} q_1 r_3 q_1 \\
[2] & \text{ld} r_3 q_2 & [5] & \text{div} r_{14} r_{10} r_{20} \\
[3] & \text{mul} q_2 r_{14} r_{10} & [6] & \text{add} q_1 r_{10} r_{16}
\end{align*}

\subsection*{4.6.3 Limitations}

While the general AQM model is powerful enough to simulate Tomasulo’s algorithm, the restrictions of ALEF make the translation cumbersome.
Firstly, in ALEF a result can only be forwarded to a small (and fixed) number of operand queues. Hence, the effect of the broadcast to substitute a result for many tags at the same time can only be matched non-atomically, using duplication instructions.

**Example.** As the result of instruction [3] is consumed by two instructions, $\tau_{10}$ can only be eliminated at the cost of a duplication instruction.

\[
\begin{align*}
[1] \text{ld } r_2 & q_1 \\
[2] \text{ld } r_3 & q_2 \\
[3] \text{mul } q_2 & \tau_{14} q_2 \\
[3'] \text{dupl } \text{MUL} & q_2 q_2 q_3 \\
[4] \text{sub } q_1 & \tau_3 q_1 \\
[5] \text{div } \tau_{14} & q_2 r_{20} \\
[6] \text{add } q_1 & q_3 \tau_{16}
\end{align*}
\]

Provided that we chose the duplication on $\text{MUL}$, the forwarding from [3] to [3'] as well as the forwarding from [3'] to [5] may reuse $q_2$. For the forwarding to instruction [6], a race condition with the result of [4] is avoided by introducing a new operand queue $q_3$ with $\gamma(q_3) = \text{ALU}$. The remaining registers lack either instructions writing to them (registers $r_2$, $r_3$ and $\tau_{14}$) or instructions reading from them (registers $\tau_{16}$ and $r_{20}$) and can thus not be eliminated – see the following chapter for a treatment of programs which use operand queues and registers.

A similar situation occurs if a result in Tomasulo’s algorithm updates a register but is also consumed by a reservation station entry. Again, a duplication instruction is needed in ALEF for achieving a similar effect.

Another aspect related to duplication is [Tom67]’s variation where some instructions effectively execute in zero time. For example, consider the instruction sequence

\[
\begin{align*}
[1] \text{add } & r_1 r_2 r_3 \\
[2] \text{id } & \text{MUL } r_3 r_4,
\end{align*}
\]

for which issuing under the standard scheme may lead to the situation shown in Figure 4.12 on the left. Under Tomasulo’s variation, the issue unit would not insert instruction [2] into a reservation station but instead copy the tag found in $r_3$ into $r_4$ – see Figure 4.12 on the right. Consequently, the execution of instruction [1] would automatically update both registers and thus achieve the same effect as the original program. In ALEF, this feature cannot be simulated, but future extensions of ALEF might well
combine duplication with ordinary instructions. This would not only allow the elimination of the identity instruction but also lead to a more efficient simulation of broadcasting to several locations.

Secondly, Tomasulo’s algorithm and AQM’s as treated in this chapter differ in the relative order of instructions which are mapped to the same functional unit. In contrast to the in-order execution resulting from AQM’s instruction queues, reservation stations may release instructions in an arbitrary order. This difference represents a particularity of the operational semantics rather than a conceptual limitation of the AQM model. In fact, by defining alternative operational semantics for ALEF the consequences of such architectural changes may easily be studied.

Thirdly, modern micro-processors combine Tomasulo’s algorithm with mechanisms for exception handling and speculative execution, often relying on the reorder buffer which rearranges the instructions into program order before retirement. It is beyond the scope of this thesis to study the interaction between forwarding and these techniques, and we have not evaluated possibilities for extending AQM’s by a reorder buffer. Notice, however, that extended opcodes such as $\text{addr} \ op_1 \ op_2 \ op_3 \ p : op_1 \otimes op_2 \rightarrow op_3 \otimes p$ may in principle signal the completion of instructions to data structures similar to the reorder buffer.

On the other hand, Tomasulo’s synchronous broadcasting and the CDB are not expected to scale well as the number of functional units and the size of reservation sta-
tions grow. In contrast, forwarding paths in AQM’s may be implemented distributedly where the availability of individual connections is indicated in architecture descriptions. We will briefly discuss in Chapter 9 how type systems could be made parametric in abstract descriptions of processor configurations.

### 4.6.4 Conclusion

The preceding discussion indicates that AQM’s are powerful enough to simulate Tomasulo’s algorithm. However, several features of super-scalar processors remain unsupported at the time of writing. We hope that the methodology proposed in this thesis will be helpful for overcoming these deficiencies. As the outlined mapping provides a novel explanation of Tomasulo’s algorithm which complements existing informal expositions as well as formal verifications of individual implementations, AQM’s may also be useful for studying and comparing future variations of Tomasulo’s architecture.

### 4.7 Discussion

This chapter considered program execution in processors with independent instruction-pipelines and operand queues of infinite and finite length. We modified the dynamic semantics of ALEF appropriately and updated the static semantics for guaranteeing desirable properties. In the distributed model, we were concerned with determinism of execution. We showed that the programming language approach allows one to tackle non-determinism using a variety of means, based on the relationship between determinism, safety and completeness. In the model for finite queues, we ensured absence of queue overflow for sequential and distributed execution. In the latter case, we observed that extending queues is only safe for deterministic programs. We believe it would have been difficult to guarantee robustness of program execution with respect to this architectural change using a purely hardware-based verification approach. Finally, the comparison to Tomasulo’s algorithm demonstrated that AQM’s can simulate existent hardware implementations of forwarding and highlight some of their technical aspects.
We briefly comment on two design decisions, and how the corresponding limitations may be overcome. First, we considered interleaving at the instruction level as each instruction was executed atomically. An obvious dynamic semantics for more fine-grained interleaving is an an SOS for the underlying micro-instructions read and write. For sequential execution, this merely amounts to a switch of granularity in the small-step semantics. For distributed execution, changing the level of granularity exposes more asynchrony as read- and write-operations of different instructions may interleave. In particular, non-atomic execution of a \texttt{dup1} \( q_1 \) \( q_2 \) \( q_3 \) instruction allows an instruction which is data-dependent on the value sent through \( q_2 \) to start (and finish) execution before the forwarding to \( q_3 \) has been performed. Consequently, a new type of race conditions is observed, like in program

\[
[1]\texttt{dup1}^{fu} q_1 ~ q_2 ~ q_3 \quad [2]\texttt{dec} q_2 ~ q_3.
\]

Although a race occurs for writing to \( q_3 \), the program is admitted by our system. The typing calculus correctly discovers the race condition and inserts the constraint \([1] < [2]\). For \( fu \neq \text{ALU} \), this constraint cannot be eliminated using a pipeline-dependency, but for initial configurations of shape \( q_1 \) a data-dependency exists. A compiler using data-dependencies for discharging constraints will thus (falsely) discharge \([1] < [2]\).

Hardware-based solutions for making \texttt{dup1} appear atomic may be difficult to implement under an asynchronous regime. However, a small modification of the definition of data-dependencies solves the problem: when discharging serialisation constraints, only data-dependencies carried through the second destination queue of \texttt{dup1} may be used. This expresses the fact that the \texttt{dup1} \( q_1 \) \( q_2 \) \( q_3 \) can only be considered completed once the second forwarding has been carried out. Consequently, the above race-condition \([1] < [2]\) cannot be discharged and the program is rejected as being non-deterministic.

In contrast, a program such as

\[
[1]\texttt{dup1}^{fu} q_1 ~ q_2 ~ q_3 \quad [2]\texttt{dec} q_3 ~ q_2
\]

is (rightly) accepted.

If the forwardings for \texttt{dup1} are initiated in a different order, other data-dependencies must be considered – if the order is unspecified, no data-dependencies involving results
of $\text{dup1}$-instructions may be used for discharging constraints. As the processor’s behaviour is expressed as a dynamic semantics for ALEF, our approach is adaptable to all these situations – one has to change the definition of data-dependencies appropriately and prove a result similar to Proposition 14.

We thus expect our approach can handle fine-grained models of execution, and supports the implementation of instructions such as $\text{dup1}$ in an asynchronous environment.

The second design decision concerns the one-to-one correspondence of functional units to instruction pipelines. In a realistic processor, several functional units of a particular functionality may be present, and different association patterns between functional units and instruction queues are possible. For example, several functional units of type ALU could be served jointly by a single queue or jointly by several queues, or each functional might have its own private queue, or several private queues. Also the mapping of instructions to functional units and/or to instruction queues can be statically fixed or be determined dynamically. Finally, operand queues can be associated to single functional units, blocks of functional units, single instruction queues or blocks of instruction queues which are connected to functional units in any of the above ways.

For some of these scenarios even functional correctness is difficult to achieve as the mapping of instruction to pipelines needs to be consistent with the operand queues used by earlier instructions for communicating operands. Clearly, the exploration of various design alternatives needs a systematic framework. The static semantics presented in this chapter provides a first step towards such a framework. This is particular so for the distinction between the derivation of constraints and their discharge, as many of the above design alternatives result either in additional constraints or in restrictions on the techniques for discharging them. For example, hazards for reading from an operand queue occur if operand queues serve several functional units or instruction pipelines. Extending the static semantics by constraints $[n] \prec [m]$ whenever $[n]$ and $[m]$ access the same queue for reading, we can retain determinism by the same approach as before. If several pipelines serve a single functional unit, we use only those instructions for discharging a constraint which inhabit the same pipeline - it suffices to modify the definition of pipeline-dependencies such that two instruction’s instruction queues are
compared rather than the functional units on which they execute.
The ability to treat various architectural alternatives using similar techniques highlights
the generality of the programming language approach and should make AQM’s applica-
table to a range of target processors.

Finally, we note that the technical requirement in Proposition 14 that a competitor \([l]\)
be related to \([n]\) by \([l] \subseteq [n]\); or \([n] \subseteq [l]\) is a little too restrictive. For example, program

\[ [1] \text{ldc} 5 \quad [2] \text{dup1\texttt{mul}} \quad q_1 \quad q_2 \quad [3] \text{dec} \quad q_2 \quad q_1 \]

is (wrongly) rejected for shape 1. Although instruction \([3]\) is data-dependent via \(q_2\)
on \([2]\) and the (empty) order of pipeline-dependencies can extended by \([2] <_I [3]\), this
cannot be used when the \(q_1\)-carried data-dependency between \([1]\) and \([2]\) is examined
as the competitor \([3]\) is not yet related to \([1]\). Further work is needed in Proposition
14 to see whether the condition may be relaxed such that competitor \([l]\) only fulfils
\([l] \subseteq [n]\); or \([n] \subseteq [l]\).

### 4.7.1 Summary and outlook

We extended the dynamic and static semantics of ALEF to cover concurrent execution
and finite operand queues. As in Chapter 3, soundness of the static semantics with
respect to the dynamic semantics was shown. In the static semantics, we separated de-
ving constraints from discharging them and discussed several techniques for the latter
task. Our analysis paid off when the same criteria could be used to preserve absence
of deadlock as queue lengths were extended. We also discussed how Tomasulo’s algo-
thesis relates to AQM’s model of computation. This demonstrates the expressiveness of
the AQM model, but also provides an alternative exposition of Tomasulo’s algorithm
itself. Assessing the status of our main contributions (Section 1.3), the second and
third items have thus been provided for AQM’s which realise some features of modern
processors.

In the next chapter, we will introduce registers and thus cover the full language ALEF.
Revisiting the computational models considered so far, we will extend the configura-
tions of the dynamic semantics by a register bank. This will trigger modifications in the
static semantics, but the changes will leave the essential characteristics untouched. In particular, the distinction between deriving constraints for eliminating race conditions and means for discharging such constraints will remain valid and will be extended to new classes of hazards. The robustness of this distinction highlights an advantage of our methodology as orthogonal issues may be analysed separately.
Chapter 5

Queue machines with registers

The previous chapters introduced computational models in which all operands are communicated via forwarding. These models are well suited for studying the fundamental properties of explicit forwarding, but they are not satisfactory models of practical computation. The necessity to enter values into operand queues repeatedly if they are used more than once leads to congestion in the forwarding network and an increased static and dynamic instruction count. For example, variable \( k \) in the Java excerpt

\[
\ldots
\text{for (int } i=1; \ i<n; \ i++ \{ \\
\quad j = j \times k; \\
\} \\
\quad j = j + k;
\]

is accessed \( n \) times where \( n \) might only be known at runtime. If the computation of \( k \) is costly and cannot be moved inside the loop, its value must be duplicated at the start of each iteration using a \( \text{dup1} \) instruction. One copy is sent to the multiplication and the other copy is forwarded to the next iteration or – in the last iteration – to the instruction following the loop. If the \( \text{dup1}^{ALU} \) instruction is used, the value of \( k \) will be available at the correct functional unit in both cases. Similar duplication schemes are necessary for variables \( n \) and \( j \), and one can easily obtain programs where each iteration contains several uses of a value. Furthermore, an \( \text{id} \) instruction is needed directly after the loop.
since the loop-carried forwarding for \( j \) has destination \( \text{MUL} \) whereas the instruction following the loop executes on \( \text{ALU} \). A translation of program (5.1) into ALEF where no sharing of operand queues is implemented thus results in

```c
// n in q5, k in q4, j in q1
[1]ldc 1 q3    // i=1
[2]jmp 3     // jmp 3
[3]dupl ALU q4 q2 q4    // dupl k
[4]mul q1 q2 q1     // j = j * k1
[5]inc q3 q3    // i++
[6]dupl ALU q3 q3 q3    // dupl i
[7]dupl ALU q5 q5 q5    // dupl n
[8]sub q5 q3 q7     // x = n1 - i1
[8]if q7 9 3   // if x=0 13 12
[9]id MUL q1 q6     // move j
[10]add q6 q4 q6    // j = j* k2
```

where we assume \( \gamma(q_1) = \gamma(q_2) = \text{MUL}, \gamma(q_3) = \gamma(q_4) = \gamma(q_5) = \gamma(q_6) = \text{ALU} \) and \( \gamma(q_7) = \text{BU} \). Indeed, the result is well-typed in the calculus of Chapter 3, with typings

\[
N_1 : 1 \rightarrow A, N_3 : A \rightarrow A \text{ and } N_9 : A \rightarrow q_5 \otimes q_3 \otimes q_5 \text{ for } A = q_3 \otimes q_4 \otimes q_5 \otimes q_1.
\]

Even a better allocation of operand queues cannot avoid that values such as \( k \) need to be inserted into the queues repeatedly. As a further drawback, queues which are used inside a loop become essentially unavailable for holding other values as \( i, j, k \) and \( n \) must be at the head of their queue(s) in each iteration. Consequently, any finite number of operand queues effectively limits the nesting depth of programs\(^1\).

It is hence to be expected that architectures with explicit forwarding will in addition contain a small register bank. In most cases, registers will contain values which are repeatedly accessed or whose range of liveness spans across many instructions or several loops. This fits well to the motivation for forwarding in conventional architectures: to provide a fast but less costly mechanism for communicating operands between instructions executed in close succession.

---

\(^1\)By inserting a chain of \( \text{id}^{(q)} q q \) instructions one can let a value reach the head of its queue while retaining the relative order of other entries, but this scheme has to be considered extremely inefficient.
This chapter consequently considers dynamic semantics for the full language, including registers. The earlier models for sequential, distributed and finite-queue execution and the type systems for eliminating runtime failure are reexamined. In most cases, the calculi for static and dynamic semantics arise as variations or extensions of the calculi presented in the previous chapters. As this correspondence carries over to many proofs of the core statements, the compositional nature of structured programming language formalisms and machine models is demonstrated, supporting our argument for programming language based system development.

5.1 Sequential execution

Motivated by similar reasons as ours, the explicit-forwarding architecture SCALP also contains a register bank. This takes the form of another functional unit, but the performance results given in [End96] indicate that this solution is unsatisfactory. Explicit instructions are needed for transferring values between the register bank and operand queues, thus adding to the pressure on the instruction memory and the decoding unit to provide the processor with instructions. Hence, increased latency and energy consumption are observed as well as low code density.

Consequently, we employ a more traditional register bank which is directly accessible by ordinary instructions. The architectural configuration is shown in Figure 5.1 where a global register bank is connected to the functional units by an operand bus.

This architectural choice is reflected in the structure of ALEF instructions where registers and operand queue names appear in the same position.

A program can use only a finite number of registers, and this set can be statically determined as indirect register addressing is not possible. Should the number of used registers (or operand queues) exceed the number of available registers, remaining values have to be spilled into memory using conventional techniques [App98a].

5.1.1 Dynamic semantics

The dynamic semantics is given by extending the operational model from Chapter 3. Configurations $C$ are equipped with a register file $R : R \rightarrow \text{Val}$, a finite map from reg-
isters $r$ to values $a$. For straight-line code, this results in configurations $C = (Q, R, M)$. Programs involving jump instructions use configurations of the form $C = (Q, R, M, \bar{n})$. The operational model (Figures 3.4 and 3.5 in Section 3.2) is extended such that relation $C \xrightarrow{t} D$ relates the above ternary configurations. This relation is obtained by

- replacing each rule of the form

$$
\begin{array}{c}
\text{CODE} \quad \begin{array}{c}
\mu_{1}, a_{1}, op_{1}
\end{array} \\
\begin{array}{c}
C_{1}
\end{array}
\begin{array}{c}
C_{2}
\end{array}
\begin{array}{c}
\ldots
\end{array}
\begin{array}{c}
C_{n-1}
\end{array}
\begin{array}{c}
\mu_{n-1}, a_{n}, op_{n-1}
\end{array} \\
\begin{array}{c}
C_{n}
\end{array}
\end{array}
\xrightarrow{\text{code}}
\begin{array}{c}
C_{1}
\end{array}
\begin{array}{c}
C_{n}
\end{array}
\end{array}
$$

of Figure 3.4 by the rule

$$
\begin{array}{c}
\text{CODE} \quad \begin{array}{c}
\mu_{1}, a_{1}, op_{1}
\end{array} \\
\begin{array}{c}
D_{1}
\end{array}
\begin{array}{c}
D_{2}
\end{array}
\begin{array}{c}
\ldots
\end{array}
\begin{array}{c}
D_{n-1}
\end{array}
\begin{array}{c}
\mu_{n-1}, a_{n}, op_{n-1}
\end{array} \\
\begin{array}{c}
D_{n}
\end{array}
\end{array}
\xrightarrow{\text{code}}
\begin{array}{c}
D_{1}
\end{array}
\begin{array}{c}
D_{n}
\end{array}
$$

where $D_{i} = (Q_{i}, R_{i}, M_{i})$ for $C_{i} = (Q_{i}, M_{i})$.

- replacing the rules $\mu$, INSTR and COMP by the rules $\mu$-Q, INSTR and COMP given in Figure 5.2. Although the latter two rules appear identical to their earlier counterparts, they are different as configurations contain a third component.

- renaming WR and RD to WR-Q and RD-Q, respectively, without changing their definition.
5.1. Sequential execution

![Diagram](Image)

Figure 5.2: Sequential execution with registers: dynamic semantics

- adding the four new rules RD-R, WR-R, µ-R1 and µ-R2 given in Figure 5.2.

Determinacy of execution is preserved as Proposition 1 carries over from Chapter 3.

**Proposition 20.** If $C \xrightarrow{t} D$ and $C \xrightarrow{t} E$ then $D = E$.

**Proof.** The rules RD-R and WR-R are easily seen to be deterministic, and determinacy of µ-R1 and µ-R2 follows similarly as the one for µ (now called µ-Q) in the proof of Proposition 1, and so do the claims for the rules CODE and the INSTR and the subsequent induction on the structure of $t$. □

In addition to deadlocks due to empty operand queues, execution also stalls if a register is accessed for which no entry exists in the register bank. In contrast, operand queue mismatches have no corresponding failure scenario for registers since registers are globally accessible.
5.1.2 Static semantics

Similar to Section 3.3, a type system can eliminate programs which would otherwise deadlock. For the same grammar of types as before, we treat products modulo the identity

\[ r \otimes r = r \]

for any register \( r \). This is motivated by the fact that registers can be repeatedly queried for their content (including zero times) but contain only one value. Commutativity and associativity of \( \otimes \) remains.

For \( A = r_1 \otimes \ldots \otimes r_m \otimes q_1 \otimes \ldots \otimes q_n \) we define \( rg(A) \) to be the register part \( r_1 \otimes \ldots \otimes r_m \) and \( q(A) \) to be the operand queue part \( q_1 \otimes \ldots \otimes q_n \). The above identity yields \( A \otimes rg(A) = A \) and \( rg(A) \otimes rg(A) = rg(rg(A)) = rg(A) \) for any \( A \).

The static semantics is given by axioms

\[
\text{R-AX}_{[n] code} : X \otimes A_{code} \rightarrow X \otimes B_{code} \otimes rg(A_{code}) \quad \text{SC(code)}
\]

where side conditions \( \gamma(r) = f u \) are always fulfilled and products \( A_{code} \) and \( B_{code} \) are as given in Table 3.1. Program fragments are combined using the same cut rule \text{CUT} as in Section 3.3.

The usage of \( rg(A_{code}) \) to the right of \( \rightarrow \) in R-AX prevents a register from being deleted. For example, in the program

\[
[1] \text{ldc } r_1 \quad [2] \text{dec } r_1 \quad q_1
\]

the value in \( r_1 \) is still available after the execution of the second instruction. This is captured in the derivation

\[
[1] \text{ldc } r_1 : [1] \rightarrow r_1 \quad [2] \text{dec } r_1 \quad q_1 : [1] \rightarrow r_1 \otimes q_1
\]

The same program can also be typed if a value is already present in \( r_1 \)

\[
[1] \text{ldc } r_1 : [1] \otimes r_1 \rightarrow r_1 \quad [2] \text{dec } r_1 \quad q_1 : [1] \rightarrow r_1 \otimes q_1
\]

using the identity \( r_1 \otimes r_1 = r_1 \). No typing is possible where the value in \( r_1 \) is deleted as any \( r \) appearing to the left of \( \rightarrow \) also appears to its right.
Lemma 5. If $t : A \rightarrow B$ then $B = \text{rg}(A) \otimes B$.

Proof. Induction on the structure of $t$. For all base cases, the rule R-AX guarantees that there is an $X$ such that $A = A_{\text{code}} \otimes X$ and $B = B_{\text{code}} \otimes \text{rg}(A_{\text{code}}) \otimes X$, so $\text{rg}(A) = \text{rg}(A_{\text{code}}) \otimes \text{rg}(X)$ and thus

$$B \otimes \text{rg}(A) = B_{\text{code}} \otimes \text{rg}(A_{\text{code}}) \otimes X \otimes \text{rg}(A_{\text{code}}) \otimes \text{rg}(X) = B_{\text{code}} \otimes \text{rg}(A_{\text{code}}) \otimes X.$$

For $s : A \rightarrow C$ and $t : C \rightarrow B$

$$st : A \rightarrow B$$

$C = C \otimes \text{rg}(A)$ and $B = B \otimes \text{rg}(C)$ follow from the induction hypothesis. Consequently, $\text{rg}(C) = \text{rg}(C) \otimes \text{rg}(A) = \text{rg}(C) \otimes \text{rg}(A)$, so $B = B \otimes \text{rg}(C) \otimes \text{rg}(A) = B \otimes \text{rg}(A)$. 

For relating static and dynamic semantics, the definition of shape is adapted to reflect the modified notion of configurations.

Definition 16. The shape of a configuration $C = (Q, R, M)$ is

$$\text{shape}(C) = \bigotimes_{q \in Q} Q[q] \otimes \bigotimes_{r \in \text{dom } R} R[r].$$

We obtain soundness and completeness for sequential execution as in Theorem 1.

Theorem 4. Let $t$ be an instruction sequence.

1. If $t : A \rightarrow B$ and $\text{shape}(C) = A$ then there is a unique $D$ such that $C \xrightarrow{t} D$, and $\text{shape}(D) = B$ holds.

2. If $C \xrightarrow{t} D$ then $t : \text{shape}(C) \rightarrow \text{shape}(D)$.

Proof. 1. The proof proceeds similarly to the proof of Theorem 1. The sub-cases for rules WR-Q, RD-Q and $\mu$-Q transfer directly. For rules RD-R, WR-R, $\mu$-R1 and $\mu$-R2 they are as follows.

RD-R Claim 1: For arbitrary $a, fu$ and $r$, there are $b$ and $c$ with $a \xrightarrow{(\text{read}(fu), b, r)} c$. Furthermore, $a = b = c$ holds.

Proof. Follows directly from the definition of rule RD-R.

WR-R Claim 2: For arbitrary $w, a$ and $r$ there is a $b$ such that $w \xrightarrow{(\text{write}, a, r)} b$. Furthermore, $a = b$ holds.

Proof. Follows directly from the definition of rule WR-R.
\(\mu\text{-R1}\) There are two claims.

**Claim 3:** If \(\text{shape}(C) = r \otimes X\) and \(fu\) arbitrary, then there are unique \(D\) and \(a\) with \(C \xrightarrow{\text{read}(fu),a,r} D\). Furthermore, \(\text{shape}(D) = \text{shape}(C)\) holds.

**Proof:** For \(C = (Q,R,M)\) and \(\text{shape}(C) = r \otimes X\) we have \(r \in \text{dom } R\), so \(R(r) = a\) holds for some unique \(a\). By claim 1, there are \(b\) and \(c\) with \(R(r) \xrightarrow{\text{read}(fu),b,r} c\), and \(a = b = c\) holds. Hence, \(C \xrightarrow{\text{read}(fu),a,q} D\) is derivable for \(D = (Q,R[r \mapsto a],M)\) i.e. \(D = C\) by the uniqueness of \(D\) (Proposition 20). Consequently, \(\text{shape}(D) = \text{shape}(C)\) follows.

**Claim 4:** If \(\text{shape}(C) = r \otimes X\) and \(a\) arbitrary, then there is a unique \(D\) with \(C \xrightarrow{\text{write},a,r} D\). Furthermore, \(\text{shape}(D) = \text{shape}(C)\) holds.

**Proof:** For \(C = (Q,R,M)\) and \(\text{shape}(C) = r \otimes X\) we have \(r \in \text{dom } R\), so \(R(r) = b\) for some unique \(b\). Applying claim 2 yields a unique \(c\) with \(b \xrightarrow{\text{write},a,r} c\), and \(c = a\) holds, hence \(R(r) \xrightarrow{\text{write},a,r} a\) and \(C \xrightarrow{\text{write},a,r} D\) follow for \(D = (Q,R[r \mapsto a],M)\), and uniqueness of \(D\) follows from Proposition 20.

Furthermore, \(\text{shape}(D) = \text{shape}(C) \otimes r = r \otimes X \otimes r = r \otimes X = \text{shape}(C)\) follows by applying the definition of \(\text{shape}(D)\) and the identity \(r \otimes r = r\).

\(\mu\text{-R2}\) **Claim 5:** If \(\text{prime}(r,\text{shape}(C))\) and \(a\) arbitrary, then there is a unique \(D\) with \(C \xrightarrow{\text{write},a,r} D\). Furthermore, \(\text{shape}(D) = r \otimes \text{shape}(C)\) holds.

**Proof:** By Definition 16, \(\text{prime}(r,\text{shape}(C))\) implies \(r \notin \text{dom } R\). Applying claim 2 to \(w = \varepsilon\) yields a unique \(b\) such that \(\varepsilon \xrightarrow{\text{write},a,r} b\) holds, with \(a = b\). Consequently, rule \(\mu\text{-R2}\) yields \(C \xrightarrow{\text{write},a,r} D\) for \(D = (Q,R[r \mapsto a],M)\) where \(C = (Q,R,M)\). Uniqueness of \(D\) follows as in Proposition 20, and Definition 16 implies \(\text{shape}(D) = \text{shape}(C) \otimes r\) as required.

For the rules CODE, we prove that if \([n]\text{code} : A \rightarrow B\) and \(\text{shape}(C) = A\) hold, then there is a unique \(D\) with \(C \xrightarrow{t} D\), and additionally \(\text{shape}(D) = B\). We treat the case \(\text{dec } op_1 \ op_2\) explicitly.

**Case** \(\text{code} = \text{dec } op_1 \ op_2\). By the typing rule R-AX, there is an \(X\) such that \(\text{shape}(C) = A = op_1 \otimes X\) and \(B = op_2 \otimes X\), and \(X\) is unique modulo identities \(r \otimes r = r\). For deriving \(C \xrightarrow{\text{dec } op_1 \ op_2} D\) for some \(D\), we have to show
that there are $E$ and $a$ such that
\[
C \xrightarrow{\text{read}(\text{ALU}, a, op_1)} E \quad \text{and} \quad E \xrightarrow{\text{write}, a-1, op_2} D.
\]

In the case of $op_1 = q$ for some $q$, R-AX implies $\gamma(q) = FU(\text{dec}) = \text{ALU}$, so the claim for rule $\mu$-Q (claim 3 in the proof of Theorem 1) implies the (unique) existence of $E$ and $a$ with $C \xrightarrow{\text{read}(\text{ALU}, a, op_1)} E$, and $\text{shape}(E) = X$.

If $op_1 = r$ for some $r$, the property $A = op_1 \otimes X$, together with claim 3 above, implies the same existence of $E$ and $a$, with $\text{shape}(E) = \text{shape}(C) = r \otimes X$.

For showing $E \xrightarrow{\text{write}, a-1, op_2} D$, we apply claim 4 of the proof of Theorem 1 in the case of $op_2 = q'$ for some $q'$, and apply the above claims 4 or 5 in the case of $op_2 = r'$ for some $r'$, depending on whether $\text{prime}(r', \text{shape}(E))$ holds of not. In the former case, we obtain $\text{shape}(D) = \text{shape}(E) \otimes q' = B$, and in the latter case we obtain $\text{shape}(D) = \text{shape}(E) \otimes r' = B$. Uniqueness of $D$ follows in both cases from Proposition 20.

**Cases code $\neq \text{dec} op_1 op_2$.** Similar.

Finally, the induction on the structure of $t$ proceeds exactly as in the proof of Theorem 1.

2. We perform a structural induction similar to the proof of the second part of Theorem 1. The sub-cases for rules WR-Q, RD-Q and $\mu$-Q transfer directly. For rules RD-R, WR-R, $\mu$-R1 and $\mu$-R2 they are as follows.

**RD-R Claim 1:** If $w \xrightarrow{\text{read}(\text{fu}), a, r} v$ then $w = v = a$.

**Proof** Follows directly from the definition of rule RD-R.

**WR-R Claim 2:** If $w \xrightarrow{\text{write}, a, r} v$ then $v = a$.

**Proof** Follows directly from the definition of rule WR-R.

**$\mu$-R1 and $\mu$-R2 Claim 3:** If $C \xrightarrow{\text{read}(\text{fu}), a, r} D$ then $\text{shape}(C) = r \otimes \text{shape}(D)$.

**Proof** If the last rule in the derivation of $C \xrightarrow{\text{read}(\text{fu}), a, r} D$ was $\mu$-R1, then $D = (Q, R[r \mapsto b], M)$ must hold for some $b$ with $R(r) \xrightarrow{\text{read}(\text{fu}), a, r} b$ and
C = (Q, R, M), so \( r \in \text{dom} \ R \). By claim 1, \( R(r) = b = a \) follows, and the definition of \( \text{shape} \) implies

\[
\begin{align*}
\text{shape}(D) & = r \otimes \otimes_{q \in \mathbb{Q} q}^{|Q(q)|} \otimes \otimes_{r' \in \text{dom} \ R[r \mapsto b]} R[r \mapsto b] r' \\
& = r \otimes \otimes_{q \in \mathbb{Q} q}^{|Q(q)|} \otimes \otimes_{r' \in \text{dom} \ R} R[r \mapsto b] r' \\
& = \otimes \otimes_{q \in \mathbb{Q} q}^{|Q(q)|} \otimes \otimes_{r' \in \text{dom} \ R} R[r \mapsto b] r' = \text{shape}(C)
\end{align*}
\]

If the last rule in the derivation for \( C \xrightarrow{(\text{read} (fu), a, r)} D \) was \( \mu\text{-R2} \), then \( r \not\in \text{dom} \ R \), then \( e \xrightarrow{(\text{read} (fu), a, r)} b \) and \( D = (Q, R[r \mapsto b], M) \) must hold for some \( b \) where \( C = (Q, R, M) \). By claim 1, however, the second condition cannot be fulfilled, so the last rule for deriving \( C \xrightarrow{(\text{read} (fu), a, r)} D \) cannot have been \( \mu\text{-R2} \).

In particular, \( \text{prime}(r, \text{shape}(C)) \) implies that there are no \( D \) and \( a \) with \( C \xrightarrow{(\text{read} (fu), a, r)} D \).

**Claim 4:** If \( C \xrightarrow{(\text{write}, a, r)} D \) then \( \text{shape}(D) = r \otimes \text{shape}(C) \).

**Proof** No matter which of the two rules was used, a derivation with final sequent \( C \xrightarrow{(\text{write}, a, r)} D \) where \( C = (Q, R, M) \) forces \( D = (Q, R[r \mapsto b], M) \) for some \( b \), so the definition of \( \text{shape} \) implies

\[
\begin{align*}
\text{shape}(D) & = \otimes \otimes_{q \in \mathbb{Q} q}^{|Q(q)|} \otimes \otimes_{r' \in \text{dom} \ R[r \mapsto b]} R[r \mapsto b] r' \\
& = \otimes \otimes_{q \in \mathbb{Q} q}^{|Q(q)|} \otimes \otimes_{r' \in \text{dom} \ R} R[r \mapsto b] r' \\
& = r \otimes \otimes_{q \in \mathbb{Q} q}^{|Q(q)|} \otimes \otimes_{r' \in \text{dom} \ R} R[r \mapsto b] r' = r \otimes \text{shape}(C).
\end{align*}
\]

For the rules \text{CODE}, we prove that \( C \xrightarrow{\text{code}} D \) implies \( [n] \text{code} : \text{shape}(C) \rightarrow \text{shape}(D) \). Again, we treat only one case explicitly.

**Case** \text{code} = \text{dec op1 op2}. For \( C \xrightarrow{\text{dec op1 op2}} D \) to hold, there must be \( \text{E} \) and \( a \) with

\[
C \xrightarrow{(\text{read} (\text{ALU}, a, op1)} \text{E and E} \xrightarrow{(\text{write}, a \text{-1, op2)}} D
\]

If \( op_1 = q \), the claim for rule \( \mu\text{-Q} \) (claim 3 in the proof of Theorem 1) implies \( \text{shape}(C) = \text{shape}(\text{E}) \otimes q = \text{shape}(\text{E}) \otimes \text{op1} \) and \( \gamma(op_1) = \gamma(q) = \text{ALU} \).

If \( op_1 = r \), claim 3 above implies \( \text{shape}(C) = r \otimes \text{shape}(E) = \text{shape}(E) \otimes \text{op1} \), and \( \gamma(op_1) = \gamma(r) = \text{ALU} \) is vacuously true.
Furthermore, for $op_2 = q'$, the claim for write and rule $\mu$-Q (claim 4 in the proof of Theorem 1) implies $shape(D) = shape(E) \otimes q' = shape(E) \otimes op_2$, and for $op_2 = r'$, the above claim 4 implies $shape(D) = shape(E) \otimes r' = shape(E) \otimes op_2$.

With $X = shape(E)$, the typing axiom R-AX indeed derives $[n]dec op_1 op_2 : op_1 \otimes X \rightarrow op_2 \otimes X$, i.e. $[n]dec op_1 op_2 : shape(C) \rightarrow shape(D)$.

**Cases** $code \neq dec op_1 op_2$. Similar.

The induction on the structure of $t$ now proceeds exactly as in the proof of Theorem 1 (second part).

Consequently, well-typed programs cannot run into deadlocks (operand queue starvation or empty registers) or operand queue mismatch. In particular, every register is guaranteed to be initialised when it is accessed for the first time.

Similar to Section 3.4, the type system admits the derived rule

$$\frac{t : A \rightarrow B}{t : A \otimes X \rightarrow B \otimes X}$$

and has principal typings. These can be obtained by the typing rules

$$\frac{[n]code :: A \rightarrow B_{code} \otimes \text{rg}(A_{code})}{SC(code)}$$

and

$$\frac{s :: A \rightarrow B \quad t :: C \rightarrow D \quad B/C = V/U}{st :: A \otimes U \rightarrow V \otimes D}$$

where a modified definition of cancellation ensures that the identity $r \otimes r = r$ does not introduce additional products.

**Definition 17.** The pair $(V, U)$ is called the cancellation of $A$ and $B$, written $A/B = V/U$, if prime$(V, U)$ and prime$(\text{rg}(V), \text{rg}(B))$ hold and also prime$(\text{rg}(U), \text{rg}(A))$ and $U \otimes A = V \otimes B$.

The definition of principal types transfers literally from Chapter 3.

**Definition 18.** A typing $t : A \rightarrow B$ is principal if for every typing $t : C \rightarrow D$ there is an $X$ such that $C = A \otimes X$ and $D = B \otimes X$. 
Example. For \([1]\) `\text{ldc } 4 \ x_1` and \([2]\) `\text{add } x_1 \ x_2 \ x_3`\text{, rule P-R-AX yields}\

\[
[1] :: 1 \rightarrow x_1 \text{ and } [2] :: x_1 \times x_2 \rightarrow x_1 \times x_2 \times x_3.
\]

We have \((x_1 \times x_2) / (x_1 \times x_2) = 1 / x_2\), hence

\[
[1] [2] :: x_2 \rightarrow x_1 \times x_2 \times x_3.
\]

If the condition \(\text{prime}(\text{rg}(U), \text{rg}(A))\) was dropped in Definition 17, the identity \(x_1 = x_1 \times x_1\) would yield the alternative cancellation \((x_1 \times x_2) / (x_1 \times x_2) = 1 / x_1 \times x_2\) and thus the typing \([1] [2] :: x_2 \times x_1 \rightarrow x_1 \times x_2 \times x_3\). While this is indeed a valid typing for \([1] [2]\), it is not the principal typing.

\[\Diamond\]

**Proposition 21.** \(t :: A \rightarrow B\) iff \(t : A \rightarrow B\) is the principal typing for \(t\).

**Proof.** There are three parts.

\(t :: A \rightarrow B\) implies \(t : A \rightarrow B\). Structural induction on \(t\). For single instructions, the claim is trivial as the weakening \(X = 1\) can be used in the axioms for \(t : A \rightarrow B\).

For the cut rule where \(s :: A \rightarrow B\), \(t :: C \rightarrow D\) and \(st :: A \times U \rightarrow V \times D\) with \(B / C = V / U\), we have \(s :: A \rightarrow B\) and \(t :: C \rightarrow D\) by induction hypothesis, and the weakening rule implies \(s :: A \times U \rightarrow o B \times U\). The definition of cancellation yields \(B \times U = V \times C\), and by applying the weakening \(V\) to the typing for \(t\) we obtain \(t : V \times C \rightarrow V \times D\). Applying the cut rule yields \(st :: A \times U \rightarrow o V \times D\).

\(t :: A \rightarrow B\) implies \(t : A \rightarrow B\) is principal typing for \(t\). We show that for \(t :: A \rightarrow B\) and \(t :: C \rightarrow D\) there is an \(X\) such that \(C = A \times X\) and \(D = B \times X\), again by structural induction on \(t\).

For a single instruction, \(A = A_{\text{code}}\) and \(B = B_{\text{code}} \times \text{rg}(A_{\text{code}})\) hold, and the typing rule for \(t :: C \rightarrow D\) implies \(C = A_{\text{code}} \times Z\) and \(D = B_{\text{code}} \times \text{rg}(A_{\text{code}}) \times Z\) for some \(Z\). Take \(X = Z\).

For \(st :: A \rightarrow B\) and \(st :: C \rightarrow D\) there are by the typing rules products such that

\[
\frac{s :: E \rightarrow F \quad t :: G \rightarrow H}{st :: E \times U \rightarrow V \times H} \quad F / G = V / U
\]
and
\[
\begin{align*}
  s : C &\rightarrow I \\
  t : I &\rightarrow D \\
  st : C &\rightarrow D
\end{align*}
\]

where \(A = E \otimes U\) and \(B = V \otimes H\). Since \(s\) is part of \(st\) and \(s :: E \rightarrow F\) holds, the induction hypothesis guarantees that \(s : E \rightarrow F\) is the principal typing for \(s\), so for \(s : C \rightarrow I\) there is a \(Z\) with

\[
E \otimes Z = C \text{ and } F \otimes Z = I.
\]

Likewise, \(t : G \rightarrow H\) is principal, so for \(t : I \rightarrow D\) there is a \(Y\) with

\[
G \otimes Y = I \text{ and } H \otimes Y = D.
\]

Consequently, \(F \otimes Z = I = G \otimes Y\) and by the definition of cancellation \(F / G = V / U\) there is an \(M\) such that \(U \otimes M = Z\) and \(V \otimes M = Y\). Therefore,

\[
C = E \otimes Z = E \otimes U \otimes M = A \otimes M
\]

and

\[
D = H \otimes Y = H \otimes V \otimes M = B \otimes M.
\]

Take \(X = M\).

\(t : A \rightarrow B\) principal typing for \(t\) implies \(t :: A \rightarrow B\). If \(t\) is a single instruction, then \(A = A_{\text{code}}\) and \(B = B_{\text{code}} \otimes \text{rg}(A_{\text{code}})\) and the typing \(t :: A_{\text{code}} \rightarrow B_{\text{code}} \otimes \text{rg}(A_{\text{code}})\) is derivable.

For \(st : A \rightarrow B\) principal, the last rule was CUT, so there is a \(C\) with \(s : A \rightarrow C\) and \(t : C \rightarrow B\). Let \(s : F \rightarrow G\) and \(t : H \rightarrow I\) be the principal typings for \(s\) and \(t\), respectively. Since \(s\) and \(t\) occur inside \(st\) we can apply the induction hypothesis, so \(s :: F \rightarrow G\) and \(t :: H \rightarrow I\). By the principality of \(s : F \rightarrow G\) and \(t : H \rightarrow I\) there are \(Z\) and \(Y\) with

\[
A = F \otimes Z, G \otimes Z = C = H \otimes Y \text{ and } B = I \otimes Y.
\]

Hence, for \(G/H = V/U\) we obtain both

\[
st :: F \otimes U \rightarrow I \otimes V
\]
and \( U \otimes L = Z \) and \( V \otimes L = Y \) for some \( L \).

We have to show \( F \otimes U = A \) and \( I \otimes V = B \). By the first part of this proof, \( st :: F \otimes U \rightarrow I \otimes V \) implies \( st : F \otimes U \rightarrow I \otimes V \), so the principality of \( st : A \rightarrow B \) guarantees that there is an \( X \) such that

\[
A \otimes X = F \otimes U \quad \text{and} \quad B \otimes X = I \otimes V.
\]

On the other hand,

\[
A = F \otimes Z = F \otimes U \otimes L \quad \text{and} \quad B = I \otimes Y = I \otimes V \otimes L
\]

holds. By combining the equations, we have

\[
\begin{align*}
A &= L \otimes F \otimes U = L \otimes A \otimes X \\
B &= I \otimes V \otimes L = L \otimes B \otimes X
\end{align*}
\]

so \( X \) and \( L \) cannot contain any operand queue names as factors, i.e. it it \( X = \text{rg}(X) \) and \( L = \text{rg}(L) \). Hence

\[
\begin{align*}
F \otimes U &= A \otimes X = L \otimes A \otimes X \otimes \text{rg}(X) = L \otimes A \otimes X = A \\
I \otimes V &= B \otimes X = L \otimes B \otimes X \otimes \text{rg}(X) = L \otimes B \otimes X = B
\end{align*}
\]

as claimed.

\[\square\]

### 5.1.3 Program execution

Similarly to Section 3.5, execution of programs with registers and jump instructions is defined in terms of basic blocks. A program \( P = (\mathbb{N}, A) \) consists of a partial map \( \mathbb{N} : \mathbb{N} \rightarrow \text{Iseq} \), together with an association \( A \subset \text{dom} \mathbb{N} \times \text{dom} \mathbb{N} \) such that

- the labelling of instructions is unique
- the first instruction of \( \mathbb{N}(n) \) has label \( [n] \)
- at most the last instruction of \( \mathbb{N}(n) \) is a jump instruction
- \( A \) contains exactly those pairs \( (n,m) \) where the opcode of last instruction in \( \mathbb{N}(n) \) is either \( \text{jmp} \ m \) or \( \text{if op} \ n_1 \ n_2 \) with \( m \in \{n_1, n_2\} \).
5.1. Sequential execution

5.1.3.1 Dynamic semantics

The dynamic semantics is obtained by similar rules as in Section 3.5, modulo the updated notion of configurations \((Q, R, M, \bar{m})\) (see Figure 5.3) and the additional rules RD-R, WR-R, \(\mu\)-R1 and \(\mu\)-R2. Consequently, determinism of execution may be proven in a similar way as in Proposition 4.

5.1.3.2 Static semantics

Instantiating the axiom R-AX with the data in Table 3.2 yields the typing rules

\[
R-AX-IF \frac{\text{IF} \ op n_1 \ n_2}{\text{IF} h \ op n_1 \ n_2} \quad \gamma(op) = \text{BU}
\]

Figure 5.3: Sequential dynamic semantics including jumps and registers
and

\[ R-AX-JMP \quad [n] \text{jmp } m : X \rightarrow \circ X \]

where any side condition \( \gamma(r) = \emptyset \) is considered true.

The rule for composing basic blocks is generalised to

\[
\Sigma = \cup_{n \in \text{dom } \mathbb{N}} \{ n : A_n \rightarrow B_n \} \\
(n, m) \in A \text{ implies } \exists X \text{ s.t. } B_n = A_m \otimes X \text{ and } B_m \otimes X = B_m \\
\Sigma \vdash (N, A) : \circ
\]

such that \( \mathbb{N}(n) \) delivers at least all register-carried operands needed by (possibly weakened) \( \mathbb{N}(m) \), and additional registers must be amongst those promised by \( B_m \) to its successors. In particular, the additional items \( X \) fulfil \( X = \text{rg}(X) \) and occur in \( \text{rg}(B_m) \). For operand queues, the rule agrees with the rule from Section 3.5.2, but registers written to in \( \mathbb{N}(m) \) may or may not contain values before \( \mathbb{N}(m) \) is called. For example,

\[
\begin{align*}
N(1) &= [1] \text{ldc } 7 \ q_1 \ [2] \text{jmp } 3 \\
N(3) &= [3] \text{ldc } 9 \ r_1 \ [4] \text{add } q_1 \ r_1 \ q_1 \ [5] \text{jmp } 3
\end{align*}
\]

can be typed in context \( \Sigma = [1 \rightarrow 1 \rightarrow q_1, \ 3 \rightarrow q_1 \rightarrow r_1 \otimes q_1] \), where the arrow \( (N(1), N(3)) \) uses \( X = 1 \) and the arrow \( (N(3), N(3)) \) uses \( X = r_1 \). The earlier rule

\[
\Sigma = \cup_{n \in \text{dom } \mathbb{N}} \{ n : A_n \rightarrow B_n \} \\
(n, m) \in A \text{ implies } B_n = A_m \\
\Sigma \vdash (N, A) : \circ
\]

would have required to weaken the typings to

\[
\Sigma' = [1 \rightarrow r_1 \rightarrow r_1 \otimes q_1, \ 3 \rightarrow r_1 \otimes q_1 \rightarrow r_1 \otimes q_1],
\]

forcing an initial configuration to contain an element in \( r_1 \).

As before, the \( A_i \) and \( B_i \) are obtained from typings of the bodies by weakenings.

Proposition 5 carries over from Chapter 3 with similar modifications regarding the shapes of configurations as in the previous section:

**Proposition 22.** Let \( \Sigma \vdash p : \circ \), and for \( C = (Q, R, M, n) \) let \( n : A \rightarrow B \) be the entry for \( n \) in \( \Sigma \). Let shape \((C) = A \otimes X \) with \( B \otimes X = B \) for some \( X \). Then there is a unique \( D \) such that \( C \xrightarrow{\mu} D \), and shape \((D) = B \) holds. Furthermore, if \( D = (P, S, L, m), m \neq \text{nil} \) and \( m : C \rightarrow D \) is the entry for \( m \) in \( \Sigma \), then there is a \( Y \) such that \( B = C \otimes Y \) and \( D \otimes Y = D \).
5.1. Sequential execution

Definition 19. Let \( N : \mathbb{N} \rightarrow \text{Iseq} \) be a partial map, \( \text{dom} \ N = \{ n_1, \ldots , n_k \} \), \( \Lambda \subseteq \text{dom} \ N \times \text{dom} \ N \) and for \( i \leq k \) let \( N(n_i) : A_i \rightarrow B_i \). A tuple \( \bar{X} = (X_1, \ldots , X_k) \) unifies \( N \) for \( \Lambda \) if \( (n_i, n_j) \in \Lambda \) implies \( B_i \otimes X_i = A_j \otimes X_j \otimes Z_i^j \) for some \( Z_i^j \) with \( B_j \otimes X_j \otimes Z_i^j = B_j \otimes X_j \).

The equivalent to Proposition 6 is the following.

Proposition 23. 1. If \( \Sigma = \bigcup_{n_i \in \text{dom} \ N} N(n_i) : A_i \otimes X_i \rightarrow B_i \otimes X_i \) and \( \bar{X} \) is a unifier for \( N \) and \( \Lambda \) then \( \Sigma \vdash (N, \Lambda) : \circ \).

2. If \( \Sigma \vdash (N, \Lambda) : \circ \) and \( \Sigma = \bigcup_{n_i \in \text{dom} \ N} N(n_i) : A_i \rightarrow B_i \) then \( \bar{X} \) unifies \( N \) for \( \Lambda \).

Proof. 1. For \( \Sigma = \bigcup_{n_i \in \text{dom} \ N} N(n_i) : A_i \otimes X_i \rightarrow B_i \otimes X_i \) and \( (n_i, n_j) \in \Lambda \) take \( X = Z_i^j \).

Then \( B_i \otimes X_i = A_j \otimes X_j \otimes Z_i^j = A_j \otimes X_j \otimes X \) and \( B_j \otimes X_j \otimes X = B_j \otimes X_j \otimes Z_i^j = B_j \otimes X_j \) since \( \bar{X} \) unifies \( N \) for \( \Lambda \). Consequently, all conditions in the rule for \( \Sigma \vdash (N, \Lambda) : \circ \) are fulfilled.

2. The rule for \( \Sigma \vdash (N, \Lambda) : \circ \) implies \( \Sigma = \bigcup_{n_i \in \text{dom} \ N} N(n_i) : A_i \rightarrow B_i, \) and \( B_i = A_j \otimes X \) holds for some \( X \) with \( B_j \otimes X = B_j \) whenever \( (n_i, n_j) \in \Lambda \). Consequently for
Again, minimal unifiers are of particular interest.

**Definition 20.** A unifier $\vec{X}$ for $\mathbb{N}$ and $A$ is minimal if for any unifier $\vec{Y}$ for $\mathbb{N}$ and $A$ there is a $\vec{Z}$ such that $\vec{X} \otimes \vec{Z} = \vec{Y}$.

In particular, the register part of the minimal unifier $\vec{X}$ occurs in any unifier $\vec{Y}$, i.e. $Y_l \otimes \text{rg}(X_l) = Y_l$ holds in every component $l$.

In contrast to the case for operand queues, a unification is always possible, based on the following iterative approach. For ease of notation, we concentrate on exclusive register usage, i.e. require that all typings $t : A \rightarrow B$ fulfill $A = \text{rg}(A)$ and $B = \text{rg}(B)$.

**Theorem 5.** Let $\vec{X}$ be the minimal unifier for $\mathbb{N}$ and $A$, and for all $n \in \text{dom } \mathbb{N}$ let $\mathbb{N}(n) : A_n \rightarrow B_n$ with $A_n = \text{rg}(A_n)$ and $B_n = \text{rg}(B_n)$. Let $m, n \in \mathbb{N}$ and $a = (m, n) \notin A$. Define $S = \{l \mid (l, m) \in A^*\}$ and $T = \{l \mid (n, l) \in A^*\}$

and let $(X_m \otimes B_m)/(A_n \otimes X_n) = V/U$ and $V/B_n = W/Y$. For each factor $r$ of $U$ let $S_r$ be the least subset of $S$ with

- $m \in S_r$
- if $k \in S_r$, $(l, k) \in A$ and prime$(r, B_l \otimes X_l)$ then $l \in S_r$

and for each factor $r$ of $W$ let $T_r$ be the least subset of $T$ with

- $n \in T_r$
- if $k \in T_r$, $(l, k) \in A$ and prime$(r, B_l \otimes X_l)$ then $l \in T_r$
- if $k \in T_r$, $(k, l) \in A$ and prime$(r, B_l \otimes X_l)$ then $l \in T_r$

Then for $Y_l = X_l \otimes \otimes_{l \in S, r} \otimes \otimes_{l \in T, r}$, the vector $\vec{Y}$ is the minimal unifier for $\mathbb{N}$ and $A \cup \{a\}$.
5.1. Sequential execution

Proof. Show that $\bar{Y}$ is unifier for $\mathbb{N}$ and $\mathbb{A} \cup \{a\}$. Let $(i, j) \in \mathbb{A} \cup \{a\}$. We show that there is a $Z$ such that $B_i \otimes Y_i = A_j \otimes Y_j \otimes Z$ and $B_j \otimes Y_j \otimes Z = B_j \otimes Y_j$.

Case $(i, j) = a = (m, n)$ and $n \neq m$. By the definitions in the theorem, $m \in S_r$ holds for each factor $r$ of $U$ and $n \in T_r$ holds for each factor $r$ of $W$. Therefore,

$$Y_m = X_m \otimes U \otimes W' \text{ and } Y_n = X_n \otimes W \otimes U'$$

holds for some factors $U' = \otimes_{n \in S_r} r$ of $U$ and $W' = \otimes_{m \in T_r} r$ of $W$.

We first show $W' = 1$. Each factor $r$ of $W'$ fulfils $m \in T_r$, so by $n \neq m$, there must be a $k \in T_r$ such that $\text{prime}(r, B_m \otimes X_m)$ and $(k, m) \in \mathbb{A}$ or $(m, k) \in \mathbb{A}$. On the other hand, since $r$ is a factor of $W'$, $r$ is a factor of $W$, thus of $V$ and then of $B_m \otimes X_m$. Contradiction, so $W'$ has no factors $r$, hence $Y_m = X_m \otimes U$.

Similarly, each factor $r$ of $U'$ fulfils $n \in S_r$, so there must be a $k \in S_r$ with $(k, n) \in \mathbb{A}$ or $(n, k) \in \mathbb{A}$ and $\text{prime}(r, B_n \otimes X_n)$. On the other hand, $r$ is a factor of $U'$, hence of $U$. Since $U$ occurs in $A_n \otimes X_n$ and $A_n$ in $B_n$ by Lemma 5, $r$ occurs in $B_n \otimes X_n$. Contradiction, so $U'$ has no factors $r$, hence $Y_n = X_n \otimes W$.

Let $Z$ be the common part of $V$ and $B_n$, i.e. $V / B_n = Z \; W / Y$. Then

$$B_m \otimes Y_m = B_m \otimes X_m \otimes U = A_n \otimes X_n \otimes V = A_n \otimes X_n \otimes Z \otimes W = A_n \otimes Y_n \otimes Z$$

and observing $Z = \text{rg}(Z)$ as $Z$ occurs in $B_n$ yields $B_n \otimes Y_n = B_n \otimes Y_n \otimes Z$.

Case $(i, j) = a = (m, n)$ and $n = m$. Since $X_n = X_m$ occurs both in $X_m \otimes B_m$ and $X_n \otimes A_n$ and $A_n = \text{rg}(A_n)$ occurs in $\text{rg}(B_n) = \text{rg}(B_m)$ Lemma 5 implies $U = 1$. For the same reason, $V$ is a factor of $B_m$, hence $W = 1$ and $Y_n = Y_m = X_n$.

Consequently,

$$B_m \otimes Y_m = B_m \otimes X_m = B_m \otimes X_m \otimes U = A_n \otimes X_n \otimes V = A_n \otimes Y_n \otimes V$$

so we take $Z = V$. Then $B_m \otimes Y_m \otimes Z = B_m \otimes Y_m \otimes V = B_m \otimes Y_m$ as $V = \text{rg}(V)$ occurs in $B_n$.

Case $n = j \neq i \neq m$. With the same argument as in the fist case, $U' = 1$ follows, so $Y_n = X_n \otimes W$. Since $\bar{X}$ unifies $N$ for $\mathbb{A}$ and $(i, j) \neq a$ we have

$$B_i \otimes Y_i = B_i \otimes X_i \otimes i \in S_r, r \otimes \otimes i \in T_r, r = A_j \otimes X_j \otimes Z \otimes \otimes i \in S_r, r \otimes \otimes i \in T_r, r$$
where \(B_j \otimes X_j \otimes Z = B_j \otimes X_j\).

For any \(r\) with \(i \in S_r\), \(\text{prime}(r, B_i \otimes X_i)\) holds due to \(m \neq i\). Also such an \(r\) is a factor of \(U\), so of \(A_n \otimes X_j = A_j \otimes X_j\), hence

\[
A_j \otimes X_j \otimes \otimes_{i \in S_r} r = A_j \otimes X_j.
\]

Any \(r\) with \(i \in T_r\) is a factor of \(W\) and \(i \neq j = n\) implies \(\text{prime}(r, B_i \otimes X_i)\), so \(W\) can be written as \(W = W_a \otimes W_b\) where

\[
W_a = \otimes_{i \in T_r} r \text{ factor of } Wr \text{ and } W_b = \otimes_{i \notin T_r} r \text{ factor of } W r.
\]

A factor \(r\) of \(W\) with \(i \notin T_r\) occurs in \(B_i \otimes X_i\) since \(n \in T_r\) and \((i, j) \in A\) holds due to \(i \neq m\). Hence

\[
B_i \otimes X_i = B_i \otimes X_i \otimes W_b.
\]

Combining the equations and taking \(Z = Z\) we obtain

\[
B_i \otimes Y_i = B_i \otimes X_i \otimes \otimes_{i \in S_r} \otimes_{i \in T_r} r = B_i \otimes X_i \otimes \otimes_{i \in S_r} r \otimes W_a
\]

\[
= B_i \otimes X_i \otimes W_b \otimes \otimes_{i \in S_r} r \otimes W_a = A_j \otimes X_j \otimes Z \otimes W \otimes \otimes_{i \in S_r} r
\]

\[
= A_j \otimes X_j \otimes Z \otimes W = A_j \otimes Y_j \otimes Z
\]

\[
= A_j \otimes Y_j \otimes Z
\]

and \(B_j \otimes Y_j \otimes Z = B_j \otimes X_j \otimes W \otimes Z = B_j \otimes X_j \otimes W = B_j \otimes Y_j\).

**Case \(i = j\).** For \(B_i \otimes X_i = A_j \otimes X_j \otimes Z\) with \(B_j \otimes X_j \otimes Z = B_j \otimes X_j\) we obtain

\[
B_i \otimes Y_i = B_i \otimes X_i \otimes \otimes_{i \in S_r} \otimes_{i \in T_r} r = A_j \otimes X_j \otimes Z \otimes \otimes_{i \in S_r} r \otimes \otimes_{i \in T_r} r
\]

\[
= A_j \otimes X_j \otimes Z \otimes \otimes_{j \in S_r} r \otimes \otimes_{j \in T_r} r = A_j \otimes Y_j \otimes Z
\]

so take \(Z = Z\). Then

\[
B_j \otimes Y_j \otimes Z = B_j \otimes X_j \otimes \otimes_{j \in S_r} \otimes_{j \in T_r} r \otimes Z
\]

\[
= B_j \otimes X_j \otimes \otimes_{j \in S_r} \otimes_{j \in T_r} r = B_j \otimes Y_j
\]

**Case \(i \neq j \neq n\).** Any \(r\) with \(j \in T_r\) fulfils

- \(r\) is a factor of \(W\)
5.1. Sequential execution

- \( \text{prime}(r, B_j \otimes X_j) \) since \( j \neq n \), hence \( \text{prime}(r, A_j \otimes X_j) \)
- \( i \in T_r \) or \( r \) a factor of \( B_i \otimes X_i \) since \( (i, j) \in \Lambda \).

and any \( i \in T_r \) fulfills

- \( j \in T_r \) or \( r \) a factor of \( B_j \otimes X_j \) and
- \( r \) occurs in \( W \).

Consequently, any \( r \) with \( i \in T_r \) but \( j \notin T_r \) occurs in \( B_j \otimes X_j \) and any \( r \) with \( j \in T_r \) but \( i \notin T_r \) occurs in \( B_i \otimes X_i \).

Hence,

\[
B_j \otimes X_j = B_j \otimes X_j \otimes \bigotimes_{i \in T_r, j \notin T_r} r
\]  
(5.3)
\[
B_i \otimes X_i = B_i \otimes X_i \otimes \bigotimes_{j \in T_r, i \notin T_r} r
\]  
(5.4)

Similarly, any \( r \) with \( i \in S_r \) occurs in \( U \) and either fulfills \( j \in S_r \) or occurs in \( B_j \otimes X_j \) and any \( r \) with \( j \in S_r \) is in \( U \) and either fulfills \( i \in S_r \) or occurs in \( B_i \otimes X_i \). Hence,

\[
B_i \otimes X_i = B_i \otimes X_i \otimes \bigotimes_{j \in S_r, i \notin S_r} r
\]  
(5.5)
\[
B_j \otimes X_j = B_j \otimes X_j \otimes \bigotimes_{i \in S_r, j \notin S_r} r
\]  
(5.6)

Hence, for \( B_i \otimes X_i = A_j \otimes X_j \otimes \overline{Z} \) and \( B_j \otimes X_j \otimes \overline{Z} = B_j \otimes X_j \) we obtain

\[
B_i \otimes Y_i = B_i \otimes X_i \otimes \bigotimes_{i \in S_r, r \otimes \bigotimes_{i \in T_r}} r
\]
\[
= B_i \otimes X_i \otimes \bigotimes_{i \in S_r, j \in S_r} r \otimes \bigotimes_{i \in S_r, j \notin S_r} r \otimes \bigotimes_{i \in T_r, j \in T_r} r \otimes \bigotimes_{i \in T_r, j \notin T_r} r
\]
\[
= B_i \otimes X_i \otimes \bigotimes_{i \in S_r, j \notin S_r} r \otimes \bigotimes_{i \in S_r, j \notin S_r} r \otimes \bigotimes_{i \in T_r, j \notin T_r} r
\]
\[
= A_j \otimes X_j \otimes \overline{Z} \otimes \bigotimes_{i \in T_r, j \notin T_r} r \otimes \bigotimes_{i \in S_r, j \notin S_r} r
\]
\[
= A_j \otimes X_j \otimes \overline{Z} \otimes \bigotimes_{i \in T_r, j \notin T_r} r \otimes \bigotimes_{i \in S_r, j \notin S_r} r
\]
\[
= A_j \otimes Y_j \otimes \overline{Z} \otimes \bigotimes_{i \in S_r, j \notin S_r} r \otimes \bigotimes_{i \in T_r, j \notin T_r} r
where the third equality holds by (5.5), the fourth by (5.4) and the other equalities result from reordering indices and applying the definitions of \( Y_i, Y_j \) and \( Z \). For \( Z = \mathbb{Z} \otimes \otimes_{i \in S_r, j \notin T_r} r \otimes \otimes_{i \in T_r, j \notin T_r} r \) we consequently obtain \( B_i \otimes Y_i = A_j \otimes Y_j \otimes Z \) and

\[
B_j \otimes Y_j \otimes Z = B_j \otimes X_j \otimes \otimes_{j \notin T_r} r \otimes \otimes_{j \in S_r} r \otimes \bar{Z} \otimes \otimes_{i \in S_r, j \notin S_r} r \\
= B_j \otimes X_j \otimes \otimes_{j \notin T_r} r \otimes \otimes_{j \in S_r} r \otimes \bar{Z} \\
= B_j \otimes X_j \otimes \otimes_{j \notin T_r} r \otimes \otimes_{j \in S_r} r \\
= B_j \otimes Y_j
\]

where the second equality holds due to (5.3) and (5.6) and the other equalities follow from the definition of \( Y_j \) and \( Z \).

**Show that \( Y \) is minimal.** We show that for the minimal unifier \( \bar{Z} \) of \( \mathbb{A} \cup \{a\} \) the equation

\[
\bar{Z} \otimes \text{rg}(Y) = \bar{Z}
\]

holds, i.e. that all \( l \) fulfil \( Z_l \otimes \text{rg}(Y_l) = Z_l \). The claim follows from proving that each factor \( r \) of \( Y_l \) occurs in \( Z_l \), so let \( r \) occur in \( Y_l \). By the definition of \( Y_l \), \( r \) occurs in \( X_l \) or \( l \in S_r \) or \( l \in T_r \) holds.

In the first case, we know that \( \bar{Z} \) unifies \( \mathbb{N} \) also for \( \mathbb{A} \), but \( \bar{X} \) is the minimal unifier for \( \mathbb{N} \) and \( \mathbb{A} \), so \( Z_l \otimes \text{rg}(X_l) = Z_l \) and \( r \) consequently occurs in \( Z_l \).

The other two cases are proven inductively, according to the definition of \( S_r \) and \( T_r \).

**Claim 1: Let \( l \in S_r \) and \( r \) occur in \( Y_l \). Then \( r \) is a factor of \( Z_l \).** The base case is \( m = l \). For \( m \in S_r \), \( r \) is a factor of \( U \), hence of \( A_n \otimes X_n \), and the definition of cancellation (Definition 17) implies \( \text{prime}(r, B_m \otimes X_m) \). As \( \bar{Z} \) unifies \( \mathbb{A} \cup \{a\} \),

\[
B_m \otimes Z_m = A_n \otimes Z_n \otimes Z \text{ for some } Z \text{ with } B_n \otimes Z_n \otimes Z = B_n \otimes Z_n
\]
holds. Since $\bar{Z}$ unifies $A$, $Z_m \otimes \text{rg}(X_m) = Z_m$ and $Z_n \otimes \text{rg}(X_n) = Z_n$ hold, hence

$$B_m \otimes Z_m \otimes \text{rg}(X_m) = A_n \otimes Z_n \otimes \text{rg}(X_n) \otimes Z.$$

Since $r$ is a factor of $A_n \otimes X_n$, $r$ is also a factor of $B_m \otimes Z_m \otimes \text{rg}(X_m)$, so $\text{prime}(r, B_m \otimes X_m)$ implies that $r$ occurs in $Z_m$.

Any $l \neq m$ with $l \in S_r$ due to $(l, k) \in A$ for some $k \in S_r$ and $\text{prime}(r, B_l \otimes X_l)$ fulfills

$$B_l \otimes Z_l = A_k \otimes Z_k \otimes Z \text{ for some } Z \text{ with } B_k \otimes Z_k \otimes Z = B_k \otimes Z_k$$

since $\bar{Z}$ unifies $\mathbb{A}$ for $\mathbb{A} \cup \{a\}$. The definition of $\bar{Y}$ implies that $r$ occurs in $Y_k$ since $k \in S_r$. By induction hypothesis, $r$ is thus a factor of $Z_k$, hence of $B_l \otimes Z_l$. Now $\text{prime}(r, B_l \otimes X_l)$ implies $\text{prime}(r, B_l)$, so $r$ must occur in $Z_l$.

Any $l \neq m$, $l \in S_r$ due to $(k, l) \in A$ for some $k \in S_r$ and $\text{prime}(r, B_l \otimes X_l)$ fulfills

$$B_k \otimes Z_k = A_l \otimes Z_l \otimes Z \text{ for some } Z \text{ with } B_l \otimes Z_l \otimes Z = B_l \otimes Z_l$$

since $\bar{Z}$ unifies $\mathbb{A}$ for $\mathbb{A} \cup \{a\}$. Again, $Z_l = Z_l \otimes \text{rg}(X_l)$ holds as $\bar{Z}$ unifies $\mathbb{A}$, and $Z_k = Z_k \otimes \text{rg}(Y_k)$ holds by induction hypothesis, and $k \in S_r$ implies that $r$ occurs in $\text{rg}(Y_k)$ by the definition of $\bar{Y}$. Thus, $r$ is a factor of

$$B_k \otimes Z_k \otimes \text{rg}(Y_k) = B_k \otimes Z_k = A_l \otimes Z_l \otimes Z = A_l \otimes Z_l \otimes \text{rg}(X_l) \otimes Z.$$

On the other hand, $r$ cannot occur in $A_l \otimes X_l$ as $B_l = B_l \otimes \text{rg}(A_l)$ holds by Lemma 5 and $\text{prime}(r, B_l \otimes X_l)$. So $r$ must be a factor of $Z \otimes Z_l$ and hence of $Z_l$ as $B_l \otimes Z_l \otimes Z = B_l \otimes Z_l$ holds and $\text{prime}(r, B_l)$.

**Claim 2:** Let $l \in T_r$ and $r$ occur in $Y_l$. Then $r$ is a factor of $Z_l$. The base case is $n = l$. For $n \in T_r$, $r$ is a factor of $W$, hence of $V$ and $B_m \otimes X_m$, and the definition of cancellation (Definition 17) implies $\text{prime}(r, A_n \otimes X_n)$ and $\text{prime}(r, B_n)$. As $\bar{Z}$ unifies $\mathbb{A} \cup \{a\}$,

$$B_m \otimes Z_m = A_n \otimes Z_n \otimes Z \text{ for some } Z \text{ with } B_n \otimes Z_n \otimes Z = B_n \otimes Z_n$$
holds. Since $\tilde{Z}$ unifies $\lambda$, $Z_m \otimes \text{rg}(X_m) = Z_m$ and $Z_n \otimes \text{rg}(X_n) = Z_n$ follow, hence

$$B_m \otimes Z_m \otimes \text{rg}(X_m) = A_n \otimes Z_n \otimes \text{rg}(X_n) \otimes Z.$$ 

Since $r$ is a factor of $B_m \otimes X_m$, $r$ is thus a factor of $A_n \otimes Z_n \otimes \text{rg}(X_n) \otimes Z$.

Now $\text{prime}(r, B_n)$ and $\text{prime}(r, A_n \otimes X_n)$ imply that $r$ must occur in $Z_n \otimes Z$ and hence in $Z$ since $B_n \otimes Z_n \otimes Z = B_n \otimes Z_n$ and $\text{prime}(r, B_n)$.

The two step cases are proven similarly to the first claim.

Type inference for programs involving registers and operand queues proceeds by combining the approaches described in Theorems 2 and 5. First, type inference is attempted with respect to the operand queues, but might fail. In case of success, type inference is carried out for the register components, and will always succeed.

\section{5.2 Distributed execution}

\subsection{5.2.1 Dynamic semantics}

The computational model for distributed execution involving registers mirrors the approach described in Chapter 4. Execution in the processor is highly unsupervised and a type system ensures deterministic execution. Distributed programs $\pi$ are parallelly composed sequential programs, and the operational model is defined as interleaved execution by embedding the sequential rules from the previous section via the rules

\begin{align*}
R\text{-FU}_1 & \quad \frac{C \vdash [n]\text{code} \rightarrow D}{FU(\text{code}) = fu} \quad R\text{-FU}_2 \quad \frac{C \vdash \frac{x}{fu} \rightarrow D}{C \vdash \frac{x}{fu} \rightarrow D} \\
R\text{-DIS}_1 & \quad \frac{C \vdash \frac{\pi_1}{D}}{C \vdash \frac{\pi_1}{D}} \quad R\text{-DIS}_2 \quad \frac{C \vdash \frac{\pi_1}{D}}{C \vdash \frac{\pi_1 \pi_2}{D}}
\end{align*}

As before, instructions are inserted into instruction queues according to the functional unit they execute in. Consequently, the relationship between sequential and distributed execution is as follows.

\textbf{Proposition 24.} 1. If $C \vdash D$ then $C \vdash \frac{\pi}{D}$. 


5.2. Distributed execution

2. If $C \xrightarrow{\pi} D$ then there is a $t$ such that $\pi_t = \pi$ and $C \xrightarrow{t} D$.

3. If $C \xrightarrow{\pi} D$, $t = \text{ins}_1 \ldots \text{ins}_n$ and $\text{ins}_i = [n_i]\text{code}_i$ then $(\otimes_{i=1}^{n} \text{code}_i) \otimes \text{shape}(D) = (\otimes_{i=1}^{n} B_{\text{code}_i}) \otimes \text{shape}(C)$.

Proof. The first two parts are proven exactly as those of Proposition 7. The third part proceeds similarly to the proof of Proposition 7, too, but in the base case of the induction (rule INSTR), $\text{ins}_1 : \text{shape}(C) \rightarrow \pi \text{shape}(D)$ holds by Theorem 4, and rule R-AX yields $\text{shape}(C) = A_{\text{code}_1} \otimes X$ and $\text{shape}(D) = B_{\text{code}_1} \otimes X \otimes \text{rg}(A_{\text{code}_1})$ for some $X$. We obtain

\[
A_{\text{code}_1} \otimes \text{shape}(D) = A_{\text{code}_1} \otimes B_{\text{code}_1} \otimes X \otimes \text{rg}(A_{\text{code}_1}) \]

\[
= A_{\text{code}_1} \otimes B_{\text{code}_1} \otimes X = B_{\text{code}_1} \otimes \text{shape}(C)
\]

as claimed, using the identity $r \otimes r = r$. The step case (rule COMP) is again proven as before.

Likewise, soundness of the sequential static semantics with respect to distributed execution follows as in Chapter 4.

Proposition 25. If $t : A \rightarrow B$ and $\text{shape}(C) = A$ then there is a $D$ such that $C \xrightarrow{\pi_D} D$ and $\text{shape}(D) = B$.

Proof. Exactly as the proof of Proposition 8.

Again, completeness is lost, with additional non-determinism resulting from registers. In fact, the following three types of race conditions may occur for instructions $[n]\text{code}_n$ and $[m]\text{code}_m$ if $[n]$ occurs before $[m]$ in $t$ but $\text{FU}(\text{code}_n) \neq \text{FU}(\text{code}_m)$.

**output dependence:** also known under the name `write-after-write`, this conflict occurs if $[n]$ and $[m]$ compete for writing to the same location. In addition to the analysis in Chapter 4 we now have to include registers, as the program $t_{\text{RAW}} = [1]\text{inc} \text{q}_1 \text{r}_1 [2]\text{mul} \text{q}_2 \text{q}_3 \text{r}_1$ shows.

**true dependence:** (read-after-write) occurs if $[m]$ might read from a location before $[n]$ has written the intended operand to that location. For example, the program $t_{\text{RAW}} = [1]\text{ldc} 2 \text{r}_1 [2]\text{inc} \text{r}_1 \text{r}_1 [3]\text{mul} \text{r}_1 \text{r}_1 \text{q}_1$ contains a race condition between the second and the third instruction.
anti-dependence: (write-after-read) occurs if \([m]\) might write to a location before \([n]\) has read the previous content. In \(t_{\text{WAR}} = [1] ldc 2 r_1 [2] inc r_1 q_1 [3] ldc 3 r_1\) for example, a race occurs involving the second and the third instruction.

The fourth possibility where \([n]\) and \([m]\) read from the same register in the order \([m]\) before \([n]\) is uncritical as they access the same value. Remember that such read-after-read conflicts cannot occur for operand queues due to the sequential order of instructions in instruction queues.

All example programs are well-typed, with principal types

\[
\begin{align*}
\tau_{\text{WAR}} &: q_1 \otimes q_2 \otimes q_3 \rightarrow r_1 \\
\tau_{\text{RAW}} &: 1 \rightarrow r_1 \otimes q_2 \\
\tau_{\text{WAR}} &: 1 \rightarrow r_1 \otimes q_2
\end{align*}
\]

Again, hardware solutions can be introduced for discovering race hazards at runtime and delay the execution of particular instructions. In fact, modern processors achieve similar tasks by register renaming and register locking [HP96].

**Definition 21.** A distributed program \(\pi\) is deterministic if \(C \xrightarrow{\pi} D\) and \(C \xrightarrow{\pi} E\) implies \(D = E\).

With the same motivation as before, we first explore the sources of non-determinism before choosing a solution for eliminating race hazards.

### 5.2.2 Determinism, safety and completeness

For the same definitions of determinism, safety and completeness of programs \(t\), the results of Section 4.2 transfer literally. In the proofs, the references to Propositions 7 and Theorem 1 are replaced by invoking Proposition 24 and Theorem 4 respectively.

### 5.2.3 A static semantics for determinism

Non-determinism may be dealt with by various means, similarly to the discussion in Chapter 4. We again outline how the static semantics can be extended to detect race...
conditions, extending the approach from Section 4.3. We equip judgements $t : A \rightarrow B$ with contexts

$$\Gamma = (\Gamma^W, \Gamma^R)$$

and sequentiality constraints $<$. Both components of $\Gamma$ contain entries of the form $op : (l, k)$, where the $l$ and $k \in \mathbb{N}$ denote (respectively) the first and last writer to $op$ in case of $\Gamma^W$ and the first and last reader from $op$ in case of $\Gamma^R$. For $\Gamma^R$ it suffices to consider only entries for $op$ of the form $r$. As before, $<$ is a partial order over $\mathbb{N}$.

The typing axiom and the cut rule are modified to

$$\frac{\text{ins} : A \rightarrow B}{(\Gamma_{\text{ins}}, \emptyset) \vdash \text{ins} : A \rightarrow B}$$

$$\frac{(\Delta, <) \vdash t : B \rightarrow C}{(\Gamma \# \Delta, <) \vdash st : A \rightarrow C}$$

where $\#$ is defined by applying $\#$ from Section 4.3.1 in both components and $\Gamma_{\text{ins}} = (\Gamma^W_{\text{ins}}, \Gamma^R_{\text{ins}})$ and $<$ are given by

$$\Gamma^W_{[n]code} = \{ op : (n, n) \mid \exists B, B_{\text{code}} = op \otimes B \}$$

$$\Gamma^R_{[n]code} = \{ r : (n, n) \mid \text{rg}(A_{\text{code}}) = r \otimes \text{rg}(A_{\text{code}}) \}$$

$$< = (\emptyset \cup < \cup \{ (k, g) \mid op : (l, k) \in \Gamma^W, op : (g, h) \in \Delta^W \} \cup \{ (k, g) \mid op : (l, k) \in \Gamma^W, op : (g, h) \in \Delta^R \} \cup \{ (k, g) \mid op : (l, k) \in \Gamma^R, op : (g, h) \in \Delta^W \} )^+$$

In the definition of $<$ each of the three sets corresponds to one type of conflict involving registers and the first set contains additionally the constraints for operand queues.

We obtain similar results as in Section 4.3.1.

**Lemma 6.**

1. $t : A \rightarrow B$ holds exactly if there are $\Gamma$ and $<$ such that $(\Gamma, <) \vdash t : A \rightarrow B$.

2. If $(\Gamma, <) \vdash t : A \rightarrow B$ and $(\Delta, <) \vdash t : C \rightarrow D$ then $\Gamma = \Delta$ and $<=<$. 
3. If \((\Gamma^W, \Gamma^R), <) \vdash t : A \rightarrow B,\) then

- all labels \(n\) occurring in \(\Gamma = (\Gamma^W, \Gamma^R)\) or \(<\) are labels of instructions in \(t\).
- if \(n < m\) then \(t\) can be written as \(s[n] \text{code}_n r[m] \text{code}_m u\), there a common factor \(op\) of both \(B \text{code}_n \otimes \text{rg}(A \text{code}_n)\) and \(B \text{code}_m \otimes \text{rg}(A \text{code}_m)\).
- \(op : (n, \_ \in \Gamma^W (\Gamma^R)\) iff \([n]\) is the first instruction \([m]\) \text{code} in \(t\) for which \(op\) is a factor of \(B \text{code} (\text{rg}(A \text{code}))\).
- \(op : (\_, n \in \Gamma^W (\Gamma^R)\) iff \([n]\) is the last instruction \([m]\) \text{code} in \(t\) for which \(op\) is a factor of \(B \text{code} (\text{rg}(A \text{code}))\).
- \(op : (n, n \in \Gamma^W (\Gamma^R)\) iff \([n]\) is the only instruction \([m]\) \text{code} in \(t\) for which \(op\) is a factor of \(B \text{code} (\text{rg}(A \text{code}))\).

4. If \((\Gamma, <) \vdash t : A \rightarrow B,\) \(t\) has the form \(s[n] \text{code}_1 r[m] \text{code}_2 u\) and \(op\) is a factor of

- \(B \text{code}_1\) and \(B \text{code}_2\) or
- \(\text{rg}(A \text{code}_1)\) and \(B \text{code}_2\) or
- \(B \text{code}_1\) and \(\text{rg}(A \text{code}_2)\)

then \(n < m\) holds.

**Proof.** Proceeds similarly to the Proof of Lemma 3 by an induction on \(t\), proving all four claims simultaneously. \(\square\)

**Lemma 7.** Let \((\Gamma, <) \vdash t : A \rightarrow B, \text{shape}(C) = A,\) and \(u = \text{ins}^u\) an interleaving of \(t\) and \(A\). Let \(t\) be sequential for \(\sqsubseteq\) and \(A\), let \(<\) be contained in \(\sqsubseteq\), and let \(C \overset{\text{ins}}{\rightarrow} D_1\) and \(t = t_1 \text{inst}_2\). Then there are \(\Delta\) and \(<\) such that for \(C = \text{shape}(D_1)\) and \(s = t_1 t_2\) the derivation

\[
(\Gamma_{\text{ins}}, \emptyset) \vdash \text{ins} : A \rightarrow C \quad (\Delta, <) \vdash s : C \rightarrow B
\]

\[
(\Gamma_{\text{ins}} \# \Delta, <) \vdash \text{ins} s : A \rightarrow B
\]

exists. In particular, \(<\) is contained in \(<\).

**Proof.** Since \(u\) is an interleaving for \(t\) and \(A\), \(t\) is sequential for \(\sqsubseteq\) and \(A\) and \(\text{shape}(C) = A\) holds, there is a (unique) \(D\) such that \(C \overset{\text{ins}}{\rightarrow} D_1 \overset{d'}{\rightarrow} D\). Theorem 4 implies \(\text{ins} : A \rightarrow C,\) hence

\[
(\Gamma_{\text{ins}}, \emptyset) \vdash \text{ins} : A \rightarrow C
\]
holds by the rule R-AX^<.

By Lemma 6, (Γ, <) ⊢ t : A → B implies t : A → B, so there is by Theorem 4 a unique E with C \rightarrow E, and shape(E) = B holds. The typing t : A → B also implies that there are D and E such that

\[
\begin{array}{c}
\frac{t_1 : A \rightarrow D \quad \text{ins} : D \rightarrow E}{t_1 \text{ins} : A \rightarrow E} \\
t_2 : E \rightarrow B
\end{array}
\]

is a valid derivation. Since \text{ins} : D \rightarrow E holds, we know that there is an X with

\[D = A_{\text{code}} \otimes X \quad \text{and} \quad E = B_{\text{code}} \otimes \text{rg}(A_{\text{code}}) \otimes X\]

where \(A_{\text{code}}\) and \(B_{\text{code}}\) are those from Table 3.1, and \text{ins} = [n]_{\text{code}}. On the other hand, \text{ins} : A \rightarrow C implies \(A = A_{\text{code}} \otimes Y\) and \(C = B_{\text{code}} \otimes \text{rg}(A_{\text{code}}) \otimes Y\) for some Y, hence

\[t_1 : A_{\text{code}} \otimes Y \rightarrow A_{\text{code}} \otimes X \quad \text{and} \quad t_2 : B_{\text{code}} \otimes \text{rg}(A_{\text{code}}) \otimes X \rightarrow B.\]

Since \text{ins} is the head instruction of its pipeline, instructions in \(t_1\) don’t access elements in the queue component of \(A_{\text{code}}\), so \(q(A_{\text{code}})\) is used as a weakening in the typing of \(t_1\), i.e. we have \(t_1 : \text{rg}(A_{\text{code}}) \otimes Y \rightarrow X \otimes \text{rg}(A_{\text{code}})\). By applying the alternative weakening \(B_{\text{code}}\), we obtain

\[t_1 : Y \otimes B_{\text{code}} \otimes \text{rg}(A_{\text{code}}) \rightarrow X \otimes B_{\text{code}} \otimes \text{rg}(A_{\text{code}})\]

\[t_2 : E \rightarrow B\]

\[s : C \rightarrow B\]

so (Δ, ⊑) ⊢ s : C → B for some Δ and ⊑ by Lemma 6. Consequently, (Γ\_\text{ins}#Δ, <) ⊢ \text{ins} s : A → B for

\[< = (\subseteq \cup \{(n,m) \mid \text{op factor of } B_{\text{code}}, \text{op : (m, _)} \in \Delta^W\}) \]

\[\cup \{(n,m) \mid \text{r factor of } B_{\text{code}}, \text{r : (m, _)} \in \Delta^R\}\]

\[\cup \{(n,m) \mid \text{r factor of } A_{\text{code}}, \text{r : (m, _)} \in \Delta^W\}\]^+

Instruction sequences \(u\) and \(t\) are both interleavings for \(t\) and \(A \rightharpoonup u\) by assumption and \(t\) due to \(\pi_t = \pi_t\) and \(t : A \rightarrow B\). Since \(t\) is sequential for ⊑ and \(A\), \(u\) and \(t\) thus both respect ⊑, and the claim follows from the minimality of \([n]\) with respect to ⊑ as in the proof of Lemma 4. □
**Proposition 26.** Let \((\Gamma, <) \vdash t : A \rightarrow B\) and \(<\) be contained in \(\square\). If \(t\) is sequential for \(\square\) and \(A\) then \(t\) is deterministic for \(A\).

**Proof.** The proof proceeds similarly to the one of Proposition 11. In the case \(FU(u_1) \neq FU(t_1)\), the instructions \(u_1\) and \(t_1\) with the principal typings \(t_1 :: A_{t_1} \rightarrow B_{t_1}\) and \(u_1 :: A_{u_1} \rightarrow B_{u_1}\) are not completely independent but \(\text{prime}(rg(A_{u_1}), B_{t_1}), \text{prime}(B_{u_1}, rg(A_{t_1}))\) and \(\text{prime}(B_{u_1}, B_{t_1})\) hold: for suppose any of these claims is violated, then \(t_1 < u_1\) holds by Lemma 6(part (4)) (as \(t_1\) is the first instruction of \(t\)), in contradiction to the fact that \(u\) respects \(<\).

Any joint factor of \(A_{u_1}\) and \(A_{t_1}\) is a register \(r\) (because of \(FU(u_1) \neq FU(t_1)\)), and the values in \(r\) in configurations \(D_1\) and \(E_1\) agree. Formally, such an \(r\) occurs in \(\text{shape}(D_1)\) and \(\text{shape}(E_1)\) as \(B_{t_1} = B_{t_1} \otimes rg(A_{t_1})\) and \(B_{u_1} = B_{u_1} \otimes rg(A_{u_1})\) hold. Hence executability of \(u_1\) in \(D_1\) and \(t_1\) in \(E_1\) is preserved, and \(\text{prime}(B_{u_1}, B_{t_1})\) implies the existence of a unique \(C_1\) such that \(D_1 \Downarrow C_1\) and \(E_1 \Downarrow C_1\).

The increased number of race hazards resulting from the introduction of registers is not matched by the existence of additional tools for discharging constraints. Pipeline-dependencies still result in serialisation, and so do data-dependencies for values which are communicated through operand queues. No serialisation is achieved by register-communicated values, exemplified by the existence of RAW hazards.

### 5.2.4 Program execution

The distributed dynamic semantics of programs with registers and jumps combines the SOS systems for distributed program execution (Section 4.4.1) and program execution involving registers (Section 5.1.3.1). Configurations are equipped with a register component and reduction is defined by the rules in Figure 5.4. Likewise, the static semantics is given by composing the approaches in Sections 4.4.2 and 5.1.3.2. Instructions \(\text{if}\) and \(\text{jmp}\) are typed according to the rules

\[
\begin{align*}
(\Gamma \text{ if } op \ n_1 \ n_2) & \vdash [n] \text{ if } op \ n_1 \ n_2 : A \rightarrow B \\
(\emptyset, \emptyset) & \vdash [n] \text{ jmp } m : X \rightarrow X
\end{align*}
\]

At boundaries of basic blocks, additional constraints are introduced according to the three race conditions. Judgements \((\Gamma_n, <_n) \vdash n : A \rightarrow B\) are inserted into \(\Sigma\) provided
5.2. Distributed execution

\[\begin{align*}
4FU_1 & \quad C \xrightarrow{\text{ins}} D \\
4FU_2 & \quad C \xrightarrow{s, \gamma} E \quad E \xrightarrow{t, \gamma} D \\
4DIS_1 & \quad C \xrightarrow{\pi, \gamma} D \\
4DIS_2 & \quad C \xrightarrow{\pi_1, \pi_2} D
\end{align*}\]

\[\begin{align*}
\text{LOAD} & \quad (Q, R, M, n, \pi) \xrightarrow{n} (Q, R, M, n, \pi)
\end{align*}\]

\[\begin{align*}
\text{EXEC} & \quad (Q, R, M, \pi) \xrightarrow{\pi_1} (P, S, L, \bar{m}) \\
& \quad (Q, R, M, \pi_2) \xrightarrow{\lambda} (P, S, L, \bar{m}, \pi_2)
\end{align*}\]

\[\begin{align*}
\text{COMPOSE} & \quad C \xrightarrow{\nu} E \quad E \xrightarrow{\nu} D \\
& \quad C \xrightarrow{\nu, \nu} D
\end{align*}\]

Figure 5.4: Dynamic semantics for programs involving registers

that for the typing \((\Gamma_n, <_n) \vdash N(n) : A \rightarrow B\) holds, and constraints \([l] < [g]\) are inserted whenever one of the following three conditions holds where \(n, \ldots, m\) are related by a path in \(A\)

- \(op : (k, l) \in \Gamma_n^W, op : (g, h) \in \Gamma_m^W\)
- \(r : (k, l) \in \Gamma_n^R, r : (g, h) \in \Gamma_m^W\)
- \(r : (k, l) \in \Gamma_n^W, r : (g, h) \in \Gamma_m^R\)

Again, it suffices to consider loop-free paths where at most the endpoint are equal and no intermediate block mentions \(op\). As before, constraints \([n] < [n]\) in the rule

\[\Sigma = \bigcup_{n \in \text{dom}_N} \{ (\Gamma_n, <_n) \vdash n : A_n \rightarrow B_n \} \]

\((n, m) \in A\) implies \(\exists X\ s.t. B_n = A_n \otimes X\ and B_m \otimes X = B_m\)

\(SP(n, op, m), op : (k, l) \in \Gamma_n^W, op : (g, h) \in \Gamma_m^W\) implies \([l] < [g]\)

\(SP(n, op, m), op : (k, l) \in \Gamma_n^W, op : (g, h) \in \Gamma_m^R\) implies \([l] < [g]\)

\(SP(n, op, m), op : (k, l) \in \Gamma_n^R, op : (g, h) \in \Gamma_m^W\) implies \([l] < [g]\)

\(\text{R-DisPROG-}\)

\((\Sigma, <) \vdash (N, A) : \Box\)
are interpreted modulo loop instantiations. Consequently, similar properties hold as in Section 4.4.2.

**Proposition 27.** Let \((\Sigma, <) \vdash P : \square\) and \((n_1, \ldots, n_k)\) be a path in \(A\). Then there are \(\Gamma\) and \(<\) such that \((\Gamma, <) \vdash \Xi(n_1) \ldots \Xi(n_k) : A_n \rightarrow B_n\) holds and \(<\) is contained in \\{\(n : A_n \rightarrow B_n \in \Sigma\)\} \cup <.

**Proof.** Similar to the proof of Proposition 15. In the case \(k > 1\), \(B_i = A_{i+1} \otimes X_i\) holds for some \(X_i\) with \(B_{i+1} \otimes X_i = B_{i+1}\) and all \(1 \leq i \leq k\), and \(<\) has the form

\[
< = \square \cup <_{k+1} \cup \{(m, g) | op : (l, m) \in \Delta^W \text{ and } op : (g, h) \in \Gamma^W_{k+1}\}
\]

\[
\cup \{(m, g) | r : (l, m) \in \Delta^W \text{ and } r : (g, h) \in \Gamma^R_{k+1}\}
\]

\[
\cup \{(m, g) | r : (l, m) \in \Delta^R \text{ and } r : (g, h) \in \Gamma^W_{k+1}\}.
\]

\[\square\]

**Proposition 28.** Let \((\Sigma, <) \vdash P : \square\), \(n \in \text{dom} \Xi\) and \((\Gamma_n, <_n) \vdash \Xi(n) : A_n \rightarrow B_n \in \Sigma\).

1. If \((Q, R, M, n) \xrightarrow{w} (P, S, L, \tilde{m})\) then \((Q, R, M, n, \pi_e) \xrightarrow{w} (P, S, L, \tilde{m}, \pi_e)\).

2. Let \(C = (Q, R, M, n, \pi_e)\), \(\text{shape}(C) = A_n\) and \(C \xrightarrow{w} (P, S, L, \tilde{m}, \pi_e)\). Let \(\sqsubseteq\) be a partial order containing \(<\) and for all \(\Xi(n_i)\) \ldots \(\Xi(n_j) : A_{n_i} \rightarrow B_{n_j}\) let the instruction sequence \(\Xi(N_i) \ldots \Xi(n_j) : A_{n_i} \rightarrow B_{n_j}\) be sequential for \(\sqsubseteq\) and \(A_{n_i}\). Then \((Q, R, M, n) \xrightarrow{w} (P, S, L, \tilde{m})\).

**Proof.** Similar to the proof of Proposition 16, using configurations extended by the register bank and the modified rules EXEC, LOAD, EXECUTE and COMPOSE. \[\square\]

### 5.3 Discussion

This section gave dynamic and static semantics for the full language ALEF, again for sequential as well as distributed execution. Most properties were obtained by transferring and combining the results from earlier chapters, and the ease with which this was possible demonstrates the robustness of our approach. In the type system, registers were modelled as products fulfilling the axiom \(r \otimes r = r\) which turns them into exponentials in the setting of linear logic [Gir86]. Indeed, accessing a register corresponds
to the familiar axiom $!r \rightarrow r \otimes !r$, and for $A = op_1 \otimes \ldots \otimes op_n$, our notation $rg(A)$ represents the weakest formula $A^?$ such that $A \otimes A^? = !A$ holds, where $!A = op_1^1 \otimes \ldots \otimes op_n^1$ and $op' = \begin{cases} \text{!}r & \text{if } op = r \\ op & \text{otherwise} \end{cases}$

Our approach employs only a tiny fragment of linear logic as types are (non-curried) first-order, no connectives other than $\otimes$ and $\rightarrow$ appear and the exponential is only applied to atoms $op$. We have not investigated further connections to linear proof theory, but doing so might be useful for modelling more complex forwarding behaviour.

### 5.3.1 AQM’s and compounding

The static and dynamic semantics presented in this chapter allows us to examine the compounding approach more closely. The description given in Chapter 2 suggests that communication within a compound should coincide with operand queue usage while inter-compound communication should correspond to register usage. This motivates us to require that compounds have type $rg(A) \rightarrow rg(B)$ for some $A$ and $B$. However, this property does not characterise compounds uniquely. The program

\[
[1]\text{dup}^1_{\text{MEM}} r_1 q_1 q_2 [2]\text{dec} q_1 q_3 [3]\text{add} q_2 q_3 r_2 :: r_1 \rightarrow \circ r_1 \otimes r_2
\]

is not a compound in the sense of [Mul01] as there are two forwarding paths between [1] and [3]. ALEF notation highlights the difference between linear chains of instructions in the instruction dependence graph and a chain of data-dependencies.

A second issue concerns the underlying dynamic semantics. The results of this chapter show that differences in the dynamic semantics influence what notions of compounding should be admitted. The following examples demonstrate the problems arising from defining compounds as *units of forwarding* but understanding and using them also as *units of scheduling*. The program

\[
t = [1]\text{ldc} 5 q_1 [2]\text{dup}^1_{\text{BUF}} q_1 q_2 r_1 [3]\text{dec} r_1 r_2 [4]\text{mul} q_2 r_2 r_3 \quad (5.7)
\]

consists of two compounds, $C_1 = [1][2][4]$ and $C_2 = [3]$. Both schedules $C_1C_2$ and $C_2C_1$ execute deterministically for an initial configuration of shape 1 under the distributed model of execution, resulting in the same final configuration. In fact,

\[
\pi_{C_1C_2} = \pi_t = \pi_{C_2C_1} = [3] \parallel [4] \parallel [1] \parallel [2]
\]
holds and any issue order of the instructions of \( t \) leads to this distributed program. However, under the sequential model, both schedules deadlock. The order \( C_1C_2 \) results in the interleaving \([1][2][4][3]\) and deadlocks after executing \([2]\). The order \( C_2C_1 \) results in the interleaving \([3][1][2][4]\) and deadlocks immediately. On the other hand, the in-order execution of \( t \) succeeds. The above compounding was perfectly fine for distributed execution but not for sequential one.

For sequential execution, additional constraints on the formation of compounds have been proposed. For example, acyclic compoundings require the graph of compounds and and inter-compound dependencies to be loop-free. This eliminates program (5.7) where \( C_2 \) and \( C_1 \) are mutually dependent on each other. Of course, this requirement is conservative as there are cyclic compoundings which may be admitted. On the other hand, for a distributed underlying semantics in the sense of this Chapter, even acyclic compounding does not guarantee determinism. For example, the program

\[
t = [1]ldc\ 5\ q_1\ [2]dup\ \text{ALU}\ q_1\ r_1\ r_2\ [3]dup\ \text{MUL}\ r_2\ r_4\ q_2\ [4]ld\ q_2\ r_3\ [5]ld\ r_1\ q_3\ [6]dec\ q_3\ r_5\ [7]add\ r_5\ r_4\ r_3
\]

(5.8)

is compounded acyclicly, with \( C_1 = [1][2], C_2 = [3][4], C_3 = [5][6] \) and \( C_4 = [7] \). The two schedules respecting the inter-compound dependencies are \( C_1C_2C_3C_4 \) and \( C_1C_3C_2C_4 \). For initial configurations of shape \( 1 \), both schedules execute successfully under the sequential model of execution, and indeed lead to the same result. However, for distributed execution only the schedule \( C_1C_2C_3C_4 \) is deterministic, while \( C_1C_3C_2C_4 \) shows a race condition between instructions \([4]\) and \([7]\). In contrast, the original program \( t \) is deterministic for \( 1 \) as any interleaving of \( \pi_t = [2][6][7] \| [3] \| [1][4][5] \| \varepsilon \) agrees with the sequential execution. The typing of \( t \) is

\[
(\Gamma, [4] \prec [7]) \vdash t : 1 \rightarrow r_1 \otimes r_2 \otimes r_3 \otimes r_4 \otimes r_5,
\]

and the constraint \([4] \prec [7]\) may be discharged using the \( q_3 \)-carried data-dependence between \([5]\) and \([6]\) and the pipeline-dependencies \([4] \preceq [5]\) and \([6] \preceq [7]\). Thanks to the SOS- and type-based formalism, we discover that the race hazard is introduced through understanding compounds as units of scheduling. Type-checking the schedule \( C_1C_3C_2C_4 \) fails as the constraint \([4] \prec [7]\) cannot be fulfilled any longer: \([4] \preceq [5]\) does not hold.
We hence argue that AQM’s are well suited for reasoning about compoundings as they capture the interaction between forwarding, model of execution and scheduling clearly.

5.3.2 Conclusion and outlook

The preceding chapters presented a programming language based analysis of forwarding. We introduced AQM’s as a novel machine model, realised as an assembly language in which operand queue names have the same status as registers. Explicitness of forwarding was shown to be beneficial for defining processor behaviour at various levels of abstraction, ranging from ISA-style sequential execution with operand queues of unbounded length and exclusive forwarding of operands to distributed execution with registers and operand queues of bounded length. We discussed how AQM’s and ALEF relate to three previous models of operand communication (SCALP, compounding, and Tomasulo’s algorithm), demonstrating that AQM’s unify various concrete realisations of forwarding. The benefits of our approach became apparent as the dynamic semantics were complemented by static semantics eliminating characteristic runtime hazards. The core of these type systems remained static as we proceeded from simple towards more complex models, demonstrating the robustness of the formalisms. The separation into the derivation of serialisation constraints and their discharge was introduced in Chapter 4, but could be reused for AQM’s with registers and might also be transferrable to other dynamic semantics of ALEF.

This completes the first part of this thesis. Starting with the following chapter, we substantiate the second advocated benefit of a programming language based approach as we link forwarding in ALEF to intermediate program analysis and compilation.
Chapter 6

Intermediate program analysis

This chapter together with Chapter 7 presents the core of a compiler-backend with target language ALEF. As operands in ALEF may either be forwarded or be passed through registers, compilation needs to decide which mechanism to use for each intermediate result, respecting the fundamental properties of operand queues:

1. a value in an operand queue can be consumed only once

2. the order in which values are inserted into an operand queue must agree with the order in which they are consumed

In this chapter, we concentrate on the first property and analyse how intermediate values are used by a program. Our method consists of a dataflow analysis for an intermediate language - variables for which the analysis can prove that each assignment corresponds to a single future usage are candidates for forwarding while variables with different usage patterns must be communicated through registers.

The second property restricts the allocation of operand queues to those variables which are forwardable, and will be dealt with in Chapter 7. Together, these two chapters provide a translation from the intermediate language into ALEF where the allocation of operand queues and registers is based on solutions to the dataflow equations.

Both the analysis and the translation are proven correct with respect to static and dynamic semantics. In case of the dataflow equations, we show that variables identified as being used linearly may indeed be deleted during their first usage. This result is
obtained by defining a non-standard semantics which is sound with respect to the stan-
dard semantics and also related to the dataflow solutions. In case of the translation,
we prove functional correctness of the resulting ALEF program with respect to the
original program in the intermediate language, and well-typedness with respect to the
calculus of Chapter 5, where the types relate to the dataflow solutions.

Synopsis of Chapter 6 Our intermediate language is a subset of languages found in
modern compilers for imperative or object-oriented programming languages [App98a].
It contains assignments $x = e$ where $x$ is a variable and $e$ a simple expression, condi-
tional and unconditional jumps, and pseudo-assignments of the form $x = \Phi(x_1, \ldots, x_n)$
for expressing programs in static single assignment (SSA) form. We provide the pro-
gram analysis necessary for converting SSA and non-SSA programs into ALEF code,
based on a usage analysis for IL variables.

We start by an informal introduction to the SSA discipline in Section 6.1. Section 6.2
then defines the syntax and dynamic semantics of IL programs and formalises the im-
PLICIT conventions of SSA. We also summarise how non-SSA programs may be trans-
formed into SSA form and define more restrictive SSA disciplines. Subsequently, the
analysis of the dynamic number of accesses to a variable using dataflow equations is
presented in Section 6.3. As usage analysis arises as a generalisation of liveness anal-
ysis, our discussion starts with a brief summary on the dataflow equations for liveness,
before introducing usage analysis for non-SSA and SSA programs. Finally, Section
6.4 shows the soundness of the analysis by first defining a non-standard dynamic se-
mantics where the number of read operations for a variable may be restricted and then
proving that restrictions arising from solutions to the equations execute successfully.

6.1 Static single assignment

Static single assignment (SSA) is a compiler-intermediate program representation in-
troduced in [AWZ88] for detecting equality of variables. By restricting the way in
which variables are named, SSA forces all reading accesses to a variable to refer to the
same assigning instruction. Consequently, the structure of a program is more explic-
6.1. Static single assignment

Itly available, with benefits for various program analysis techniques. In the past, SSA and its generalisations [JP93] [Ana99] have successfully served as a common representation for dead code elimination, constant propagation [SWG94], value numbering [BCS97] and partial redundancy elimination [KCL+99].

The main principle of SSA requires that each variable \( x \) has only one point of definition, i.e. that a program can have at most one instruction assigning to \( x \). For straight-line code, this is achieved by introducing a new variable \( x_i \) whenever a previously assigned variable \( x \) is assigned to again. The uses (occurrences of \( x \) in the right hand side of an assignment or in a conditional branch) are renamed such that they always refer to the most recently introduced copy of \( x \). For example, in code 6.1 (left), the two assignments to \( x \) may be renamed to \( x_1 \) and \( x_2 \) and the three assignments to \( y \) to \( y_1 \), \( y_2 \) and \( y_3 \).

\[
\begin{align*}
[1] & : y = 5 & [1] & : y_1 = 5 \\
[3] & : y = x + y & [3] & : y_2 = x_1 + y_1 \\
[4] & : x = x + y & [4] & : x_2 = x_1 + y_2 \\
\end{align*}
\]

(6.1)

Updating the uses of \( x \) and \( y \) accordingly results in code 6.1 (right). Data-dependencies between instructions are explicit and each site of use refers to the correct assignment. Furthermore, the life span of variables \( x_i \) resulting from a variable \( x \) do not overlap: for example, \( y_1 \) is dead after instruction \( [3] \) as it will never be used again.

At points where control flow paths merge, the requirement that each variable have a unique static defining instruction is maintained by pseudo-instructions

\[
x = \Phi(x_1, \ldots, x_n)
\]

where \( 1, \ldots, n \) is an arbitrary but fixed enumeration of the incoming control flow arrows. Similar to other assignments, a \( \Phi \)-instruction \( x = \Phi(x_1, \ldots, x_n) \) represents a point of definition for variable \( x \) and a point of use for the \( x_i \). Semantically, the execution of \( x = \Phi(x_1, \ldots, x_n) \) assigns the value \( x_i \) to \( x \) whenever the the \( \Phi \)-instruction is reached via control flow arrow numbered \( i \). If a program is treated as a graph of basic blocks, \( \Phi \)-instructions need only occur at the beginning of blocks and a group of \( \Phi \)-instructions is assumed to be executed concurrently, i.e. as a single instruction which
first extracts the correct values from the tuples on the right-hand side and then assigns them atomically to the variables on the left-hand side. As an example, consider

\[
\begin{align*}
[1] & \text{if } v \ 2 \ 5 \\
[2] & x = 5 \\
[3] & y = 6 \\
[4] & \text{jmp } 8 \\
[5] & x = 2 \\
[6] & y = 9 + v \\
[7] & \text{jmp } 8 \\
[8] & z = x \times y \\
[9] & z = z + v
\end{align*}
\]

the program 6.2 (left) where (depending on the value of \( v \)) instruction [8] is reached either through the control flow edge \([4] \rightarrow [8]\) or through the edge \([7] \rightarrow [8]\). When the blocks \([2]..[4]\) and \([5]..[7]\) are transformed into SSA, different names are used for the assignments to \( x \) and \( y \) in order to ensure that each variable has a single point of definition. Inserting \( \Phi \)-instructions \([8a]\) and \([8b]\) guarantees that instruction [8] accesses the correct values no matter which path was used. In the code 6.2 (right), the first component in both \( \Phi \)-instructions is implicitly associated with the control flow arrow \([4] \rightarrow [8]\) and the second component with the arrow \([7] \rightarrow [8]\). The order of instructions \([8a]\) and \([8b]\) is semantically irrelevant. No \( \Phi \)-instruction is needed for \( v \) since neither of the two branches assigns to \( v \) and the earlier value is still valid at point [8].

### 6.2 An intermediate language IL

This section introduces an imperative intermediate language IL as a formal basis for the translation of programs into ALEF. The language is similar to quadruple-based intermediate languages [App98a], but additionally contains \( \Phi \)-instructions. It can thus express general (i.e. non-SSA) programs as well as programs in SSA form.
bop ∈ BinOp ::= + | − | *

e ∈ Expr ::= a | x | x bop a | x bop x

x(k) ∈ Var^k ::= \[ \begin{pmatrix} x_1 \\ \vdots \\ x_k \end{pmatrix} \]

n(m) ∈ N^m ::= \[ \begin{pmatrix} n_1 \\ \vdots \\ n_m \end{pmatrix} \]

M(k,m) ∈ Var^{k×m} ::= \[ \begin{pmatrix} x_{1,1} & \cdots & x_{1,m} \\ \vdots \\ x_{k,1} & \cdots & x_{k,m} \end{pmatrix} \]

ass ∈ Assignment ::= [n]x = e

ϕ ∈ Φ-Block ::= [n]x(k) = M(k,m)n(m)

jump ∈ Jump ::= [n]iff x m l | [n] jmp m

Figure 6.1: Intermediate language IL: syntax

We first give the syntax of IL and collect some basic definitions regarding the usage of variables in programs. We then summarise an approach from the literature for transforming IL programs into SSA form and a more restrictive form called edge-split SSA. Finally, we define the dynamic semantics of IL, including the execution of Φ-instructions.

### 6.2.1 Syntax and regular programs

For a set \( Var = \{x, y, \ldots \} \) of program variables and values \( a ∈ Val \) and numbers \( n ∈ N \) as before, the syntax of IL is given by the grammar in Figure 6.1. We let \( I \) range over the set \( Instrs \) of instructions (assignments, Φ-blocks and jumps) and \( \bar{I} \) over sequences of instructions. In accordance with the general definitions in Chapter 2, the length \( n \) of \( \bar{I} = I_1 \ldots I_n \) is denoted by \( |\bar{I}| \), and for \( n > 0 \) we write \( fst(\bar{I}) \) for \( I_1 \) and \( lst(\bar{I}) \) for \( I_n \).
For simplicity, IL does not contain memory operations. These can however be added without further complications at the cost of extending the operational models by a memory component.

**Example.** The abstract syntax for $\Phi$-Blocks ensures that all pseudo-assignments $x = \Phi(y_1, \ldots, y_k)$ in a $\Phi$-Block are of the same arity and use the same numbering of incoming control flow arrows. For example, the basic block $[8a][8b][8][9]$ in program 6.2 (right) takes the form

\[
[8a] \begin{pmatrix} x_3 \\ y_3 \end{pmatrix} = \begin{pmatrix} x_1 & x_2 \\ y_1 & y_2 \end{pmatrix} \begin{pmatrix} 4 \\ 7 \end{pmatrix}
\]

\[ [8]z1 = x3 \times y3 \]

\[ [9]z2 = z1 + v \]

The operational behaviour (to be defined shortly) assigns $x_1$ to $x_3$ and $y_1$ to $y_3$ if the control flow reached $[8a]$ through instruction $[4]$ while entering $[8a]$ through the path $[7] \rightarrow [8a]$ results in the assignments $x_3 = x_2$ and $y_3 = y_2$. In both cases, the respective pair of assignments is carried out atomically.

\[ \Box \]

**Definition 22.** For instruction $I$ the sets $\text{defs}(I)$ and $\text{uses}(I)$ of variables defined and used in $I$ are given by

\[
\begin{align*}
\text{uses}(a) &= \emptyset \\
\text{uses}(x) &= \{x\} \\
\text{uses}(xbop a) &= \{x\} \\
\text{uses}(xbop y) &= \{x, y\} \\
\text{uses}\left(\begin{pmatrix} x_{11} & \cdots & x_{1m} \\ \vdots & & \vdots \\ x_{k1} & \cdots & x_{km} \end{pmatrix}\right) &= \{x_{ij} \mid 1 \leq i \leq k, 1 \leq j \leq m\} \\
\text{uses}([n]x = e) &= \text{uses}(e) \\
\text{uses}([n]x = Mn) &= \text{uses}(M) \\
\text{uses}([n]\text{if } x m l) &= \{x\} \\
\text{uses}([n]\text{jmp } m) &= \emptyset
\end{align*}
\]
and

\[
\text{defs}(\begin{array}{c}
x_1 \\
\vdots \\
x_k
\end{array}) = \{x_i | 1 \leq i \leq k\}
\]

\[
\text{defs}([n]x = e) = \{x\}
\]

\[
\text{defs}([n]x = Mn) = \text{defs}(x)
\]

\[
\text{defs}([n]\text{if } x \text{ m l}) = \emptyset
\]

\[
\text{defs}([n]\text{jmp } m) = \emptyset
\]

**Definition 23.** A basic block is an instruction sequence \( \vec{I} = I_1 \ldots I_n \) where \( n \geq 0 \) holds, at most \( I_n \) is a jump instruction and at most \( I_1 \) is a \( \Phi \)-block. We denote the set of basic blocks by \( \text{Blocks} \), and let \( B \) range over this set.

**Definition 24.** A program \( P = (N, A, \text{succ}, \text{entry}) \) consists of a partial function \( N : N \rightarrow \text{Blocks} \), a relation \( A \subseteq \text{dom } N \times \text{dom } N \), a partial function \( \text{succ} : \text{Instrs} \rightarrow \mathcal{P}(\text{Instrs}) \) and a label \( \text{entry} \in \text{dom } N \) such that

- the labelling of instructions is unique
- for all \( n \in \text{dom } N \), \( |N(n)| > 0 \) holds and \( \text{fst}(N(n)) \) has the label \([n]\)
- the relation \( A \) is given by

\[
A = \{(n,m) | \text{lst}(N(n)) \in \{[.]\text{jmp } m, [.]\text{if } x \text{ m l}, [.]\text{if } x \text{ l m}\}\}
\]

- the function \( \text{succ} \) is given by

\[
\text{succ}(I) = \{J | \exists n \in \text{dom } N \text{ s.t. } N(n) = \vec{KIJL}\}
\]

\[
\cup \{\text{fst}(N(m)) | (n,m) \in A, I = \text{lst}(N(n))\}
\]

A partial map \( \pi : N \rightarrow \text{Instrs}_P \) with \( \pi(i + 1) \in \text{succ}(\pi(i)) \) is called an infinite path in \( P \) if \( \text{dom } \pi = N \) holds, and is called a finite path (of length \( n \)) if \( \text{dom } \pi = \{1, \ldots, n\} \). In the latter case, we also write \( \pi \) as \( (\pi(1), \ldots, \pi(n)) \) and say that \( \pi \) is a path from \( \pi(1) \) to \( \pi(n) \).
Chapter 6. Intermediate program analysis

for \( x \in \text{uses}(I_n) \), a path \((I_1, \ldots, I_n)\) in \( P \) with \( I_1 = \text{fst}(\text{entry}) \), and \( I_n \notin \Phi\)-Block, there is an \( i < n \) such that \( x \in \text{defs}(I_i) \) holds

where the (finite) sets of instructions, variables and labels used by \( P \) are denoted by \( \text{Instrs}_P \), \( \text{Var}_P \subset \text{Var} \) and \( \text{N}_P \subset \text{N} \), respectively, and the set of instruction sequences \( J \) such that there is a \( B \in \text{cod} \ N \) with \( B = IJ \) by \( \text{Blocks}_P \).

**Definition 25.** Program \( P = (\text{N}, \text{A}, \text{succ}, \text{entry}) \) is regular if no basic block \( B \in \text{cod} \ N \) has the form \( \phi I \).

### 6.2.2 Programs in Static Single Assignment form

The informal discussion of SSA in the previous section is imprecise as no syntactic conditions on SSA programs are given. For example, the requirement that all \( \Phi\)-functions of a block have distinct variables on the left-hand side is necessary for proving properties of SSA code, but follows only implicitly from the semantics - all \( \Phi\)-functions of a block are executed concurrently. Treatments of SSA in the literature often leave many properties implicit or consider SSA programs resulting from particular algorithms. In order to avoid later complications, the following definition of SSA programs captures syntactically those conditions which we require. It is based on the dominance notion which stipulates that each use of a variable be preceded by an execution of its (unique) defining instruction.

**Definition 26.** For \( I, J \in \text{Instrs}_P \), \( I \) dominates \( J \) if each path from \( \text{fst}(\text{entry}) \) to \( J \) in \( P \) contains \( I \).

**Definition 27.** Program \( P = (\text{N}, \text{A}, \text{succ}, \text{entry}) \) is an (SSA) program if

- \( \text{fst}(\text{entry}) \notin \Phi\)-Block
- for each \( x \in \text{Var}_P \) there is at most one \( I \in \text{Instrs}_P \) with \( x \in \text{defs}(I) \). We denote this instruction by \( I_x \).
- for each \( \phi \in [l]x(k) = M(k,m)n(m) \) in \( P \), the following conditions are fulfilled
  - for all \( 1 \leq j \leq m \), \( n_j = n \) holds iff there is an \( I \) such that \( \phi \in \text{succ}(I) \) and \( n \) is the label of \( I \)

---
6.2. An intermediate language IL

- $1 \leq i \neq j \leq m$ implies $n_i \neq n_j$
- $1 \leq i \neq j \leq k$ implies $x_i \neq x_j$
- if $n_j$ is the label of $I$, $\phi \in \text{succ}(I)$ and $x_{ij}$ occurs in $M$ then $I_{x_{ij}}$ dominates $I$

Informally, the last item captures the following properties.

- Each control flow edge $I \rightarrow \phi$ corresponds to exactly one entry in $n$ (and consequently one column of $M$).
- All variables on the LHS of a $\phi$-block are distinct (necessary for concurrent execution of the assignments).
- All variables used in a row $j$ are well-defined when the control flow enters the $\phi$-block through the arrow associated to $n_j$.

6.2.2.1 Translation into SSA form

In order to transform a non-SSA program into an SSA program, variables have to be renamed suitably and $\Phi$-instructions have to be inserted at appropriate points. In order to avoid unnecessary $\Phi$-instructions, several algorithms with varying complexity have been proposed ([CFR+91], [JP93], [BP96]). Our summary follows one of the original algorithms [CFR+91] based on the dominance relation (see also [App98a]). This relation is most often represented in form of the dominator tree with instructions as nodes, root $\text{fst}(\text{entry})$ and immediate dominance relations as arrows.

**Definition 28.** $I$ is the immediate dominator of $J$ if $I$ dominates $J$, $I \neq J$ and for all $K \neq I$ which dominate $J$, $I$ does not dominate $K$. $I$ strictly dominates $J$ if $I$ dominates $J$ and $I \neq J$.

It can be shown that immediate dominators are unique and Lengauer and Tarjan showed how to compute them efficiently using a depth-first traversal of the program graph [LT79]. The dominance frontier of $I$ consists of instructions $J$ which are not strictly dominated by $I$ themselves, but one of their predecessors is:
**Definition 29.** The dominance frontier of $I$ is given by

$$DF(I) = \{J \mid I \text{ does not strictly dominate } J\}$$

$$\cap \{J \mid \exists K.J \in \text{succ}(K) \text{ and } I \text{ dominates } K\}.$$ 

An instruction $J$ in $DF(I)$ thus represents a point of convergence, i.e. a common endpoint of otherwise disjoint paths $(I, \ldots, J)$ and $(K, \ldots, J)$. Hence, a $\Phi$-instruction is needed just before $J$ for each variable $x$ which fulfils $x \in \text{defs}(I)$ and is used in $J$ or a successor of $J$. One can show that iterating this dominance frontier criterion by including the $\text{defs}$ arising from $\Phi$-instructions inserted in the $i$-th step during the iteration $i + 1$ leads to the insertion of the fewest number of $\Phi$-instructions necessary for satisfying the property that any use of a variable be dominated by its definition.

After all $\Phi$-Blocks have been inserted, the program is transformed into SSA form by traversing the program graph or the dominator tree and renaming variables such that for each $x$ there is only one $I$ with $x \in \text{defs}(I)$.

### 6.2.2.2 Restricted SSA forms

The translation in Chapter 7 requires SSA programs to fulfil additional constraints. The first constraint is known under the name edge-split SSA.

**Definition 30.** An SSA program $P = (\mathbb{N}, A, \text{succ, entry})$ is in edge-split form if $\phi \in \text{succ}(I)$ implies $I \neq [n] if x m l$.

The second constraint concerns the order of $\Phi$-instructions in a block. As the translation needs to fix a particular order of these instructions, functional correctness can only be ensured if programs adhere to the following policy.

**Definition 31.** A program is in standard edge-split form if it is in edge-split form and for all $\phi$ of the form $[n] x = \text{Mn} x_{ij} \neq x_i$ holds for all $j$ and $i$ and all $l > i$.

Ordinary SSA programs may be transformed into edge-split form in a semantics-preserving way by inserting extra basic blocks whenever the condition in Definition
30 is not fulfilled. For example, program 6.3 (left)

\[ N(1) = \begin{align*}
\begin{bmatrix} 1 \end{bmatrix}u &= \ldots \\
\begin{bmatrix} 2 \end{bmatrix}v &= \\
\begin{bmatrix} 3 \end{bmatrix}w &= \\
\begin{bmatrix} 4 \end{bmatrix}&\text{if } u 5 8
\end{align*} \]

\[ N(5) = \begin{align*}
\begin{bmatrix} 5 \end{bmatrix} &= \begin{bmatrix} x \\ y \\ w \end{bmatrix} = \begin{bmatrix} v & x \\ w & x \end{bmatrix} \begin{bmatrix} 4 \\ 7 \end{bmatrix} \\
\begin{bmatrix} 6 \end{bmatrix}z &= x - 1 \\
\begin{bmatrix} 7 \end{bmatrix}&\text{if } z 8 5
\end{align*} \]

\[ N(8) = \begin{align*}
\begin{bmatrix} 8 \end{bmatrix} &= \begin{bmatrix} s \\ t \end{bmatrix} = \begin{bmatrix} v & y \\ u & x \end{bmatrix} \begin{bmatrix} 4 \\ 7 \end{bmatrix} \\
\begin{bmatrix} 9 \end{bmatrix}&\ldots
\end{align*} \]

is transformed into edge-split form by introducing two new basic blocks, containing instructions \([5a]\) and \([8a]\), respectively.

In most cases, standard edge-split form may be obtained from edge-split form by reordering the \(\Phi\)-instructions in each basic block – program 6.3 (right) shows the effect of swapping the \(\Phi\)-instructions for \(x\) and \(y\). If cyclic dependencies exist between variables in a \(\Phi\)-Block, additional variables need to be inserted. For example, the \(\Phi\)-Block

\[ [1] \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} y & x1 \\ y1 & x \end{bmatrix} \begin{bmatrix} 5 \\ 7 \end{bmatrix} \]

\[ [2]z = x \ast y \]

may be transformed to

\[ [1] \begin{bmatrix} w \\ y \end{bmatrix} = \begin{bmatrix} y & x1 \\ y1 & x \end{bmatrix} \begin{bmatrix} 5 \\ 7 \end{bmatrix} \]

\[ [2a]x = w \]

\[ [2]z = x \ast y \]

where \([2a]\) is a fresh label and \(w\) a fresh variable. We expect that in SSA programs resulting from application programs few additional variables are needed – \(\Phi\)-instructions
manage different copies \( x_i \) of a program variable \( x \) and the situation shown in the example should not arise. However, our program analysis treats SSA programs irrespectively of how they were obtained. Consequently, a syntactic restriction is necessary for the development in this thesis. An embedding of our analysis in a compiler which generates SSA internally may replace the above transformation with a formal proof that the translation into SSA satisfies the condition of standard edge-split SSA.

For the development in this chapter, ordinary SSA discipline suffices. In the remainder of this chapter, we will thus always assume that programs \( P \) are either regular (and not treat cases for \( \Phi \)-instructions) or in SSA form (and include \( \Phi \)-instructions in our analysis).

### 6.2.3 Dynamic semantics

The dynamic semantics of an IL-program \( P \) is given by the transition relation \( S \rightarrow T \) between ternary configurations of the form

\[
(\sigma, n, B) \in (\text{Var}_P \rightarrow \text{Val}) \times \mathbb{N}_P \times \text{Blocks}_P.
\]

The components are interpreted as follows.

- the partial map \( \sigma \) represents the state by mapping variables to their content
- \( n \) is the label of the instruction from which the control flow was transferred to the current instruction, and is used for executing \( \Phi \)-blocks
- \( B \) represents the remaining instructions of the current basic block.

The update operation \( \sigma[x \mapsto a] \) is defined according to the general definition in Chapter 2 and is generalised to

\[
\sigma[x_1 \mapsto a_1, \ldots, x_n \mapsto a_n] = \sigma[x_1 \mapsto a_1] \ldots [x_n \mapsto a_n]
\]

provided that the \( x_1, \ldots, x_n \) are distinct.

The evaluation of expressions is defined by the relation \( \sigma, e \downarrow a \)

\[
\frac{\sigma, a \downarrow a}{\sigma, a \downarrow a} \quad \frac{\sigma, x \downarrow a_1}{\sigma, x \downarrow a_1} \quad \frac{\sigma, x \downarrow a_1}{\sigma, x \downarrow a_1} \quad \frac{\sigma, y \downarrow a_2}{\sigma, y \downarrow a_2}
\]

\[
\frac{x \in \text{dom} \sigma}{\sigma, x \downarrow \sigma(x)} \quad \frac{\sigma, x \downarrow a_1 \quad \sigma, y \downarrow a_2}{\sigma, x \downarrow a_1 \quad \sigma, y \downarrow a_2}
\]

\[
\frac{\sigma, x \downarrow BOP(a_1, a)}{\sigma, x \downarrow BOP(a_1, a)}
\]
where \(BOP\) is the semantic operation corresponding to the syntactic operator \(bop\).

For a program \(P\), the (standard) dynamic semantics \(\rightarrow\) is given by the rules

\[
\begin{align*}
\text{Jmp} & \quad (\sigma, n, [k] \text{ jmp } m) \rightarrow (\sigma, k, N(m)) \\
\text{If-t} & \quad (\sigma, n, [k] \text{ if } x m_1 m_2) \rightarrow (\sigma, k, N(m_1)) \\
\text{If-f} & \quad (\sigma, n, [k] \text{ if } x m_1 m_2) \rightarrow (\sigma, k, N(m_2)) \quad (6.4) \\
\text{Ass} & \quad (\sigma, n, [k] x = e) \rightarrow (\sigma[x \mapsto a], n, \overline{I}) \\
\text{Phi} & \quad (\sigma, n, \phi \overline{I}) \rightarrow (\sigma', n, \overline{I}) \quad \begin{cases} 
\phi = [\ell]x(k) = M(m, k)n(m) \\
n = n_j \\
\sigma' = \sigma[x_i \mapsto a_i] 
\end{cases}
\end{align*}
\]

As usual, we denote the transitive closure of \(\rightarrow\) by \(\rightarrow^+\) and the reflexive transitive closure by \(\rightarrow^*\).

**Definition 32.** For program \(P = (N, A, \text{ succ, entry})\), the initial configuration is \(S_0 = ([], \text{ entry, N(entry)})\).

Some basic properties of \(\rightarrow\) are collected in the following lemmas and propositions.

**Lemma 8.** Let \((\sigma, n, B) \rightarrow (\sigma', n', B')\).

1. If \(|B'| > 0\) then \(\text{fst}(B') \in \text{ succ}(\text{fst}(B))\).
2. If \(|B'| = 0\) then \(\text{ succ}(\text{fst}(B)) = 0\).

In both cases, \(\text{dom } \sigma \subset \text{ dom } \sigma'\) holds, and \(x \in \text{ def s}(\text{fst}(B))\) implies \(x \in \text{ dom } \sigma'\).

**Proof.** Induction on the rules for \((\sigma, n, B) \rightarrow (\sigma', n', B')\)

1. Both \(\text{fst}(B')\) and \(\text{fst}(B)\) are well-defined as \(|B'| > 0\) holds and no reduction \((\sigma, n, B) \rightarrow (\sigma', n', B')\) exists for \(|B| = 0\).
Lemma 9. Let $S_0 \rightarrow^* (\sigma, n, I\bar{I})$.

1. If $I \not\in \Phi$-Block then $x \in \text{uses}(I)$ implies $x \in \text{dom } \sigma$. 

Chapter 6. Intermediate program analysis
2. If $I$ has the form $[l].\mathbf{x}(k) = M(k,m)\mathbf{n}(m)$ and $\mathbf{n} = (n_1, \ldots, n_m)$ then there is a unique $j$ such that $n = n_j$. Furthermore, $x_{ij} \in \text{dom } \sigma$ holds for all $i$.

Proof. 1. If $(\sigma, n, I) = S_0$ then $\text{uses}(I) = \emptyset$ because the last condition in Definition 24 requires for each $x \in \text{uses}(I)$ the existence of an $J \neq I$ on the trivial path $(\text{fst}(\text{n(entry})) = (I)$ with $x \in \text{defs}(J)$.

For $(\sigma, n, I) \neq S_0$ and $S_0 \rightarrow \ldots \rightarrow S_l = (\sigma, n, I)$ and $S_l = (\sigma_i, n_i, B_i)$, Lemma 8 implies $\text{fst}(B_{i+1}) \in \text{succ}(\text{fst}(B_i))$ for $0 \leq i < l$. Consequently, the last condition in Definition 24 guarantees that for each $x \in \text{uses}(I)$ there is a $j < l$ such that $x \in \text{defs}(\text{fst}(B_j))$. By Lemma 8 $x \in \text{dom } \sigma_{j+1}$ holds and by induction on $l - j$ we obtain $x \in \text{dom } \sigma'$.

2. For $I = \phi$, $P$ cannot be a regular program since no $\Phi$-Blocks occur in regular programs.

For an SSA program $P$ and $I = \phi$, $S_0 \neq (\sigma, n, I)$ holds because $I = \text{n(entry)}$ and Definition 27 requires $\text{fst}(\text{n(entry)}) \neq \emptyset$. Hence there are $h > 0$ and $S_{h-1}$ such that $S_0 \rightarrow^* S_{h-1} \rightarrow S_h = (\sigma, n, I)$. We write $S_{h-1} = (\sigma', n', B)$. Since $I$ is of the form $[l].\mathbf{x} = \mathbf{Mn}$, the step $S_{h-1} \rightarrow^* (\sigma, n, I)$ must have been a Jmp, If-t or If-f step, as the form of component $B$ on the LHS in the rules Ass and Phi implies that the component $B$ in the RHS of the same rule does not start with a $\Phi$-block. In the three rules Jmp, If-t and If-f, the value in the second component of the resulting configuration equals the label of the jump instruction, so Lemma 8 yields $I \in \text{succ}(\text{fst}(B))$. By Definition 27 there is thus a $j$ such that $n_j = n$, and $j$ is unique since all components of $\mathbf{n}$ are distinct by Definition 27. By the last clause of that definition, $I_{x_{ij}}$ dominates $\text{fst}(B)$, so $x_{ij} \in \text{dom } \sigma$ follows in a similar way as in part (1).

\[ \square \]

**Proposition 29.** (Determinacy of $\rightarrow$) If $S \rightarrow T$ and $S \rightarrow R$ then $T = R$.

Proof. Follows by induction on the rules for $\rightarrow$, using the determinacy of $\downarrow$ and $\sigma[x \rightarrow a]$, and in rule Phi the distinctness of the components of $\mathbf{x}$ (Definition 27).

\[ \square \]
**Proposition 30.** If $S_0 \rightarrow^* (\sigma, n, B)$ and $|B| > 0$ then there is a unique $T$ such that $(\sigma, n, B) \rightarrow T$.

**Proof.** For $B = I\bar{I}$ and $I \neq \emptyset$ Lemma 9 implies that $\text{uses}(I) \subset \text{dom} \sigma$, so for all rules the conditions $\sigma, e \Downarrow a$ succeed since in all rules all variables in the expression $e$ are contained in $\text{uses}(I)$. Also, evaluation $\sigma, e \Downarrow a$ is deterministic. In the rules for (unconditional and conditional) jumps, $I \in \{[k] \text{jmp}(m), [k] \text{if } x m l, [k] \text{if } x l m\}$ implies $m \in \text{dom } \mathbb{N}$ by Definition 24, so the branch target $\mathbb{N}(m)$ is guaranteed to exist.

If $I$ has the form $[l]x(k) = M(k,m)\mathbb{N}(m)$ then Lemma 9 implies that there is a unique $j$ such that $n = n_j$ and $x_{ij} \in \text{dom } \sigma$ for all $1 \leq i \leq k$. Again, this implies that $\sigma, x_{ij} \Downarrow a_i$ is well-defined and for each $i$ a unique such $a_i$ exists, so the definition of $\sigma'$ in rule Phi is unique.

In particular, targets of a conditional or unconditional jump are always in the domain of $\mathbb{N}$ and the corresponding rules Jmp, If-t and If-t are well-defined.

### 6.3 Analysis of variable usage

This section presents the program analysis on which the translation from IL into ALEF is based, and we treat both regular and SSA programs. The analysis consists of determining the usage pattern for each variable. In general, a variable whose value is accessed exactly once may be suitable for forwarding whereas a variable whose content is accessed repeatedly has to be communicated through a register.

The usage analysis we are interested in concerns the *dynamic* number of reading accesses rather than the *static* number of accessing instructions. For example, in program

$$
\begin{align*}
N(1) &= [1]v = \ldots & N(6) &= [6]u = x * 2 \\
 & \quad [2]x = 2 & & \quad [7] \text{ jmp } 8 \\
 & \quad [3] \text{ if } v 4 6 & & \\
 & \quad [5] \text{ jmp } 8
\end{align*}
$$

(6.5)

the variable $x$ is always accessed exactly once although there are two (static) instructions reading from $x$. Also $u$ is accessed exactly once, although the values might stem from different assigning instructions.
The opposite situation may also occur: a variable may be dynamically accessed multiple times despite the existence of only one accessing instruction. For example, the number of accesses to \( y \) in code

\[
N(1) = [1]v = \ldots \quad N(4) = [4]u = y + 3 \quad N(7) = [7]v = 7
\]

\[
\]

\[
[3]if \ v > 7 \quad [6]if \ v > 7
\]

depends on the initial value of \( v \).

Other variables are accessed an unknown number of times but each access corresponds to one execution of an assignment to that variable. This is for example the case for both \( w \) and \( z \) in program

\[
N(1) = [1]v = \ldots \quad N(4) = [4]z = w \times 2 \quad N(8) = [8]v = w + 1
\]

\[
\]

\[
[3]if \ v > 8 \quad [6]v = v - 1
\]

\[
[7]if \ v > 8
\]

Each execution of an assignment to \( w \) or \( z \) corresponds to exactly one reading access, independently from the initial value of \( v \).

A complete analysis would detect all situations where the number of dynamic assignments equals the number of dynamic accesses in all executions. However, this task is in general not computable, as is shown by program

\[
N(1) = [1]v = 0 \quad N(6) = [6]y3 = y1 \times y1
\]

\[
[2]y1 = 2 \quad [7]jmp \ 8
\]

\[
[3]if \ v > 6
\]

\[
N(4) = [4]y2 = y1 + 3 \quad N(8) = [8]v = 7
\]

\[
[5]jmp \ 8
\]

where \( y1 \) is dynamically assigned to exactly once, and is also accessed exactly once provided that the left branch is taken. Indeed, the assignment \( v = 0 \) ensures that this is the case. However, the general task to detect what path the control flow takes at runtime is undecidable. Thus, any usage analysis is necessarily conservative in the sense that some opportunities for forwarding are not detected. The analysis we present does not
involve techniques for analysing the values of variables or the control flow and will hence mark variable y1 in as being non-forwardable. On the other hand, our analysis is powerful enough to detect the linear usage of variables w, x and z in programs (6.5) and (6.7) and to separate it from the possibly non-linear usage of y in program (6.6).

Our method consists of a dataflow analysis similar to the task of determining the liveness of a variable. We therefore briefly review liveness analysis before adapting it to usage analysis. Our summary follows the expositions in [App98a] and [NNH99], and the more interested reader is referred to these sources for more in-depth descriptions.

### 6.3.1 Dataflow analysis for liveness

Liveness analysis aims at detecting whether the value of a variable is needed beyond a particular point in a program.

**Definition 33.** Variable $x \in \text{Var}_\mathcal{P}$ is called live at instruction $I \in \text{Instr}_\mathcal{P}$ if there is a $J \in \text{Instr}_\mathcal{P}$ with $x \in \text{uses}(J)$ and a path from $I$ to $J$ in $\mathcal{P}$ which does not contain any $K$ with $x \in \text{defs}(K)$.

Like other dataflow problems, liveness can be calculated from the sets of variables defined and used in instructions as follows.

To each instruction $I$, two sets of variables $\text{liveIn}(I)$ and $\text{liveOut}(I)$ are associated containing (approximations to) the set of variables live before and after the execution of $I$, respectively. Formally, this corresponds to defining two functions

$$
\text{liveIn}, \text{liveOut} : \text{Instr}_\mathcal{P} \rightarrow \mathcal{P}(\text{Var}_\mathcal{P})
$$

and we denote the function space $\text{Instr}_\mathcal{P} \rightarrow \mathcal{P}(\text{Var}_\mathcal{P})$ by $\text{FSP}_{lv}$. A variable $x$ is live-out at $I$ if it is needed in any of the successors of $I$. It is live-in at $I$ if it is either live-out but not assigned to in $I$ or it is used in the RHS of $I$. This motivates us to require that the functions fulfil the pointwise constraints

$$
\begin{align*}
\text{liveOut}(I) & \supseteq \bigcup_{J \in \text{succ}(I)} \text{liveIn}(J) \\
\text{liveIn}(I) & \supseteq (\text{liveOut}(I) \setminus \text{defs}(I)) \cup \text{uses}(I)
\end{align*}
$$

(6.9)

Any pair $(f, g)$ of functions mapping instructions to subsets of $\text{Var}_\mathcal{P}$ which fulfils the above inclusions for each $I$ (i.e. can be substituted for $\text{liveOut}$ and $\text{liveIn}$) gives some
liveness information. However, the larger the subsets grow, the less useful information they convey – in particular the trivial pair \((\lambda I. \text{Var}, \lambda I. \text{Var})\) conveys no information at all – it states that all variables are (potentially) always live. The following approach results in non-trivial solutions which are as small as possible.

The endofunction

\[
\text{update} : (\text{FSP}_{lv} \times \text{FSP}_{lv}) \rightarrow (\text{FSP}_{lv} \times \text{FSP}_{lv})
\]

corresponding to the above constraints, defined by

\[
\text{update}(f, g) = (\lambda J. \cup_{J \in \text{succ}(I)} g(J), \lambda I. (f(I) \setminus \text{def}(I)) \cup \text{uses}(I))
\]

is monotone in both arguments where the partial order over functions in \(\text{FSP}_{lv}\) is defined by

\[
f \sqsubseteq g \iff \forall I \in \text{Instr}_P. f(I) \subseteq g(I)
\]

and \(\lambda I. \emptyset\) is the least element. The codomain of \(\text{FSP}_{lv}\) is a finite lattice, consisting of the subsets of \(\text{Var}_P\) ordered by set inclusion, with \(\perp = \emptyset\) and \(\top = \text{Var}_P\). Consequently, \(\text{update}\) has a (unique) least fixed point \((f^{lfp}, g^{lfp})\) by the theorem of Knaster-Tarski [Tar55]: there is a unique pair \((f^{lfp}, g^{lfp})\) which fulfils the constraints (6.9) with equality and is minimal with respect to that property.

The above motivation suggests that the least fixed point represents the preferred solution to the liveness constraints as it results in the smallest sets of live variables at each program point. The calculation of the fixed-point solution proceeds by initialising

\[
\text{liveIn}(I) = \text{liveOut}(I) = \emptyset
\]

for all \(I\) and iterating the \(\text{update}\) operation – termination is guaranteed. Rapid convergence can be achieved by work-list algorithms and traversing the nodes in reverse postorder in each update iteration, i.e. by updating the sets for successors of \(I\) prior to visiting \(I\). Since \(\text{liveIn}(I)\) is defined in terms of \(\text{liveOut}(I)\), the flow of information is in the opposite direction of the program (execution) flow, and liveness analysis is called a backwards dataflow problem.
Example. For program
\[
\begin{align*}
\end{align*}
\]

the constraints (6.9) expand to
\[
\begin{align*}
liveIn([1]) & \supseteq (liveOut([1]) \setminus \{x\}) \cup \emptyset \\
liveOut([1]) & \supseteq liveIn([2]) \supseteq (liveOut([2]) \setminus \{y\}) \cup \emptyset = liveOut([2]) \setminus \{y\} \\
liveOut([2]) & \supseteq liveIn([3]) \supseteq (liveOut([3]) \setminus \emptyset) \cup \{y\} = liveOut([3]) \cup \{y\} \\
liveOut([3]) & \supseteq liveIn([4]) \cup liveIn([7]) \\
& \supseteq ((liveOut([4]) \setminus \{x\}) \cup \{x\}) \cup ((liveOut([7]) \setminus \{y\}) \cup \{x, y\}) \\
& = (liveOut([4]) \cup \{x\}) \cup (liveOut([7]) \cup \{x, y\}) \\
& \supseteq (liveOut([5]) \setminus \{z\}) \cup \{x, y\} \\
liveOut([6]) & \supseteq liveIn([1]) \\
liveOut([7]) & \supseteq liveOut([8]) \supseteq liveOut([8]) \setminus \emptyset = liveOut([8]) \\
liveOut([8]) & \supseteq liveIn([1])
\end{align*}
\]

resulting in the recursive relation
\[
\begin{align*}
liveOut([1]) & \supseteq liveOut([2]) \setminus \{y\} \\
& \supseteq (liveOut([3]) \cup \{y\}) \setminus \{y\} = liveOut([3]) \setminus \{y\} \\
& \supseteq ((liveOut([4]) \cup \{x\}) \cup (liveOut([7]) \cup \{x, y\})) \setminus \{y\} \\
& = (liveOut([4]) \cup liveOut([7]) \cup \{x\}) \setminus \{y\} \\
& \supseteq ((liveOut([5]) \setminus \{z\}) \cup \{x, y\} \cup liveOut([8]) \cup \{x\}) \setminus \{y\} \\
& = ((liveOut([5]) \setminus \{z\}) \cup liveOut([8]) \cup \{x\}) \setminus \{y\} \\
& \supseteq ((liveOut([1]) \setminus \{x, z\}) \cup (liveOut([1]) \setminus \{x\}) \cup \{x\}) \setminus \{y\} \\
& = (liveOut([1]) \cup \{x\}) \setminus \{y\}
\end{align*}
\]

Its least fixed point is \( liveOut([1]) = \{x\} \), and substituting this into the constraints (6.11) yields the (again minimal) sets
\[
\begin{align*}
liveOut([2]) &= liveOut([3]) = liveOut([4]) = \{x, y\} \\
liveOut([5]) &= liveOut([6]) = liveOut([7]) = liveOut([8]) = \emptyset.
\end{align*}
\]
6.3. Analysis of variable usage

Any solution contains the minimal solution, hence the analysis shows that $x$ and $y$ are jointly live-out at instructions [2], [3] and [4]. During register allocation, this liveness information is used for deducing that $x$ and $y$ must be mapped to different registers.

In contrast to non-minimal solutions such as $\text{liveOut} = \lambda I. \{x, y, z\}$ the above solution allows $z$ to share a register with either $x$ or $y$.  

Another application of liveness analysis is the elimination of useless variables. Any variable $x$ for which $x \in \text{defs}(I)$ implies $x \notin \text{liveOut}(I)$ may be eliminated from the program. For example, the above minimal solution for program (6.10) indicates that $z$ is useless as it is not live-out at its defining instruction. In general, deleting defining instructions of a useless variables opens further opportunities for optimisation. Not only may other variables become useless themselves, but also repeating the liveness-analysis may reduce the number of conflicts between variables for register allocation.

For proving soundness, (solutions to) the dataflow equations need to be related to the dynamic semantics. [NNH99] present an elegant indirect approach: they prove that the value to which a variable $x$ is mapped is irrelevant at program points where $x$ is not live. A corresponding equivalence relation equates states which only differ in the content of non-live variables and is shown to be preserved by the dynamic semantics.

6.3.2 Usage analysis for regular programs

Usage analysis arises from liveness analysis by replacing the question whether a variable is used by the question how often it is accessed. In this section we aim to detect variables with exactly one use for each execution of an assigning instruction. The direction of the flow of information agrees with that of liveness: once we know how often a variable is used in the part of the program following an instruction $I$ we can deduce the number of times it is used in the part including $I$ by inspecting the form of $I$. In addition, usage analysis for variable $x$ can be restricted to the part of the program in which $x$ is live – and algorithms may thus work backwards from the liveness frontier to the instructions defining $x$.

The two pieces of information associated with each instruction need to be modified.
Instead of assigning sets of variables to instructions via functions

\[
\text{liveIn, liveOut} : \text{Instrs}_P \rightarrow \mathcal{P}(\text{Var}_P)
\]
or equivalently functions \(\text{liveIn, liveOut} : \text{Instrs}_P \rightarrow \text{Var}_P \rightarrow \mathcal{B}\) from variables to the lattice of booleans, we employ functions

\[
fwdIn, fwdOut : \text{Instrs}_P \rightarrow \text{Var}_P \rightarrow \mathcal{L}
\]

where \(\mathcal{L}\) is the lattice

\[
\begin{array}{c c c c}
& 0 & 1 & \top \\
\hline
\bot & 0 & 1 & \top \\
0 & 0 & 1 & \top \\
1 & 1 & 1 & \top \\
\top & \top & \top & \top \\
\end{array}
\]

The bottom element represents unknown information, whereas \(\top\) indicates contradictory usage in different branches or existence of more than one usage. Elements 0 and 1 represent exactly none and exactly one usage, respectively.

We define the operation \(\oplus\) over \(\mathcal{L}\) by

\[
\begin{array}{c|c c c c}
\oplus & \bot & 0 & 1 & \top \\
\hline
\bot & \bot & \bot & 1 & \top \\
0 & \bot & 0 & 1 & \top \\
1 & 1 & 1 & \top & \top \\
\top & \top & \top & \top & \top \\
\end{array}
\]

and use \(\sqcup\) to denote least upper bounds of elements in \(\mathcal{L}\), with \(\sqcup 0 = \bot\).

The operation \(\oplus\) is commutative and associative and fulfils the following properties.

**Lemma 10.** Let \(a, b, \ldots \in \mathcal{L}\). Then \(a \oplus b \sqsubseteq 0\) holds exactly if \(a \sqsubseteq 0\) and \(b \sqsubseteq 0\). Furthermore, \(\oplus\) is pointwise monotone: if \(a \sqsubseteq c\) and \(b \sqsubseteq d\) then \(a \oplus b \sqsubseteq c \oplus d\).

**Proof.** Inspection of the table defining \(\oplus\) and the structure of \(\mathcal{L}\). \(\square\)
On the other hand, \( \oplus \) is not increasing as \( a \sqsubseteq a \oplus b \) does not hold in general.

We define functions \( \text{uses} : \text{Expr} \rightarrow \text{Var} \rightarrow \mathcal{L} \) and \( \text{uses} : \text{Instr} \rightarrow \text{Var} \rightarrow \mathcal{L} \) by

\[
\begin{align*}
\text{uses}(a)(y) &= 0 \\
\text{uses}(x)(y) &= \begin{cases} 1 & \text{if } x = y \\ 0 & \text{otherwise} \end{cases} \\
\text{uses}(x \text{ bop } e)(y) &= \text{uses}(x)(y) \oplus \text{uses}(e)(y) \\
\text{uses}([n]x = e)(y) &= \text{uses}(e)(y) \\
\text{uses}([n]\text{ if } x n_1 n_2)(y) &= \text{uses}(x)(y) \\
\text{uses}([n]\text{ jmp } m)(y) &= 0
\end{align*}
\]

(6.12)

generalising the \( \text{uses} \) predicate of liveness analysis. The functions \( \text{fwdIn} \) and \( \text{fwdOut} \) are given by the mutually recursive equations

\[
\begin{align*}
f\text{wdOut}(I)(x) &= \begin{cases} 0 & \text{if } \text{succ}(I) = \emptyset \\
\sqcup_{J \in \text{succ}(I)} f\text{wdIn}(J)(x) & \text{otherwise} \end{cases} \\
f\text{wdIn}(I)(x) &= \begin{cases} \text{uses}(I)(x) & \text{if } x \in \text{defs}(I) \\
\text{uses}(I)(x) \oplus f\text{wdOut}(I)(x) & \text{otherwise} \end{cases}
\end{align*}
\]

(6.13)

The \( \text{update} \) function corresponding to equations (6.13) is

\[
\begin{align*}
\text{update}(f, g) &= (\lambda I. \lambda x. \text{if } \text{succ}(I) = \emptyset \text{ then } 0 \text{ else } \sqcup_{J \in \text{succ}(I)} g(J)(x), \\
&\quad \lambda I. \lambda x. \text{if } x \in \text{defs}(I) \text{ then } \text{uses}(I)(x) \text{ else } \text{uses}(I)(x) \oplus f(I)(x))
\end{align*}
\]

where \( f \) and \( g \) are of functionality

\[
\text{FSP}_\text{fwd} = \text{Instr} \rightarrow (\text{Var} \rightarrow \mathcal{L}).
\]

The lattice structure of \( \mathcal{L} \) is inherited by the function space \( \text{Var} \rightarrow \mathcal{L} \) where functions \( F, G : \text{Var} \rightarrow \mathcal{L} \) are ordered pointwise by

\[
F \sqsubseteq G \text{ iff } \forall x \in \text{Var} \cdot F(x) \sqsubseteq G(x)
\]

and minimal and maximal elements are \( \lambda x. \bot \) and \( \lambda x. \top \). For \( f, g : \text{FSP}_\text{fwd} \) with

\[
f \sqsubseteq g \text{ iff } \forall I \in \text{Instr} \cdot f(I) \sqsubseteq g(I)
\]
the function

\[ \text{update} : (\text{FSP}_{\text{fwd}} \times \text{FSP}_{\text{fwd}}) \to (\text{FSP}_{\text{fwd}} \times \text{FSP}_{\text{fwd}}) \]

is thus monotone in both arguments: \( f \sqsubseteq f' \) implies by definition \( f(I) \sqsubseteq f'(I) \) for all \( I \) and thus \( f(I)(x) \sqsubseteq f'(I)(x) \) for all \( I \) and \( x \), so \( I \oplus f(I)(x) \sqsubseteq I \oplus f'(I)(x) \) holds (by Lemma 10) for all \( I \), \( x \) and \( l \in L \), hence

\[
\text{update}(f, g) = \begin{cases} 
\lambda I. \lambda x. & \text{if } \text{succ}(I) = \emptyset \text{ then } 0 \text{ else } \bigcup_{J \in \text{succ}(I)} g(J)(x), \\
\lambda I. & \lambda x. \text{ if } x \in \text{defs}(I) \text{ then } \text{uses}(I)(x) \text{ else } \text{uses}(I)(x) \oplus f(I)(x) \\
\subseteq & \lambda I. \lambda x. \text{ if } \text{succ}(I) = \emptyset \text{ then } 0 \text{ else } \bigcup_{J \in \text{succ}(I)} g(J)(x), \\
\lambda I. & \lambda x. \text{ if } x \in \text{defs}(I) \text{ then } \text{uses}(I)(x) \text{ else } \text{uses}(I)(x) \oplus f'(I)(x) \\
= & \text{update}(f', g)
\end{cases}
\]

and similarly one obtains \( \text{update}(f, g) \sqsubseteq \text{update}(f, g') \) for \( g \sqsubseteq g' \). Consequently, fixed points exist by the theorem of Knaster-Tarski. In accordance with the general observation of [NNH99] on the relation of fixed points to the direction of the lattice operation \( \sqsubseteq \) we observe that the least fixed point to the equations (6.13) represents the preferred solution to the usage analysis problem as it detects the most linearly used variables.

The result of the forwardability analysis is represented in the \( \text{fwdOut} \) component of fixed points: we will prove in Section 6.4 that the value assigned to \( x \) in instruction \( I \) is accessed exactly once if \( \text{fwdOut}(I)(x) = 1 \) holds.

**Example.** The structure of the lattice \( L \) may be motivated using the two programs

\[
\begin{align*}
N(1) = & \quad [1]x = 2 \\
& \quad [2]y = 3 \\
& \quad [3] \text{ jmp } 4 \\
N(4) = & \quad [4]y = y + x \\
& \quad [5] \text{ jmp } 4 \\
N(3) = & \quad [3]y = x + 1 \\
N(5) = & \quad [5]z = y + 2 \\
\end{align*}
\]

In the program on the left, \textit{uses} and \textit{defs} are given by

\[
\text{uses}([4])(x) = \text{uses}([4])(y) = 1 \text{ and } \text{uses}(I)(x) = \text{uses}(I)(y) = 0 \text{ for } I \neq [4]
\]

and

\[
\text{defs}([1]) = \{x\}, \text{defs}([2]) = \text{defs}([4]) = \{y\} \text{ and } \text{defs}([3]) = \text{defs}([5]) = \emptyset.
\]
The definition of \( \text{fwdIn} \) and \( \text{fwdOut} \) results in the recursive equations

\[
\begin{align*}
\text{fwdOut}([4])(x) &= 0 \oplus 1 \oplus \text{fwdOut}([4])(x) \\
\text{fwdOut}([4])(y) &= 0 \oplus 1 \\
\text{fwdOut}([2])(x) &= 0 \oplus 1 \oplus \text{fwdOut}([4])(x) \\
\text{fwdOut}([2])(y) &= 0 \oplus 1 \\
\text{fwdOut}([1])(x) &= 0 \oplus 0 \oplus 1 \oplus \text{fwdOut}([4])(x) \\
\text{fwdOut}([1])(y) &= 0.
\end{align*}
\]

We aim at an analysis which detects that \( x \) is possibly used infinitely often but allows \( y \) to be forwarded: each value which is assigned to \( y \) is subsequently accessed exactly once. By the above correspondence between \( \text{fwdOut}(I)(x) \) and \( x \in \text{defs}(I) \) we thus aim at a fixed point with

\[
\begin{align*}
\text{fwdOut}([1])(x) &= \top \\
\text{fwdOut}([2])(y) &= \text{fwdOut}([4])(y) = 1
\end{align*}
\]

and the last condition motivates us to define

\[
0 \oplus 1 = 1.
\]

Monotonicity \( 1 \oplus \bot \subseteq 1 \oplus 0 \) yields \( 1 \oplus \bot \subseteq 1 \), but defining \( 1 \oplus \bot = \bot \) would lead to the least fixed point for \( \text{fwdOut}([4])(x) \) being \( \bot \), hence to \( \text{fwdOut}([1])(x) = \bot \neq \top \), in contradiction to our requirement. We therefore stipulate

\[
1 \oplus \bot \neq \bot.
\]

For the program on the right, the interesting equations are

\[
\begin{align*}
\text{fwdOut}([3])(y) &= \text{fwdIn}([4])(y) = 0 \oplus (\text{fwdIn}([5])(y) \sqcup \text{fwdIn}([3])(y)) \\
\text{fwdIn}([5])(y) &= 1 \oplus 0 \\
\text{fwdIn}([3])(y) &= 0
\end{align*}
\]

which simplify to

\[
\text{fwdOut}([3])(y) = 0 \oplus ((1 \oplus 0) \sqcup 0) = 0 \oplus (1 \sqcup 0).
\]
The intended value is \( \text{fwdOut}([3])(y) = \top \) since \( y \) cannot be forwarded – it is not used inside the loop. This motivates us to require \( 1 \sqcup 0 = \top \) which yields the diamond shape of \( L \).

For defining \( \oplus \), the entries for \( \top \oplus l \) and the values on the diagonal are clearly motivated by the purpose of the analysis. We observed \( 0 \oplus 1 = 1 \) above, and noted \( \bot \neq 1 \oplus \bot \subseteq 1 \), so the only option left is to set
\[
1 \oplus \bot = 1.
\]

Finally, the only choice for \( \bot \oplus 0 \) is
\[
\bot \oplus 0 = \bot
\]
since all other choices violate monotonicity – consider \( 1 \oplus 0 = 1 \) for the alternative values \( 0 \) and \( \top \) and \( 0 \oplus 0 = 0 \) for \( 1 \).

\[\diamondsuit\]

**Example.** The analysis for the example programs (6.5) to (6.8) is as follows. Table 6.1 shows the sets \( \text{defs}(I) \) and the pointwise functions \( \text{uses}(I) \) for all instructions. For example, \( \{z\} = \text{defs}([4]) \) holds in program (6.7) and \( \text{uses}([6])(y1) = \top \) in program (6.8).

Applying equations (6.13) to program (6.5) yields Table 6.2 from which
\[
\text{fwdIn}([6])(x) = 1 \oplus \text{fwdOut}([6])(x)
\]
and
\[
\text{fwdOut}([3])(u) = \text{fwdIn}([4])(u) \sqcup \text{fwdIn}([6])(u)
\]
can be deduced. The data in Table 6.2 simplifies to Table 6.3. We can discover the linear usage of \( x \) by observing that \( \text{fwdOut}([2])(x) = 1 \) holds where \([2]\) is the only instruction \( I \) with \( x \in \text{defs}(I) \). Likewise, both assignments to \( u \) are used linearly as for \( I \in \{[4],[6]\} \) we have \( u \in \text{defs}(I) \) and \( \text{fwdOut}(I)(u) = 1 \).

Similarly, Table 6.4 shows the result of applying equations (6.13) to program (6.6). For the variable \( y \) we obtain
\[
\text{fwdOut}([2])(y) = \ldots = \text{fwdIn}([4])(y) \sqcup \text{fwdIn}([7])(y)
\]
\[
\text{fwdIn}([4])(y) = \ldots = 1 \oplus (\text{fwdIn}([4])(y) \sqcup \text{fwdIn}([7])(y))
\]
\[
\text{fwdIn}([7])(y) = \ldots = 0
\]
6.3. Analysis of variable usage

<table>
<thead>
<tr>
<th>( I )</th>
<th>(6.5)</th>
<th>(6.6)</th>
<th>(6.7)</th>
<th>(6.8)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>( \text{defs} )</td>
<td>( \text{uses} )</td>
<td>( \text{defs} )</td>
<td>( \text{uses} )</td>
</tr>
<tr>
<td>1</td>
<td>{( v )}</td>
<td>0 0 0</td>
<td>{( v )}</td>
<td>0 0 0</td>
</tr>
<tr>
<td>2</td>
<td>{( x )}</td>
<td>0 0 0</td>
<td>{( y )}</td>
<td>0 0 0</td>
</tr>
<tr>
<td>3</td>
<td>0 0 1</td>
<td>0 0 1</td>
<td>0 0 1</td>
<td>0 0 1</td>
</tr>
<tr>
<td>4</td>
<td>{( u )}</td>
<td>1 0 0</td>
<td>{( u )}</td>
<td>1 0 0</td>
</tr>
<tr>
<td>5</td>
<td>0 0 0</td>
<td>{( v )}</td>
<td>0 0 1</td>
<td>{( w )}</td>
</tr>
<tr>
<td>6</td>
<td>{( u )}</td>
<td>1 0 0</td>
<td>0 0 1</td>
<td>{( v )}</td>
</tr>
<tr>
<td>7</td>
<td>0 0 0</td>
<td>{( v )}</td>
<td>0 0 0</td>
<td>0 0 1</td>
</tr>
<tr>
<td>8</td>
<td>{( v )}</td>
<td>0 1 0</td>
<td>{( v )}</td>
<td>1 0 0</td>
</tr>
</tbody>
</table>

Table 6.1: Dataflow information for programs (6.5) to (6.8).

Since the least fixed point of \( I = 1 \oplus (L \uplus 0) \) is \( \top \), this equation yields \( \text{fwdIn}(\{4\})(y) = \top \), hence \( \text{fwdOut}(\{2\})(y) = \top \), and we discover the non-linear usage of \( y \) due to \( y \in \text{defs}(\{2\}) \).

For program (6.7) we similarly obtain the equations

\[
\begin{align*}
\text{fwdOut}(\{2\})(w) &= \text{fwdIn}(\{4\})(w) \sqcup \text{fwdIn}(\{8\})(w) \\
\text{fwdOut}(\{5\})(w) &= \text{fwdIn}(\{4\})(w) \sqcup \text{fwdIn}(\{8\})(w) \\
\text{fwdIn}(\{4\})(w) &= \text{fwdIn}(\{8\})(w) = 1 \\
\text{fwdOut}(\{4\})(z) &= 1 \oplus (\text{fwdIn}(\{4\})(z) \sqcup \text{fwdIn}(\{8\})(z)) \\
\text{fwdIn}(\{4\})(z) &= \text{fwdIn}(\{8\})(z) = 0 \\
\text{fwdOut}(\{1\})(v) &= 1 \oplus (\text{fwdIn}(\{4\})(v) \sqcup \text{fwdIn}(\{8\})(v)) \\
\text{fwdIn}(\{4\})(v) &= 1 \\
\text{fwdIn}(\{8\})(v) &= 0
\end{align*}
\]

and thus correctly identify \( z \) and \( w \) as being used linearly and \( v \) as being non-linear.

Finally, in program (6.8), \( \text{uses}(\{6\})(y_1) = \top \) yields \( \text{fwdOut}(\{2\})(y_1) = \top \), and we correctly identify \( y_1 \) as being used non-linearly.

\( \Box \)
<table>
<thead>
<tr>
<th>( I )</th>
<th>( \text{fwdIn} )</th>
<th>( \text{fwdOut} )</th>
</tr>
</thead>
<tbody>
<tr>
<td>[1]</td>
<td>( \text{fwdOut([1])} )</td>
<td>( \text{fwdOut([1])} )</td>
</tr>
<tr>
<td>[2]</td>
<td>( 0 )</td>
<td>( \text{fwdOut([2])} )</td>
</tr>
<tr>
<td>[3]</td>
<td>( \text{fwdOut([3])} )</td>
<td>( \text{fwdOut([3])} )</td>
</tr>
<tr>
<td>[4]</td>
<td>( \text{fwdOut([4])} )</td>
<td>( 0 )</td>
</tr>
<tr>
<td>[5]</td>
<td>( \text{fwdOut([5])} )</td>
<td>( \text{fwdOut([5])} )</td>
</tr>
<tr>
<td>[6]</td>
<td>( \text{fwdOut([6])} )</td>
<td>( 0 )</td>
</tr>
<tr>
<td>[7]</td>
<td>( \text{fwdOut([7])} )</td>
<td>( \text{fwdOut([7])} )</td>
</tr>
<tr>
<td>[8]</td>
<td>( \text{fwdOut([8])} )</td>
<td>( 1 \oplus \text{fwdOut([8])} )</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>( I )</th>
<th>( \text{fwdOut} )</th>
</tr>
</thead>
<tbody>
<tr>
<td>[1]</td>
<td>( \text{fwdIn([2])} )</td>
</tr>
<tr>
<td>[2]</td>
<td>( \text{fwdIn([3])} )</td>
</tr>
<tr>
<td>[3]</td>
<td>( \text{fwdIn([4])} \sqcup \text{fwdIn([6])} )</td>
</tr>
<tr>
<td>[4]</td>
<td>( \text{fwdIn([5])} )</td>
</tr>
<tr>
<td>[5]</td>
<td>( \text{fwdIn([8])} )</td>
</tr>
<tr>
<td>[6]</td>
<td>( \text{fwdIn([7])} )</td>
</tr>
<tr>
<td>[7]</td>
<td>( \text{fwdIn([8])} )</td>
</tr>
<tr>
<td>[8]</td>
<td>( 0 )</td>
</tr>
</tbody>
</table>

Table 6.2: Equations (6.13) applied to program (6.5).
### Table 6.3: Simplification of Table 6.2

<table>
<thead>
<tr>
<th>$I$</th>
<th>$fwdIn$</th>
<th>$fwdOut$</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>$x$</td>
<td>$u$</td>
</tr>
<tr>
<td>[1]</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>[2]</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>[3]</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>[4]</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>[5]</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>[6]</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>[7]</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>[8]</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>

### Table 6.4: Equations (6.13) applied to program (6.6)

<table>
<thead>
<tr>
<th>$I$</th>
<th>$fwdIn$</th>
<th>$fwdOut$</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>$y$</td>
<td>$u$</td>
</tr>
<tr>
<td>[1]</td>
<td>$fwdOut([1])$</td>
<td>$fwdOut([1])$</td>
</tr>
<tr>
<td>[2]</td>
<td>0</td>
<td>$fwdOut([2])$</td>
</tr>
<tr>
<td>[3]</td>
<td>$fwdOut([3])$</td>
<td>$fwdOut([3])$</td>
</tr>
<tr>
<td>[4]</td>
<td>$1 \oplus fwdOut([4])$</td>
<td>0</td>
</tr>
<tr>
<td>[5]</td>
<td>$fwdOut([5])$</td>
<td>$fwdOut([5])$</td>
</tr>
<tr>
<td>[6]</td>
<td>$fwdOut([6])$</td>
<td>$fwdOut([6])$</td>
</tr>
<tr>
<td>[7]</td>
<td>$fwdOut([7])$</td>
<td>$fwdOut([7])$</td>
</tr>
</tbody>
</table>
6.3.3 Usage analysis for SSA

In the presence of $\Phi$-functions, we use modified definitions of $\text{uses}, \text{fwdIn}$ and $\text{fwdOut}$. As motivation, consider a block $B = \phi I$ with $\phi$ of the form $[k](x_3) = \begin{pmatrix} x_1 & x_2 \\ n & m \end{pmatrix}$ and incoming control flow arrows from instructions $\text{lst}(B_1)$ and $\text{lst}(B_2)$ labelled $n$ and $m$, respectively, such that each $B_i$ contains an assignment to $x_i$. When calculating $\text{fwdIn}(\phi)$, only the usage of $x_1$ should be promoted upwards to $B_1$ (it is not available in $B_2$) and only the usage of $x_2$ to $B_2$.

By providing the control flow successor through an additional argument, the functionalities of $\text{uses}, \text{fwdIn}$ and $\text{fwdOut}$ are generalised to

$$\begin{align*}
\text{uses} : \text{Instrs}_P &\to \mathbb{N}_P \to \text{Var}_P \to \mathbb{L} \\
\text{fwdIn}, \text{fwdOut} : \text{Instrs}_P &\to \mathbb{N}_P \to \text{Var}_P \to \mathbb{L}
\end{align*}$$

For $\phi$ of the form $[l]x(k) = M(k,m)n(m)$, we define

$$\text{uses}(\phi)(n)(y) = \begin{cases} 
\bigoplus_{i=1}^k \text{uses}(x_{ij})(y) & \text{if } n = n_j \text{ for some } j \\
0 & \text{if } \forall 1 \leq j \leq m, n \neq n_j
\end{cases}$$

and for non-Phi instructions we define

$$\text{uses}(I)(n)(y) = \text{uses}'(I)(y)$$

where $\text{uses}'$ is the definition of $\text{uses}$ in the previous section. For all instruction forms, functions $\text{fwdIn}$ and $\text{fwdOut}$ are defined by

$$\text{fwdIn}(I)(n)(y) = \begin{cases} 
\text{uses}(I)(n)(y) & \text{if } y \in \text{defs}(I) \\
\text{uses}(I)(n)(y) \oplus \text{fwdOut}(I)(n)(y) & \text{otherwise}
\end{cases}$$

and

$$\text{fwdOut}(I)(n)(x) = \begin{cases} 
0 & \text{if } \text{succ}(I) = \emptyset \\
\bigsqcup_{J \in \text{succ}(I)} \text{fwdIn}(J)(k)(x) & \text{if } \text{succ}(I) \neq \emptyset
\end{cases}$$

where $k$ is the label of $I$.

Similarly to the previous subsection, we obtain solutions to the dataflow equations by calculating the (least) fixed point. Functions in $\mathbb{N}_P \to \text{Var}_P \to \mathbb{L}$ may be ordered pointwise by

$$\varphi \sqsubseteq \psi \text{ iff } \forall n \in \mathbb{N}_P. \varphi(n) \sqsubseteq \psi(n)$$
where $\varphi(n) \sqsubseteq \psi(n)$ is the partial order over $\text{Var}_p \rightarrow L$ defined above. Consequently, the function space

$$\text{FSP}_{\text{SSA}} = \text{Instrs}_p \rightarrow N_p \rightarrow \text{Var}_p \rightarrow L$$

is ordered by

$$f \sqsubseteq g \text{ iff } \forall I \in \text{Instrs}_p. f(I) \sqsubseteq g(I)$$

where $\lambda I. (\lambda n. (\lambda x. \bot))$ is the least element. Hence,

$$\text{update} : (\text{FSP}_{\text{SSA}} \times \text{FSP}_{\text{SSA}}) \rightarrow (\text{FSP}_{\text{SSA}} \times \text{FSP}_{\text{SSA}})$$

defined by

$$\text{update}(f, g) = \begin{align*}
&\lambda I. \lambda n. \lambda x. & \text{ if } \text{succ}(I) = \emptyset \text{ then } 0 \\
&\quad \text{ else let } k = \text{label}(I) \in \bigsqcup_{J \in \text{succ}(I)} g(J)(k)(x), \\
&\quad \lambda I. \lambda n. \lambda x. & \text{ if } x \in \text{defs}(I) \text{ then } \text{uses}(I)(n)(x) \\
&\quad \text{ else } \text{uses}(I)(n)(x) \oplus f(I)(n)(x))
\end{align*}$$

is monotone: $f \sqsubseteq f'$ implies by definition $f(I) \sqsubseteq f'(I)$ for all $I \in \text{Instrs}_p$, hence $f(I)(n) \sqsubseteq f'(I)(n)$ for all $I \in \text{Instrs}_p$ and $n \in N_p$ and $f(I)(n)(x) \sqsubseteq f'(I)(n)(x)$ for all $I \in \text{Instrs}_p$, $n \in N_p$ and $x \in \text{Var}_p$, thus $l \oplus f(I)(n)(x) \sqsubseteq l \oplus f'(I)(n)(x)$ for all $I \in \text{Instrs}_p$, $n \in N_p$, $x \in \text{Var}_p$ and $l \in L$ and therefore

$$\text{update}(f, g) = \begin{align*}
&\lambda I. \lambda n. \lambda x. & \text{ if } \text{succ}(I) = \emptyset \text{ then } 0 \\
&\quad \text{ else let } k = \text{label}(I) \in \bigsqcup_{J \in \text{succ}(I)} g(J)(k)(x), \\
&\quad \lambda I. \lambda n. \lambda x. & \text{ if } x \in \text{defs}(I) \text{ then } \text{uses}(I)(n)(x) \\
&\quad \text{ else } \text{uses}(I)(n)(x) \oplus f(I)(n)(x))
\end{align*} \sqsubseteq \begin{align*}
&\lambda I. \lambda n. \lambda x. & \text{ if } \text{succ}(I) = \emptyset \text{ then } 0 \\
&\quad \text{ else let } k = \text{label}(I) \in \bigsqcup_{J \in \text{succ}(I)} g(J)(k)(x), \\
&\quad \lambda I. \lambda n. \lambda x. & \text{ if } x \in \text{defs}(I) \text{ then } \text{uses}(I)(n)(x) \\
&\quad \text{ else } \text{uses}(I)(n)(x) \oplus f'(I)(n)(x))
\end{align*} = \text{update}(f', g)$$

Similarly, one may show $\text{update}(f, g) \sqsubseteq \text{update}(f, g')$ for $g \sqsubseteq g'$, hence the finiteness of the lattice $\text{FSP}_{\text{SSA}}$ and the theorem of Knaster-Tarski guarantee the existence of (least) fixed points.
Example. Consider the following program $P = (N, A, \text{succ}, 1)$

$$
\begin{align*}
N(1) &= [1]v = 7 \\
N(3) &= [3]y1 = 3 \\
N(6) &= [6]y2 = 7 \\
N(9) &= [9] (y) = \begin{pmatrix} y1 & y2 \\ z1 & z2 \end{pmatrix}^{5} \\
&= [10]w = z + y
\end{align*}
$$

The definition of uses yields

$$
\begin{align*}
\text{uses}(2)(m)(v) &= \text{uses}(7)(m)(y2) = 1 \text{ for all } m \\
\text{uses}(9)(5)(y1) &= \text{uses}(9)(5)(z1) = 1 \\
\text{uses}(9)(8)(y2) &= \text{uses}(9)(8)(z2) = 1 \\
\text{uses}(10)(m)(y) &= \text{uses}(10)(m)(z) = 1 \text{ for all } m
\end{align*}
$$

and uses(I)(m)(x) = 0 elsewhere. We consequently obtain

$$
\begin{align*}
fwdOut([10])(m)(x) &= 0 \\
fwdIn([10])(m)(x) &= \begin{cases} 
\text{uses}(10)(m)(x) & \text{if } x = w \\
\text{uses}(10)(m)(x) \oplus fwdOut([10])(m)(x) & \text{otherwise}
\end{cases} \\
&= \begin{cases} 
1 & \text{if } x \in \{z, y\} \\
0 & \text{otherwise}
\end{cases}
\end{align*}
$$

$$
\begin{align*}
fwdOut([9])(m)(x) &= fwdIn([10])([9])(x) = \begin{cases} 
1 & \text{if } x \in \{z, y\} \\
0 & \text{otherwise}
\end{cases} \\
fwdIn([9])(m)(x) &= \begin{cases} 
\text{uses}(9)(m)(w) & \text{if } x \in \{y, z\} \\
\text{uses}(9)(m)(w) \oplus fwdOut([9])(m)(w) & \text{otherwise}
\end{cases} \\
&= \begin{cases} 
1 & \text{if } x \in \{z1, y1\} \text{ and } m = 5 \\
1 & \text{if } x \in \{z2, y2\} \text{ and } m = 8 \\
0 & \text{otherwise}
\end{cases}
\end{align*}
$$
6.3. Analysis of variable usage

\[
fwdOut([8])(m)(x) = fwdIn([9])([8])(x)
= \begin{cases} 
1 & \text{if } x \in \{z2, y2\} \\
0 & \text{otherwise}
\end{cases}
\]

\[
fwdIn([8])(m)(x) = \text{uses}([8])(m)(x) \oplus fwdOut([8])(m)(x)
= \begin{cases} 
1 & \text{if } x \in \{z2, y2\} \\
0 & \text{otherwise}
\end{cases}
\]

\[
fwdOut([7])(m)(x) = fwdIn([8])([7])(x) = \begin{cases} 
1 & \text{if } x \in \{z2, y2\} \\
0 & \text{otherwise}
\end{cases}
\]

\[
fwdIn([7])(m)(x) = \begin{cases} 
\text{uses}([7])(m)(x) & \text{if } x \in \{z2\} \\
\text{uses}([7])(m)(x) \oplus fwdOut([7])(m)(x) & \text{otherwise}
\end{cases}
\]

\[
fwdOut([6])(m)(x) = fwdIn([7])([6])(x) = \begin{cases} 
\top & \text{if } x \in \{y2\} \\
0 & \text{otherwise}
\end{cases}
\]

\[
fwdIn([6])(m)(x) = \begin{cases} 
\text{uses}([6])(m)(x) & \text{if } x \in \{y2\} \\
\text{uses}([6])(m)(x) \oplus fwdOut([6])(m)(x) & \text{otherwise}
\end{cases}
\]

\[
fwdOut([5])(m)(x) = fwdIn([9])([5])(x)
= \begin{cases} 
1 & \text{if } x \in \{z1, y1\} \\
0 & \text{otherwise}
\end{cases}
\]

\[
fwdIn([5])(m)(x) = \text{uses}([5])(m)(x) \oplus fwdOut([5])(m)(x)
= \begin{cases} 
1 & \text{if } x \in \{z1, y1\} \\
0 & \text{otherwise}
\end{cases}
\]
\[ \text{fwdOut}([4])(m)(x) = \text{fwdIn}([5])([4])(x) = \begin{cases} 1 & \text{if } x \in \{z1,y1\} \\ 0 & \text{otherwise} \end{cases} \]

\[ \text{fwdIn}([4])(m)(x) = \begin{cases} \text{uses}([4])(m)(x) & \text{if } x \in \{z1\} \\ \text{uses}([4])(m)(x) \oplus \text{fwdOut}([4])(m)(x) & \text{otherwise} \end{cases} \]

\[ = \begin{cases} 1 & \text{if } x \in \{y1\} \\ 0 & \text{otherwise} \end{cases} \]

\[ \text{fwdOut}([3])(m)(x) = \text{fwdIn}([4])([3])(x) = \begin{cases} 1 & \text{if } x \in \{y1\} \\ 0 & \text{otherwise} \end{cases} \]

\[ \text{fwdIn}([3])(m)(x) = \begin{cases} \text{uses}([3])(m)(x) & \text{if } x \in \{y1\} \\ \text{uses}([3])(m)(x) \oplus \text{fwdOut}([3])(m)(x) & \text{otherwise} \end{cases} \]

\[ = 0 \]

\[ \text{fwdOut}([2])(m)(x) = \text{fwdIn}([3])([2])(x) \cup \text{fwdIn}([6])([2])(x) = 0 \]

\[ \text{fwdIn}([2])(m)(x) = \text{uses}([2])(m)(x) \oplus \text{fwdOut}([2])(m)(x) \]

\[ = \begin{cases} 1 & \text{if } x \in \{v\} \\ 0 & \text{otherwise} \end{cases} \]

\[ \text{fwdOut}([1])(m)(x) = \text{fwdIn}([2])([1])(x) = \begin{cases} 1 & \text{if } x \in \{v\} \\ 0 & \text{otherwise} \end{cases} \]

\[ \text{fwdIn}([1])(m)(x) = \begin{cases} \text{uses}([1])(m)(x) & \text{if } x \in \{v\} \\ \text{uses}([1])(m)(x) \oplus \text{fwdOut}([1])(m)(x) & \text{otherwise} \end{cases} \]

\[ = 0 \]

Note that for all \( I, x \) and \( m_1, m_2, \text{fwdOut}(I)(m_1)(x) = \text{fwdOut}(I)(m_2)(x) \) holds whenever \( x \in \text{defs}(I) \). The non-linear usages of \( y2 \) and \( w \) can be detected by observing that \( \text{fwdOut}([6])(m)(y2) = T \) and \( \text{fwdOut}([10])(m)(w) = 0 \) hold, where \([6] = I_{y2} \) and \([10] = I_w \). For all other pairs \((I, x)\) with \( x \in \text{defs}(I) \), \( \text{fwdOut}(I)(m)(x) = 1 \) holds and these variables are indeed used exactly once. \( \Diamond \)
6.4 Soundness of usage analysis

For proving the soundness of the dataflow analysis we define a more fine-grained operational model $\Rightarrow$. This non-standard dynamic semantics reflects our understanding of limited reading access. Each time a variable is assigned to, the semantics associates a tag with the value saying how many times the value may be accessed for reading. We distinguish between unlimited reading capability, single reading capability and no reading capability. Variables of the latter category may not be accessed for reading but may be assigned to. Variables whose tag is of the 'read-once' form may be accessed for reading but not be assigned to, and the first reading operation changes the tag to 'no-reads'. Variables carrying an 'unlimited' tag may be read arbitrarily often and may also be reassigned to.

We show that the dynamic semantics including tags is sound with respect to the semantics $\rightarrow$ of IL: any $\Rightarrow$-execution abstracts to a $\rightarrow$-execution. On the other hand, $\Rightarrow$ is relative complete for $\rightarrow$ in the sense that for any $\rightarrow$-execution there is an annotation for the variables for which $\Rightarrow$ mirrors $\rightarrow$. Finally, we show that the dataflow analysis is sound for $\Rightarrow$, i.e. that annotations resulting from the dataflow analysis guarantee that a program does not get stuck by performing too many reading accesses to a variable or by reassigning too early.

6.4.1 Non-standard dynamic semantics

The capabilities consist of elements of the lattice $\mathcal{L}$: the unlimited capability is $\top$ and the 'read-once' and 'no-read' capabilities are given by 1 and 0 respectively. Capabilities are incorporated into the operational model by extending the codomain of states to $\mathcal{L} \times \text{Val}$, and extended states are ranged over by mappings $\Sigma : \text{Var}_F \rightarrow (\mathcal{L} \times \text{Val})$.

The update operations $\Sigma[x \mapsto (l,a)]$ and $\Sigma[x_1 \mapsto (l_1,a_1), \ldots, x_n \mapsto (l_n,a_n)] = \Sigma[x_1 \mapsto (l_1,a_1)] \ldots [x_n \mapsto (l_n,a_n)]$ (for distinct $x_i$) are given by the general notions defined in Chapter 2.

The evaluation relation $\Downarrow$ modifies the capabilities as described above

\[
\begin{align*}
\Sigma, a & \Downarrow \Sigma, a \\
\Sigma, x \downarrow \Sigma_1, a & \quad \Sigma, x \downarrow \Sigma_1, BOP(a,b)
\end{align*}
\]
\[ x \in \text{dom } \Sigma \quad \Sigma(x) = (l, a) \quad l \not\equiv 1 \quad \Sigma_1 = \Sigma[x \mapsto (l \ominus 1, a)] \]
\[ \Sigma, x \downarrow \Sigma_1, a \]

\[ \begin{array}{c}
\Sigma, x \downarrow \Sigma_1, a \\
\Sigma_1, y \downarrow \Sigma_2, b \\
\Sigma, x \, \text{bop } y \downarrow \Sigma_2, \text{BOP}(a, b)
\end{array} \]

where we define
\[ l_1 \ominus l_2 = \sqcup \{ l \mid l_1 \ominus l_2 = l_1 \}. \]

In particular, the condition \( l \not\equiv 1 \) in the rule for \( \Sigma, x \downarrow \Sigma_1, a \) forbids reading from a variable tagged with 0. The complementary write permission is denoted by \( \downarrow \).

**Definition 34.** For \( \Sigma : \text{Var}_p \rightarrow (L \times \text{Val}) \) and \( x \in \text{Var}_p \), we write \( \Sigma \downarrow x \) if \( \Sigma(x) = (l, a) \) implies \( l \not\equiv 0 \).

The definition of \( \ominus \) may be expanded using \( \sqcup \emptyset = \bot \), which yields the table

<table>
<thead>
<tr>
<th>( \ominus )</th>
<th>( \bot )</th>
<th>0</th>
<th>1</th>
<th>( \top )</th>
</tr>
</thead>
<tbody>
<tr>
<td>( \bot )</td>
<td>( \bot )</td>
<td>( \bot )</td>
<td>( \bot )</td>
<td>( \bot )</td>
</tr>
<tr>
<td>0</td>
<td>( \bot )</td>
<td>( \bot )</td>
<td>( \bot )</td>
<td>( \bot )</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>( \bot )</td>
</tr>
<tr>
<td>( \top )</td>
<td>( \top )</td>
<td>( \top )</td>
<td>( \top )</td>
<td>( \top )</td>
</tr>
</tbody>
</table>

and the following properties.

**Lemma 11.** Let \( a, b, c \in L \). Then

1. \( a \ominus a \not\equiv 0 \)

2. \( (a \ominus b) \ominus b \not\equiv a \)

3. \( (1 \ominus a \ominus b) \ominus a \not\equiv 1 \)

4. \( (a \ominus b) \ominus c = a \ominus (b \ominus c) \) for \( b, c \neq \bot \).

5. \( a \not\equiv b \neq \bot \) implies \( a \ominus c \not\equiv b \ominus c \).

**Proof.** Inspection of the tables defining \( \ominus \) and \( \ominus \). \( \square \)
The dynamic semantics for extended configurations is given by the relation $S \Rightarrow^\alpha T$ which is parametric in an association function $\alpha$. This function assigns to each pair $(I, x)$ such that $I \in \mathbf{Instr}_p$ and $x \in \mathbf{uses}(I)$ an element $\alpha(I, x) \in \mathcal{L}$, representing the tag to be associated with $x$ during an assignment.

$$\begin{align*}
\text{JMP} & : (\Sigma, n, [k], \text{jmp } m) \Rightarrow^\alpha (\Sigma, k, n(m)) \\
\text{IF-T} & : \Sigma, x \Downarrow \Sigma_1, 0 \\
& \quad (\Sigma, n, [k] \text{ if } x \downarrow m_1 \downarrow m_2) \Rightarrow^\alpha (\Sigma_1, k, n(m_1)) \\
\text{IF-F} & : \Sigma, x \Downarrow \Sigma_1, a \\
& \quad (\Sigma, n, [k] \text{ if } x \downarrow m_1 \downarrow m_2) \Rightarrow^\alpha (\Sigma_1, k, n(m_2)) \\
\text{ASS} & : (\Sigma, n, \text{ass } \bar{I}) \Rightarrow^\alpha (\Sigma, \text{ass } [x \mapsto (\alpha(\text{ass}, x), a)], n, \bar{I}) \\
\text{PHI} & : \forall 1 \leq i \leq k, \Sigma_i, x_i \Downarrow \Sigma_{i+1}, a_i \\
& \quad (\Sigma, n, \phi \bar{I}) \Rightarrow^\alpha (\Sigma, n, \bar{I}) \\
\text{ASS} & : (\Sigma, n, \text{ass } \bar{I}) \\
\text{PHI} & : \forall 1 \leq i \leq k, \Sigma_i, x_i \Downarrow \Sigma_{i+1}, a_i \\
& \quad (\Sigma, n, \phi \bar{I}) \Rightarrow^\alpha (\Sigma, n, \bar{I}) \\
& \quad \phi = [l]x(k) = M(m, k)n(m) \\
& \quad n = n_j \\
& \quad \forall 1 \leq i \leq k, \Sigma_{i+1} \Downarrow x_i \\
& \quad \Sigma = \Sigma_{k+1}[x_i \mapsto (\alpha(\phi, x_i), a_i)]
\end{align*}$$

In order to justify the occurrence of lattice elements in the dynamic semantics, we observe that an entry $\Sigma(x) = (l, a)$ admits

- no reading access if $l = 0$ since $\Sigma, x \not\Downarrow \Sigma_1, b$ for all $\Sigma_1$ and $b$ because the side condition $l \uparrow 1$ is violated
- exactly one reading access if $l = 1$, since $\Sigma, x \Downarrow \Sigma_1, b$ implies $b = a$ and $\Sigma_1(x) = (1 \ominus 1, a) = (0, a)$.
- arbitrarily many reading accesses (including zero) if $l = \top$ as $\Sigma, x \Downarrow \Sigma_1, b$ implies $b = a$ and $\Sigma_1 = \Sigma$ due to $\top \ominus 1 = \top$
- a writing access exactly if $l \Downarrow 0$ holds, due to the side condition $\Sigma \Downarrow x$ in rules ASS and PHI.
Initialising annotations \( l \) for variable \( x \) during each assignment to \( x \) thus determines how often the assigned value can be read, but also how often it must be read prior to a subsequent assignment. In particular, a value with \( l = 1 \) must be read exactly once before it can be assigned to again, but also can be read from only once. Values with \( l = 1 \) are thus candidates for forwarding.

By incorporating the lattice elements into the operational model, the dataflow analysis may be proven sound in a different approach than [NNH99]. Configurations are naturally related to the functions \( \text{fwdIn}/\text{fwdOut} \) using the tags, and the duality between \( \ominus \) and \( \oplus \) embodies the directional relationship between (forward) program execution and (backward) dataflow analysis.

Before proving soundness, we relate \( \Rightarrow \) to \( \rightarrow \).

**Definition 35.** The tag-free state \( \widehat{\Sigma} : \text{Var}_P \rightarrow \text{Val} \) corresponding to \( \Sigma : \text{Var}_P \rightarrow (\mathcal{L} \times \text{Val}) \) is given pointwise by

\[
\widehat{\Sigma}(x) = a \text{ iff } \exists l. \Sigma(x) = (l, a).
\]

For \( S = (\Sigma, n, B) \) we also write \( \widehat{S} \) for the configuration \( (\widehat{\Sigma}, n, B) \).

**Proposition 31.** (Soundness of \( \Rightarrow \) w.r.t. \( \rightarrow \))

\[
S \Rightarrow_{\alpha} T \text{ implies } \widehat{S} \rightarrow \widehat{T}
\]

**Proof.** Induction on the rule used for \( (\Sigma, n, B) \Rightarrow_{\alpha} (\Sigma', n', B') \).

**Case JMP.** For \( B = [k] \text{ jmp } m \), the definition of rule JMP implies \( \Sigma = \Sigma', n' = n \) and \( B' = N(m) \). Thus, \( \widehat{\Sigma} = \widehat{\Sigma'} \) holds and the claim holds trivially by rule Jmp.

**Case IF-T.** For \( B = [k] \text{ if } x \text{ m } l \), the definition of rule IF-T implies \( \Sigma = \Sigma_1, n' = k \) and \( B' = N(m) \) where \( \Sigma, x \parallel \Sigma_1, 0 \). For \( \Sigma, x \parallel \Sigma_1, 0 \) to hold, we must have \( \Sigma(x) = (l, 0) \) with \( 1 \sqsubset l \) and \( \Sigma_1 = \Sigma[x \mapsto (l \ominus 1, 0)] \). Consequently, \( \widehat{\Sigma}(x) = 0 \) and \( \widehat{\Sigma} = \widehat{\Sigma}_1 \), so the claim follows by rule If-t.

**Case IF-F.** Similar to the previous case.
Case ASS. For $B = \text{ass} \overline{T}$ and $\text{ass} = [k]x = e$, the definition of rule ASS implies $\Sigma' = \Sigma_1[x \mapsto (\alpha(\text{ass},x),a)], n' = n$ and $B' = \overline{T}$ where $\Sigma, e \vdash \Sigma_1, a$. Consequently, $\widehat{\Sigma}, e \downarrow a$, $\widehat{\Sigma}_1 = \widehat{\Sigma}$ and 

$$\widehat{\Sigma}' = \Sigma_1[x \mapsto (\alpha(\text{ass},x),a)] = \widehat{\Sigma}_1[x \mapsto a] = \widehat{\Sigma}[x \mapsto a]$$

hold so the claim follows from the definition of rule Ass.

Case PHI. For $B = \phi \overline{T}$ and $\phi = [f]x(k) = M(k,m)n(m)$, the definition of rule PHI implies

- $n = n_j$ for a $1 \leq j \leq m$,
- $\Sigma_i, x_i \downarrow \Sigma_{i+1}, a_i$ for some $\Sigma = \Sigma_1, \ldots, \Sigma_{k+1}$ and all $1 \leq i \leq k$,
- $\Sigma_{k+1} \downarrow x_i$ for all $1 \leq i \leq k$,
- $n' = n, B' = \overline{T}$ and $\Sigma' = \Sigma_{k+1}[x_i \mapsto (\alpha(\phi,x),a_i)]$

By the definition of $\downarrow$, $\widehat{\Sigma}_i = \widehat{\Sigma}_1$ and $\widehat{\Sigma}_i, x_i \downarrow a_i$ hold for all $1 \leq i \leq k$. Also, $\widehat{\Sigma}_{k+1} = \widehat{\Sigma}_1$ and $\widehat{\Sigma}' = \widehat{\Sigma}_1[x_i \mapsto a_i]$, and the claim follows from the definition of rule Phi.

By combining Propositions 31 and 29 we obtain that $\widehat{T} = \widehat{R}$ holds whenever $S \Rightarrow_\alpha T$ and $S \Rightarrow_\beta R$, for any $\alpha$ and $\beta$. As $\downarrow$ and $\downarrow$ are deterministic, one may also show that $T = R$ holds if $S \Rightarrow_\alpha T$ and $S \Rightarrow_\alpha R$ by adapting the proof of Proposition 29 to $\Rightarrow_\alpha$.

**Definition 36.** For program $P$ and configuration $S = (\sigma, n, B)$, the annotation $(\sigma)_\alpha$ of $\sigma$ according to annotation function $\alpha : (\text{Instrs}_P \times \text{Var}_P) \rightarrow \mathcal{L}$ is given by

$$(\sigma)_\alpha(x) = (\downarrow_{x \in \text{def}(I)} \alpha(I,x), a) \text{ iff } \sigma(x) = a.$$ 

and the $\alpha$-annotated configuration $(S)_\alpha$ by $((\sigma)_\alpha, n, B)$.

Consequently, 

$$\widehat{(\sigma)_\alpha} = \sigma$$
and

$$(\Sigma)_{\alpha}(x) = (\bigcup_{x \in \text{defs}(I)} \alpha(I, x), a)$$

iff $\Sigma(x) = (l, a)$

hold. For SSA programs, Definition 36 simplifies to

$$(\sigma)_{\alpha}(x) = (\alpha(I_x, x), a)$$

iff $\sigma(x) = a$

and hence

$$(\hat{\Sigma})_{\alpha}(x) = (\alpha(I_x, x), a)$$

iff $\Sigma(x) = (l, a)$.

**Proposition 32.** *(Completeness)* There is an annotation function $\alpha$ such that

$$S \rightarrow T \text{ implies } (S)_{\alpha} \Rightarrow (T)_{\alpha}.$$

*Proof.* Take $\alpha(I, x) = T$ whenever $x \in \text{defs}(I)$ and observe that for this $\alpha$, $\Sigma, e \downarrow \Sigma_1, a$ and $\Sigma = (\sigma)_{\alpha}$ implies $\Sigma = \Sigma_1$ since $T \oplus I = T$ holds.

Consequently, any $\rightarrow$-execution can be matched by a $\Rightarrow$-execution for the trivial $\alpha = \lambda x. T$. This gives unlimited read/write capabilities to all variables and corresponds to communicating all variables through registers. The result of Proposition 32 is thus not satisfactory for reasoning about forwardability - it does not relate executions to solutions for the dataflow equations. In the next subsection, we will prove the dataflow equations sound by showing that association functions $\alpha$ originating from a solution for the equations respect $\Rightarrow$.

In the following, we let $S_0$ also denote the initial state with respect to $\Rightarrow$. Confusion with the initial state with respect to $\rightarrow$ may always be resolved based on the context in which the notation is used.

**Lemma 12.** If $S_0 \Rightarrow^*_\alpha S$, $S = (\Sigma, n, I\overline{I})$ and $I \neq \emptyset$, then $x \in \text{uses}(I)$ implies $x \in \text{dom} \Sigma$.

*Proof.* For $S_0 \Rightarrow^*_\alpha S$, Proposition 31 implies $\Sigma \rightarrow^* \Sigma$, so for $x \in \text{uses}(I)$ Lemma 9 (first part) yields $x \in \text{dom} \Sigma$. Now apply $\text{dom} \Sigma = \text{dom} \hat{\Sigma}$.  

**Definition 37.** For $S = (\Sigma, n, B)$ and $f : \text{Var}_B \rightarrow \mathcal{L}$ we write $f \subseteq S$ if for all $x \in \text{dom} \Sigma$, $\Sigma(x) = (l, a)$ implies $f(x) \subseteq l$. 
6.4.2 Soundness for regular programs

We prove the equations (6.13) sound by showing that for any association $\alpha$ originating from a solution, $\Rightarrow_{\alpha}$ is implied by $\rightarrow$.

**Definition 38.** Let $P$ be a regular program. A pair of functions $$ \text{In, Out : Instrs}_P \rightarrow \text{Var}_P \rightarrow \mathcal{L} $$ is called a solution for $P$ if equations (6.13) are valid for the substitution $\text{In/fwdIn, Out/fwdOut}$.

In the remainder of this subsection, let $P = (\Sigma, A, \text{succ, entry})$ be a regular program and $\text{In}/\text{Out}$ a solution for $P$.

**Proposition 33.** If $x \in \text{Var}_P$, $I \in \text{Instrs}_P$ and $J \in \text{succ}(I)$ then $\text{Out}(I)(x) \supseteq \text{In}(J)(x)$.

**Proof.** From $J \in \text{succ}(I)$ we obtain $\text{succ}(I) \neq \emptyset$, hence

$$ \text{Out}(I)(x) = \sqcup_{K \in \text{succ}(I)} \text{In}(K)(x) \supseteq \text{In}(J)(x) $$

follows from equations (6.13) for any solution $\text{In}/\text{Out}$.

**Definition 39.** The association function $\alpha_{\text{Out}}$ is given by

$$ \alpha_{\text{Out}}(I, x) = \text{Out}(I)(x) $$

for all pairs $(I, x)$ such that $I \in \text{Instrs}_P$ and $x \in \text{defs}(I)$. For association functions $\alpha$ and $\beta$ we write $\alpha \sqsupseteq \beta$ if $\alpha(I, x) \supseteq \beta(I, x)$ for all $(I, x)$ such that $I \in \text{Instrs}_P$ and $x \in \text{defs}(I)$.

The following proposition represents the heart of the soundness proof. It guarantees that an execution sequence starting in $S_0$ whose capabilities are (at least as generous as) given by a solution to the dataflow equations can always proceed. Furthermore, the resulting configuration is related to the dataflow equations.

**Proposition 34.** (Progress) Let $S = (\Sigma, n, I \bar{T})$ and $\text{In}(I) \subseteq S$. Let $\alpha \sqsupseteq \alpha_{\text{Out}}$ and $S_0 \Rightarrow^*_\alpha S$. Then there is a (unique) $T$ such that $S \Rightarrow_{\alpha} T$. Furthermore, $T$ fulfils $\text{Out}(I) \subseteq T$.

**Proof.** Case distinction on the structure of $I$. 
**Case** $x = a$. $\Sigma, a \downarrow \Sigma, a$ holds by the definition of $\downarrow$.

In order to show $\Sigma \downarrow x$, let $\Sigma(x) = (l, b)$. Since $\text{In}(I) \sqsubseteq S$ implies $\text{In}(I)(x) \sqsubseteq l$, $\text{In}$ fulfils equations (6.13), and $x \in \text{defs}(I)$, we obtain

$$l \sqsupseteq \text{In}(I)(x) = \text{uses}(I)(x) = \text{uses}(a)(x) = 0.$$

For $T = (\Sigma_1, n, \bar{I})$ where $\Sigma_1 = \Sigma[x \mapsto (\alpha(I, x), a)]$ we thus have $S \Rightarrow_\alpha T$.

For showing $\text{Out}(I) \sqsubseteq T$, let $y \in \text{dom} \Sigma_1$ with $\Sigma_1(y) = (\bar{I}, \bar{a})$. There are two cases.

1. If $x = y$, then $\bar{I} = \alpha(I, x) \sqsubseteq \alpha_{\text{Out}}(I, x) = \text{Out}(I)(x)$ follows.

2. If $x \neq y$ then $\Sigma(y) = (\bar{I}, \bar{a})$ follows from the definition of $\Sigma_1$, so $\text{In}(I) \sqsubseteq S$ yields $\text{In}(I)(y) \sqsubseteq \bar{I}$. From $y \notin \text{defs}(I)$ we thus obtain

$$\bar{I} \sqsupseteq \text{In}(I)(y) = \text{uses}(I)(y) \oplus \text{Out}(I)(y) = \text{Out}(I)(y).$$

In both cases, $\text{Out}(I)(y) \sqsubseteq \bar{I}$ is fulfilled.

**Case** $x = x$. Since $S_0 \Rightarrow_\alpha^* S$ and $x \in \text{uses}(I)$ hold, Lemma 12 implies $x \in \text{dom} \Sigma$, so $\Sigma(x) = (l, a)$ for some $l$ and $a$, and by assumption $\text{In}(I)(x) \sqsubseteq l$.

For $\Sigma_1 = \Sigma[x \mapsto (l \ominus 1, a)]$, $\Sigma, x \downarrow \Sigma_1, a$ holds: $\text{In}$ fulfils equations (6.13), so $l \sqsupseteq \text{In}(I)(x) = \text{uses}(I)(x) = \text{uses}(x)(x) = 1$.

Also, $\Sigma_1 \downarrow x$ holds, because $\Sigma_1(x) = (l', a')$ yields $a' = a$ and $l' = l \ominus 1 \sqsupseteq 0$.

For the unique $T = (\Sigma_2, n, \bar{I})$ with $\Sigma_2 = \Sigma[x \mapsto (\alpha(I, x), a)]$ we thus have $S \Rightarrow_\alpha T$.

For showing $\text{Out}(I) \sqsubseteq T$, let $y \in \text{dom} \Sigma_2$ and $\Sigma_2(y) = (\bar{I}, \bar{a})$. There are two cases.

1. If $x = y$ then $\bar{I} = \alpha(I, x) \sqsubseteq \alpha_{\text{Out}}(I, x) = \text{Out}(I)(x)$ follows.

2. If $x \neq y$ then $\Sigma(y) = \Sigma_1(y) = \Sigma_2(y) = (\bar{I}, \bar{a})$ follows from the definition of $\Sigma_2$ and $\Sigma_1$. In particular, $x \in \text{dom} \Sigma$ holds, so the assumption $\text{In}(I) \sqsubseteq S$ yields $\text{In}(I)(y) \sqsubseteq \bar{I}$. Since $\text{In}$ fulfils equations (6.13), we obtain

$$\bar{I} \sqsupseteq \text{In}(I)(y) = \text{uses}(I)(y) \oplus \text{Out}(I)(y) = 0 \oplus \text{Out}(I)(y).$$

In both cases, $\text{Out}(I)(y) \sqsubseteq \bar{I}$ is fulfilled.
6.4. Soundness of usage analysis

Case \( x = z \) where \( z \neq x \). Since \( S_0 \Rightarrow \alpha S \) and \( z \in uses(I) \) hold, Lemma 12 implies \( z \in \text{dom } \Sigma \), so \( \Sigma(z) = (l, a) \) for some \( l \) and \( a \), and by assumption \( \text{In}(I)(z) \subseteq l \).

For \( \Sigma_1 = \Sigma[z \mapsto (l \ominus 1, a)] \), \( \Sigma, z \not\in \Sigma_1, a \) holds: \( \text{In} \) fulfills equations (6.13), so

\[
1 \supseteq \text{In}(I)(z) = \text{uses}(I)(z) \ominus \text{Out}(I)(z) = 1 \ominus \text{Out}(I)(z) \supseteq 1. \tag{6.22}
\]

Also, \( \Sigma_1 \downarrow x \) holds, because \( \Sigma_1(x) = (l', a') \) implies \( \Sigma(x) = (l', a') \) by the definition of \( \Sigma_1 \). By the assumption \( \text{In}(I) \subseteq S \), \( 0 = \text{uses}(I)(x) = \text{In}(I)(x) \subseteq l' \) follows.

For the unique \( T = (\Sigma_2, n, \tilde{l}) \) with \( \Sigma_2 = \Sigma[z \mapsto (\alpha(I, x), a)] \) we thus have \( S \Rightarrow \alpha T \).

For showing \( \text{Out}(I) \subseteq T \), let \( y \in \text{dom } \Sigma_2 \) and \( \Sigma_2(y) = (\tilde{l}, \tilde{a}) \). There are three cases.

1. If \( x = y \) then \( \tilde{l} = \alpha(I, x) \supseteq \alpha_{\text{Out}}(I, x) = \text{Out}(I)(x) \) follows.

2. If \( y = z \) then \( \tilde{l} = l \ominus 1 \) follows from \( z \neq x \) and

\[
\Sigma_2 = \Sigma_1[y \mapsto (\alpha(I, x), a)] = \Sigma[z \mapsto (l \ominus 1, a)][x \mapsto (\alpha(I, x), a)].
\]

By property (6.22), \( 1 \supseteq l \) holds, hence \( l \in \{1, \top\} \).

If \( l = 1 \) then \( \tilde{l} = l \ominus 1 = 0 \) holds, and property (6.22) implies

\[
1 = l \supseteq 1 \ominus \text{Out}(I)(z) = 1 \ominus \text{Out}(I)(y).
\]

Consequently, \( \text{Out}(I)(y) \in \{0, \bot\} \) and \( \text{Out}(I)(y) \subseteq \tilde{l} \).

If \( l = \top \) then \( \tilde{l} = l \ominus 1 = \top \), so \( \text{Out}(I)(y) \subseteq \tilde{l} \) is trivially fulfilled.

3. If \( x \neq y \neq z \) then \( \Sigma(y) = (\tilde{l}, \tilde{a}) \) holds due to

\[
\Sigma_2 = \Sigma_1[y \mapsto (\alpha(I, x), a)] = \Sigma[z \mapsto (l \ominus 1, a)][x \mapsto (\alpha(I, x), a)].
\]

The assumption \( \text{In}(I) \subseteq S \) implies \( \text{In}(I)(y) \subseteq \tilde{l} \), and as \( \text{In} \) fulfills the equations (6.13) and \( y \not\in \text{defs}(I) \) holds, we obtain

\[
\tilde{l} \supseteq \text{In}(I)(y) = \text{uses}(I)(y) \ominus \text{Out}(I)(y) = 0 \ominus \text{Out}(I)(y).
\]

In all three cases, \( \text{Out}(I)(y) \subseteq \tilde{l} \) follows.
**Case** \( x = x \ bop a \) For \( S_0 \Rightarrow^*_\alpha S \) and \( x \in \text{uses}(I) \), Lemma 12 implies \( x \in \text{dom} \Sigma \), so \( \Sigma(x) = (l,a') \) holds for some \( l \) and \( a' \), and the assumption \( \text{In}(I) \subseteq S \) yields \( \text{In}(I)(x) \subseteq l \).

For \( \Sigma_1 = \Sigma[x \mapsto (l \cup 1,a')] \) we obtain \( \Sigma,x \downarrow \Sigma_1,a' \): \( \text{In} \) fulfils equations (6.13), so

\[
I \supseteq \text{In}(I)(x) = \text{uses}(I)(x) = \text{uses}(x \ bop a)(x) = 1.
\]

Also \( \Sigma_1,a \downarrow \Sigma_1,a \) holds, and thus \( \Sigma,x \ bop a \downarrow \Sigma_1,b \) for \( b = \text{BOP}(a',a) \).

Furthermore, \( \Sigma_1(x) = (\tilde{l},\tilde{a}) \) implies \( \tilde{l} = l \cup 1 \) and \( \tilde{a} = a' \), and \( l \supseteq 1 \) yields \( \tilde{l} \supseteq 0 \) and thus \( \Sigma_1 \downarrow x \).

For the unique \( \mathcal{T} = (\Sigma_2,n,\tilde{l}) \) where

\[
\Sigma_2 = \Sigma_1[x \mapsto (\alpha(I,x),b)] \\
= \Sigma[x \mapsto (l \cup 1,a')][x \mapsto (\alpha(I,x),b)] \\
= \Sigma[x \mapsto (\alpha(I,x),b)]
\]

we thus obtain \( S \Rightarrow^\alpha \mathcal{T} \).

For showing \( \text{Out}(I) \subseteq \mathcal{T} \), let \( y \in \text{dom} \Sigma_2 \) and \( \Sigma_2(y) = (\tilde{l},\tilde{a}) \). There are two cases.

1. If \( x = y \) then \( \tilde{l} = \alpha(I,x) \supseteq \alpha_{\text{Out}}(I,x) = \text{Out}(I)(x) \) follows.

2. If \( x \neq y \) then \( \Sigma(y) = (\tilde{l},\tilde{a}) \) follows from the definition of \( \Sigma_2 \), hence \( y \in \text{dom} \Sigma \). The assumption \( \text{In}(I) \subseteq S \) yields \( \text{In}(I)(y) \subseteq \tilde{l} \), and from equations (6.13) and \( y \notin \text{defs}(I) \) we obtain

\[
\tilde{l} \supseteq \text{In}(I)(y) = \text{uses}(I)(y) \oplus \text{Out}(I)(y) = 0 \oplus \text{Out}(I)(y).
\]

In both cases, \( \text{Out}(I)(y) \subseteq \tilde{l} \) holds.

**Remaining cases** \( x = z \ bop a \) and \( x = z \ bop y \) Similar to the above cases. Consider \( z = x \) and \( z \neq x \) for the various forms of \( y \).

**Case** \( \text{jmp} \ m \) For the unique \( \mathcal{T} = (\Sigma,k,N(m)) \) where \( k \) is the label of \( I \), \( S \Rightarrow^\alpha \mathcal{T} \) holds.
In order to show \( \text{Out}(I) \subseteq T \), let \( y \in \text{dom} \Sigma \) and \( \Sigma(y) = (l, a) \). The assumption \( \text{In}(I) \subseteq S \) guarantees \( \text{In}(I)(y) \subseteq l \), and as \( \text{In} \) fulfills the equations (6.13),

\[
l \supseteq \text{In}(I)(y) = \text{uses}(I)(y) \oplus \text{Out}(I)(y) = \text{Out}(I)(y)
\]

follows as required.

**Case if \( x \neq m \).** Since \( S_0 \Rightarrow^*_\alpha S \) and \( x \in \text{uses}(I) \) hold, Lemma 12 implies \( x \in \text{dom} \Sigma \), so \( \Sigma(x) = (l, a) \) holds for some \( l \) and \( a \), with (by assumption) \( \text{In}(I)(x) \subseteq l \).

For \( \Sigma_1 = \Sigma[x \mapsto (l \ominus 1, a)] \), \( \Sigma, x \downarrow \Sigma_1, a \) holds: \( \text{In} \) fulfills equations (6.13), so

\[
l \supseteq \text{In}(I)(x) = \text{uses}(I)(x) \oplus \text{Out}(I)(x) = 1 \oplus \text{Out}(I)(x) \supseteq 1.
\]

If \( a = 0 \), \( S \Rightarrow^\alpha T \) holds for the unique \( T = (\Sigma_1, k, N(m)) \) where \( k \) is the label of \( I \), and for \( a \neq 0 \) the same holds for \( T = (\Sigma_1, k, N(I)) \).

For showing \( \text{Out}(I) \subseteq T \), let \( y \in \text{dom} \Sigma_1 \) and \( \Sigma_1(y) = (\tilde{l}, \tilde{a}) \). There are two cases.

1. If \( x = y \) then Lemma 11 (items (2) and (5)) and \( 1 \oplus \text{Out}(I)(x) \neq \bot \) yield

\[
\tilde{l} = l \ominus 1 \supseteq (1 \oplus \text{Out}(I)(x)) \ominus 1 \supseteq \text{Out}(I)(x).
\]

2. If \( y \neq x \) then \( \Sigma(y) = \Sigma_1(y) = (\tilde{l}, \tilde{a}) \), so \( \tilde{l} \supseteq \text{In}(I)(y) \) by assumption \( S \supseteq \text{In}(I) \), hence

\[
\tilde{l} \supseteq \text{In}(I)(y) = \text{uses}(I)(y) \oplus \text{Out}(I)(y) = 0 \oplus \text{Out}(I)(y) = \text{Out}(I)(y).
\]

In both cases, we obtain \( \text{Out}(I)(y) \subseteq \tilde{l} \).

As progress at each point in a computation is ensured, single steps \( S_i \Rightarrow^\alpha S_{i+1} \) may be composed. The property \( \text{Out}(I) \supseteq \text{In}(J) \) for \( J \in \text{succ}(I) \) guarantees composability, as it ensures that the precondition \( \text{In}(J) \subseteq S_{i+1} \) for step \( i+1 \) follows from the precondition \( \text{In}(I) \subseteq S_i \) of step \( i \). As all association functions respect the (deterministic) standard semantics \( \Rightarrow^\alpha \) by Proposition 31, we thus obtain soundness for any association function \( \alpha \supseteq \alpha_{Out} \).
Chapter 6. Intermediate program analysis

Theorem 6. (Soundness of dataflow equations for regular programs) Let \( \alpha \supseteq \alpha_{Out} \). If \( \hat{S}_0 \rightarrow \rightarrow \)-terminates in a configuration \( T \) then \( S_0 \Rightarrow \alpha \)-terminates in a configuration \( S \) with \( \hat{S} = T \) and \( \lambda x.0 \subseteq S \).

Proof. Let \( \hat{S}_0 = T_0 \rightarrow T_1 \rightarrow \ldots \rightarrow T_n = T \) be a \( \rightarrow \)-reduction sequence such that for \( T_i = (\sigma_i, n_i, B_i) \), \( I_i = \text{fst}(B_i) \) holds for \( 1 \leq i < n \) and \( |B_n| = 0 \). As \( |\text{N(entry)}| > 0 \) holds, we have \( n > 0 \). We show that there are \( S_1, \ldots, S_n \) such that (see Figure 6.2) for all \( 0 \leq i < n \)

1. \( \text{In}(I_i) \subseteq S_i \)

2. \( S_i \Rightarrow \alpha S_{i+1} \) and \( \text{Out}(I_i) \subseteq S_{i+1} \)

3. \( \hat{S}_{i+1} = T_{i+1} \)

4. \( S_0 \Rightarrow \alpha S_{i+1} \)

For \( i = 0 \), \( \hat{S}_0 = (\Sigma_0, n_0, B_0) \) implies \( \text{dom } \Sigma_0 = \text{dom } \sigma_0 = \emptyset \), so \( \text{In}(I_0) \subseteq S_0 \) holds trivially. \( B_0 = \text{N(entry)} \) implies \( |B_0| > 0 \), so Proposition 34 guarantees the existence of a unique...
6.4. Soundness of usage analysis

$S_1$ with $S_0 \Rightarrow^\alpha S_1$, and $Out(I_0) \subseteq S_1$. Finally, Proposition 31 implies $\widehat{S}_0 \Rightarrow \widehat{S}_1$, hence $\widehat{S}_1 = T_1$ by Proposition 29.

For $i + 1 < n$, the induction hypothesis implies $Out(I_i) \subseteq S_{i+1}$. Also, $|B_{i+1}| > 0$ holds, so $I_{i+1} \in \text{succ}(I_i)$ follows by Lemma 8, and Proposition 33 yields

$$In(I_{i+1}) \subseteq Out(I_i) \subseteq S_{i+1}. $$

As $S_0 \Rightarrow^*_\alpha S_{i+1}$ holds by induction hypothesis, Proposition 34 applies, hence there is a unique $S_{i+2}$ with

$$S_{i+1} \Rightarrow^\alpha S_{i+2}, $$

and

$$Out(I_{i+1}) \subseteq S_{i+2} $$

holds. The induction hypothesis also guarantees $\widehat{S}_{i+1} = T_{i+1}$, so Proposition 31 yields $T_{i+1} \Rightarrow \widehat{S}_{i+2}$, resulting in

$$\widehat{S}_{i+2} = T_{i+2} $$

by Proposition 29. Also, from $S_0 \Rightarrow^*_\alpha S_{i+1}$ and $S_{i+1} \Rightarrow^\alpha S_{i+2}$, we obtain

$$S_0 \Rightarrow^*_\alpha S_{i+2}, $$

completing the inductive proof.

Specialising to $i = n - 1$ yields an $S = S_n$ with $S_0 \Rightarrow^*_\alpha S_n = S$,

$$\hat{S} = \hat{S}_n = T_n = T $$

and $Out(I_{n-1}) \subseteq S_n = S$. As $|B_n| = 0$ holds, Lemma 8 implies $\text{succ}(I_{n-1}) = \emptyset$, so

$$\lambda x.0 \subseteq Out(I_{n-1}) \subseteq S_n = S $$

follows from equation (6.13).

Notice that in Theorem 6, as well as in Proposition 34, the association function $\alpha$ is not required to correspond to a solution to the dataflow equations. This freedom may be used during the allocation phase in Chapter 7 for assigning even forwardable variables to registers.
6.4.3 Soundness for SSA programs

Soundness of the dataflow solutions for SSA programs is proven similarly to soundness for regular programs.

**Definition 40.** For an SSA program $\mathcal{P}$, a pair of functions $\mathsf{In}, \mathsf{Out}: \mathsf{Instrs}_\mathcal{P} \to \mathbb{N}_\mathcal{P} \to \mathsf{Var}_\mathcal{P} \to \mathcal{L}$ is called a solution for $\mathcal{P}$ if equations (6.18) and (6.19) are fulfilled for the substitution $\mathsf{In}/\mathsf{fwdIn}, \mathsf{Out}/\mathsf{fwdOut}$.

In the remainder of this subsection, let $\mathcal{P} = (\mathbb{N}_\mathcal{P}, \alpha, \mathsf{succ}, \mathsf{entry})$ be an SSA program and $\mathsf{In}, \mathsf{Out}: \mathsf{Instrs}_\mathcal{P} \to \mathbb{N}_\mathcal{P} \to \mathsf{Var}_\mathcal{P} \to \mathcal{L}$ a solution for $\mathcal{P}$.

**Proposition 35.** Let $x \in \mathsf{Var}_\mathcal{P}$, $I \in \mathsf{Instrs}_\mathcal{P}$, $m, m_1, m_2 \in \mathbb{N}_\mathcal{P}$ arbitrary and $n \in \mathbb{N}_\mathcal{P}$ the label of $I$. Then

1. $\mathsf{Out}(I)(m_1)(x) = \mathsf{Out}(I)(m_2)(x)$
2. $J \in \mathsf{succ}(I)$ implies $\mathsf{Out}(I)(m)(x) \supseteq \mathsf{In}(J)(n)(x)$.

**Proof.** Both parts follow by applying equation (6.19).

1. If $\mathsf{succ}(I) = \emptyset$ then $\mathsf{Out}(I)(m_1)(x) = 0 = \mathsf{Out}(I)(m_2)(x)$ follows immediately, while $\mathsf{succ}(I) \neq \emptyset$ implies

   $$\mathsf{Out}(I)(m_1)(x) = \bigsqcup_{K \in \mathsf{succ}(I)} \mathsf{In}(K)(n)(x) = \mathsf{Out}(I)(m_2)(x).$$

2. Equation (6.19) implies $\mathsf{Out}(I)(m)(x) = \bigsqcup_{K \in \mathsf{succ}(I)} \mathsf{In}(K)(n)(x) \supseteq \mathsf{In}(J)(n)(x)$.

**Definition 41.** The association function $\alpha_{\mathsf{Out}}$ is given by

$$\alpha_{\mathsf{Out}}(I, x) = \bigsqcup_{m \in \mathbb{N}_\mathcal{P}} \mathsf{Out}(I)(m)(x)$$

for all pairs $(I, x)$ such that $I \in \mathsf{Instrs}_\mathcal{P}$ and $x \in \mathsf{defs}(I)$.

**Proposition 36.** (Progress) Let $S = (\Sigma, n, I\overline{I})$ and $\mathsf{In}(I)(g) \sqsubseteq S$ for all $g \in \mathbb{N}_\mathcal{P}$. Let $\alpha \sqsupseteq \alpha_{\mathsf{Out}}$ and $S_0 \Rightarrow^* \alpha S$. Then there is a (unique) $T$ such that $S \Rightarrow \alpha T$. Furthermore, $T$ fulfils $\mathsf{Out}(I)(h) \sqsubseteq T$ for all $h \in \mathbb{N}_\mathcal{P}$. 


6.4. Soundness of usage analysis

Proof. The proof proceeds by case distinction on $I$.

Case $I \notin \Phi$-Block Follows from Proposition 34.

Case $I = \emptyset$. For $\phi$ of the form $x(k) = M(m,k)n(m)$, the soundness of $\Rightarrow$ (Proposition 31) and $S_0 \Rightarrow^* \alpha \Rightarrow^* \hat{S}$ imply $S_0 \Rightarrow^* \hat{\Sigma}$, so by Lemma 9(part (2)) there is a unique $j$ such that $n = n_j$, and $x_i \in dom \Sigma = dom \Sigma$ holds for all $1 \leq i \leq k$. Consequently, for each $1 \leq i \leq k$, there are $l_i$ and $a_i$ such that $\Sigma(x_i) = (l_i,a_i)$. For all $1 \leq i \leq k$ we have $In(I)(n) \subseteq \Gamma$ by assumption, hence $In(I)(n_j)(x_i) \subseteq l_i$. Furthermore, $In(I)(n_j)(x_i) \subseteq 1$ holds for all $i$, since $x_i \in \text{defs}(I)$ implies

$$In(I)(n_j)(x_i) = \bigoplus_{\mu}^k \text{uses}(x_{\mu j})(x_i) = 1 \bigoplus \bigoplus_{\mu}^k \text{uses}(x_{\mu j})(x_i) \subseteq 1$$

and $x_i \notin \text{defs}(I)$ implies

$$In(I)(n_j)(x_i) = (\bigoplus_{\mu}^k \text{uses}(x_{\mu j})(x_i)) \bigoplus Out(I)(n_j)(x_i) = 1 \bigoplus (\bigoplus_{\mu}^k \text{uses}(x_{\mu j})(x_i)) \bigoplus Out(I)(n_j)(x_i) \subseteq 1.$$

The proof now proceeds in four steps.

1. We first show that there are $\Sigma = \Sigma_i, \ldots, \Sigma_{k+1}$ such that for all $1 \leq i \leq k$ the following properties hold.
   - $\Sigma_i, x_i \vdash \Sigma_{i+1}, a_i$
   - for all $v \leq k$, $\Sigma_{i+1}(x_{vj}) = (l_v \ominus (\bigoplus_{\mu}^i \text{uses}(x_{\mu j})(x_{\mu j})), a_v)$
   - $dom \Sigma_{i+1} = dom \Sigma$
   - if $y \in dom \Sigma_{i+1}$ and $y \neq x_{vj}$ holds for all $1 \leq v \leq k$ then $\Sigma_{i+1}(y) = \Sigma(y)$.

For $i = 1$ we define $\Sigma_2 = \Sigma_1[x_{1j} \mapsto (l_1 \ominus 1, a_1)] = \Sigma[x_{1j} \mapsto (l_1 \ominus 1, a_1)]$.

Then
   - $\Sigma_1, x_{1j} \vdash \Sigma_2, a_1$ is fulfilled, because $x_{1j} \in dom \Sigma_1 = dom \Sigma$ holds with $\Sigma_1(x_{1j}) = \Sigma(x_{1j}) = (l_1, a_1)$, where $l_1 \equiv In(I)(n_j)(x_{1j}) \equiv 1$ (see above).
   - for $1 \leq v \leq k$, we have

     if $x_{vj} \neq x_{1j}$: $\bigoplus_{\mu}^i \text{uses}(x_{\mu j})(x_{\mu j}) = \text{uses}(x_{vj})(x_{1j}) = 0$, hence

     $$\Sigma_2(x_{vj}) = \Sigma(x_{vj}) = (l_v, a_v) = (l_v \ominus (\bigoplus_{\mu}^i \text{uses}(x_{\mu j})(x_{\mu j})), a_v)$$
If \( x_{ij} = x_{1j} \): \( \bigoplus_{\mu = 1}^{i-1} \text{uses}(x_{ij})(x_{\mu j}) = \text{uses}(x_{ij})(x_{1j}) = 1 \) and \( (l_1, a_1) = \Sigma(x_{1j}) = \Sigma(x_{ij}) = (l_v, a_v) \), hence

\[ \Sigma_2(x_{ij}) = \Sigma_2(x_{1j}) = (l_1 \ominus 1, a_1) = (l_v \ominus (\bigoplus_{\mu = 1}^{i-1} \text{uses}(x_{ij})(x_{\mu j})), a_v) \]

- \( \text{dom } \Sigma_2 = \{x_{1j}\} \cup \text{dom } \Sigma_1 = \{x_{1j}\} \cup \text{dom } \Sigma = \text{dom } \Sigma \) by the definition of \( \Sigma_2 \) and because of \( x_{1j} \in \text{dom } \Sigma \) (see above).
- for \( y \in \text{dom } \Sigma_2, y \neq x_{ij} \) for all \( 1 \leq v \leq k \) implies \( y \neq x_{1j} \), hence

\[ \Sigma_2(y) = \Sigma_1[x_{1j} \mapsto \ldots ](y) = \Sigma_1(y) = \Sigma(y). \]

For \( 1 < i \leq k \), we first show that there is a (necessarily unique) \( \Sigma_{i+1} \) such that \( \Sigma_i, x_{ij} \downarrow \Sigma_{i+1}, a_i \). The induction hypothesis (item (2)) yields

\[ \Sigma_i(x_{ij}) = (l_v \ominus (\bigoplus_{\mu = 1}^{i-1} \text{uses}(x_{ij})(x_{\mu j})), a_v) \]

for all \( 1 \leq v \leq k \), so in particular for \( v = i \) we obtain

\[ \Sigma_i(x_{ij}) = (l_i \ominus (\bigoplus_{\mu = 1}^{i-1} \text{uses}(x_{ij})(x_{\mu j})), a_i). \]

Abbreviating \( l_i \ominus (\bigoplus_{\mu = 1}^{i-1} \text{uses}(x_{ij})(x_{\mu j})) \) by \( \tilde{l} \), we thus have \( \Sigma_i(x_{ij}) = (\tilde{l}, a_i) \), and we need to show \( \tilde{l} \uplus 1 \). There are two cases.

If \( x_{ij} \in \text{defs}(I) \), then

\[
\begin{align*}
  l_i & \equiv \text{In}(I)(n_j)(x_{ij}) = \bigoplus_{\mu = 1}^{k-1} \text{uses}(x_{\mu j})(x_{ij}) \\
  & = 1 \bigoplus (\bigoplus_{\mu = 1}^{k-1} \text{uses}(x_{\mu j})(x_{ij})) \\
  & = \begin{cases} 
    1 & \text{if } \forall \mu \in \{1, \ldots, k\} \setminus \{i\}. \text{uses}(x_{\mu j})(x_{ij}) = 0 \\
    \top & \text{if } \exists \mu \in \{1, \ldots, k\} \setminus \{i\}. \text{uses}(x_{\mu j})(x_{ij}) \equiv 1
  \end{cases}
\end{align*}
\]

holds, where the first case, i.e. \( l_i = 1 \), implies

\[ \bigoplus_{\mu = 1}^{i-1} \text{uses}(x_{ij})(x_{\mu j}) = \bigoplus_{\mu = 1}^{i-1} \text{uses}(x_{\mu j})(x_{ij}) = \bigoplus_{\mu = 1}^{i-1} 0 = 0, \]

hence \( \tilde{l} = l_i \ominus 0 = l_i \uplus 1 \), while the second case, i.e. \( l_i = \top \), implies \( \tilde{l} = l_i \ominus \ldots = \top \uplus 1 \) directly.
6.4. Soundness of usage analysis

If \( x_{ij} \notin \text{defs}(I) \), then

\[
\begin{align*}
\text{l}_i & \equiv \text{In}(I)(n_j)(x_{ij}) = (\oplus_{\mu=1}^{k} \text{uses}(x_{\mu j})(x_{ij})) \oplus \text{Out}(I)(n_j)(x_{ij}) \\
& = 1 \oplus (\oplus_{\mu=1}^{k} \text{uses}(x_{\mu j})(x_{ij})) \oplus \text{Out}(I)(n_j)(x_{ij}) \\
& = \left\{ \begin{array}{ll}
1 & \text{if } \text{Out}(I)(n_j)(x_{ij}) \subseteq 0 \text{ and } \\
\forall \mu \in \{1, \ldots, k\} \setminus \{i\}, \text{uses}(x_{\mu j})(x_{ij}) = 0 \\
\top & \text{otherwise}
\end{array} \right.
\end{align*}
\]

holds, so \( \tilde{l} \equiv 1 \) follows in a similar way as in the first case.

As both cases yield \( \tilde{l} \equiv 1 \), the precondition for \( \vdash \) is satisfied, and for

\[
\Sigma_{i+1} = \Sigma_{i}[x_{ij} \mapsto (\tilde{l} \ominus 1, a_i)]
\]

all four items required of \( \Sigma_{i+1} \) are fulfilled:

- \( \Sigma_{i}, x_{ij} \downarrow \Sigma_{i+1}, a_i \) is fulfilled by the definition of \( \downarrow \)
- for \( 1 \leq v \leq k \), \( \Sigma_{i+1}(x_{vj}) = (l_v \ominus (\oplus_{\mu=1}^{v-1} \text{uses}(x_{vj})(x_{\mu j})), a_v) \) holds:
  - if \( x_{vj} \neq x_{ij} \), then \( \text{uses}(x_{vj})(x_{ij}) = 0 \) holds, so the second item of the induction hypothesis yields

  \[
  \Sigma_{i+1}(x_{vj}) = \Sigma_i(x_{vj}) = (l_v \ominus (\oplus_{\mu=1}^{v-1} \text{uses}(x_{vj})(x_{\mu j})), a_v)
  \]

  \[
  = (l_v \ominus (\oplus_{\mu=1}^{v-1} \text{uses}(x_{vj})(x_{\mu j})), a_v).
  \]

  if \( x_{vj} = x_{ij} \), then \( (l_v, a_v) = \Sigma(x_{vj}) = \Sigma(x_{ij}) = (l_i, a_i) \) and

  \[
  \tilde{l} \ominus 1 = (l_i \ominus (\oplus_{\mu=1}^{i-1} \text{uses}(x_{ij})(x_{\mu j})))) \ominus 1
  \]

  \[
  = l_i \ominus ((\oplus_{\mu=1}^{i-1} \text{uses}(x_{ij})(x_{\mu j}))) \oplus 1
  \]

  \[
  = l_i \ominus (\oplus_{\mu=1}^{i-1} \text{uses}(x_{ij})(x_{\mu j}))
  \]

  \[
  = l_v \ominus (\oplus_{\mu=1}^{v-1} \text{uses}(x_{vj})(x_{\mu j}))
  \]

follow, hence

\[
\Sigma_{i+1}(x_{vj}) = \Sigma_{i+1}(x_{ij}) = (\tilde{l} \ominus 1, a_i)
\]

\[
= (l_v \ominus (\oplus_{\mu=1}^{v-1} \text{uses}(x_{vj})(x_{\mu j})), a_v).
\]
the definition of $\Sigma_{i+1}$ implies $\text{dom } \Sigma_{i+1} = \{x_{ij}\} \cup \text{dom } \Sigma_i$, so the claim follows using the induction hypothesis $\text{dom } \Sigma_i = \text{dom } \Sigma$ and the property $x_{ij} \in \text{dom } \Sigma$ which was shown at the very beginning of the proof.

- for $y \in \text{dom } \Sigma_{i+1}$ and $y \not= x_{ij}$ for all $1 \leq v \leq k$, the induction hypothesis yields $\Sigma_i(y) = \Sigma(y)$, hence

$$\Sigma_{i+1}(y) = \Sigma_i[x_{ij} \mapsto (\overline{\ell} \odot 1, a_i)](y) = \Sigma_i(y) = \Sigma(y).$$

2. For $1 \leq i \leq k$, we now show that $\Sigma_{k+1} \downarrow x_i$ holds, i.e. that for $x_i \in \text{dom } \Sigma_{k+1}$ and $\Sigma_{k+1}(x_i) = (l, a), l \ni 0$ holds. There are two cases.

**Case $\exists v \leq k$ such that $x_i = x_{vj}$.** Then $x_i = x_{vj} \in \text{dom } \Sigma_{k+1} = \text{dom } \Sigma$ follows from part (1), and $x_{vj} = x_i \in \text{defs}(I)$ holds, hence

$$l_v \ni \text{In}(I)(n_j)(x_{vj})$$

$$= \bigoplus_{\mu \neq v}^{k} \text{uses}(x_{\mu j})(x_{vj})$$

$$= 1 \bigoplus \bigoplus_{\mu \neq v}^{k} \text{uses}(x_{\mu j})(x_{vj})$$

$$= \begin{cases} 1 & \text{if for all } \mu \in \{1, \ldots, k\} \backslash \{v\}, x_{\mu j} \neq x_{vj} \text{ holds} \\ \top & \text{otherwise} \end{cases}$$

and, by Lemma 11 (item (5))

$$l = l_v \bigoplus \bigoplus_{\mu \neq v}^{k} \text{uses}(x_{vj})(x_{\mu j})$$

$$\ni \begin{cases} 1 \bigoplus \bigoplus_{\mu \neq v}^{k} \text{uses}(x_{vj})(x_{\mu j}) & \text{if } x_{\mu j} \neq x_{vj} \text{ holds for all } \mu \in \{1, \ldots, k\} \backslash \{v\} \\ \top & \text{otherwise} \end{cases}$$

$$= \begin{cases} 1 \bigoplus \bigoplus_{\mu \neq v}^{k} \text{uses}(x_{vj})(x_{\mu j}) & \text{if } x_{\mu j} \neq x_{vj} \text{ holds for all } \mu \in \{1, \ldots, k\} \backslash \{v\} \\ \top & \text{otherwise} \end{cases}$$

$$\ni 0$$

**Case $x_i \neq x_{vj}$ holds for all $v \leq k$.** For $x_i \in \text{dom } \Sigma_{k+1}$, part (1) (forth item) implies $\Sigma_{k+1}(x_i) = \Sigma(x_i)$, so the assumption $\text{In}(I)(n_j) \ni S$ together with $x_i \in \text{defs}(I)$ yields

$$l \ni \text{In}(I)(n_j)(x_i) = \text{uses}(I)(n_j)(x_i) = \bigoplus_{\mu \neq v}^{k} \text{uses}(x_{\mu j})(x_i) = 0$$
6.4. Soundness of usage analysis

where the last equality is valid as \( \text{uses}(x_{\mu j})(x_i) = 0 \) follows for each \( \mu \) from \( x_{\mu j} \neq x_i \).

3. Consequently, for the (unique!) \( T = (\Sigma', n, \tilde{I}) \) where

\[
\Sigma' = \Sigma_{k+1}[x_1 \mapsto (l_1', a_1), \ldots, x_k \mapsto (l_k', a_k)]
\]

and

\[
l_i' = \alpha(I, x_i)
\]

we obtain \( S \Rightarrow \alpha \ T \) by the rule PHI.

4. Finally, for proving \( \text{Out}(I)(h) \subseteq T \) for all \( h \in \mathbf{N}_p \), we show that \( y \in \text{dom } \Sigma' \) and \( \Sigma'(y) = (l, a) \) implies \( \text{Out}(I)(h)(y) \subseteq l \). By Proposition 35 (part (1)), it suffices to show \( \text{Out}(I)(n)(y) \subseteq l \), so let \( y \in \text{dom } \Sigma' \) and \( \Sigma'(y) = (l, a) \). There are two cases.

**Case** \( y = x_i \) for some \( 1 \leq i \leq k \). By the definition of SSA programs (Definition 27) there is at most one \( i \) with \( y = x_i \). The definition of \( \Sigma' \) implies \( l = l_i' \), so

\[
l = l_i' = \alpha(I, x_i) \sqsubseteq \alpha_{\text{Out}}(I, x_i) = \sqsubseteq_{g \in \mathbf{N}_p} \text{Out}(I)(g)(x_i) \sqsubseteq \text{Out}(I)(n)(x_i) = \text{Out}(I)(n)(y)
\]

**Case** \( y \neq x_i \) holds for all \( 1 \leq i \leq k \). In this case, the definition of \( \Sigma' \) implies \( y \in \text{dom } \Sigma_{k+1} \) and \( \Sigma_{k+1}(y) = \Sigma'(y) = (l, a) \). There are two sub-cases.

**If** \( y = x_{\nu j} \) for some \( 1 \leq \nu \leq k \), then part (1) (second item), \( y \notin \text{defs}(I) \) and the assumption \( \text{In}(I)(n_j) \sqsubseteq S \) guarantee

\[
l_{\nu} \sqsubseteq \text{In}(I)(n_j)(y) = (\bigoplus_{\mu=1}^{k} \text{uses}(x_{\nu j})(x_{\mu j})) \oplus \text{Out}(I)(n_j)(y) = 1 \oplus (\bigoplus_{\mu=1}^{k, \mu \neq \nu} \text{uses}(x_{\nu j})(x_{\mu j})) \oplus \text{Out}(I)(n_j)(y) = \begin{cases} 1 & \text{if } \text{Out}(I)(n_j)(y) \subseteq 0 \text{ and } x_{\mu j} \neq x_{\nu j} \\
 & \text{holds for all } \mu \in \{1, \ldots, k\} \setminus \{\nu\}
\end{cases}
\]

otherwise.
By Lemma 11 (item (5)) we obtain

\[ l = \bigvee \left( \bigoplus_{\mu=1}^{k} uses(y)(x_{\mu j}) \right) \]

\[ \bigvee \left\{ \begin{array}{ll}
1 \bigoplus (1 \bigoplus \bigoplus_{\mu=1}^{k} uses(y)(x_{\mu j})) & \text{if } Out(I)(n_j)(y) \subseteq 0 \\
\top \bigoplus \ldots & \text{and } x_{\mu j} \neq x_{v_j} \text{ holds for all } \mu \in \{1, \ldots, k\} \setminus \{v\} \\
0 & \text{otherwise}
\end{array} \right. \]

\[ = \bigvee \left\{ \begin{array}{ll}
0 & \text{if } Out(I)(n_j)(y) \subseteq 0 \text{ and } x_{\mu j} \neq x_{v_j} \text{ holds for all } \\
\mu \in \{1, \ldots, k\} \setminus \{v\} & \text{otherwise}
\end{array} \right. \]

\[ \bigvee Out(I)(n_j)(y). \]

If \( y \neq x_{v_j} \) holds for all \( 1 \leq v \leq k \), then part (1) (forth item) implies \( (l, a) = \Sigma_{k+1}(y) = \Sigma(y) \). From \( y \neq x_i \) for all \( 1 \leq i \leq k \), we obtain \( y \notin \text{defs}(I) \), hence

\[ l \geq In(I)(n_j)(y) = \bigoplus_{\mu=1}^{k} uses(x_{\mu j})(y) \bigoplus Out(I)(n_j)(y) \]

\[ = \bigoplus_{\mu=1}^{k} 0 \bigoplus Out(I)(n_j)(y) \]

\[ = Out(I)(n_j)(y). \]

Both sub-cases yield \( l \geq Out(I)(n)(y) \) as required.

\[ \square \]

**Theorem 7.** *(Soundness of dataflow equations for SSA programs)* Let \( \alpha \supseteq \alpha_{Our} \). If \( S_0 \quad \rightarrow \text{- terminates in a configuration } \top \) then \( S_0 \quad \rightarrow_{\alpha} \text{- terminates in a configuration } S \) with \( \hat{S} = \top \) and \( \lambda x.0 \subseteq S \).

**Proof.** Similar to the proof of Theorem 6, using Propositions 35 and 36 instead of Propositions 33 and 34.

\[ \square \]

**6.4.4 Discussion**

Our soundness proof does not directly follow the approach taken in [NNH99]. Instead of incorporating the dataflow solutions in the operational semantics by factoring the configuration space with respect to an equivalence relation, we incorporate them
directly using annotated configurations. We expect that our approach also applies to liveness analysis, if the codomain of states is extended to pairs \((\mathcal{B} \times \text{Val})\). The non-standard semantics was canonically constructed from the lattice used in the dataflow analysis using \(\otimes\). This allowed us to relate the fixed points of the dataflow equations to the operational behaviour based on the progress lemma. Future work is needed for comparing the two approaches in more depth and for evaluating how they generalise to other dataflow problems.
Chapter 7

Translating intermediate programs into ALEF

This chapter presents a translation from IL into ALEF based on solutions to the dataflow equations. Variables which have been identified as being linear with respect to all points of definition are candidates for forwarding. They can be passed through operand queues provided that all consuming instructions execute on the same functional unit. Remaining variables are communicated through registers. The translation is introduced in several stages which are shown to be functionally correct and faithful to the dataflow solution. The resulting ALEF program is well-typed in the sense of Chapter 5.

The first stage (Section 7.1) decides whether a variable can be forwarded based on the dataflow solution and the above restrictions. For programs in SSA form, the second stage (Section 7.2) eliminates $\Phi$-blocks and inserts compensating instruction sequences. Although the resulting program does not fulfil the SSA requirements any longer, it is functionally equivalent to the original program, and forwardability of variables is preserved. The translation into ALEF is defined in Section 7.3, and is parametric in an allocation function $\rho$ which assigns operand queue names to forwardable variables and register names to non-forwardable variables. In Section 7.4, we show that the resulting program is type-correct with respect to the calculus in Section 5.1 for any $\rho$ that satisfies some mild compatibility conditions regarding the dataflow solutions. Section 7.5 describes how an allocation of queues can be obtained based on
colouring the conflict graphs for variables. Functional correctness of any allocation $\rho$ which maps adjacent nodes to different queues or registers is proven in Section 7.6.

In order to achieve the outlined results, the translation itself is kept rather basic, and Section 7.7 consequently discusses a number of improvements and extensions.

### 7.1 Deciding forwardability

Based on a solution $\text{In/Out}$ to the dataflow equations and the intended translation scheme, the first step classifies each variable as being forwardable or not. Before giving the formal definition of forwardability, we briefly motivate its core conditions.

In Section 6.3.1, we observed that the information whether a variable is useless can be read off a solution to the dataflow equations for liveness by inspecting $\text{liveOut}(I)(x)$ where $I$ is an assignment to $x$. Similarly, we now observe that forwardability information for $x$ is captured in a solution $\text{In/Out}$ to equations (6.13), in component $\text{Out}(I)(x)$ where $x \in \text{defs}(I)$. If $\text{Out}(I)(x) = \top$ holds then $x$ is not forwardable since the dataflow analysis failed to prove linear usage. Variables with $\text{Out}(I)(x) = 0$ cannot be forwarded either, because this appearance of $x$ was proven to be useless (other occurrences of $x$ might still be used, so we cannot remove $x$ completely). A linearly used appearance of $x$ consist of an instruction $I$ with $\text{Out}(I)(x) = 1$, and is forwardable if all its uses are in instructions executing on the same functional unit. In order to map variable $x$ to an operand queue, two conditions must be fulfilled:

- all instructions $I$ with $x \in \text{defs}(I)$ must satisfy $\text{Out}(I)(x) = 1$
- all uses of $x$ are in instructions executing on the same functional unit

The first condition ensures that in a regular program, variables with multiple sites of definition are translated correctly. For example, variable $x$ in program (7.1)

\begin{align*}
N(1) &= [1]v = \ldots \\
       &\quad [2] \text{if } v \text{ 3 6}
N(3) &= [3]x = 3
\quad [4]y = 5
\quad [5] \text{jmp 9}
N(6) &= [6]x = 9
\quad [7]y = x + 1
\quad [8] \text{jmp 9}
N(9) &= [9]z = x \ast y
\quad [10] \ldots
\end{align*}

(7.1)
must be communicated through a register. Its definition in instruction [6] is used non-linearly (indeed, any solution to its dataflow equations fulfils $Out([6])(x) = \top$), and the consuming instruction [9] must hence be translated to $\mathtt{mul \ r op_y \ op_z}$ for some register $r$. Consequently, the assignment to $x$ in instruction [3] must also be translated to a register usage despite the linear use of the assigned value.

The second condition arises from the binding of operand queue names to functional units in ALEF. For example, the condition requires variables $x$ and $y$ in program (7.2)

$$
\begin{align*}
N(1) &= [1]v = \ldots \\
N(5) &= [5]z = x + y \\
N(6) &= [6]z = x \times y \\
[2]x &= 5 \\
[3]y &= 7 \\
[4] \text{if } v \leq 6
\end{align*}
$$

(7.2)

to be communicated through registers, despite their linear usage. For suppose $x$ or $y$ are mapped to operand queues $q_1$ and $q_2$, respectively, then the typing rules for ALEF yield $\gamma(q_1) = ALU \neq MUL = \gamma(q_1)$ and $\gamma(q_2) = ALU \neq MUL = \gamma(q_2)$ and the resulting program is ill-typed and fails to execute.

In contrast, uses of $x$ in IL-instructions of the form $y = x$ or in $\Phi$-blocks do not constrain forwardability as these instructions will be translated to instructions $\text{id}^{fu} \text{op}_x \text{op}_y$ which exist for each functional unit $fu$.

For regular programs, we require that all instructions $J$ with $x \in \text{uses}(J)$ be mapped to the same functional unit. This is slightly conservative as disjoint sites of definition occasionally refer to different sets of using instructions. As explicitness of these structural dependencies are the main reason for using SSA, our requirement represents a trade-off for not using SSA. Indeed, SSA programs do not contain conflicts of the kind shown in program (7.1) as only one defining instruction exists for each variable.

In the following definition, we use the notation $Out(I)(x)$ for both regular and SSA programs, justified by part (1) of Proposition 35.

**Definition 42.** Let $P$ be a regular or SSA program and $\text{In}/\text{Out}$ a solution for $P$. Variable $x \in \text{Var}_P$ is called forwardable to $fu$ if

- $Out(I)(x) \notin \{\bot, \top\}$ holds for all $I \in \text{Instr}_P$

- $Out(I)(x) = 1$ holds for all $I \in \text{Instr}_P$ with $x \in \text{defs}(I)$
• for all $I \in \text{Instr}_p$ with $x \in \text{uses}(I)$, one of the following conditions holds

- $I \in \Phi$-Block

- $I$ has the form $[n]y = x$ for some $y$

- $I$ has the form $[n]y = z + e$ or $[n]y = z - e$ and $fu = ALU$

- $I$ has the form $[n]y = z \times e$ and $fu = MUL$

- $I$ has the form $[n]\text{if } x m l$ and $fu = BU$.

While the last two conditions are motivated by the above discussion, the first one is a technical condition whose significance will become apparent in the later development.

**Example.** In the SSA program (6.20), the solution discussed in Section 6.3.3 characterises variables $z$ and $y$ as being forwardable to ALU while $v$ is forwardable to BU. Variables $z_1$, $z_2$ and $y_1$ are forwardable to any $fu$, while $y_2$ and $w$ are not forwardable.

\[\checkmark\]

### 7.2 Eliminating $\Phi$-instructions

For SSA programs, the second step consists of eliminating $\Phi$-instructions. For this step, we assume that input programs are in standard edge-split form as described in Section 6.2.2.2. In particular, $\Phi$-blocks must not occur as successors of conditional jumps and the order of $\Phi$-instructions in each $\Phi$-block must be such that a variable assigned to in the $i$-th row does not occur on the right-hand side in rows $l > i$.

We eliminate each $\Phi$-block by inserting compensating assignments into its control flow predecessors, preceding the jump instructions. For example, eliminating instruction $[9]$ from program (6.20) requires the insertion of $y = y_1; z = z_1$ between instructions $[4]$ and $[5]$ and of $y = y_2; z = z_2$ between instructions $[7]$ and $[8]$.

Formally, for $P = (N, A, \text{succ}, \text{entry})$ and $\text{fst}(N(l)) = \phi$, we define a program $P_{\phi}$ which results from $P$ by eliminating the instruction $\phi$ as follows.
7.2. Eliminating $\Phi$-instructions

- Any instruction $[n] \text{jmp } l$ is prefixed with the sequence
  \[ \tilde{I}_n = x_1 = x_{1j} \]
  \[ \vdots \]
  \[ x_k = x_{kj} \]
  where $\phi = [l] x(k) = M(k,m)n(m)$ and $n = n_j$.

- $\phi$ is deleted from $N(l)$.

- $[n] \text{jmp } l$ is replaced by $[n] \text{jmp } l'$ where $l'$ is the label of the (unique) $J \in \text{succ}(\phi)$.

- $N$, $A$ and $\text{succ}$ are updated to reflect the changes.

**Example.** For program (6.20) we indeed obtain

\[ I_5 = y = y_1 \quad \text{and} \quad I_6 = y = y_2 \]

\[ z = z_1 \quad \text{and} \quad z = z_2 \]

so the program $P_{[9]}$ resulting from eliminating instruction $[9]$ is

\[ N(1) = [1]v = 7 \quad N(6) = [6]y_2 = 7 \]

\[ [2]\text{if } v \geq 6 \quad [7]z_2 = y_2 * 2 \]

\[ N(3) = [3]y_1 = 3 \quad [103]y = y_2 \]

\[ [4]z_1 = 2 \quad [104]z = z_2 \quad (7.3) \]

\[ [101]y = y_1 \quad [8]\text{jmp } 10 \]

\[ [102]z = z_1 \]

\[ [5]\text{jmp } 10 \quad N(10) = [10]w = z + y \]

\[ \diamond \]

Repeated elimination of $\phi$-blocks transforms an edge-split SSA program into a regular program. The edge-split condition guarantees that at most one compensation sequence is inserted into each block, hence the $\phi$-blocks may be eliminated in an arbitrary order. Forwardability of variables is preserved as solutions to the dataflow-equations after an elimination step coincide (modulo the relabelling of instructions) with those of the
original program. Each use of a variable $x_{ij}$ corresponds to a use in a new instruction $x_i = x_{ij}$ in a $\bar{I}_n$ with $n = n_j$. Furthermore, all compensating instruction sequences contain assignments to all $x_i$ occuring in the LHS of a $\phi$-block. Consequently, each variable $y$ with $y \in \text{defs}(\phi)$ fulfils $y = x_i$ for some $1 \leq i \leq k$, and by Definition 27 this $i$ is unique. Also for all $i > j$, $x_{ij} \neq y$ holds for all $j$. Consequently, for $n = n_j$ and $\bar{I}_n = I_1 \ldots I_k$, instruction $I_i$ has the form $y = x_i$, and $\text{fwdOut}(I_{x_{ij}})(y)$ agrees with $\text{fwdOut}(\phi)(y)$. Likewise $\text{fwdIn}(I_1)(y) = \text{fwdIn}(\phi)(n)(y)$, since $I_1 \ldots I_{n-1}$ contains exactly the assignments corresponding to position $n = n_j$.

\[
\oplus_{i=1}^{I_{n-1}} \text{uses}(I_i)(n)(z) = \oplus_{i=1}^{I_{n-1}} \text{uses}(x_{ij})(z) = \text{uses}(\Phi')(n)(z)
\]

where $\Phi'$ contains the first $i$ lines of $\phi$, and $y \notin \text{defs}(I_{l})$ for $l < i$.

Similarly, for variables $y$ not occurring in $\text{defs}(\phi)$, $\text{fwdOut}(\phi)(n)(y) = \text{fwdOut}(I_k)(y)$ and $\text{fwdIn}(\phi)(n)(y) = \text{fwdIn}(I_1)(y)$ hold. Since $\bar{I}_n$ and $\phi$ have the same successors (the immediate successor of $\phi$) and predecessors (the body of $n$’s block), the dataflow equations agree. Consequently, each solution to the equations for the program containing $\phi$ corresponds to a solution for the program containing the $\bar{I}_n$ but not $\phi$.

**Example.** The dataflow equations for the above (regular) program $P_9$ are

\[
\begin{align*}
\text{fwdOut}([1])(v) & = 1 \\
\text{fwdOut}([3])(y) & = \text{fwdOut}([4])(z) = 1 \\
\text{fwdOut}([101])(y) & = \text{fwdOut}([102])(z) = 1 \\
\text{fwdOut}([6])(y) & = \top \\
\text{fwdOut}([7])(z) & = 1 \\
\text{fwdOut}([103])(y) & = \text{fwdOut}([104])(z) = 1 \\
\text{fwdOut}([10])(u) & = 0
\end{align*}
\]

(7.4)

The least fixed point corresponding to these equations yields the same forwardability as the one for the original SSA program: $y$ and $z$ are forwardable to ALU because $\text{fwdOut}$ assigns the value 1 to their defining instructions $[101]/[103]$ and $[102]/[104]$. Variable $v$ is forwardable to BU, and variables $z1$, $z2$ and $y1$ are forwardable to any fu. Finally, $y2$ and $w$ are not forwardable.

Furthermore, eliminating a $\Phi$-block respects the functional semantics. The condition
on the shape of the matrix $M$ ensures that the sequential execution of the compensating instruction sequence leads to the same result as the concurrent assignment during the execution of the original $\phi$-block. In particular, the side condition $x_i \neq x_{lj}$ for all $i$ and all $l > i$ in Definition 31 ensures that the correct values are copied into the $x_i$ variables, as the definition of $\overline{I}_n$ respects the order of single $\phi$-instructions in a $\Phi$-block. This is formally proven in Proposition 37 using the following technical lemma.

**Lemma 13.** If $(\Sigma, m_1, a\tilde{s} \overline{I}) \Rightarrow^* (\Sigma', m_1, \overline{I})$ then $(\Sigma, m_2, a\tilde{s} \overline{I}) \Rightarrow^* (\Sigma', m_2, \overline{I})$ for any $m_2 \in N_\Phi$ and $\overline{I}$.

**Proof.** Induction on the length $n$ of $a\tilde{s} = ass_1 \ldots ass_n$.

For $n = 1$, the rule ASS guarantees that $ass_1$ is of the form $[k]x = e$, and that $\Sigma, e \nleftarrow \Sigma_1, a$ holds for some $\Sigma_1$ and $a$ such that $\Sigma_1 \downarrow x$ and $\Sigma' = \Sigma_1[x \mapsto \alpha(ass_1, x), a]$. Consequently, the derivation

$$
\Sigma, e \nleftarrow \Sigma_1, a
$$

is possible.

For $n > 1$, $a\tilde{s} = ass_1 \ldots ass_n$, $(\Sigma, m_1, a\tilde{s} \overline{I}) \Rightarrow^* (\Sigma', m_1, \overline{I})$ yields

$$(\Sigma, m_1, ass_1 ass_2 \ldots ass_n \overline{I}) \Rightarrow^* (\Sigma_1, m_1, ass_2 \ldots ass_n \overline{I})$$

for some $\Sigma_1$ with $ass_1 = [k]x = e$, $\Sigma, e \nleftarrow \Sigma, a, \Sigma \downarrow x$, $\Sigma_1 = \Sigma[x \mapsto \alpha(ass_1, x), a]$ and

$$(\Sigma_1, m_1, ass_2 \ldots ass_n \overline{I}) \Rightarrow^* (\Sigma', m_1, \overline{I}).$$

These properties of $\Sigma$, $x$ and $a$ allow us to derive

$$
\Sigma, e \nleftarrow \Sigma, a
$$

and applying the induction hypothesis yields $(\Sigma_1, m_2, ass_2 \ldots ass_n \overline{I}) \Rightarrow^* (\Sigma', m_2, \overline{I})$ so $(\Sigma, m_2, a\tilde{s} \overline{I}) \Rightarrow^* (\Sigma', m_2, \overline{I})$ can be derived.

The proposition shows that prefixing the compensation code to the jump has the same effect as executing the jump followed by the $\phi$-block.
Proposition 37. Let $S = (\Sigma, l, [n] \text{ jmp } h)$ and $\Pi(h) = \phi \bar{I}$ where $\phi$ has the form $[h] x(k) = M(k, m) n(m)$. For $n = n_j$ let $\bar{K}$ denote the code sequence $\bar{I}_n$, and let

$$S \Rightarrow \alpha (\Sigma, n, \phi \bar{I}) \Rightarrow \alpha (\Sigma', n, \bar{I})$$

be an execution for $\mathcal{P}$. Then

$$(\Sigma, l, \bar{K} [n] \text{ jmp } h) \Rightarrow^* \alpha (\Sigma', n, \bar{I})$$

is an execution for $\mathcal{P}_{\phi}$.

Proof. We will show that

$$(\Sigma, l, \bar{K} \bar{I}) \Rightarrow^* \alpha (\Sigma', l, \bar{I})$$

holds. By Lemma 13, this will imply

$$(\Sigma, l, \bar{K} \bar{I}) \Rightarrow^* \alpha (\Sigma', l, \bar{I})$$

for any $\bar{I}$. By choosing $\bar{I} = [n] \text{ jmp } h$ we will thus obtain

$$(\Sigma, l, \bar{K} [n] \text{ jmp } h) \Rightarrow^* \alpha (\Sigma', [n] \text{ jmp } h),$$

from which the claim follows due to $\Pi(h) = \bar{I}$ and

$$(\Sigma', l, [n] \text{ jmp } h) \Rightarrow \alpha (\Sigma', n, \Pi(h)).$$

In order to show $(\Sigma, l, \bar{K} \bar{I}) \Rightarrow^* (\Sigma', l, \bar{I})$ observe that by the definition of rule PHI there are $\Sigma = \Sigma_1, \ldots, \Sigma_{k+1}$ such that for all $1 \leq i \leq k$,

$$\Sigma_i, x_{ij} \downarrow \Sigma_{i+1}, a_i \text{ and } \Sigma_{k+1} \downarrow x_i$$

and

$$\Sigma' = \Sigma_{k+1}[x_1 \mapsto (l_1, a_1), \ldots, x_k \mapsto (l_k, a_k)]$$

holds, where $l_i = \alpha(\phi, x_i)$. Each $\Sigma_i, x_{ij} \downarrow \Sigma_{i+1}, a_i$ implies $\Sigma_i(x_{ij}) = (\bar{l}_i, a_i)$ for some $\bar{l}_i \text{ jmp } h$ with $\Sigma_{i+1} = \Sigma_i[x_{ij} \mapsto (\bar{l}_i \text{ jmp } a_i)]$.

This sequence of read operations for $\Sigma_i, x_{ij} \downarrow \Sigma_{i+1}, a_i$ is depicted horizontally at the top in Figure 7.1. The subsequent write operations during the assignment $\Sigma' = \Sigma_{k+1}[x_1 \mapsto (l_1, a_1), \ldots, x_k \mapsto (l_k, a_k)$ are shown vertically on the right hand side. The remainder of
this proof shows that executing each line of the $\Phi$-block separately is possible due to the conditions in rules PHI and ASS, and that executing the compensating assignments in the same order as the entries in $\phi$ leads to the same final configuration $\Sigma'$. This is achieved by defining intermediate configurations $\Delta_i$ and $\Pi_i$ where $\Pi_i$ represents the configuration after executing assignments $1, \ldots, i-1$ and $\Delta_{i+1}$ denotes the configuration after executing assignments $1, \ldots, i-1$ and the reading operation of assignment $i$.

In particular, we show that the lattice annotations obtained for the sequence of assignments coincide with the lattice annotations for $\phi$ in rule PHI, and that the same holds for the values $a_i$ and $b_i$ read during the two executions. This property rests on the restriction on the appearance of variables in $\phi$ for SSA programs in standard edge-split form (Definition 31): as a variable $y$ occurring as an $x_i$ on the left may not occur in a subsequent line on the right, the sequential (non-atomic) execution of the assignments will never overwrite a variable which is needed later.

Formally, for $1 \leq \nu \leq k+1$ we define states $\Pi_\nu$ by (see Figure 7.1)

\[
\Pi_1 = \Sigma_1 \\
\Pi_\nu = \Delta_\nu[x_{\nu-1} \mapsto (l_{\nu-1}, b_{\nu-1})] \text{ for } \nu > 1
\]
where $\Delta_v$ and $b_{v-1}$ follow from $\Pi_{v-1}(x_{v-1}) = (l'_{v-1}, b_{v-1})$ with $l'_{v-1} \sqsupseteq 1$ and

$$
\Delta_v = \Pi_{v-1}[x_{v-1} \mapsto (l'_{v-1} \odot 1, b_{v-1})].
$$

By definition, $(l'_1, b_1) = \Pi_1(x_1) = \Sigma_1(x_1) = (\tilde{l}_1, a_1)$ holds, hence $l'_1 = \tilde{l}_1 \sqsupseteq 1$, $b_1 = a_1$ and $\Delta_2 = \Pi_1[x_1 \mapsto (l'_1 \odot 1, b_1)] = \Sigma_1[x_1 \mapsto (\tilde{l}_1 \odot 1, a_1)] = \Sigma_2$. By induction on $v \geq 2$ we now show $l'_{v} = \tilde{l}_v \sqsupseteq 1$, $b_v = a_v$ and

$$
\Delta_{v+1} = \Sigma_{v+1}[x_1 \mapsto (l_1, a_1)] \ldots [x_{v-1} \mapsto (l_{v-1}, a_{v-1})].
$$

(7.5)

For $v = 2$, Definition 31 yields $x_{2j} \notin x_1$ and thus

$$(l'_2, b_2) = \Pi_2(x_{2j}) = \Delta_2[x_1 \mapsto (l_1, b_1)](x_{2j}) = \Delta_2(x_{2j}) = \Sigma_2(x_{2j}) = (\tilde{l}_2, a_2).$$

Consequently, $l'_2 = \tilde{l}_2 \sqsupseteq 1$, $b_2 = a_2$ and

$$
\Delta_3 = \Pi_2(x_{2j} \mapsto (l'_2 \odot 1, b_2])
= \Delta_2[x_1 \mapsto (l_1, b_1)](x_{2j} \mapsto (l'_2 \odot 1, b_2])
= \Delta_2[x_{2j} \mapsto (l'_2 \odot 1, b_2)][x_1 \mapsto (l_1, b_1)]
= \Sigma_2[x_{2j} \mapsto (l'_2 \odot 1, b_2)][x_1 \mapsto (l_1, b_1)]
= \Sigma_2[x_{2j} \mapsto (\tilde{l}_2 \odot 1, a_2)][x_1 \mapsto (l_1, a_1)]
= \Sigma_3[x_1 \mapsto (l_1, a_1)].
$$

For $v > 2$ we obtain $l'_{v-1} = \tilde{l}_{v-1} \sqsupseteq 1$ and $w_{v-1} = a_{v-1}$ by induction hypothesis. Applying $x_i \neq x_{ij}$ for $l > i$ from Definition 31 (and the induction hypothesis) yields

$$(l'_v, b_v) = \Pi_v(x_{vj})
= \Delta_v[x_{v-1} \mapsto (l_{v-1}, b_{v-1})](x_{vj})
= \Delta_v(x_{vj}) = \Sigma_v[x_1 \mapsto (l_1, a_1)] \ldots [x_{v-2} \mapsto (l_{v-2}, a_{v-2})](x_{vj})
= \Sigma_v(x_{vj}) = (\tilde{l}_v, a_v)$$
and thus \( l'_v \cap l \equiv 1 \) and \( b_v = a_v \). Consequently,

\[
\Delta_{i+1} = \Pi_v[x_v \mapsto (l'_v \cap 1, b_v)]
\]

\[
= \Delta_v[x_{v-1} \mapsto (l_{v-1}, b_{v-1})][x_v \mapsto (l'_v \cap 1, b_v)]
\]

\[
= \Delta_v[x_{v-1} \mapsto (l_{v-1}, a_{v-1})][x_v \mapsto (l'_v \cap 1, a_v)]
\]

\[
= \Sigma_v[x_1 \mapsto (l_1, a_1)] \cdots [x_{v-2} \mapsto (l_{v-2}, a_{v-2})]
\]

\[
[x_{v-1} \mapsto (l_{v-1}, a_{v-1})][x_v \mapsto (l'_v \cap 1, a_v)]
\]

\[
= \Sigma_v[x_{v,j} \mapsto (l'_v \cap 1, a_i)] x_1 \mapsto (l_1, a_1) \cdots [x_{v-1} \mapsto (l_{v-1}, a_{v-1})]
\]

\[
= \Sigma_{i+1}[x_1 \mapsto (l_1, a_1)] \cdots [x_{v-1} \mapsto (l_{v-1}, a_{v-1})]
\]

completing the induction.

By definition, \( \tilde{K} = l_1 \ldots l_k \) where each assignment \( I_i \) has the form \( x_i = x_{i,j} \). By rule ASS we obtain \((\Pi_i, l_i, I_i, \ldots I_k) \Rightarrow a (\Pi_{i+1}, l, I_{i+1}, \ldots I_k)\) for each \( i \), due to

\[
\Pi_i, x_{i,j} \downarrow \Delta_{i+1}, a; \: \Pi_i(x_i) = (l'_i, b_i) = (l_i, a_i) \text{ with } l_i \equiv 1 \text{ and } \Delta_{i+1} = \Pi_i[x_i \mapsto (l'_i \cap 1, b_i)] = \Pi_i[x_{i,j} \mapsto (l_i \cap 1, a_i)]
\]

\( \Delta_{i+1} \downarrow x_i \): Equation (7.5) implies \( \Delta_{i+1} = \Sigma_{i+1}[x_1 \mapsto (l_1, a_1)] \cdots [x_{i-1} \mapsto (l_{i-1}, a_{i-1})] \), and Definition 27 implies \( x_i \neq x_l \) for \( l < i \), hence for \( x_i \in dom \Delta_{i+1} \) we obtain

\[\Delta_{i+1}(x_i) = \Sigma_{i+1}[x_1 \mapsto (l_1, a_1)] \cdots [x_{i-1} \mapsto (l_{i-1}, a_{i-1})](x_i) = \Sigma_{i+1}(x_i).\]

On the other hand, the definition of the \( \Sigma \) states yields

\[\Sigma_{k+1} = \Sigma_{i+1}[x_{i+1,j} \mapsto \ldots] \cdots [x_{k,j} \mapsto \ldots]\]

so using the property \( x_{i,j} \neq x_i \) for all \( l > i \) from Definition 31 we obtain

\[\Sigma_{i+1}(x_i) = \ldots = \Sigma_{k+1}(x_i)\]

and thus

\[\Delta_{i+1}(x_i) = \Sigma_{k+1}(x_i).\]

Consequently, \( \Delta_{i+1} \downarrow x_i \) holds since \( \Sigma_{k+1} \downarrow x_i \) holds in rule PHI.

\[\Pi_{i+1} = \Delta_{i+1}[x_i \mapsto (\alpha(I_i, x_i), a_i)] \text{ by definition, } \Pi_{i+1} = \Delta_{i+1}[x_i \mapsto (l_i, b_i)] \text{ holds and we showed } b_i = a_i. \text{ As the dataflow solutions for } P \text{ and } P' \text{ agree, } l_i = \alpha(\theta, x_i) = \alpha(I_i, x_i) \text{ holds and the claim follows.}\]
Consequently, \((\Pi_1, l, I_1 \ldots I_k ) \Rightarrow^* (\Pi_{k+1}, l, I_k )\) holds, and the claim follows from \(\Pi_1 = \Sigma_1 = \Sigma, K = I_1 \ldots I_k\) and

\[
\begin{align*}
\Pi_{k+1} & = \Delta_{k+1}[x_k \mapsto (l_k, a_k)] \\
& = \Sigma_{k+1}[x_1 \mapsto (l_1, a_1)] \ldots [x_{k-1} \mapsto (l_{k-1}, a_{k-1})][x_k \mapsto (l_k, a_k)] \\
& = \Sigma'.
\end{align*}
\]

From now on we assume that all \(\Phi\)-blocks of a program have been eliminated and that \(P\) is a regular program.

### 7.3 Translation

The translation \([\cdot]\) of a regular IL program \(P\) into ALEF is defined by induction on the structure of \(P\). For each instruction form, a corresponding ALEF instruction is defined, and the translation of \(P = (N, A, \text{succ, entry})\) results from translating each basic block \(N(m)\). The translation is parametric in the allocation of operand queues and registers to variables, which is represented by an allocation function \(\rho : \text{Var}_P \rightarrow \text{OP}\). Type-correctness of the translation is independent of the exact allocation chosen as long as the following property is satisfied.

**Definition 43.** An allocation function \(\rho : \text{Var}_P \rightarrow \text{OP}\) is called compatible with the dataflow solution \(\text{In}/\text{Out}\) if for all operand queues \(q\),

\[\rho(x) = q\] implies \(x\) forwardable to \(\gamma(q)\).

The translation \([\cdot]_\rho\) is defined by Table 7.1 where an extended ALEF instruction set is used which includes instructions with immediate operands. The choice to map an instruction \([n]x = y\) with \(\rho(y) = r\) to \(\text{id}^{\text{ALF}}\) instead of any other \(\text{id}^{fu}\) is arbitrary, and one could in fact choose different functional units each time such an instruction is translated.

The typing information for the new ALEF instructions is given in Table 7.2.
7.3. Translation

| $[n]x = a$ | $[n]ldc \ a (x)$ |
| $[n]x = y$ | $\begin{cases} [n]ldc(q) \ p(x) & \text{if } p(y) = q \\ [n]ldc(r) \ p(x) & \text{if } p(y) = r \end{cases}$ |
| $[n]x = y - a$ | $[n]sub \ a \ p(y) \ p(x)$ |
| $[n]x = y + a$ | $[n]add \ a \ p(y) \ p(x)$ |
| $[n]x = y \* a$ | $[n]mul \ a \ p(y) \ p(x)$ |
| $[n]x = y - z$ | $[n]sub \ p(y) \ p(z) \ p(x)$ |
| $[n]x = y + z$ | $[n]add \ p(y) \ p(z) \ p(x)$ |
| $[n]x = y \* z$ | $[n]mul \ p(y) \ p(z) \ p(x)$ |
| $[n]if \ ml \ x$ | $[n]if \ p(x) \ ml$ |
| $[n]jmp \ m$ | $[n]jmp \ m$ |
| $B$ | $[[I_1]_p \ldots [[I_m]_p$ if $B = I_1 \ldots I_m$ |
| $P$ | $(\lambda m. [[N(m)]_p, A]$ |

Table 7.1: Translation $[[P]]_p$ for IL program $P = (N, A, succ, entry)$

<table>
<thead>
<tr>
<th>code</th>
<th>$FU(code)$</th>
<th>$A_{code}$</th>
<th>$B_{code}$</th>
</tr>
</thead>
<tbody>
<tr>
<td>add $a \ op_1 \ op_2$</td>
<td>ALU</td>
<td>$op_1$</td>
<td>$op_2$</td>
</tr>
<tr>
<td>sub $op_1 \ op_2 \ op_3$</td>
<td>ALU</td>
<td>$op_1 \otimes op_2$</td>
<td>$op_3$</td>
</tr>
<tr>
<td>subi $a \ op_1 \ op_2$</td>
<td>ALU</td>
<td>$op_1$</td>
<td>$op_2$</td>
</tr>
<tr>
<td>muli $a \ op_1 \ op_2$</td>
<td>MUL</td>
<td>$op_1$</td>
<td>$op_2$</td>
</tr>
</tbody>
</table>

Table 7.2: Type system for extended instruction set
Example. We continue the example (7.3) from the previous section (program $P_{[9]}$ and equations (7.4)). Take $\gamma(q_1) = \gamma(q_3) = ALU$, $\gamma(q_2) = MEM$ and $\gamma(q_4) = BU$, and define $\rho$ by

$$
\rho(v) = q_4 \quad \rho(z1) = q_1 \\
\rho(y1) = q_1 \quad \rho(z2) = q_2 \\
\rho(y2) = r_2 \quad \rho(z) = q_3 \\
\rho(y) = q_1 \quad \rho(w) = r_1
$$

Then $\rho$ is compatible with the forwardability information resulting from the least solution to the dataflow equations. For example, $\rho(y) = q_1$ holds, and $y$ is indeed forwardable to $\gamma(q_1) = ALU$. On the other hand, $w$ is not forwardable to any $fu$ and is mapped to a register.

The resulting ALEF program $\llbracket P_{[9]} \rrbracket_\rho = (N,A)$ is

$$
\begin{align*}
N(1) = \llbracket N(1) \rrbracket_\rho &= [1]ldc 7 q_4 \\
N(3) = \llbracket N(3) \rrbracket_\rho &= [2]if q_4 3 6 \\
N(6) = \llbracket N(6) \rrbracket_\rho &= [6]ldc 7 r_2 \\
N(10) = \llbracket N(10) \rrbracket_\rho &= [10]add q_3 q_1 r_1
\end{align*}
$$

and $A = \{(1,3), (1,6), (3,10), (6,10)\}$.

Notice that Definition 43 does not force the compiler to actually map forwardable variables to operand queues. The compiler may thus optimise the scheduling of operand communication by exploring different allocations.

The following two properties will be used in the following section.

**Lemma 14.** If $\rho$ is an allocation function for $P$, $I \in \text{Instr}_P$ and $x \in \text{defs}(I)$ then $\rho(x)$ is a factor of $B_{\text{code}}$ where $\llbracket I \rrbracket_\rho = [n] \text{code for some } n$.

**Proof.** Inspection of Tables 7.1, 3.1 and 7.2.  

$\square$
Lemma 15. Let \( \rho \) be an allocation function for regular \( \mathbb{P} \) which is compatible with a solution \( \text{In/Out} \) of the dataflow equations for \( \mathbb{P} \) and let \( \rho(x) = q \). For \( I \in \text{Instrs}_\mathbb{P} \), \( \text{Out}(I)(x) = 1 \) holds whenever \( x \in \text{defs}(I) \) or \( \text{Out}(I)(x) \sqsupseteq 1 \).

Proof. For \( \rho(x) = q \), the definition of compatibility implies that \( x \) is forwardable to \( \gamma(q) \), so by Definition 42 \( \text{Out}(I)(x) \neq \top \) holds for all \( I \in \text{Instrs}_\mathbb{P} \), and \( \text{Out}(I)(x) = 1 \) holds for all \( I \) with \( x \in \text{defs}(I) \). Thus, both \( x \in \text{defs}(I) \) and \( \text{Out}(I)(x) \sqsupseteq 1 \) imply \( \text{Out}(I)(x) = 1 \). \( \square \)

7.4 Type-correctness

In order to show that the translation preserves the forwarding structure, we relate dataflow solutions for IL to \( \otimes \)-types in ALEF. In the following discussion, let \( \mathbb{P} = (\mathbb{N}, \mathbf{A}, \text{succ}, \text{entry}) \) be a fixed regular IL program and \( \text{In/Out} \) be a fixed (not necessarily least) solution to its dataflow equations.

Any allocation function \( \rho \) which is compatible with a dataflow solution for \( \mathbb{P} \) leads to \( \llbracket \mathbb{P} \rrbracket_\rho \) being well-typed in the typing calculus for sequential execution of ALEF programs in Section 5.1, where types arise from \( \text{In/Out} \) as follows.

Definition 44. For allocation function \( \rho \) and \( I \in \text{Instrs}_\mathbb{P} \), let

\[
\begin{align*}
\text{uses}_\rho(I)(x) &= \begin{cases} 
\rho(x) & \text{if } \text{uses}(I)(x) \sqsupseteq 1 \\
1 & \text{otherwise}
\end{cases} \\
\text{Out}_\rho(I)(x) &= \begin{cases} 
1 & \text{if } x \notin \text{defs}(I), \text{Out}(I)(x) \sqsubseteq 0 \text{ and } \rho(x) = q \\
\rho(x) & \text{otherwise}
\end{cases} \\
\text{In}_\rho(I)(x) &= \begin{cases} 
\text{uses}_\rho(I)(x) & \text{if } x \in \text{defs}(I) \\
\text{uses}_\rho(I)(x) \otimes \text{Out}_\rho(I)(x) & \text{otherwise}
\end{cases}
\end{align*}
\]

Then the types \( \text{uses}_\rho(I) \), \( \text{In}_\rho(I) \) and \( \text{Out}_\rho(I) \) are given by

\[
\begin{align*}
\text{uses}_\rho(I) &= \otimes_{x \in \text{Var}_\mathbb{P}} \text{uses}_\rho(I)(x) \\
\text{In}_\rho(I) &= \otimes_{x \in \text{Var}_\mathbb{P}} \text{In}_\rho(I)(x) \\
\text{Out}_\rho(I) &= \otimes_{x \in \text{Var}_\mathbb{P}} \text{Out}_\rho(I)(x)
\end{align*}
\]
**Example.** The IL program

\[ [1]v = 2 \ [2]w = v + 5 \ [3] \text{if } w 1 1 \]  

(7.6)

has a solution fulfilling

\[ \text{Out}([1])(v) = \text{Out}([2])(w) = 1 \text{ and } \text{In}([2])(v) = \text{In}([3])(w) = 1. \]

This solution is compatible with an allocation \( \rho(v) = q_1, \rho(w) = q_2 \) where \( \gamma(q_1) = \text{ALU} \) and \( \gamma(q_2) = \text{BU} \), and leads to

\[ [[1][2][3]]_\rho = [1]1 \text{dc } 2 \ q_1 \ [2]\text{add } 5 \ q_1 \ q_2 \ [3]\text{if } q_2 \ 1 1 : 1 \rightarrow 1 \]

and

\[
\begin{align*}
\text{In}^\rho([1])(v) &= 1 & \text{In}^\rho([1])(w) &= 1 \\
\text{Out}^\rho([1])(v) &= q_1 & \text{Out}^\rho([1])(w) &= 1 \\
\text{In}^\rho([2])(v) &= q_1 & \text{In}^\rho([2])(w) &= 1 \\
\text{Out}^\rho([2])(v) &= 1 & \text{Out}^\rho([2])(w) &= q_2 \\
\text{In}^\rho([3])(v) &= 1 & \text{In}^\rho([3])(w) &= q_2 \\
\text{Out}^\rho([3])(v) &= 1 & \text{Out}^\rho([3])(w) &= 1 \\
\end{align*}
\]

In particular, \( [[1][2][3]]_\rho : \text{In}^\rho([1]) \rightarrow \text{Out}^\rho([3]) \) holds. \( \diamond \)

The translation of a single instruction may be typed according to its component in \( \text{In}/\text{Out} \).

**Proposition 38.** Let \( \rho \) be compatible with \( \text{In}/\text{Out} \) and \( I \in \text{Instr}_\rho \). Then

1. \( [[I]]_\rho : \text{In}^\rho(I) \rightarrow \text{Out}^\rho(I) \) and

2. for \( J \in \text{succ}(I) \) there is a factor \( \text{rg}(Y) \) of \( \text{rg}(\text{Out}^\rho(J)) \) with

\[ \text{Out}^\rho(I) = \text{In}^\rho(J) \otimes \text{rg}(Y). \]

**Proof.** 1. Case distinction on the form of \( I \).
Case $’[n]x = a’$. The definition of $\llbracket I \rrbracket_\rho$ implies $\llbracket I \rrbracket_\rho = [n] 1 \text{dc } a \rho(x)$, hence

\begin{align*}
In^\rho(I) &= \bigotimes_{y \in \text{Var}_\gamma} In^\rho(I)(y) = 1 \bigotimes_{y \neq x} (1 \otimes Out^\rho(I)(y)) \\
Out^\rho(I) &= \bigotimes_{y \in \text{Var}_\gamma} Out^\rho(I)(y) = \rho(x) \bigotimes_{y \neq x} Out^\rho(I)(y)
\end{align*}

and the typing rule for $1 \text{dc}$ succeeds for $X = \bigotimes_{y \neq x} Out^\rho(I)(y)$.

Case $’[n]x = z’$ where $x \neq z$ and $\rho(z) = q$. The definition of $\llbracket I \rrbracket_\rho$ implies

\[ \llbracket I \rrbracket_\rho = [n] \text{id}^\gamma q \rho(x) \]

and equation (6.12) implies

\[ \text{uses}(I)(y) = \begin{cases} 1 & \text{if } y = z \\ 0 & \text{otherwise} \end{cases} \]

Hence,

\begin{align*}
In^\rho(I) &= 1 \otimes \rho(z) \otimes Out^\rho(I)(z) \bigotimes_{y \neq \{x,z\}} (1 \otimes Out^\rho(I)(y)) \\
&= q \otimes \bigotimes_{y \neq x} Out^\rho(I)(y) \\
Out^\rho(I) &= \rho(x) \bigotimes_{y \neq x} Out^\rho(I)(y)
\end{align*}

and the typing rule for $\text{id}$ succeeds with $X = \bigotimes_{y \neq x} Out^\rho(I)(y)$, because the side condition $SC$ reads $\gamma(q) = \gamma(q)$ and is trivially fulfilled.

Case $’[n]x = z’$ where $x \neq z$ and $\rho(z) = r$. The definition of $\llbracket I \rrbracket_\rho$ yields

\[ \llbracket I \rrbracket_\rho = [n] \text{id}^\text{ALU} r \rho(x) \]

and

\[ \text{uses}(I)(y) = \begin{cases} 1 & \text{if } y = z \\ 0 & \text{otherwise} \end{cases} \]

holds by equation (6.12). Hence,

\begin{align*}
In^\rho(I) &= 1 \otimes \rho(z) \otimes Out^\rho(I)(z) \bigotimes_{y \neq \{x,z\}} (1 \otimes Out^\rho(I)(y)) \\
&= r \otimes \bigotimes_{y \neq x} Out^\rho(I)(y) \\
Out^\rho(I) &= \rho(x) \bigotimes_{y \neq x} Out^\rho(I)(y)
\end{align*}

due to $Out^\rho(I)(z) = r$ and $r \otimes r = r$. Again, the typing judgement for $\llbracket I \rrbracket_\rho$ follows for $X = \bigotimes_{y \neq x} Out^\rho(I)(y)$ because the side condition $SC$ is void.
Case \( \lbrack n \rbrack x = x' \) where \( \rho(x) = q \). From the definition of \( \llbracket \_ \rrbracket_\rho \) we obtain \( \llbracket I \rrbracket_\rho = \llbracket n \rrbracket_\rho d^\rho(q) q q \), hence

\[
\begin{align*}
\text{In}^\rho(I) & = q \otimes \otimes_{y \neq x}(1 \otimes \text{Out}^\rho(I)(y)) \\
\text{Out}^\rho(I) & = q \otimes \otimes_{y \neq x} \text{Out}^\rho(I)(y)
\end{align*}
\]

The claim follows for \( X = \otimes_{y \neq x} \text{Out}^\rho(I)(y) \) and the side condition is fulfilled.

Case \( \lbrack n \rbrack x = x' \) where \( \rho(x) = r \). The definition of \( \llbracket \_ \rrbracket_\rho \) yields \( \llbracket I \rrbracket_\rho = \llbracket n \rrbracket_\rho d^\rho_L r r \), hence

\[
\begin{align*}
\text{In}^\rho(I) & = r \otimes \otimes_{y \neq x}(1 \otimes \text{Out}^\rho(I)(y)) \\
\text{Out}^\rho(I) & = r \otimes \otimes_{y \neq x} \text{Out}^\rho(I)(y)
\end{align*}
\]

The claim follows for \( X = \otimes_{y \neq x} \text{Out}^\rho(I)(y) \) and the side condition is fulfilled.

Cases \( \lbrack n \rbrack x = z \) bop \( a' \) and \( \lbrack n \rbrack x = x \) bop \( a' \). Similar to the cases \( \lbrack n \rbrack x = z' \) and \( \lbrack n \rbrack x = x' \), with \( fu = \text{MUL} \) for \( \text{bop} \in \{\ast\} \) and \( fu = \text{ALU} \) for \( \text{bop} \in \{-, +\} \).

Case \( \lbrack n \rbrack x = z \) bop \( v' \) where \( z \neq x \neq v \). We treat the case \( \text{bop} \in \{\ast\} \) and \( \rho(z) = q_1 \neq q_2 = \rho(v) \) explicitly, all other cases being similar.

The definition of \( \llbracket \_ \rrbracket_\rho \) yields \( \llbracket I \rrbracket_\rho = \llbracket n \rrbracket_\rho \text{mu}1 q_1 q_2 \rho(x) \), and equation (6.12) implies

\[
\begin{align*}
\text{In}^\rho(I) & = \text{In}^\rho(I)(x) \otimes \text{In}^\rho(I)(z) \otimes \text{In}^\rho(I)(v) \otimes \otimes_{y \neq \{x, z, v\}} \text{In}^\rho(I)(y) \\
& = 1 \otimes q_1 \otimes \text{Out}^\rho(I)(z) \otimes q_2 \otimes \text{Out}^\rho(I)(v) \otimes \\
& \otimes_{y \neq \{x, z, v\}} (1 \otimes \text{Out}^\rho(I)(y)) \\
& = q_1 \otimes q_2 \otimes \otimes_{y \neq x} \text{Out}^\rho(I)(y)
\end{align*}
\]

\[
\begin{align*}
\text{Out}^\rho(I) & = \text{Out}^\rho(I)(x) \otimes \otimes_{y \neq x} \text{Out}^\rho(I)(y) = \rho(x) \otimes \otimes_{y \neq x} \text{Out}^\rho(I)(y).
\end{align*}
\]

With \( X = \otimes_{y \neq x} \text{Out}^\rho(I)(y) \) the typing judgement \( \llbracket I \rrbracket_\rho : X \otimes q_1 \otimes q_2 \rightarrow X \otimes \rho(x) \) is derivable. The side condition reads \( \gamma(q_1) = \gamma(q_2) = \text{MUL} \) and is fulfilled because the compatibility of \( \rho \) with In/Out implies that \( z \) and \( v \)
are forwardable to $\gamma(q_1)$ and $\gamma(q_2)$, respectively. By the syntactic form of $I$, these variables are only forwardable to $fu = \text{MUL}$ by Definition 42, so $\gamma(q_1) = \gamma(q_2) = \text{MUL}$ holds.

Other cases \( \llbracket n \rrbracket x = \text{z bop} \). Similar.

Case \( \llbracket n \rrbracket \text{if } x \text{ m} \). The definition of \( \llbracket \text{if} \rrbracket_{\rho} \) yields \( \llbracket \text{if} \rrbracket_{\rho} = \llbracket n \rrbracket \text{if } \rho(x) \text{ m} \), and the earlier definition yield $\text{defs}(I) = \{x\}$, $\text{uses}(I)(x) = 1$ and $\text{uses}(I)(y) = 0$ for all $y \neq x$, hence

\[
\begin{align*}
\text{In}^\rho(I) & = \rho(x) \otimes \otimes_{y \in \text{Var}_\rho} (1 \otimes \text{Out}^\rho(I)(y)) \\
\text{Out}^\rho(I) & = \otimes_{y \in \text{Var}_\rho} \text{Out}^\rho(I)(y)
\end{align*}
\]

The claim follows with $X = \otimes_{y \in \text{Var}_\rho} \text{Out}^\rho(I)(y)$. The side condition reads $\gamma(\rho(x)) = B\text{U}$ and is trivially fulfilled if $\rho(x) = r$, and is fulfilled by the definition of compatibility and forwardability if $\rho(x) = q$.

Case \( \llbracket n \rrbracket \text{jmp } m \). The definition of \( \llbracket \text{jmp} \rrbracket_{\rho} \) yields \( \llbracket \text{jmp} \rrbracket_{\rho} = \llbracket n \rrbracket \text{jmp } m \), and we obtain

\[
\begin{align*}
\text{In}^\rho(I) & = \otimes_{y \in \text{Var}_\rho} \text{uses}^\rho(I)(y) \otimes \text{Out}^\rho(I)(y) = \otimes_{y \in \text{Var}_\rho} \text{Out}^\rho(I)(y) = \text{Out}^\rho(I)
\end{align*}
\]

so with $X = \text{In}^\rho(I)$ the typing rule for jmp succeeds.

2. We show that for each $x \in \text{Var}_\rho$ there is a factor $\text{rg}(Y_x)$ of $\text{rg}(\text{Out}^\rho(J))$ such that $\text{Out}^\rho(I)(x) = \text{In}^\rho(J)(x) \otimes \text{rg}(Y_x)$ holds. Then $\text{rg}(Y) = \otimes_{x \in \text{Var}_\rho} \text{rg}(Y_x)$ is a factor of $\text{rg}(\text{Out}^\rho(J))$ and the claim follows due to

\[
\text{Out}^\rho(I) = \otimes_{x \in \text{Var}_\rho} \text{Out}^\rho(I)(x) = \otimes_{x \in \text{Var}_\rho} (\text{In}^\rho(J)(x) \otimes \text{rg}(Y_x)) = \text{In}^\rho(J) \otimes \text{rg}(Y).
\]

We argue by case distinction on the value of $\text{Out}^\rho(I)(x)$.

Case $\text{Out}^\rho(I)(x) = 1$. The definition of $\text{Out}^\rho(I)(x)$ implies $x \notin \text{defs}(J)$, $\rho(x) = q$ and $\text{Out}(I)(x) \sqsubseteq 0$. From Proposition 33 we obtain $\text{In}(J)(x) \sqsubseteq \text{Out}(I)(x) \sqsubseteq 0$, hence $\text{uses}(J)(x) \sqsubseteq 0$ and

\[
\text{uses}^\rho(J)(x) = 1.
\]

There are now two cases, depending on whether $x \in \text{defs}(J)$ holds or not.
If \( x \in \text{defs}(J) \) then \( \text{In}^\rho(J)(x) = \text{uses}^\rho(J)(x) = 1 \) follows, so the claim holds with \( Y_x = 1 \).

If \( x \notin \text{defs}(J) \) then \( 0 \supseteq \text{In}(J)(x) = \text{uses}(J)(x) \oplus \text{Out}(J)(x) \) holds, hence \( 0 \supseteq \text{Out}(J)(x) \) and \( \text{Out}^\rho(J)(x) = 1 \). Therefore,

\[
\text{In}^\rho(J)(x) = \text{uses}^\rho(J)(x) \otimes \text{Out}^\rho(J)(x) = 1
\]

and the claim holds for \( Y_x = 1 \).

In both cases, \( \text{rg}(Y_x) = 1 \) is trivially a factor of \( \text{rg}(\text{Out}^\rho(J)) \).

**Case** \( \text{Out}^\rho(I)(x) = q \). The definition of \( \text{Out}^\rho(I)(x) \) forces \( \rho(x) = q \), and \( x \in \text{defs}(I) \) or \( \text{Out}(I)(x) \supseteq 1 \) hold. By Lemma 15, \( \text{Out}(I)(x) = 1 \) follows in either case, and from Proposition 33 we obtain \( \text{In}(J)(x) \subseteq \text{Out}(I)(x) = 1 \), hence \( \text{In}(J)(x) \in \{ \bot, 1 \} \). Again there are two cases.

If \( x \in \text{defs}(J) \) then \( \{ \bot, 1 \} \supseteq \text{In}(J)(x) = \text{uses}(J)(x) \in \{ 0, 1, \top \} \) holds, hence \( 1 = \text{In}(J)(x) = \text{uses}(J)(x) \) and \( \text{In}^\rho(J)(x) = \text{uses}^\rho(J)(x) = q \). Take \( Y_x = 1 \).

If \( x \notin \text{defs}(J) \), then \( \{ \bot, 1 \} \supseteq \text{In}(J)(x) = \text{uses}(J)(x) \oplus \text{Out}(J)(x) \) holds. In the case of \( \text{uses}(J)(x) = 1 \), \( \text{Out}(J)(x) \subseteq 0 \) follows, so \( \text{Out}^\rho(J)(x) = 1 \) and

\[
\text{In}^\rho(J)(x) = \text{uses}^\rho(J)(x) \otimes \text{Out}^\rho(J)(x) = \rho(x) \otimes 1 = q.
\]

In the case of \( \text{uses}(J)(x) = 0 \) and \( \text{Out}(J)(x) = 1 \),

\[
\text{uses}^\rho(J)(x) = 1 \quad \text{and} \quad \text{Out}^\rho(J)(x) = \rho(x) = q
\]

follow, hence

\[
\text{In}^\rho(J)(x) = \text{uses}^\rho(J)(x) \otimes \text{Out}^\rho(J)(x) = q.
\]

In both cases, take \( Y_x = 1 \) which is a factor of \( \text{rg}(\text{Out}^\rho(J)) \). The case of \( \text{uses}(J)(x) = 0 \) and \( \text{Out}(J)(x) = \bot \) cannot occur: from \( \rho(x) = q \) and the definition of compatibility we know that \( x \) is forwardable to \( \gamma(q) \), i.e. that \( \text{Out}(J)(x) \notin \{ \bot, \top \} \) holds.

**Case** \( \text{Out}^\rho(I)(x) = r \). Then \( \rho(x) = r \) follows, and there are three cases.
Case uses\( (J) (x) = 0 \) and \( x \in \text{defs}(J) \). Then
\[
\text{In}^\rho (J) (x) = \text{uses}^\rho (J) (x) = 1
\]
holds, and we take \( Y_x = r \). By Lemma 14, \( \rho (x) = r = Y_x \) is a factor of \( B_{[\rho]} \), hence \( Y_x \otimes Z = B_{[\rho]} \) holds for some \( Z \). The first part of this proposition implies \( \text{Out}^\rho (J) = B_{[\rho]} \otimes W \) for some \( W \), so \( \text{Out}^\rho (J) = Y_x \otimes Z \otimes W \) and \( \text{rg}(Y_x) = r \) is a factor of \( \text{rg}(\text{Out}^\rho (J)) \).

Case uses\( (J) (x) \models 1 \) and \( x \in \text{defs}(J) \). Then \( \text{Out}^\rho (J) (x) = r \) holds due to \( \rho (x) = r \), and \( \text{In}^\rho (J) (x) = \text{uses}^\rho (J) (x) = r \). We take \( Y_x = 1 \).

Case \( x \notin \text{defs}(J) \). Again, \( \text{Out}^\rho (J) (x) = r \) holds due to \( \rho (x) = r \), and we have \( \text{uses}^\rho (J) (x) \in \{1, r\} \), so \( \text{In}^\rho (J) (x) = r \) follows and we take \( Y_x = 1 \).

\( \square \)

The situation in the case ":Out\( \rho (I) (x) = q\)" in the proof of part (2) motivates the condition \( \text{Out}(J) (x) \neq \top \) for all \( I \) in the definition of forwardability (Definition 42). Consider the program \( P \),
\[
\begin{align*}
N(1) &= \ [1]\!\!v = \ldots \quad N(3) &= \ [3]\!\!x = 2 \quad N(5) = \ [5]\!\!\{\!\!1\!\!\text{if } v \!\!3\!\!5 \} \\
& \quad \quad \quad \quad \ [2]\!\!\{\!\!\text{if } v \!\!3\!\!5 \} \\
& \quad \quad \quad \quad \ [4]\!\!y = x + 1
\end{align*}
\]
and its dataflow equations for \( x \)
\[
\begin{align*}
fwdIn([1]) (x) &= 0 \oplus fwdOut([1]) (x) = fwdIn([2]) (x) \\
fwdIn([2]) (x) &= 0 \oplus fwdOut([2]) (x) = fwdIn([3]) (x) \sqcup fwdIn([5]) (x) \\
fwdIn([3]) (x) &= 0 \\
fwdOut([3]) (x) &= fwdIn([4]) (x) = 1 \oplus fwdOut([4]) (x) \\
fwdOut([4]) (x) &= 0 \\
fwdIn([5]) (x) &= 0 \oplus fwdOut([5]) (x) = fwdIn([3]) (x) \sqcup fwdIn([5]) (x)
\end{align*}
\]

In particular, the last equation simplifies to
\[
\text{fwdIn}([5]) (x) = 0 \sqcup \text{fwdIn}([5]) (x)
\]
(7.7)
for which a (non-minimal) fixed point is

\[ f_{\text{fwdIn}}([5])(x) = \top. \]

A solution \( \text{In}/\text{Out} \) based on this fixed point fulfils

\[ \text{In}([1])(x) = \text{Out}([1])(x) = \text{In}([2])(x) = \text{Out}([2])(x) = \top. \]

The condition \( \text{Out}(I)(x) \neq \top \) in Definition 42 ensures that for this solution variable \( x \) is not forwardable, despite being used linearly.

Suppose the condition was dropped. Then, an allocation \( \rho(x) = q \) would be possible due to \( f_{\text{fwdOut}}([3])(x) = 1 \), leading to

\[
\begin{align*}
\text{In}^\rho([1])(x) &= 1 \otimes \text{Out}^\rho([1])(x) = q \\
\text{In}^\rho([2])(x) &= 1 \otimes \text{Out}^\rho([2])(x) = q \\
\text{In}^\rho([3])(x) &= 1
\end{align*}
\]

and Lemma 15 and Proposition 38 (part (2)) would be violated as \( [3] \in \text{succ}([2]) \) holds. Furthermore, \( \text{In}^\rho([1])(x) = q \) means that an initial ALEF configuration would have to provide a value in \( q \) which results in functionally incorrect behaviour for \( v = 0 \) as instruction \( \llbracket [4] \rrbracket_\rho \) increments the head value of \( q \).

On the other hand, Definition 42 does allow \( x \) to be forwarded to ALU. For this, however, a different fixed point must be chosen. In particular, the least fixed point \( f_{\text{fwdIn}}([5])(x) = 0 \) to equation (7.7) leads to

\[ \text{In}([1])(x) = \text{Out}([1])(x) = \text{In}([2])(x) = \text{Out}([2])(x) = 0. \]

This solution fulfils the condition in Definition 42 and also Lemma 15, and for \( \rho(x) = q \) it yields

\[
\begin{align*}
\text{In}^\rho([1])(x) &= \ldots = \text{Out}^\rho([2])(x) = 1 \\
\text{In}^\rho([3])(x) &= 1 \\
\text{In}^\rho([5])(x) &= \text{Out}^\rho([5])(x) = 1.
\end{align*}
\]

Hence, Proposition 38 (part (2)) is satisfied and functionally correct behaviour is observed as an initial configuration does not need to provide a value in \( q \).
For registers, a similar mismatch between $\text{Out}^\rho(I)$ and $\text{In}^\rho(J)$ motivates the factor $Y$ in Proposition 38, and indeed requires most registers to be initialised at the start of an ALEF execution.

**Example.** Consider the IL program

$$[1]v = 2 \ [2]w = 5 \ [3]z = 6 \quad (7.8)$$

and the allocation $\rho$ with $\rho(x) = r_x$ for $x \in \{v,w,z\}$ and $r_x \neq r_y$ for $x \neq y$. We obtain

$$\text{In}^\rho([1]) = r_v \otimes r_z \quad \|[[1]]_\rho = 1\text{dc} \ 2 \ r_v$$
$$\text{In}^\rho([2]) = r_v \otimes r_z \quad \|[[2]]_\rho = 1\text{dc} \ 5 \ r_v$$
$$\text{In}^\rho([3]) = r_v \otimes r_w \quad \|[[3]]_\rho = 1\text{dc} \ 6 \ r_z$$

and for $I \in \{[1],[2],[3]\}$, $\text{Out}^\rho(I) = r_v \otimes r_w \otimes r_z$ and $\|I : \text{In}^\rho(I) \rightarrow \text{Out}^\rho(I)$ follow, hence

$$\text{Out}^\rho([1]) = \text{In}^\rho([2]) \otimes r_w$$
$$\text{Out}^\rho([2]) = \text{In}^\rho([3]) \otimes r_z$$

and $\|[[1] [2] [3]]_\rho : \text{In}^\rho([1]) \rightarrow \text{Out}^\rho([3])$. \hfill $\diamond$

The definition of $\text{In}^\rho$ and $\text{Out}^\rho$ yields the following result.

**Corollary 3.** Let $\rho$ be compatible with $\text{In}/\text{Out}$ and $B = I_1 \ldots I_n$ with $|B| > 0$. If $\text{rg}(Y)$ is a factor of $\text{rg}(\text{Out}^\rho(I_1))$ then

$$\|B\|_\rho : \text{In}^\rho(I_1) \otimes \text{rg}(Y) \rightarrow \text{Out}^\rho(I_n)$$

and $\text{Out}^\rho(I_n) = \text{rg}(\text{Out}^\rho(I_1)) \otimes X$ for some $X$.

**Proof.** Induction on $n$.

**Case $n = 1$.** Applying Proposition 38 (part (1)) yields $\|I_1\|_\rho : \text{In}^\rho(I_1) \rightarrow \text{Out}^\rho(I_1)$, so by the results of Chapter 5, $\|I_1\|_\rho : \text{In}^\rho(I_1) \otimes \text{rg}(Y) \rightarrow \text{Out}^\rho(I_1)$ holds for any factor $\text{rg}(Y)$ of $\text{Out}^\rho(I_1)$. Take $X = \text{Out}^\rho(I_1)$. 
Case $n > 1$. For $B = I_1 I_2 \ldots I_n$ the base case yields
\[
[[I_1]]_\rho : \mathit{In}^\rho (I_1) \otimes \mathit{rg}(Y) \rightarrow \mathit{Out}^\rho (I_1)
\]
and Proposition 38 (part (2)) yields
\[
\mathit{Out}^\rho (I_1) = \mathit{In}^\rho (I_2) \otimes \mathit{rg}(Y_2)
\]
for some factor $\mathit{rg}(Y_2)$ of $\mathit{rg}(\mathit{Out}^\rho (I_2))$, i.e.
\[
\mathit{rg}(Y_2) \otimes W = \mathit{rg}(\mathit{Out}^\rho (I_2)) \text{ for some } W.
\]
Consequently, the induction hypothesis can be applied to $I_2 \ldots I_n$, resulting in
\[
[[I_2 \ldots I_n]]_\rho : \mathit{In}^\rho (I_2) \otimes \mathit{rg}(Y_2) \rightarrow \mathit{Out}^\rho (I_n)
\]
and by applying the cut rule we obtain
\[
[[I_1 \ldots I_n]]_\rho : \mathit{In}^\rho (I_1) \otimes \mathit{rg}(Y) \rightarrow \mathit{Out}^\rho (I_n)
\]
as claimed. The induction hypothesis also results in $\mathit{Out}^\rho (I_n) = \mathit{rg}(\mathit{Out}^\rho (I_2)) \otimes V$ for some $V$. The identity $\mathit{Out}^\rho (I_2) \otimes \mathit{rg}(Y_2) = \mathit{Out}^\rho (I_2)$ yields
\[
[[I_2]]_\rho : \mathit{In}^\rho (I_2) \rightarrow \mathit{Out}^\rho (I_2)
\]
and by applying Lemma 5 (in Chapter 5) we obtain
\[
\mathit{Out}^\rho (I_2) = \mathit{rg}(\mathit{In}^\rho (I_2) \otimes \mathit{rg}(Y_2)) \otimes \mathit{Out}^\rho (I_2)
\]
\[
= \mathit{rg}(\mathit{Out}^\rho (I_1)) \otimes \mathit{Out}^\rho (I_2).
\]
Consequently, $\mathit{rg}(\mathit{Out}^\rho (I_2)) = \mathit{rg}(\mathit{Out}^\rho (I_1)) \otimes \mathit{rg}(\mathit{Out}^\rho (I_2))$ holds and thus
\[
\mathit{Out}^\rho (I_n) = \mathit{rg}(\mathit{Out}^\rho (I_2)) \otimes V = \mathit{rg}(\mathit{Out}^\rho (I_1)) \otimes \mathit{rg}(\mathit{Out}^\rho (I_2)) \otimes V,
\]
hence $\mathit{Out}^\rho (I_n) = \mathit{rg}(\mathit{Out}^\rho (I_1)) \otimes X$ for $X = \mathit{rg}(\mathit{Out}^\rho (I_2)) \otimes V$.

\[\Box\]

We are thus able to show well-typedness of the result of the translation.
Theorem 8. If \( \rho \) is an allocation compatible with a solution \( \text{In}/\text{Out} \) of a regular program \( P \) then \( \Sigma \vdash \llbracket P \rrbracket_\rho : \circ \) for some \( \Sigma \).

Proof. For \( P = (N, A, \text{succ}, \text{entry}) \) and any \( m \in \text{dom} \, N \), Corollary 3 with \( \text{rg}(Y) = 1 \) yields

\[
\llbracket N(m) \rrbracket_\rho : \text{In}^\rho (\text{fst}(N(m))) \rightarrow \text{Out}^\rho (\text{lst}(N(m)))
\]

and

\[
\text{Out}^\rho (\text{lst}(N(m))) = \text{rg}(\text{Out}^\rho (\text{fst}(N(m)))) \otimes X_m
\]

for some \( X_m \). For the abbreviations

\[ A_m = \text{In}^\rho (\text{fst}(N(m))) \quad \text{and} \quad B_m = \text{Out}^\rho (\text{lst}(N(m))) \]

there are thus ALEF basic blocks \( N(m) = \llbracket N(m) \rrbracket_\rho \) such that

\[ N(m) : A_m \rightarrow B_m \]

and \( B_m = \text{rg}(\text{Out}^\rho (\text{fst}(N(m)))) \otimes X_m \). Take

\[ \Sigma = \bigcup_{m \in \text{dom} \, N} N_m : A_m \rightarrow B_m. \]

The graph of basic blocks in ALEF contains exactly the same arrows as the IL program, by Table 7.1. Consequently, for \( (m, k) \in \lambda_{\text{ALEF}} \) we obtain \( \text{fst}(N(k)) \in \text{succ}(\text{lst}(N(m))) \), and by Proposition 38 (part (2)) there is a factor \( \text{rg}(Y_{mk}) \) of \( \text{rg}(\text{Out}^\rho (\text{fst}(N(k)))) \) such that

\[ B_m = A_k \otimes \text{rg}(Y_{mk}) \]

holds. Thus,

\[ B_m = A_k \otimes \text{rg}(Y_{mk}) \quad \text{and} \quad \text{rg}(Y_{mk}) \otimes \text{rg}(Z) = \text{rg}(\text{Out}^\rho (\text{fst}(N(k)))) = \text{rg}(B_k) \]

hold for some \( Z \) whenever \( (m, k) \in \lambda_{\text{ALEF}} \), so the condition on arrows in rule PROG-R is fulfilled. \( \square \)

The result of Theorem 8 may be used to justify the condition \( \text{Out}(I)(x) \neq \bot \) for all \( I \) in the definition of forwardability (Definition 42).
Example. The dataflow equations for variable \( w \) in program (7.9)

\[
\begin{align*}
&\quad [2] \text{if } v \leq 6 & \quad [4] \text{if } v > 6 \\
\end{align*}
\]

are

<table>
<thead>
<tr>
<th>( I )</th>
<th>( \text{In}(I)(w) )</th>
<th>( \text{Out}(I)(w) )</th>
</tr>
</thead>
<tbody>
<tr>
<td>( [1] )</td>
<td>\text{In}([3])(w) \sqcup \text{In}([6])(w)</td>
<td>\text{In}([3])(w) \sqcup \text{In}([6])(w)</td>
</tr>
<tr>
<td>( [2] )</td>
<td>\text{In}([3])(w) \sqcup \text{In}([6])(w)</td>
<td>\text{In}([3])(w) \sqcup \text{In}([6])(w)</td>
</tr>
<tr>
<td>( [3] )</td>
<td>\text{In}([3])(w) \sqcup \text{In}([6])(w)</td>
<td>\text{In}([3])(w) \sqcup \text{In}([6])(w)</td>
</tr>
<tr>
<td>( [4] )</td>
<td>\text{In}([5])(w) \sqcup \text{In}([6])(w)</td>
<td>\text{In}([5])(w) \sqcup \text{In}([6])(w)</td>
</tr>
<tr>
<td>( [5] )</td>
<td>\text{In}([5])(w) \sqcup \text{In}([6])(w)</td>
<td>\text{In}([5])(w) \sqcup \text{In}([6])(w)</td>
</tr>
<tr>
<td>( [6] )</td>
<td>\text{In}([6])(w)</td>
<td>\text{In}([6])(w)</td>
</tr>
</tbody>
</table>

The least solution \( \text{In}/\text{Out} \) arises from choosing \( \text{In}([6])(w) = \bot \):

\[
\begin{align*}
\text{Out}([3])(w) &= \text{In}([4])(w) = \text{Out}([4])(w) = 1 \\
\text{In}([6])(w) &= \text{Out}([6])(w) = \bot \\
\text{In}([1])(w) &= \text{Out}([1])(w) = \text{In}([2])(w) = \text{Out}([2])(w) = 0
\end{align*}
\]

If the condition \( \forall I. \text{Out}(J)(x) \neq \bot \) in Definition 42 was dropped, \( w \) would be forwardable, and we could choose an allocation \( \rho \) which maps \( w \) to operand queue \( q \), and \( v \) and \( x \) to (say) \( r_1 \) and \( r_2 \). However, this allocation falsifies Proposition 38(part (2)) as it yields \( \text{Out}^p([4])(w) = q \) and \( \text{In}^p([6])(w) = 1 \). More importantly, Theorem 8 is also violated as the translation \( [P_{(7.9)}]_\rho \) is ill-typed. The principal typings

\[
\begin{align*}
\llbracket N(1) \rrbracket_\rho &:: 1 \rightarrow r_1 \\
\llbracket N(3) \rrbracket_\rho &:: r_1 \rightarrow q \otimes r_1 \\
\llbracket N(5) \rrbracket_\rho &:: q \rightarrow r_2 \\
\llbracket N(6) \rrbracket_\rho &:: 1 \rightarrow 1
\end{align*}
\]

of the translations

\[
\begin{align*}
\llbracket N(1) \rrbracket_\rho &= [1] \text{ldc 17 } r_1 \\
&\quad [2] \text{if } r_1 \leq 6 \\
\llbracket N(3) \rrbracket_\rho &= [3] \text{ldc 2 } q \\
&\quad [4] \text{if } r_1 > 6 \\
\llbracket N(5) \rrbracket_\rho &= [5] \text{ldc } q \ r_2 \\
\llbracket N(6) \rrbracket_\rho &= [6] \text{jmp 6}
\end{align*}
\]
cannot be unified. Indeed, all alternative fixed points to $\text{In}([6])(w)$ yield solutions $\text{In}/\text{Out}$ for which $w$ is not forwardable.

Similar to the situation for the condition $\text{Out}(I)(x) \neq \top$, the restriction $\text{Out}(I)(x) \neq \bot$ may often be satisfied without losing forwardability, if a non-minimal fixed point is chosen.

**Example.** The dataflow equations for variable $w$ for program

\[
\begin{align*}
N(1) &= [1]v = 3 & N(4) &= [4]\text{jmp} \ 4 & N(5) &= [5]x = w \\
\quad & [2]w = 6 \\
\quad & [3]\text{if } v \ 4 \ 5
\end{align*}
\]

are

<table>
<thead>
<tr>
<th>$I$</th>
<th>$\text{In}(I)(w)$</th>
<th>$\text{Out}(I)(w)$</th>
</tr>
</thead>
<tbody>
<tr>
<td>[1]</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>[2]</td>
<td>0</td>
<td>$\text{In}([4])(w) \sqcup \text{In}([5])(w)$</td>
</tr>
<tr>
<td>[3]</td>
<td>$\text{In}([4])(w) \sqcup \text{In}([5])(w)$</td>
<td>$\text{In}([4])(w) \sqcup \text{In}([5])(w)$</td>
</tr>
<tr>
<td>[4]</td>
<td>$\text{In}([4])(w)$</td>
<td>$\text{In}([4])(w)$</td>
</tr>
<tr>
<td>[5]</td>
<td>1</td>
<td>0</td>
</tr>
</tbody>
</table>

The least solution $\text{In}/\text{Out}$ arises from choosing $\text{In}([4])(w) = \bot$:

\[
\begin{align*}
\text{In}([4])(w) &= \text{Out}([4])(w) = \bot \\
\text{Out}([2])(w) &= \text{In}([3])(w) = \text{Out}([3])(w) = 1
\end{align*}
\]

If the condition $\forall I.\text{Out}(I)(x) \neq \bot$ was dropped, $w$ would be forwardable with respect to this solution. Again, choosing $\rho(w) = q$ would violate Proposition 38(part (2)) as it would imply $\text{Out}^\rho([3])(w) = q$ and $\text{In}^\rho([4])(w) = 1$. In contrast to the previous example, Theorem 8 would still hold, but not for the types constructed in its proof: the resulting program $[P(7.10)]_\rho$ may only be unified after weakening the typing of $[N(4)]_\rho$ by $q$. However, the solution arising from the alternative fixed point $\text{In}([4])(w) = 1$ fulfills

\[
\begin{align*}
\text{Out}([2])(w) &= \text{In}([3])(w) = \text{Out}([3])(w) = \text{Out}([4])(w) = 1
\end{align*}
\]
and allows \( w \) to be forwarded. For the same allocation as above, we obtain

\[
In^\rho(I)(w) = \begin{cases} 1 & \text{if } I \in \{[1],[2]\} \\ q & \text{otherwise} \end{cases}
\quad \text{and} \quad
Out^\rho(I)(w) = \begin{cases} 1 & \text{if } I \in \{[1],[5]\} \\ q & \text{otherwise} \end{cases}
\]

Hence, Proposition 38 is fulfilled, \( [P_{(7.10)}] \rho \) is well-typed and its typing arises from \( In/Out \) as described in the proof of Theorem 8.

For the solutions arising from the remaining choices \( In([4])(w) = 0 \) and \( In([4])(w) = \top \), variable \( w \) is not forwardable.

\[ \Diamond \]

As the earlier example program (7.8) shows, initial ALEF configurations have to provide values in nearly all registers, despite the requirement that any IL-variable be assigned to prior to its first use. This property results from using the least upper bound operation in the definition of \( fwdOut \) in equations (6.13) and the slack-less equality \( fwdIn(I)(x) = uses(I)(x) \) if \( x \in \text{defs}(I) \). While the condition on variable assignments ensures that this mismatch does not lead to functionally incorrect behaviour, a tighter correspondence between \( In \) and \( In^\rho \) would be desirable. This may be possible by a second dataflow analysis in forward-direction after the decision on forwardability has been made. Such a dual analysis might resemble the effect of the operation \( \ominus \) (see Section 6.4), but a detailed study is left for future research. An alternative may be to employ translations \( \tau^\rho \) which relate \( \oplus \) and \( \otimes \) homomorphically, like

\[
\tau^\rho_{x}(\top) = \rho(x) \quad \tau^\rho_{x}(0) = 1 \\
\tau^\rho_{x}(1) = \rho(x) \quad \tau^\rho_{x}(\bot) = 1
\]

with subsequent definitions

\[
defs(I)(x) = \begin{cases} 1 & \text{if } x \in \text{defs}(I) \\ 0 & \text{otherwise} \end{cases}
\]

and

\[
\begin{align*}
\tau^\rho_{\text{defs}}(I) &= \otimes_{x \in \text{defs}(I)} \tau^\rho_{x}(\text{defs}(I)(x)) \\
\tau^\rho_{\text{uses}}(I) &= \otimes_{x \in \text{uses}(I)} \tau^\rho_{x}(\text{uses}(I)(x)) \\
\tau^\rho_{\text{In}}(I) &= \otimes_{x \in \text{Var}} \tau^\rho_{x}(\text{In}(I)(x)) \\
\tau^\rho_{\text{Out}}(I) &= \otimes_{x \in \text{Var}} \tau^\rho_{x}(\text{Out}(I)(x)).
\end{align*}
\]
7.5 Allocating operand queues and registers

Operand queues and registers are assigned to variables by defining the function $\rho : \text{Var}_P \rightarrow \text{OP}$. For register allocation, many algorithms exist in the literature which ensure that variables which share a register cannot be in use at the same time. These algorithms are often presented as (extensions of) graph colouring problems where nodes represent variables and undirected arcs $(x, y)$ represent conflicts. Conflicts arise between variables $x$ and $y$ if there is an instruction $I$ at which both $x$ and $y$ are live. Any node colouring of the conflict graph which maps adjacent nodes to different colours yields a register allocation in which conflicting variables are assigned to different registers. The task to colour the nodes of a graph with a minimal number of colours can be proven to be NP-complete. In practice, algorithms therefore compute non-optimal solutions which use (in general) a higher number of registers than necessary. Numerous varieties of register allocation algorithms exist, coping with finite supply of registers, restrictions on the usability of registers for values of a particular type, or supporting register windowing for procedure calls.

For allocating operand queues and registers to IL programs, we assume an infinite supply of operand queues and registers, leaving the combination with register or operand queue spilling [App98a] for future research. Restricting ourselves to the most basic case, we allocate registers to non-forwardable variables using the above colouring approach and assign operand queues to forwardable variables by colouring a a family of graphs, one for each type of functional unit.

Due to the order of values inside operand queues, the conflict condition for operand queue mapped variables differs from that for register mapped variables. Instead of employing joint liveness, variables $x$ and $y$ which are forwardable to the same functional unit conflict if there is a path of instructions in $P$ which assigns to $x$ and $y$ in the order $x$ before $y$ reads in the order $y$ before $x$.

**Definition 45.** For IL program $P = (N, A, \text{succ}, \text{entry})$ and $x \in \text{Var}_P$, a path $\pi$ in $P$ is called an $x$-path if $x \in \text{defs}(\pi(1))$ and

- if $\pi$ is finite and $n$ its length, then $x \in \text{uses}(\pi(n))$ holds and $x \not\in \text{defs}(\pi(i))$ for all $1 < i < n$
• if \( \pi \) is infinite then \( x \notin \text{defs}(\pi(i)) \) holds for all \( 1 < i \)

Conflicts between variables which are forwardable to the same \( fu \) may be expressed as containment between paths.

**Definition 46.** For \( x \neq y \in \text{Var}_p \), a (finite or infinite) \( x \)-path \( \pi \) contains a finite \( y \)-path \( \tau \) if

• there is a \( k \geq 0 \) such that for all \( i \in \text{dom} \, \tau \), \( \tau(i) = \pi(i + k) \) holds, and

• if \( \pi \) is finite and the last instructions of \( \pi \) and \( \tau \) coincide, then this last instruction is of the form \( z = y \text{ bop } x \) for some \( z \).

In particular, no \( x \)-path contains an infinite \( y \)-path. In contrast, a finite \( x \)-path \((I_1, \ldots, I_n)\) contains \((I_2, \ldots, I_{n-1})\) always, and contains \((I_2, \ldots, I_n)\) if \( I_n \) is of the form \( z = y \text{ bop } x \), but not if \( I_n \) is of the form \( I_n = z = x \text{ bop } y \).

**Definition 47.** For regular \( P \), solution \( \text{In/Out} \) and functional unit \( fu \), the (undirected) conflict graph \( G_{fu} = (V_{fu}, E_{fu}) \) consists of nodes \( x \in \text{Var}_P \) which are forwardable to \( fu \) with respect to \( \text{In/Out} \) and edges \((x, y)\) whenever there is an \( x \)-path \( \pi \) and a \( y \)-path \( \tau \) and \( \pi \) contains \( \tau \).

The register conflict graph \( G_{Reg} = (V_{Reg}, E_{Reg}) \) is given by the variables \( x \in \text{Var}_P \) as nodes and conflict edges for all pairs \((x, y)\) such that there is an instruction \( I \in \text{Instrs}_P \) with \( x, y \in \text{liveOut}(I) \).

**Example.** Consider the three programs

\[
N(1) = \begin{align*}
[1] & \quad x = 2 \\
[2] & \quad y = 5 \\
[3] & \quad z = 7 \\
[4] & \quad \text{jmp 5} \\
[5] & \quad \text{jmp 5} \\
\end{align*}
\]

\[
N(1) = \begin{align*}
[1] & \quad x = 4 \\
[2] & \quad y = 7 \\
[3] & \quad z = x - y \\
[4] & \quad \text{jmp 5} \\
[5] & \quad x = x + z \quad \text{(7.11)} \\
\end{align*}
\]

\[
N(5) = \begin{align*}
[1] & \quad x = 4 \\
[2] & \quad y = 7 \\
[3] & \quad z = y - x \\
[4] & \quad \text{jmp 5} \\
[5] & \quad x = y + 1 \\
[6] & \quad z = z - 1 \\
[7] & \quad \text{if } z \ 9 \ 5 \\
[8] & \quad \text{if } z \ 9 \ 5 \\
\end{align*}
\]

\[
N(9) = \begin{align*}
[1] & \quad x = 2 \\
[2] & \quad y = 5 \\
[3] & \quad z = 7 \\
[4] & \quad \text{jmp 5} \\
[5] & \quad \text{jmp 5} \\
[6] & \quad y = y + 1 \\
[7] & \quad x = x - y \\
\end{align*}
\]
In the program on the left, there is one x-path $\pi = ([1],[2],[3])$ and one y-path $\tau = ([2],[3])$ and both variables are forwardable to ALU. Although all instructions of $\tau$ occur in $\pi$, $\tau$ is not contained in $\pi$ as the instruction $[3]$ does not have the form $z = y \text{ bop } x$. Indeed, no conflict exists between $x$ and $y$ and they may share the same operand queue.

The second program has the same x-paths and y-paths, but this time instruction $[3]$ is of the form $z = y \text{ bop } x$. Consequently, a conflict exists between $x$ and $y$. The conflict graph $G_{\text{ALU}}$ of variables forwardable to ALU contains an edge between $x$ and $y$ and a colouring will hence map them to different operand queues. This is necessary for functionally correct execution: suppose an allocation $\rho$ was admitted with $\rho(x) = \rho(y) = q$. Then the translated program $[[P]]_{\rho}$ would read

$$[[P]]_{\rho} = \begin{array}{c}
1 \text{ ldc } 4 \ & 2 \ \text{ ldc } 7 \ & 3 \ \text{ sub } q \ & q \ \rho(z)
\end{array}$$

and the last instruction would assign $-3$ to $\rho(z)$ instead of $3$.

In the third program, $x$ and $y$ are forwardable to ALU, and $z$ is not forwardable. There are three x-paths

\begin{align*}
\pi_1 &= ([1],[2],[3],[4],[5]) \\
\pi_2 &= ([5],[6],[7],[8],[9]) \\
\pi_1 &= ([5],[6],[7],[8],[5])
\end{align*}

and three y-paths

\begin{align*}
\tau_1 &= ([2],[3],[4],[5],[6]) \\
\tau_2 &= ([6],[7],[8],[9]) \\
\tau_1 &= ([6],[7],[8],[5],[6])
\end{align*}

No $\pi_i$ contains any $\tau_j$, and no $\tau_j$ contains any $\pi_i$, hence no conflict exists between $x$ and $y$. In particular $\pi_2$ does not contain $\tau_2$ as instruction $[9]$ is not of the form $z = y \text{ bop } x$.

For $\gamma(q) = \text{ALU}$ the allocation $\rho(x) = \rho(y) = q_1$, $\rho(z) = r_1$ yields the translation

\begin{align*}
N(1) &= 1 \text{ ldc } 2 \ q_1 \\
N(5) &= 5 \text{ add } q_1 \ r_1 \ q_1 \\
N(9) &= 9 \text{ sub } q_1 \ q_1 \ r_1 \\
& \begin{array}{c}
2 \ \text{ ldc } 5 \ q_1 \\
3 \ \text{ ldc } 7 \ r_1 \\
4 \ \text{ jmp } 5
\end{array}
\begin{array}{c}
6 \ \text{ inc } q_1 \ q_1 \\
7 \ \text{ dec } r_1 \ r_1 \\
8 \ \text{ if } r_1 \ 9 \ 5
\end{array}
\end{align*}

\[^1\text{In fact, for commutative bop the second condition in Definition 46 may be dropped.}\]
with principal typings \( N(1) : 1 \rightarrow q_1 \otimes q_1 \otimes r_1 \), \( N(5) : q_1 \otimes r_1 \rightarrow q_1 \otimes r_1 \) and \( N(9) : q_1 \otimes q_1 \rightarrow r_1 \). Weakening the typing for \( N(5) \) by \( q_1 \) and the typing for \( N(9) \) by \( r_1 \) yields

\[
\begin{align*}
N(1) & : 1 \rightarrow q_1 \otimes q_1 \otimes r_1 \\
N(5) & : q_1 \otimes q_1 \otimes r_1 \rightarrow q_1 \otimes q_1 \otimes r_1 \\
N(9) & : q_1 \otimes q_1 \otimes r_1 \rightarrow r_1 
\end{align*}
\]

and makes the program type-correct. It is also functionally correct due to \( N(5) \)'s invariant: the head value of \( q_1 \) is the updated value of \( x \), the second element of \( q_1 \) contains \( y \) and \( r_1 \) contains the value of \( z \). Thus, instruction [9] correctly subtracts \( y \) from \( x \). \( \diamond \)

Allocating queue and registers comprises node-colouring the conflict graphs.

**Definition 48.** For regular \( P \) and solution \( In/Out \), allocation function \( \rho : \text{Var}_P \rightarrow OP \) is called an allocation for \( P \) if

- \( \rho(x) = q \) implies \( x \in V'_{\gamma(q)} \)
- \( \rho(x) = \rho(y) = r \) implies \( (x, y) \notin \mathcal{E}_{\text{Reg}} \)
- \( \rho(x) = \rho(y) = q \) implies \( (x, y) \notin \mathcal{E}_{\gamma(q)} \)

**Lemma 16.** If \( \rho \) is an allocation for \( P \) with respect to solution \( In/Out \) then \( \rho \) is compatible with \( In/Out \).

**Proof.** Let \( \rho(x) = q \) and suppose \( x \) is not forwardable to \( \gamma(q) \). Then \( x \notin V'_{\gamma(q)} \) so \( \rho(x) \neq q \). Contradiction. \( \square \)

By Theorem 8, translating \( P \) with respect to an allocation for \( P \) consequently leads to an ALEF program which is well-typed. Variables which are forwardable to several functional units can be inserted into any graph, each choice yielding a different instantiation of the polymorphic instruction \( \text{idfu} \).
7.6 Functional correctness

We finally prove functional correctness of the translation. This is achieved by relating the ALEF execution of $\llbracket P \rrbracket_\rho$ to the execution of $P$ in the non-standard dynamic semantics $\Rightarrow_\alpha$. We first define the association function $\alpha_\rho$, which arises from allocation $\rho$ according to the read/write capabilities motivated earlier: register-mapped variables are associated to the lattice element $\top$ and queue-mapped ones to $1$. For an allocation arising from a dataflow solution, the resulting $\alpha_\rho$ fulfils the condition of Theorem 6 which guarantees that the functional correctness also holds with respect to the standard semantics $\rightarrow$ of IL.

We then establish a satisfaction relation between IL states $S$ and ALEF states $C$ which expresses the fact that values in IL-component $\sigma$ occur as entries in $Q$ or $R$ in $C$, depending on their entry in the association function $\alpha_\rho$.

The satisfaction relation is shown to be preserved by program execution, including the order of values in operand queues. In particular, this result implies that loops fulfil an invariant similar to the one observed in the rightmost program in (7.11).

Again, we consider $P$ to be a fixed regular program and $In/Out$ a fixed solution for $P$. Furthermore, we let $\rho$ be an allocation for $P$ with respect to $In/Out$ according to Definition 48.

**Definition 49.** The association function $\alpha_\rho$ is given by

$$
\alpha_\rho(I,x) = \begin{cases} 
1 & \text{if } x \in \text{defs}(I) \text{ and } \rho(x) = q \\
\top & \text{if } x \in \text{defs}(I) \text{ and } \rho(x) = r
\end{cases}
$$

The following Lemma ensures that $\Rightarrow_\alpha$ is faithful to the standard execution $\rightarrow$ of $P$, based on Theorem 6.

**Lemma 17.** $\alpha_\rho \sqsupseteq \alpha_{Out}$.

**Proof.** Let $x \in \text{defs}(I)$.

If $\rho(x) = q$ then $x$ is forwardable to $\gamma(q)$ as $\rho$ is compatible with $In/Out$, so $x \in \text{defs}(I)$ implies $Out(I)(x) = 1$, so $\alpha_\rho(I,x) = 1 = \alpha_{Out}(I,x)$.

If $\rho(x) = r$ then $\alpha_\rho(I,x) = \top$, so $\alpha_\rho(I,x) \sqsupseteq \alpha_{Out}(I,x)$ is trivially fulfilled. \qed
In particular, Lemma 17 explains why Theorem 6 was stated for arbitrary $\alpha \sqsubseteq \alpha_{\text{Out}}$, although the proof used only properties of $\alpha_{\text{Out}}$. Consider an allocation which maps some variables to registers, regardless of their forwardability. Such an allocation is admitted by all definitions we gave so far, and should certainly be admitted for translation into ALEF, too. However, its association function does not necessarily correspond to any solution to the dataflow equations. As an extreme example consider the trivial allocation $\rho$ which maps all variables to registers. Its association function is $\alpha_{\rho}(I,x) = \top$ whenever $x \in \text{defs}(I)$ holds, but $\lambda x.\top$ is in general not a fixed point for equations (6.13). Thus, stating Theorem 6 only for associations arising from solutions would have ruled out allocation $\rho$.

The stated version of Theorem 6 leaves the decision whether a forwardable variable is indeed forwarded to the allocation phase rather than to the dataflow analysis. This is the right choice as many compilers aim to perform program analysis independently from architectural limitations such as the availability of a certain number of operand queues or registers.

The satisfaction relation between IL-configurations $S$ and ALEF-configurations $C$ is motivated as follows. Firstly, $S = (\Sigma,n,B)$ may contain entries $\Sigma(x) = (\top,a)$ and $\Sigma(y) = (\top,a)$ for variables $x,y \in \text{Var}_P$ with $\rho(x) = r = \rho(y)$. On the other hand, the register component $R$ of an ALEF configuration $C = (Q,R,M,\bar{\kappa})$ maps $r$ to a single value. Our definition relating $S$ to $C$ therefore requires that such variables $x$ and $y$ cannot both be live-in at $\text{fst}(B)$, and stipulates that the entry in $\Sigma$ for the live variable is the one found in $C$.

Second, the order of values in operand queues is not specified in $\Sigma$. Our definition requires for each $q \in Q$ that each entry $a$ in $Q(q)$ corresponds precisely to one variable $x$ with $\Sigma(x) = (1,a)$ and $\rho(x) = q$. Furthermore, the order of values $a_1 \ldots a_n$ in $Q(q)$ must correspond to the order of the last $n$ instructions $I_1,\ldots,I_n$ in the IL-execution history leading to $S$, where each $I_i$ assigns value $a_i$ to a variable $x$ with $\rho(x) = q$.

**Definition 50.** Let $\rho$ be an allocation for regular $P$, $S = (\Sigma,1,\_)$ and $C = (Q,R,M,\bar{\kappa})$. We write $C \models_{\alpha_{\rho}} S$ if

1. there are $S_1,\ldots,S_n$ such that $S_n = S$ and $S_i \Rightarrow_{\alpha_{\rho}} S_{i+1}$ for $0 \leq i < n$, where $S_0$ is the initial configuration. We write $S_i = (\Sigma_i,m_i,B_i)$. 

2. if \( x \in \text{dom } \Sigma, \Sigma(x) = (\top, a) \) and \( \rho(x) = r \) then \( r \in \text{dom } \mathcal{R} \).

3. if \( r \in \text{dom } \mathcal{R} \) and \( |B_n| > 0 \) then there is at most one \( x \in \text{live}\text{ln}(\text{fst}(B_n)) \) with \( \rho(x) = r \), and any such \( x \) fulfills \( \Sigma(x) = (\top, R(r)) \).

4. for each \( q \in Q, |Q(q)| = |\{ x \in \text{dom } \Sigma | \rho(x) = q, \exists b. \Sigma(x) = (1, b) \}| \) holds.

5. for each \( q \in Q \) and \( Q(q) = a_1 \ldots a_m \) there are \( x_1, \ldots, x_m \in \text{Var}_\rho \) and \( I_1, \ldots, I_m \in \text{Instrs}_\rho \) such that for all \( 1 \leq i \leq m \)

   - \( I_i = \text{fst}(B_{j_i}) \) for some \( 0 \leq j_1 \leq \ldots \leq j_m < n \)
   - \( x_i \in \text{defs}(I_i) \)
   - \( \rho(x_i) = q \)
   - \( \Sigma(x_i) = (1, a_i) \)
   - \( x_i \notin \text{defs}(J) \cup \text{uses}(J) \) for \( J = \text{fst}(B_k) \) and all \( j_i < k < n \)

   and \( \bigcup_{i=1}^{m} x_i = \{ x \in \text{dom } \Sigma | \rho(x) = q, \exists b. \Sigma(x) = (1, b) \} \).

In particular, the length of \( Q(q) \) equals the number of variables mapped to \( q \) which have been assigned to most recently but not yet been read, with assigning instructions \( \text{fst}(B_{j_i}) \).

Before proving that the satisfaction relation is preserved by execution, we collect two facts.

**Lemma 18.** If \( S_0 \Rightarrow^*_\alpha S, S = (\Sigma, B), |B| > 0 \) and \( x \in \text{dom } \Sigma \), then

- \( \rho(x) = r \) implies \( \Sigma(x) = (\top, a) \) for some \( a \).
- \( \rho(x) = q \) implies \( \Sigma(x) = (l, a) \) for some \( l \in \{0, 1\} \).

**Proof.** Since \( S_0 = (\Sigma_0, \bot, B_0) \) is initial, \( x \notin \text{dom } \Sigma_0 \) holds, so for \( x \in \text{dom } \Sigma \) and \( \Sigma(x) = (l, a) \) we must have \( l = \alpha_\rho \bigoplus \ldots \bigoplus 1 \) for some \( n \). By the definitions of \( \alpha_\rho \) and \( \bigoplus \), \( \rho(x) = r \) implies \( \alpha_\rho = \top \), hence \( l = \top \).

For \( \rho(x) = q \) we obtain \( \alpha_\rho = 1 \), hence \( l \in \{ \bot, 0, 1 \} \). On the other hand, the definitions of compatibility and forwardability imply \( \text{Out}(\text{fst}(B))(x) \in \{0, 1\} \), hence \( \bot \neq \text{In}(\text{fst}(B))(x) \). From \( \text{In}(\text{fst}(B)) \subseteq S \) we thus obtain \( l \in \{0, 1\} \). \( \square \)
Lemma 19. If \( \rho \) is an allocation for \( \mathbb{P} \), \( \rho(x) = q \) and \( \pi \) is a path in \( \mathbb{P} \) with \( x \in \text{defs}(\pi(1)) \) and \( x \not\in \text{defs}(\pi(i)) \cup \text{uses}(\pi(i)) \) for all \( i > 1 \), then \( \pi \) is the prefix of an \( x \)-path in \( \mathbb{P} \).

Proof. If \( \pi \) is infinite, then the claim holds trivially as \( \pi \) is an \( x \)-path.

For finite \( \pi \), and \( n \) its length, suppose \( \text{succ}(\pi(n)) = \emptyset \) holds. Then equations (6.13) imply \( \text{Out}(\pi(n))(x) = 0 \), and for all \( 1 < i \leq n \), \( \text{uses}(\pi(i))(x) = 0 \) holds due to \( x \not\in \text{uses}(\pi(i)) \), hence

\[
0 = \text{Out}(\pi(n))(x) = \text{In}(\pi(n))(x) \subseteq \text{Out}(\pi(n-1))(x) = \text{In}(\pi(n-1))(x) \subseteq \\
\ldots \subseteq \text{Out}(\pi(1))(x) = 1
\]

where \( \text{Out}(\pi(1))(x) = 1 \) follows from Lemmas 16 and 15. This contradicts the structure of the lattice \( L \), as \( 0 \not\subseteq 1 \) holds. Hence \( \text{succ}(\pi(n)) \neq \emptyset \) follows.

Similarly, \( J \in \text{succ}(\pi(n)) \) with \( x \in \text{defs}(J) \) but \( x \not\in \text{uses}(J) \) leads to the same contradiction \( 0 \subseteq \text{Out}(\pi(n))(x) \subseteq \ldots \subseteq \text{Out}(\pi(1))(x) = 1 \) since \( 0 = \text{In}(J)(x) \subseteq \text{Out}(\pi(n))(x) \) holds. Consequently, \( x \in \text{defs}(J) \) implies \( x \in \text{uses}(J) \), and the path resulting from appending any such \( J \) to \( \pi \) is an \( x \)-path whose existence we had to show.

If there is no \( J \in \text{succ}(\pi(n)) \) with \( x \in \text{defs}(J) \), then we choose an arbitrary \( J \) from \( \text{succ}(\pi(n)) \neq \emptyset \) and extend \( \pi \) by \( J \). The resulting path is of the form required in the claim, and is of length \( n + 1 \).

By repeating this process, we either obtain a path of which \( \pi \) is a prefix, or we extend \( \pi \) to an infinite path for which the claim holds trivially. \( \square \)

The proof of functional correctness shows that initial configurations \( S_0 \) and \( C_0 \) are always related, and that each \( \Rightarrow_{\alpha_\rho} \)-step corresponds to an ALEF step, preserving the satisfaction relationship between configurations. We first state the result for assignments.

Proposition 39. Let \( C_0 = ([],[],M,\bar{\kappa}) \), \( C = (Q,R,M,\bar{\kappa}) \), \( S = (\Sigma,m,B) \) and \( |B| > 0 \). Then

\begin{itemize}
  \item \( C_0 \models_{\alpha_\rho} S_0 \)
  \item If \( \text{fst}(B) \) is an assignment \( \text{ass} \), \( C \models_{\alpha_\rho} S \) and \( S \Rightarrow_{\alpha_\rho} T \) then there is a unique \( D \) such that \( C \xrightarrow{\text{ass}}_{\alpha_\rho} D \). Furthermore, \( D \models_{\alpha_\rho} T \) and \( D = (\omega,\omega,M,\bar{\kappa}) \) hold.
\end{itemize}
Proof. For the first claim, observe that the definition of $S_0$ implies $\Sigma_0 = []$, hence for $n = 0$ all points in Definition 50 are trivially fulfilled.

For the second item, we first note that for $C \models_{\alpha_p} S$ and $S \Rightarrow_{\alpha_p} T$ with $S = (\Sigma, m, ass, \tilde{I})$, the definition of $\Rightarrow_{\alpha_p}$ implies $T = (\Sigma_2, m, \tilde{I})$ for the unique

$$\Sigma_2 = \Sigma_1[x \mapsto (\alpha_p(ass, x), a)]$$

where $ass$ is of the form $x = e$ and $\Sigma, e \downarrow \Sigma_1, a$ and $\Sigma_1 \downarrow x$ hold. The proof proceeds by case distinction on the structure of $ass$, and we treat one representative case.

Case $[.]x = x'$. The definition of $\Sigma, x \downarrow \Sigma_1, a$ implies $\Sigma(x) = (l, a)$ for some $l \sqsupseteq 1$ with $\Sigma_1 = \Sigma[x \mapsto (l \ominus 1, a)]$, and $\Sigma_1 \downarrow x$ implies $l \ominus 1 \sqsubseteq 0$. There are two cases, depending on the value of $\rho(x)$.

Case $\rho(x) = r$. The definition of $\llbracket . \rrbracket_\rho$ yields $\llbracket ass \rrbracket_\rho = 1d_{\text{ALU}} r r$, and the assumption $C \models_{\alpha_p} S$ implies that there is an $n$ with

1. $S_0 \Rightarrow_{\alpha_p} S$, and in particular $ass = \text{fst}(B_n) = \text{fst}(B)$.
2. if $y \in \text{dom} \Sigma, \Sigma(y) = (\top, b)$ and $\rho(y) = r'$ then $r' \in \text{dom} R$.
3. for $r' \in \text{dom} R$ there is at most one $y \in \text{liveIn}(ass)$ with $\rho(y) = r'$, and any such $y$ fulfills $\Sigma(y) = (\top, R(r'))$.
4. for each $q \in Q$, $|Q(q)| = |\{y \in \text{dom} \Sigma \mid \rho(y) = q, \exists b. \Sigma(y) = (1, b)\}|$ holds.
5. for each $q \in Q$ and $Q(q) = a_1 \ldots a_m$ there are $x_1, \ldots, x_m \in \text{Var}_P$ and $I_1, \ldots, I_m \in \text{Instr}_P$ such that for all $1 \leq i \leq m$
   - $I_i = \text{fst}(B_{j_i})$ for some $0 \leq j_1 \leq \ldots \leq j_m < n$
   - $x_i \in \text{defs}(I_i)$
   - $\rho(x_i) = q$
   - $\Sigma(x_i) = (1, a_i)$
   - $x_i \notin \text{defs}(J) \cup \text{uses}(J)$ for $J = \text{fst}(B_k)$ and all $j_i < k < n$

and $\cup_{i=1}^m x_i = \{y \in \text{dom} \Sigma \mid \rho(y) = q, \exists b. \Sigma(y) = (1, b)\}$. 

Step 1. We first show that there is a D with $C \xrightarrow{[\text{ass}]_{\rho}} D$.

From $S_0 \Rightarrow_{\alpha_\rho}^n S$ (fact (1)) and $S \Rightarrow_{\alpha_\rho} T$ we obtain $S_0 \Rightarrow_{\alpha_\rho}^{n+1} T$.

We noticed $\Sigma(x) = (l, a)$ above, so Lemma 18 implies $l = T$, hence $\Sigma_1 = \Sigma$ and fact (2) imply $r \in \text{dom } R$. Furthermore, $x \in \text{uses}(ass)$ holds, hence $x \in \text{liveIn}(ass)$, and fact (3) yields $\Sigma(x) = (T, R(r))$. From $\Sigma(x) = (l, a)$ and uniqueness of the map $R$ we thus obtain $R(r) = a$.

Consequently, the definition of ALEF execution yields $C \xrightarrow{[\text{ass}]_{\rho}} D$ for the unique $D = C$.

Step 2. Second, we show that for $D = C$, the relation $D \models_{\alpha_\rho} T$ holds.

1. $S_0 \Rightarrow_{\alpha_\rho}^{n+1} T$ was shown above.

2. For $y \in \text{dom } \Sigma_2$, $\Sigma_2(y) = (\top, b)$ and $\rho(y) = r'$ there are two cases.
   - If $x = y$ then $r' = \rho(y) = \rho(x) = r$, hence $r' \in \text{dom } R$
   - If $x \neq y$ then $y \in \text{dom } \Sigma_2$ implies $y \in \text{dom } \Sigma_1 = \text{dom } \Sigma$ with $\Sigma(y) = \Sigma_2(y) = (\top, b)$, so fact (2) implies $r' \in \text{dom } R$.

   In both cases, $r' \in \text{dom } R$ holds.

3. For $r' \in \text{dom } R$ and $\vert \tilde{I} \vert > 0$, suppose that $\{y, z\} \subseteq \text{liveIn}(\text{fst}(\tilde{I}))$ holds for some $y \neq z$ with $\rho(y) = r' = \rho(z)$. From $\text{succ}(ass) = \{\text{fst}(\tilde{I})\}$ and equations (6.9) we obtain $\{y, z\} \subseteq \text{liveOut}(ass)$. Consequently, $y$ and $z$ are adjacent in $G_{\text{Reg}}$ so $\rho(y) \neq \rho(z)$ as $\rho$ is an allocation for $P$. This contradicts $\rho(y) = r' = \rho(z)$, hence there is at most one $y \in \text{liveIn}(\text{fst}(\tilde{I}))$ with $\rho(y) = r'$.

   It remains to show that any such $y$ fulfils $\Sigma_2(y) = (T, R(r'))$. There are two cases.
   - if $y = x$, then $r' = r$ follows, and the definition of $\alpha_\rho$ yields
     $\Sigma_2(y) = \Sigma_2(x) = (\alpha_\rho(ass, x), a) = (T, R(r))$.
   - if $y \neq x$ then equations (6.9) imply
     $y \in \text{liveIn}(\text{fst}(\tilde{I})) \subseteq \text{liveOut}(ass)$,
     hence $y \in \text{liveIn}(ass)$ holds due to $y \notin \{x\} = \text{defs}(ass)$. By fact (3), $\Sigma_2(y) = \Sigma_1(y) = \Sigma(y) = (T, R(r'))$ follows.
4. For \( q \in Q \), we have to show

\[
|Q(q)| = |\{ y \in \text{dom} \Sigma_2 | \rho(y) = q, \exists b. \Sigma_2(y) = (1, b) \}|.
\]

Abbreviating \( \{ y \in \text{dom} \Sigma_2 | \rho(y) = q, \exists b. \Sigma_2(y) = (1, b) \} \) by \( M_q \), any \( y \in M_q \) fulfils \( y \neq x \) as \( \rho(y) = q \neq r = \rho(x) \) holds. The definition of \( \Sigma_2 \) thus implies

\[
M_q = \{ y \in \text{dom} \Sigma_2 | \rho(y) = q, \exists b. \Sigma_2(y) = (1, b) \}
\]

so fact (4) yields \( |Q(q)| = |M_q| \).

5. For each \( q \in Q \) and \( Q(q) = a_1 \ldots a_m \) we have to show that there are \( z_1, \ldots, z_m \in \text{Var}_p \) and \( J_1, \ldots, J_m \in \text{Instr}_p \) such that for all \( 1 \leq i \leq m \)

- \( J_i = \text{fst}(B_i) \) for some \( 0 \leq l_1 \leq \ldots \leq l_m < n + 1 \)
- \( z_i \in \text{defs}(J_i) \)
- \( \rho(z_i) = q \)
- \( \Sigma(z_i) = (1, a_i) \)
- \( z_i \notin \text{defs}(K) \cup \text{uses}(K) \) for \( K = \text{fst}(B_h) \) and all \( l_i < h < n + 1 \)

and \( \cup_{i=1}^m z_i = M_q \) for the \( M_q \) as defined above.

Defining \( z_i = x_i \) and \( J_i = I_i \) from fact (5) for \( 1 \leq i \leq m \) yields

- \( J_i = I_i = \text{fst}(B_i) \) for \( j_i = l_i \), hence \( 1 \leq l_1 \leq \ldots \leq l_m < n < n + 1 \)
- \( z_i = x_i \in \text{defs}(I_i) = \text{defs}(J_i) \)
- \( \rho(z_i) = \rho(x_i) = q \)
- each \( z_i \) fulfils \( z_i \neq x \) due to \( \rho(z_i) = q \neq r = \rho(x) \), hence \( \Sigma_2(z_i) = \Sigma_1(z_i) = \Sigma(x_i) = (1, a_i) \).
- for \( l_i < h < n \) we obtain \( z_i = x_i \notin \text{defs}(K) \cup \text{uses}(K) \) for \( K = \text{fst}(B_h) \)
by fact (5), and for \( n \leq h < n + 1 \) we have \( h = n \), so
\[
\begin{align*}
z_i = x_i \notin \{x\} &= \text{defs}(\text{ass}) \cup \text{uses}(\text{ass}) \\
&= \text{defs}(\text{fst}(B_h)) \cup \text{uses}(\text{fst}(B_h))
\end{align*}
\]
and \( \bigcup_{i=1}^{m} z_i = \bigcup_{i=1}^{m} x_i = M_q \) using the above equality
\[
\{y \in \text{dom} \Sigma | \rho(y) = q, \exists b. \Sigma(y) = (1, b)\} = M_q.
\]

**Case** \( \rho(x) = q \). The definition of \([\text{ass}]_{\rho} \) yields \([\text{ass}]_{\rho} = \text{id}^{\rho(q) q q}\). Lemma 16 implies that \( \rho \) is compatible with the solution \( \text{In/Out} \) of \( P \), so \( x \) is forwardable to \( \gamma(q) \). The assumption \( C \models_{\alpha_{\rho}} S \) implies that there is an \( n \) with
1. \( S_0 \Rightarrow_{\alpha_{\rho}} S \), and in particular \( \text{ass} = \text{fst}(B_n) \).
2. if \( y \in \text{dom} \Sigma, \Sigma(y) = (\top, b) \) and \( \rho(y) = r \) then \( r \in \text{dom} R \).
3. for \( r \in \text{dom} R \) there is at most one \( y \in \text{liveIn}(\text{ass}) \) with \( \rho(y) = r \), and any such \( y \) fulfils \( \Sigma(y) = (\top, R(r)) \).
4. for each \( q' \in Q, |Q(q')| = |\{y \in \text{dom} \Sigma | \rho(y) = q', \exists b. \Sigma(y) = (1, b)\}| \) holds.
5. for each \( q' \in Q \) and \( Q(q') = a_1 \ldots a_m \) there are \( x_1, \ldots, x_m \in \text{Var}_P \) and \( I_1, \ldots, I_m \in \text{Instrs}_P \) such that for all \( 1 \leq i \leq m \)
   - \( I_i = \text{fst}(B_{j_i}) \) for some \( 0 \leq j_1 \leq \ldots \leq j_m < n \)
   - \( x_i \in \text{defs}(I_i) \)
   - \( \rho(x_i) = q' \)
   - \( \Sigma(x_i) = (1, a_i) \)
   - \( x_i \notin \text{defs}(J) \cup \text{uses}(J) \) for \( J = \text{fst}(B_k) \) and all \( j_i < k < n \)
and \( \bigcup_{i=1}^{m} x_i = \{y \in \text{dom} \Sigma | \rho(y) = q', \exists b. \Sigma(y) = (1, b)\} \).

**Step 1.** We first show that there is a \( D \) with \( C \models_{\alpha_{\rho}} D \).

From fact (1) and the assumption \( S \Rightarrow_{\alpha_{\rho}} T \) we obtain \( S_0 \Rightarrow_{\alpha_{\rho}} T \). We noticed \( \Sigma(x) = (l, a) \) for some \( l \geq 1 \) above, and Lemma 18 now implies \( l = 1 \) due to \( \rho(x) = q \), hence \( \Sigma(x) = (1, a) \) and \( \Sigma_1 = \Sigma[x \mapsto (0, a)] \). Therefore, \( x \in M_q \) where
\[
M_q = \{y \in \text{dom} \Sigma | \rho(y) = q, \exists b. \Sigma(y) = (1, b)\}.
\]
By fact (4), \( Q(q) \) has thus at least one element, i.e. there is an \( m \geq 1 \) with \( Q(q) = a_1 \ldots a_m \). By fact (5), there are \( x_1, \ldots, x_m \) and \( I_1, \ldots, I_m \) such that \( M_q = \bigcup_{i=1}^{m} x_i \) and

- \( I_i = \text{fst}(B_{j_i}) \) for some \( 0 \leq j_1 \leq \ldots \leq j_m < n \)
- \( x_i \in \text{defs}(I_i) \)
- \( \rho(x_i) = q \)
- \( \Sigma(x_i) = (1, a_i) \)
- \( x_i \notin \text{defs}(J) \cup \text{uses}(J) \) for \( J = \text{fst}(B_{k}) \) and all \( j_i < k < n \).

\( x \in M_q \) implies that \( x = x_i \) holds for some \( 1 \leq i \leq m \), and we now show \( x = x_1 \), i.e. \( a = a_1 \).

Suppose \( x \neq x_1 \). The sequence of instructions

\[
\text{fst}(B_{j_1}), \text{fst}(B_{j_1+1}), \ldots, \text{fst}(B_{j_2}), \ldots, \text{fst}(B_{j_m}), \ldots, \text{ass} = \text{fst}(B_n)
\]

is a path \( \pi_0 \in P \) since \( 0 \leq j_1 \leq \ldots \leq j_m < n \) and IL execution \( \Rightarrow_{\alpha_p} \) respects the relation \( \text{succ} \) of \( P \) by Proposition 31 and Lemma 8. Furthermore, fact (5) yields \( x_1 \in \text{defs}(I_1) = \text{defs}(B_{j_1}) \) and \( x_1 \notin \text{defs}(K) \cup \text{uses}(K) \) for all \( K \neq I_1 \) on \( \pi_0 \). By Lemma 19, \( \pi_0 \) is hence the prefix of an \( x_1 \)-path \( \pi \).

Since \( x = x_i \) holds for some \( 1 \leq i < m \), \( x_i \in \text{defs}(I_i) \) and \( x \in \text{uses}(\text{ass}) \),

\[ \tau = I_1, \ldots, I_m, \ldots, \text{ass} \]

is an \( x \)-path – fact (5) guarantees that \( \text{ass} \) is the first use of \( x_i \) following \( I_j \).
Thus \( \pi \) contains \( \tau \). Contradiction: \( \rho(x) = q = \rho(x_1) \) holds by definition of \( x_1 \), but the nodes \( x \) and \( x_1 \) are adjacent in \( G_{p(q)} \) as \( \pi \) contains \( \tau \), so \( \rho(x) \neq \rho(x_1) \) as \( \rho \) is an allocation for program \( P \). Therefore, \( x = x_1 \) holds.

For the unique \( D = (P, R, M, \bar{k}) \) with \( P = Q[q \mapsto a_2 \ldots a_m a] \) we thus have \( C \xrightarrow{\|\text{ass}\|_{\rho}} D \).

**Step 2.** Second, we show that \( D \models_{\alpha_p} \top \) holds.

1. We have \( S_0 \Rightarrow^{n+1}_{\alpha_p} \top \) as noticed above.
2. For \( y \in \text{dom } \Sigma_2 \), \( \Sigma_2(y) = (\top, b) \) and \( \rho(y) = r, y \neq x \) follows due to \( \rho(x) = q \). Hence \( y \in \text{dom } \Sigma \) with \( (\top, b) = \Sigma_2(y) = \Sigma_1(y) = \Sigma(x) \), so fact (2) implies \( r \in \text{dom } R \).
3. For $r \in dom \mathbf{R}$, $|\mathcal{I}| > 0$ and variables $y \neq z$ with $\rho(y) = r = \rho(z)$ and $\{y, z\} \subseteq liveIn(fst(\mathcal{I}))$, we have $\{y, z\} \subseteq liveOut(ass)$ by equations (6.9) and $\text{succ}(ass) = \{fst(\mathcal{I})\}$. Consequently, $y$ and $z$ are adjacent in $G_{\text{Reg}}$ so $\rho(y) \neq \rho(z)$.

For $y \in liveIn(fst(\mathcal{I}))$ with $\rho(y) = r$, $y \neq x$ holds due to $\rho(x) = q$, and $y \in liveIn(fst(\mathcal{I}))$ implies $y \in liveOut(ass) \subseteq liveIn(ass)$ by equations (6.9) as $y \notin \{x\} = \text{defs}(ass)$. Hence, $\Sigma_2(y) = \Sigma(y) = (1, R(r))$ by fact (3).

4. For $q' \in Q$ there are two cases.

- If $q \neq q'$ then any $y \in dom \Sigma$ with $\rho(y) = q'$ and $\Sigma(y) = (1, b)$ for some $b$ fulfils $x \neq y$ due to $\rho(x) = q$, hence

$$|P(q')| = |Q(q')|$$

$$= |\{y \in dom \Sigma | \rho(y) = q', \exists b. \Sigma(y) = (1, b)\}|$$

$$= |\{y \in dom \Sigma | y \neq x, \rho(y) = q', \exists b. \Sigma(y) = (1, b)\}|$$

$$= |\{y \in dom \Sigma_1 | y \neq x, \rho(y) = q', \exists b. \Sigma_1(y) = (1, b)\}|$$

$$= |\{y \in dom \Sigma_2 | y \neq x, \rho(y) = q', \exists b. \Sigma_2(y) = (1, b)\}|$$

$$= |\{y \in dom \Sigma_2 | \rho(y) = q', \exists b. \Sigma_2(y) = (1, b)\}|$$

holds by fact (4) and the definitions of $P$ and $\Sigma_2$.

- If $q = q'$ then fact (4), $\rho(x) = q$ and $\Sigma(x) = (1, a)$ imply

$$|P(q')| = |Q(q)|$$

$$= |\{y \in dom \Sigma | \rho(y) = q, \exists b. \Sigma(y) = (1, b)\}|$$

$$= |\{x \} \cup \{y \in dom \Sigma | x \neq y, \rho(y) = q, \exists b. \Sigma(y) = (1, b)\}|$$

$$= |\{x \} \cup \{y \in dom \Sigma_1 | x \neq y, \rho(y) = q, \exists b. \Sigma_1(y) = (1, b)\}|$$

$$= |\{x \} \cup \{y \in dom \Sigma_2 | x \neq y, \rho(y) = q, \exists b. \Sigma_2(y) = (1, b)\}|$$

$$= |\{y \in dom \Sigma_2 | \rho(y) = q, \exists b. \Sigma_2(y) = (1, b)\}|$$
where the last step follows from $\Sigma_2(x) = (\alpha_p(ass,x),a)$ where $\alpha_p(ass,x) = 1$ holds by Definition 49 due to $x \in \text{defs}(ass)$ and $\rho(x) = q$.

5. For $q \in Q$ and $P(q') = b_1 \ldots b_\mu$ we have to show that there are $z_1, \ldots, z_\mu$ and $J_1, \ldots, J_\mu$ such that $\cup_{i=1}^{\mu} z_i = \{ y \in \text{dom} \Sigma_2 \mid \rho(y) = q', \exists b. \Sigma_2(y) = (1, b) \}$ and for all $1 \leq i \leq \mu$ the following properties hold.

- $J_i = \text{fst}(B_i)$ for some $0 \leq l_1 \leq \ldots \leq l_\mu < n + 1$
- $z_i \in \text{defs}(J_i)$
- $\rho(z_i) = q'$
- $\Sigma_2(z_i) = (1, b_i)$
- $z_i \notin \text{defs}(K) \cup \text{uses}(K)$ for $K = \text{fst}(B_i)$ and all $l_i < h < n + 1$

Again, there are two cases.

- If $q \neq q'$ then we take (for all $i$) $z_i = x_i$ and $J_i = I_i$ from fact (5), and obtain
  - $J_i = I_i = \text{fst}(B_i) = \text{fst}(B_{i_1})$ for $j_i = l_i$, hence $0 \leq l_1 \leq \ldots l_\mu < n < n + 1$
  - $z_i = x_i \in \text{defs}(I_i) = \text{defs}(J_i)$
  - $\rho(z_i) = \rho(x_i) = q'$
  - Each $z_i$ fulfills $z_i \neq x$ due to $\rho(z_i) = q' \neq q = \rho(x)$, hence $\Sigma_2(z_i) = \Sigma_1(z_i) = \Sigma(z_i) = (1, b_i)$
  - $z_i = x_i \notin \text{defs}(K) \cup \text{uses}(K)$ for $K = \text{fst}(B_i)$ and all $l_i < h < n$ by fact (5), and $z_i = x_i \notin \text{defs}(\text{fst}(B_{n})) \cup \text{uses}(\text{fst}(B_n)) = \text{defs}(ass) \cup \text{uses}(ass) = \{ x \}$ holds due to $z_i \neq x$.

- If $q = q'$ then $\mu = m$ and
  \[
  b_i = \begin{cases} 
  a_{i+1} & \text{for } i < m \\
  a_1 & \text{for } i = m
  \end{cases}
  \]

follow, and we define
\[
(z_i, J_i) = \begin{cases} 
(x_{i+1}, l_{i+1}) & \text{for } i < m \\
(x, ass) & \text{for } i = m
\end{cases}
\]
where the $x_i$ and $l_i$ are those from fact (5). Then (see Figure 7.2)

$$\bigcup_{i=1}^{m} z_i = \{z_m\} \cup_{i=1}^{m-1} z_i$$

$$= \{x\} \cup_{i=1}^{m-1} x_{i+1}$$

$$= \{x_1\} \cup_{i=1}^{m-1} x_{i+1}$$

$$= \bigcup_{i=1}^{m} x_i$$

using $x = x_1$ from above, and with the definition

$$l_i = \begin{cases} j_{i+1} & \text{for } i < m \\ n & \text{for } i = m \end{cases}$$

we obtain

- $0 < j_1 < j_2 = l_1 < j_3 = l_2 < \ldots < j_m = l_{m-1} < n = l_m < n + 1,$
- $J_i = I_{i+1} = fst(B_{j_{i+1}}) = fst(B_{l_{i+1}})$ for $1 \leq i < m$ and $J_m = ass = fst(B_n) = fst(B_{l_m})$
- $z_i = x_{i+1} \in \text{defs}(I_{i+1}) = \text{defs}(J_i)$ for $1 \leq i < m$ and $z_m = x \in \text{defs}(ass) = \text{defs}(J_m)$
- $\rho(z_i) = \rho(x_{i+1}) = q' = q$ for $1 \leq i < m$ and $\rho(z_m) = \rho(x) = q$
- for $1 \leq i \neq i' \leq m$ we have $z_i \neq z_{i'}$, because

$$\bigcup_{i=1}^{m} z_i = \bigcup_{i=1}^{m} x_i$$

$$= \{y \in \text{dom } \Sigma \mid \rho(y) = q, \exists b. \Sigma(y) = (1, b)\}$$

$$= \{y \in \text{dom } \Sigma_2 \mid \rho(y) = q, \exists b. \Sigma_2(y) = (1, b)\}$$
where the last equality was shown in step (4). For \( i \neq m \) we thus obtain \( z_i \neq x \) and

\[
\Sigma_2(z_i) = \Sigma_1(z_i) = \Sigma(z_i) = \Sigma(x_{i+1}) = (1, a_{i+1}) = (1, b_i),
\]

while \( i = m \) yields \( \Sigma_2(z_m) = \Sigma_2(x) = (1, a) = (1, a_1) = (1, b_m) \).

- For \( 1 \leq i < m \), \( z_i = x_{i+1} \notin \text{defs}(K) \cup \text{uses}(K) \) holds by fact (5) for \( K = \text{fst}(B_k) \) and all \( j_{i+1} < k < n \), i.e. \( l_i < k < n \), and for \( k = n \), \( z_i \notin \{x\} = \text{defs}(\text{fst}(B_k)) \cup \text{uses}(\text{fst}(B_k)) \) holds due to \( z_i \neq x \) shown above. For \( i = m \), the requirement reads \( z_m = x \notin \text{defs}(\text{fst}(B_k)) \cup \text{uses}(\text{fst}(B_k)) \) for all \( n = l_m = l_i < k < n + 1 \) and is thus trivially fulfilled as no such \( k \) exists.

\( \square \)

A similar result relates IL states and ALEF configurations for jump instructions.

**Proposition 40.** Let \( C, S \) and \( B \) as before. If \( \text{fst}(B) \) is a jump instruction \( \text{jump} \), \( \bar{k} = \text{nil} \), \( C \models_{\alpha_p} S \) and \( S \Rightarrow_{\alpha_p} T \) then there is a unique \( D \) such that \( C \xrightarrow{[\text{jump}]_{\rho}} D \). Furthermore, \( D \models_{\alpha_p} T \) and \( D = (\omega, M, 1) \) hold where \( T = (\omega, N(1)) \).

**Proof.** There are two cases, depending on the form of \( \text{jump} \). In both cases, the proof follows the same style as the proof of the previous proposition.

**Case** \( \text{jump} = [h] \text{jmp } m_1 \). The definition of \( [\_]_{\rho} \) yields

\[
[jump]_{\rho} = [h] \text{ jmp } m_1,
\]

and for \( S \Rightarrow_{\alpha_p} T \), rule JMP implies \( T = (\Sigma, h, N(m_1)) \).

The assumption \( C \models_{\alpha_p} S \) implies that there is an \( n \) with

1. \( S_0 \Rightarrow_{\alpha_p} S \), and in particular \( \text{jump} = \text{fst}(B_n) = \text{fst}(B) \).
2. if \( y \in \text{dom } \Sigma, \Sigma(y) = (T, b) \) and \( \rho(y) = \rho' \) then \( \rho' \in \text{dom } R \).
3. for \( \rho' \in \text{dom } R \) there is at most one \( y \in \text{liveIn}(\text{jump}) \) with \( \rho(y) = \rho' \), and any such \( y \) fulfils \( \Sigma(y) = (T, R(\rho')) \).
(4) for each \( q \in Q \), \( |Q(q)| = |\{ y \in dom \Sigma | \rho(y) = q, \exists b. \Sigma(y) = (1, b)\}| \) holds.

(5) for each \( q \in Q \) and \( Q(q) = a_1 \ldots a_m \) there are variables \( x_1, \ldots, x_m \in \text{Var}_p \) and \( I_1, \ldots, I_m \in \text{Instrs}_p \) such that for all \( 1 \leq i \leq m \)

- \( I_i = \text{fst}(B_{j_i}) \) for some \( 0 \leq j_1 \leq \ldots \leq j_m < n \)
- \( x_i \in \text{defs}(I_i) \)
- \( \rho(x_i) = q \)
- \( \Sigma(x_i) = (1, a_i) \)
- \( x_i \notin \text{defs}(J) \cup \text{uses}(J) \) for \( J = \text{fst}(B_k) \) and all \( j_i < k < n \)

and \( \bigcup_{i=1}^{m} x_i = \{ y \in dom \Sigma | \rho(y) = q, \exists b. \Sigma(y) = (1, b)\} \).

For the unique \( D = (Q, R, M, m_1) \), \( C \xrightarrow{[\text{jump}]} \rho \rightarrow D \) holds, and the claim regarding \( t \) is fulfilled. For showing \( D \models_{\alpha_p} T \), we observe that the first two components \( Q \) and \( R \) of \( D \) agree with those of \( C \), and that the first component \( \Sigma \) of \( T \) agrees with the first component of \( S \), so all five claims follow from facts (1) to (5).

**Case \( \text{jump} = [h|\text{if } x \text{ m}_1 \text{ m}_2] \).** The definition of \( [\_]_\rho \) yields

\[
[jump]_\rho = [h|\text{if } \rho(x) \text{ m}_1 \text{ m}_2],
\]

and for \( S \models_{\alpha_p} T \), rules IF-T and IF-F imply that \( \Sigma, x \downarrow \Sigma_1, a \) holds for some \( a \) with \( T = (\Sigma_1, h, N(m_1)) \) if \( a = 0 \) and \( T = (\Sigma_1, h, N(m_2)) \) if \( a \neq 0 \). The definition of \( \downarrow \) implies \( \Sigma(x) = (l, a) \) for some \( l \sqsupseteq 1 \) and \( \Sigma_1 = \Sigma[x \mapsto (l \sqsupseteq 1, a)] \), and the assumption \( C \models_{\alpha_p} S \) implies that there is an \( n \) with

1. \( S_0 \models_{\alpha_p} S \), and in particular \( \text{jump} = \text{fst}(B_n) = \text{fst}(B) \).
2. if \( y \in dom \Sigma \), \( \Sigma(y) = (\top, b) \) and \( \rho(y) = r' \) then \( r' \in dom R \).
3. for \( r' \in dom R \) there is at most one \( y \in \text{liveIn}(\text{jump}) \) with \( \rho(y) = r' \), and any such \( y \) fulfils \( \Sigma(y) = (\top, R(r')) \).
4. for each \( q \in Q \), \( |Q(q)| = |\{ y \in dom \Sigma | \rho(y) = q, \exists b. \Sigma(y) = (1, b)\}| \) holds.
5. for each \( q \in Q \) and \( Q(q) = a_1 \ldots a_m \) there are variables \( x_1, \ldots, x_m \in \text{Var}_p \) and \( I_1, \ldots, I_m \in \text{Instrs}_p \) such that for all \( 1 \leq i \leq m \)

- \( I_i = \text{fst}(B_{j_i}) \) for some \( 0 \leq j_1 \leq \ldots \leq j_m < n \).
\[ \begin{align*}
&\bullet x_i \in \text{defs}(I_i) \\
&\bullet \rho(x_i) = q \\
&\bullet \Sigma(x_i) = (1, a_i) \\
&\bullet x_i \notin \text{defs}(J) \cup \text{uses}(J) \text{ for } J = \text{fst}(B_k) \text{ and all } j_i < k < n \end{align*} \]

and \( \bigcup_{i=1}^{m} x_i = \{ y \in \text{dom} \Sigma | \rho(y) = q, \exists b. \Sigma(y) = (1, b) \} \).

Again, there are two cases, depending on the value of \( \rho(x) \).

**Case** \( \rho(x) = r \). The proof proceeds in two steps.

**Step 1.** We first show that there is a D with \( C \xrightarrow{\text{jump}} \rho D \).

From \( S_0 \Rightarrow_{\alpha_0}^n S \) (fact (1)) and \( S \Rightarrow \alpha_0 T \) we obtain \( S_0 \Rightarrow_{\alpha_0}^{n+1} T \).

We noticed \( \Sigma(x) = (l, a) \) above, so Lemma 18 implies \( l = \top \), hence \( \Sigma_1 = \Sigma \), and fact (2) implies \( r \in \text{dom} R \). Consequently, fact (3) yields \( \Sigma(x) = (\top, R(r)) \) as \( x \in \text{uses}(\text{jump}) \subseteq \text{liveIn}(\text{jump}) \) holds. Uniqueness of the map \( R \) and \( \Sigma(x) = (l, a) \) thus result in \( R(r) = a \).

Consequently, for \( D = (Q, R, M, t) \) where

\[ t = \begin{cases} 
  m_1 & \text{if } a = 0 \\
  m_2 & \text{if } a \neq 0 
\end{cases} \]

we obtain \( C \xrightarrow{\text{jump}} \rho D \), and the value in the last position agrees with the basic block \( M(t) \) in \( T \).

**Step 2.** We now show \( D \models_{\alpha_0} T \).

1. \( S_0 \Rightarrow_{\alpha_0}^{n+1} T \) was shown above

2. For \( y \in \text{dom} \Sigma_1, \Sigma_1(y) = (\top, b) \) and \( \rho(y) = r' \) there are two cases.
   - If \( x = y \) then \( r' = \rho(y) = \rho(x) = r \), hence \( r' \in \text{dom} R \)
   - If \( x \neq y \) then \( y \in \text{dom} \Sigma_1 \) implies \( y \in \text{dom} \Sigma \) with \( \Sigma(y) = \Sigma_1(y) = (\top, b) \), so fact (2) implies \( r' \in \text{dom} R \).

   In both cases, \( r' \in \text{dom} R \) holds.

3. For \( r' \in \text{dom} R \) suppose there are variables \( y \neq z \in \text{liveIn}(\text{fst}(M(t))) \)
   with \( \rho(y) = r' = \rho(z) \). Then equations (6.9) and \( \text{fst}(M(t)) \in \text{succ}(\text{jump}) \)
yield \( \{ y, z \} \subseteq \text{liveOut}(\text{jump}) \). Consequently, \( y \) and \( z \) are adjacent in \( G_{\text{Reg}} \) so \( \rho(y) \neq \rho(z) \) as \( \rho \) is an allocation for \( P \). This contradicts \( \rho(y) = r' = \rho(z) \), hence at most one \( y \in \text{liveIn}(\text{fst}(\mathbb{N}(1))) \) with \( \rho(y) = r' \) exists.

For showing that any such \( y \) fulfils \( \Sigma_1(y) = (\top, R(r')) \), there are two cases.

- if \( y = x \) then \( r' = r \) and \( \Sigma_1(y) = \Sigma_1(x) = (\top, R(r)) \) follows
- if \( y \neq x \) then equations (6.9) imply

\[
y \in \text{liveIn}(\text{fst}(\mathbb{N}(1))) \subseteq \text{liveOut}(\text{jump}),
\]

hence \( y \in \text{liveIn}(\text{jump}) \) holds due to \( y \not\in \emptyset = \text{defs}(\text{jump}) \). By fact (3), \( \Sigma_1(y) = \Sigma(y) = (\top, R(r')) \) follows.

4. For \( q \in Q \), we have to show

\[
|Q(q)| = |\{ y \in \text{dom} \Sigma_1 | \rho(y) = q, \exists b. \Sigma_1(y) = (1, b) \}|.
\]

Abbreviating \( \{ y \in \text{dom} \Sigma_1 | \rho(y) = q, \exists b. \Sigma_1(y) = (1, b) \} \) by \( M_q \), any \( y \in M_q \) fulfils \( y \neq x \) as \( \rho(y) = q \neq r = \rho(x) \) holds. The definition of \( \Sigma_1 \) thus implies

\[
M_q = \{ y \in \text{dom} \Sigma_1 | \rho(y) = q, \exists b. \Sigma_1(y) = (1, b) \}
= \{ y \in \text{dom} \Sigma_1 | y \neq x, \rho(y) = q, \exists b. \Sigma_1(y) = (1, b) \}
= \{ y \in \text{dom} \Sigma | y \neq x, \rho(y) = q, \exists b. \Sigma(y) = (1, b) \}
= \{ y \in \text{dom} \Sigma | \rho(y) = q, \exists b. \Sigma(y) = (1, b) \}
\]

so fact (4) yields \( |Q(q)| = |M_q| \).

5. For each \( q \in Q \) and \( Q(q) = a_1 \ldots a_m \) we have to show that there are \( z_1, \ldots, z_m \in \text{Var}_P \) and \( J_1, \ldots, J_m \in \text{Instrs}_P \) such that for all \( 1 \leq i \leq m \)

- \( J_i = \text{fst}(B_{l_i}) \) for some \( 0 \leq l_1 \leq \ldots \leq l_m < n + 1 \)
- \( z_i \in \text{defs}(J_i) \)
- \( \rho(z_i) = q \)
- \( \Sigma(z_i) = (1, a_i) \)
• \( z_i \notin \text{defs}(K) \cup \text{uses}(K) \) for \( K = \text{fst}(B_h) \) and all \( l_i < h < n + 1 \)

and \( \bigcup_{i=1}^{m} z_i = M_q \) for the \( M_q \) as defined above.

Defining \( z_i = x_i \) and \( J_i = J_l \) from fact (5) for \( 1 \leq i \leq m \) yields

• \( J_i = J_l = \text{fst}(B_{j_l}) \) for \( j_i = l_i \), and \( 1 \leq l_i \leq \ldots \leq l_i < n < n + 1 \) holds

• \( z_i = x_i \in \text{defs}(I_i) = \text{defs}(J_i) \)

• \( \rho(z_i) = \rho(x_i) = q \)

• each \( z_i \) fulfils \( z_i \neq x \) due to \( \rho(z_i) = q \neq r = \rho(x) \). Therefore,

\[
\Sigma_2(z_i) = \Sigma_1(z_i) = \Sigma(z_i) = \Sigma(x_i) = (1, a_i).
\]

• for \( l_i < h < n \) we obtain \( z_i = x_i \notin \text{defs}(K) \cup \text{uses}(K) \) for \( K = \text{fst}(B_h) \)

by fact (5), and for \( n \leq h < n + 1 \) we have \( h = n \), so

\[
\Sigma(x_i) = (l, a)
\]

and \( \bigcup_{i=1}^{m} z_i = \bigcup_{i=1}^{m} x_i = M_q \) using the above equality

\[
\{y \in \text{dom} \Sigma | \rho(y) = q, \exists b. \Sigma(y) = (1, b)\} = M_q.
\]

**Case** \( \rho(x) = q \). Again, the proof proceeds in two steps.

**Step 1.** We first show that there is a \( D \) with \( C \xrightarrow{[\text{jump}]_{\rho}} D \).

From \( S_0 \Rightarrow_{\alpha}^n S \) (fact (1)) and \( S \Rightarrow_{\alpha} T \) we obtain \( S_0 \Rightarrow_{\alpha_{\rho}}^{n+1} T \).

We noticed \( \Sigma(x) = (l, a) \) above, so Lemma 18 implies \( l = 1 \), hence \( \Sigma_1 = \Sigma[x \mapsto (0, a)] \), and

\[
x \in M_q = \{y \in \text{dom} \Sigma | \rho(y) = q, \exists b. \Sigma(y) = (1, b)\}.
\]

By fact (4), \( |Q(q)| > 1 \) follows, and for \( Q(q) = a_1 \ldots a_m \) there are (by fact (5)) \( x_1 \ldots x_m \) and \( I_1 \ldots I_m \) such that \( I_i = \text{fst}(B_{j_i}) \) holds for some \( 0 \leq j_1 \leq \ldots \leq j_m < n \) and all properties in fact (5) are fulfilled. Therefore, \( a = a_i \) holds for some \( 1 \leq i \leq m \).

Similarly to the proof of the case \( 'x = x' \), we show \( x = x_1 \), i.e. \( a = a_1 \).

Suppose \( x \neq x_1 \). Then the sequence

\[
\text{fst}(B_{j_1}) \ldots \text{fst}(B_{j_2}) \ldots \text{fst}(B_{j_m}) \ldots \text{jump} = \text{fst}(B_n)
\]
Chapter 7. Translating intermediate programs into ALEF

is a path \( \pi_0 \) in \( P \). Also, for \( 1 \leq k \), fact (5) implies \( x_1 \in \text{defs}(\pi_0(1)) \) and \( x_1 \notin \text{defs}(\pi_0(k)) \cup \text{uses}(\pi_0(k)) \). By Lemma 19, \( \pi_0 \) is hence the prefix of an \( x_1 \)-path \( \pi \). Since \( x = x_i \) holds for some \( 1 \leq i < n \), \( x_i \in \text{defs}(I_i) \) and \( x \in \text{uses}(\text{jump}) \), the path

\[
\tau = I_1, \ldots, I_m, \ldots, \text{jump}
\]

is an \( x \)-path – fact (5) guarantees that \( \text{jump} \) is the first use of \( x_i \) following \( I_i \). Thus \( \pi \) contains \( \tau \), in contradiction to \( \rho \vdash x_1 \) and \( \rho \vdash x \). By Lemma 19, \( \pi_0 \) is hence the prefix of an \( x_1 \)-path \( \pi \). Since \( x = x_i \) holds for some \( 1 \leq i < n \), \( x_i \in \text{defs}(I_i) \) and \( x \in \text{uses}(\text{jump}) \), the path

\[
\tau = I_1, \ldots, I_m, \ldots, \text{jump}
\]

is an \( x \)-path – fact (5) guarantees that \( \text{jump} \) is the first use of \( x_i \) following \( I_i \). Thus \( \pi \) contains \( \tau \), in contradiction to \( \rho \vdash x_1 \) and \( \rho \vdash x \). Therefore, \( x = x_1 \) holds.

For \( D = (P, R, M, I) \) where

\[
\begin{align*}
\tau &= \begin{cases} 
m_1 & \text{if } a = 0 \\
m_2 & \text{if } a \neq 0 
\end{cases}
\end{align*}
\]

and \( P = Q[q \mapsto a_2 \ldots a_m] \) we obtain \( C \models [\text{jump}]_{\rho} D \), and the value \( \tau \) in the last component of \( D \) coincides with the component \( N(\tau) \) in \( T \).

**Step 2.** We now show \( D \models_{\alpha_p} T \).

1. \( S_0 \Rightarrow_{\alpha_p} T \) was shown above

2. For \( y \in \text{dom } \Sigma_1 \), \( \Sigma_1(y) = (\top, b) \) and \( \rho(y) = r' \), \( y \neq x \) follows due to \( \rho(x) = q \), so \( y \in \text{dom } \Sigma_1 \) implies \( y \in \text{dom } \Sigma \) with \( \Sigma(y) = \Sigma_1(y) = (\top, b) \), and fact (2) implies \( r' \in \text{dom } R \).

3. For \( r' \in \text{dom } R \) suppose there are variables \( y \neq z \in \text{liveIn}(\text{fst}(N(\tau))) \) with \( \rho(y) = r' = \rho(z) \). Then equations (6.9) and \( \text{fst}(N(\tau)) \in \text{succ}(\text{jump}) \) yield \( \{y, z\} \subseteq \text{liveOut}(\text{jump}) \). Consequently, \( y \) and \( z \) are adjacent in \( \mathcal{G}_{\text{Reg}} \) so \( \rho(y) \neq \rho(z) \) as \( \rho \) is an allocation for \( P \). This contradicts \( \rho(y) = r' = \rho(z) \), hence at most one \( y \in \text{liveIn}(\text{fst}(N(\tau))) \) exists with \( \rho(y) = r' \).

Again, any such \( y \) fulfils \( y \neq x \) due to \( \rho(y) = r' \neq q = \rho(x) \) and equations (6.9) and \( 0 = \text{defs}(\text{jump}) \) imply

\[
y \in \text{liveIn}(\text{fst}(N(\tau))) \subseteq \text{liveOut}(\text{jump}) \subseteq \text{liveIn}(\text{jump})
\]

By fact (3), \( \Sigma_1(y) = \Sigma(y) = (\top, R(r')) \) follows.
4. For \( q' \in Q \), we show
\[
|P(q')| = |\{y \in \text{dom } \Sigma_1 | \rho(y) = q', \exists b. \Sigma_1(y) = (1, b)\}|
\]
- If \( q \neq q' \) then any \( y \in \text{dom } \Sigma \) with \( \rho(y) = q' \) and \( \Sigma(y) = (1, b) \) for some \( b \) fulfills \( x \neq y \) due to \( \rho(x) = q \), hence
\[
|P(q')| = |Q(q')| - 1
\]
\[
= |\{y \in \text{dom } \Sigma | \rho(y) = q, \exists b. \Sigma(y) = (1, b)\}| - 1
\]
\[
= |\{x \in \text{dom } \Sigma | x \neq y, \rho(y) = q, \exists b. \Sigma(y) = (1, b)\}| - 1
\]
\[
= |\{y \in \text{dom } \Sigma_1 | x \neq y, \rho(y) = q, \exists b. \Sigma_1(y) = (1, b)\}|
\]
\[
= |\{y \in \text{dom } \Sigma_1 | \rho(y) = q, \exists b. \Sigma_1(y) = (1, b)\}|
\]

5. For \( q' \in Q \) and \( P(q') = b_1 \ldots b_\mu \) we have to show that there are \( z_1, \ldots, z_\mu \) and \( J_1, \ldots, J_\mu \) such that \( \cup_{i=1}^\mu z_i = \{y \in \text{dom } \Sigma_1 | \rho(y) = q', \exists b. \Sigma_1(y) = (1, b)\} \) and for all \( 1 \leq i \leq \mu \) the following properties hold.
- \( J_i = \text{fst}(B_{l_i}) \) for some \( 0 \leq l_1 \leq \ldots \leq l_\mu < n + 1 \)
- \( z_i \in \text{defs}(J_i) \)
- \( \rho(z_i) = q' \)
- \( \Sigma_1(z_i) = (1, b_i) \)
- \( z_i \notin \text{defs}(K) \cup \text{uses}(K) \) for \( K = \text{fst}(B_g) \) and all \( l_i < g < n + 1 \)

Again, there are two cases.
• If \( q \neq q' \) then we take (for all \( i \)) \( z_i = x_i \) and \( J_i = I_i \) from fact (5), and obtain

- \( J_i = I_i = \text{fst}(B_{j_i}) = \text{fst}(B_{l_i}) \) for \( j_i = l_i \), hence \( 0 \leq l_1 \leq \ldots \leq l_{\mu} < n < n + 1 \)
- \( z_i = x_i \in \text{defs}(I_i) = \text{defs}(J_i) \)
- \( \rho(z_i) = \rho(x_i) = q' \)
- each \( z_i \) fulfils \( z_i \neq x \) due to \( \rho(z_i) = q' \neq q = \rho(x) \), hence \( \Sigma_1(z_i) = \Sigma(x_i) = (1, b_i) \)
- \( z_i = x_i \notin \text{defs}(K) \cup \text{uses}(K) \) for \( K = \text{fst}(B_g) \) and all \( l_i < g < n \) by fact (5) and \( z_i = x_i \notin \text{defs}(\text{fst}(B_n)) \cup \text{uses}(\text{fst}(B_n)) = \{x\} \) holds due to \( z_i \neq x \).

• If \( q = q' \) then \( \mu = m - 1 \), and \( b_i = a_{i+1} \) holds for \( 1 \leq i < m \) (where \( m \) is the length of \( Q(q) \)), and we define

\[
(z_i, J_i) = (x_{i+1}, I_{i+1})
\]

(again for \( 1 \leq i < m \)), where the \( x_i \) and \( I_i \) are those from fact (5). Then

\[
\bigcup_{i-1}^{m} z_i = \bigcup_{i-2}^{m} x_i = \{y \in \text{dom } \Sigma_1 | \rho(y) = q, \exists b. \Sigma(y) = (1, b)\},
\]

and with the definition

\[
I_i = j_{i+1}
\]

we obtain

- \( 0 < j_1 < j_2 = l_1 < j_3 = l_2 < \ldots < j_m = l_{m-1} < n < n + 1, J_i = I_{i+1} = \text{fst}(B_{j_{i+1}}) = \text{fst}(B_{l_i}) \) for \( 1 \leq i < m \)
- \( z_i = x_{i+1} \in \text{defs}(I_{i+1}) = \text{defs}(J_i) \) for \( 1 \leq i < m \)
- \( \rho(z_i) = \rho(x_{i+1}) = q' = q \) for \( 1 \leq i < m \)
- from

\[
\bigcup_{i-1}^{m} z_i = \bigcup_{i-2}^{m} x_i
\]

\[
= \{y \in \text{dom } \Sigma_1 | \rho(y) = q, \exists b. \Sigma(y) = (1, b)\} \setminus \{x\}
\]

\[
= \{y \in \text{dom } \Sigma_1 | \rho(y) = q, \exists b. \Sigma_2(y) = (1, b)\}
\]
we obtain for $z_i \neq z'_i$ for $1 \leq i \neq i' \leq m$, hence $z_i \neq x$ follows for $i \neq \mu$, and we obtain

$$\Sigma_1(z_i) = \Sigma(x_{i+1}) = (1, a_{i+1}) = (1, b_i)$$

for $1 \leq i \leq \mu = m - 1$

- for $1 \leq i \leq \mu = m - 1$, $z_i = x_{i+1} \notin \text{defs}(K) \cup \text{uses}(K)$ holds by fact (5) for $K = \text{fst}(B_k)$ and all $j_{i+1} < k < n$, i.e. $l_i < k < n$, and for $k = n$, $z_i \notin \{x\} = \text{defs}(\text{fst}(B_k)) \cup \text{uses}(\text{fst}(B_k))$ follows from $z_i \neq x$.

Consequently, the execution of basic blocks in IL and ALEF agree.

**Proposition 41.** For $B \in \text{Blocks}_p$ let $S = (\Sigma, n, B)$, $C = (Q, R, M, \text{nil})$ and $C \models_{\alpha_p} S$. If $S \Rightarrow_{\alpha_p} B \Rightarrow T$ then there is a unique $D$ such that $C \Rightarrow_{\alpha_p} B \Rightarrow D$. Furthermore, for $D = (Q', R', M', \tilde{k})$ and $T = (\Sigma', n', B')$ the following hold

- $D \models_{\alpha_p} T$

- $|B'| = 0$ implies $\tilde{k} = \text{nil}$

- $|B'| > 0$ implies $\tilde{k} \neq \text{nil}$ and $B' = \text{Nil}(k)$. 

**Proof.** Induction on $|B|$. For $|B| = 1$, $\text{fst}(B)$ is either a jump instruction $\text{jump}$ or an assignment $\text{ass}$.

In the former case, there is a unique $D$ with $C \Rightarrow_{\alpha_p} B \Rightarrow D$ by Proposition 40, and $D \models_{\alpha_p} T$ holds. Hence $C \Rightarrow_{\alpha_p} B \Rightarrow D$, and by the rules $\text{JMP}$, $\text{IF-T}$ and $\text{IF-F}$, $B' = \text{Nil}(l)$ follows for some $l$. By Definition 24, $|B'| > 1$ holds, and $\text{nil} \neq k = l$ indeed holds by Proposition 40.

In the latter case, the second part of Proposition 39 implies $C \Rightarrow_{\alpha_p} B \Rightarrow D$ for some unique $D$, and $D \models_{\alpha_p} T$ holds. From $|B| = 1$, $|B'| = 0$ follows by rule $\text{ASS}$, and the second part of Proposition 39 indeed guarantees $\tilde{k} = \text{nil}$ as the last components of $C$ and $D$ agree.
For $|B| > 1$, $\text{fst}(B) = \text{ass}$ holds by Definition 24, and for $S \Rightarrow_\alpha T$ and $S_1 = (\Sigma_1, m_1, B_1)$, rule ASS implies $B = \text{ass} B_1$ with $|B_1| \geq 0$. Applying the second part of Proposition 39 yields a (unique) $E$ such that $C \frac{\equiv [\text{fst}(B)]_\rho}{\Rightarrow_\alpha} E$, and $E \models_\alpha S_1$ holds, and the last component of $E$ equals that of $C$ and is hence $\text{nil}$. Applying the induction hypothesis to $S_1 \Rightarrow_\alpha T$ and $E \models_\alpha S_1$ yields a (unique) $D$ with $E \frac{\equiv [B_1]_\rho}{\Rightarrow_\alpha} D$, and $D \models_\alpha T$ and the two claims regarding the last component of $D$ hold. Composing $C \frac{\equiv [\text{fst}(B)]_\rho}{\Rightarrow_\alpha} E$ and $E \frac{\equiv [B_1]_\rho}{\Rightarrow_\alpha} D$ thus yields $C \frac{\equiv [B]_\rho}{\Rightarrow_\alpha} D$ for a $D$ of the required shape. 

Consequently, $\Rightarrow_\alpha$-execution is faithfully matched by ALEF execution:

**Theorem 9.** Let $P = (N, A, \text{succ}, \text{entry})$ be a regular IL-program and $\rho$ an allocation for $P$. If $S_0 \Rightarrow_\alpha$-terminates in a configuration $T$ then there are unique $D$ and $w$ such that for $C = (\llbracket \cdot \rrbracket, \llbracket \cdot \rrbracket, M, \text{entry})$, $C \rightarrow w D$ holds with $D \models_\alpha T$.

**Proof.** By the first part of Proposition 39, $C_0 = (\llbracket \cdot \rrbracket, \llbracket \cdot \rrbracket, M, \text{nil})$ fulfills $C_0 \models_\alpha S_0$, where $S_0 = (\llbracket \cdot \rrbracket, \text{entry}, N(\text{entry}))$ by Definition 32. As $|N(\text{entry})| > 0$ holds by Definition 24, there is a unique $T_1$ such that $S_0 \Rightarrow_\alpha T_1$, and applying Proposition 41 yields a unique $D_1$ with $C_0 \frac{\equiv [N(\text{entry})]_\rho}{\Rightarrow_\alpha} D_1$, hence $C \frac{\equiv \text{entry}}{\Rightarrow_\alpha} D$ by rule EXECUTE. Furthermore, $D_1 \models_\alpha T_1$ holds, for $T_1 = (\Sigma, m, B)$, $|B| = 0$ implies $D_1 = (\_, \_, \_, \text{nil})$, while $|B| = 1$ implies $B = N(l)$ for some $l$, and $D_1 = (\_, \_, \_, l)$. With an induction on the number of basic blocks executed during the reduction $S \Rightarrow_\alpha T$ the claim follows. 

By combining Theorems 6 and 9 we thus obtain functional correctness of the translation based on dataflow solutions for regular programs.

**Corollary 4.** Let $P = (N, A, \text{succ}, \text{entry})$ be a regular IL-program and $\rho$ an allocation for $P$. If $S_0 \rightarrow$-terminates in a configuration $T$ then there is are unique $D$ and $w$ such that for $C = (\llbracket \cdot \rrbracket, \llbracket \cdot \rrbracket, M, \text{entry})$, $C \rightarrow w D$ holds, and $D \models_\alpha T$.

For SSA programs, the similar result follows by Theorem 7 and Proposition 37.
7.7 Discussion

7.7.1 Summary of the compilation

Chapters 6 and 7 outlined the back-end of a compiler with a program analysis phase for an intermediate language and a translation into ALEF. The various routes a compilation may follow are shown in Figure 7.3.

Starting from a regular IL program, forwardability information may either be obtained directly from a solution to the dataflow equations for variable usage. Alternatively, the program may be transformed into (standard edge-split) SSA form as described in Section 6.2. We may also accept SSA programs which did not arise from such a trans-
formation by checking that the syntactic requirements are met. In both cases, dataflow equations and forwardability analysis are performed on the SSA code, before $\Phi$-blocks are eliminated in a forwardability-preserving (and functionally correct) way. Subsequently, for each forwardable variable it is decided whether forwarding should actually be used, possibly governed by restrictions on the number of queues and their length. Allocating queues and registers by colouring the conflict graphs yields the final ALEF program. Provided that each step in the translation fulfilled the requirements of the corresponding propositions (compatibility, etc.) the resulting ALEF program agrees functionally with the input program, and is well-typed where the typing corresponds to the chosen dataflow solution.

### 7.7.2 Extensions and future work

In order to obtain the formal results, we presented a rather simple dataflow analysis and a very basic translation. The following sections outline some improvements which would be desirable for a full compiler, as well as some topics for more theoretical future work.

#### 7.7.2.1 Alternative patterns of usage

**Multiple usage** Instead of requiring a value to be used exactly once on each path, one can require a value to be used exactly $k$ times, for arbitrary fixed $k$. The corresponding dataflow analysis extends $\mathcal{L}$ to the flat lattice $\mathcal{L}_k$ of values $0, \ldots, k$, with addition defined by

<table>
<thead>
<tr>
<th>$\oplus$</th>
<th>$\bot$</th>
<th>0</th>
<th>1</th>
<th>$\cdots$</th>
<th>$k-1$</th>
<th>$k$</th>
<th>$\top$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$\bot$</td>
<td>$\bot$</td>
<td>$\bot$</td>
<td>1</td>
<td>$\cdots$</td>
<td>$k-1$</td>
<td>$k$</td>
<td>$\top$</td>
</tr>
<tr>
<td>0</td>
<td>$\bot$</td>
<td>0</td>
<td>1</td>
<td>$\cdots$</td>
<td>$k-1$</td>
<td>$k$</td>
<td>$\top$</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>2</td>
<td>$\cdots$</td>
<td>$k$</td>
<td>$\top$</td>
<td>$\top$</td>
</tr>
<tr>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\ddots$</td>
<td>$\top$</td>
<td>$\top$</td>
<td>$\top$</td>
</tr>
<tr>
<td>$k-1$</td>
<td>$k-1$</td>
<td>$k-1$</td>
<td>$k$</td>
<td>$\top$</td>
<td>$\top$</td>
<td>$\top$</td>
<td>$\top$</td>
</tr>
<tr>
<td>$k$</td>
<td>$k$</td>
<td>$k$</td>
<td>$\top$</td>
<td>$\top$</td>
<td>$\top$</td>
<td>$\top$</td>
<td>$\top$</td>
</tr>
<tr>
<td>$\top$</td>
<td>$\top$</td>
<td>$\top$</td>
<td>$\top$</td>
<td>$\top$</td>
<td>$\top$</td>
<td>$\top$</td>
<td>$\top$</td>
</tr>
</tbody>
</table>
and ⊕ defined by $l_1 ⊕ l_2 = \bigcup \{ l | l \oplus l_2 = l_1 \}$.

In the context of ALEF, the case $k = 2$ is of particular interest, as variables with exactly two uses can be translated into a duplication instruction, provided that all pairs of consuming instructions execute on the same sets of functional units. For example the IL-program \( [1]x = 4; \ [2]y = x * 2; \ [3]z = y + x \) may be translated to

\[
[1]\text{ldc } 4 \ q_1 \ [1a]\text{dup1 } MUL \ q_1 \ q_3 \ q_1 \ [2]\text{mul } 2 \ q_1 \ q_2 \ [3]\text{add } q_2 \ q_3 \ r_1
\]

where we assume $\gamma(q_1) = MUL$ and $\gamma(q_2) = \gamma(q_3) = ALU$. In an extended ALEF instruction set, an instruction might exist which combines the first two instructions to $[1]\text{ldc } 4 \ q_3 \ q_1$.

Variables which are used exactly $k > 2$ times may in principle be translated into a cascade of duplication instructions. For example,

\[
[1]x = 4; \ [2]y = x * x; \ [3]z = y + x
\]

may be translated to

\[
[1]\text{ldc } 4 \ q_1 \ [1a]\text{dup1 } MUL \ q_1 \ q_1 \ [1b]\text{dup1 } MUL \ q_1 \ q_3 \ q_1 \ [2]\text{mul } q_1 \ q_1 \ q_2 \ [3]\text{add } q_2 \ q_3 \ r_1.
\]

As $k$ grows however, we expected that register usage will outperform dup1-cascades.

**Up to $k$ uses** More generously, it may be possible to include paths with *at most* $k$ uses, rather than requiring *exactly* $k$ uses. By modifying the definition of $fwdOut$ to

\[
fwdOut(I)(x) = \begin{cases} 
\bot & \text{if } \text{succ}(I) = \emptyset \\
\bigcup_{J \in \text{succ}(I)} \text{fwdIn}(J)(x) & \text{otherwise}
\end{cases}
\]

values which are not consumed may be left over at the end of a program run. Inside loops, no spare values are admitted, and the above definition in fact simplifies to

\[
fwdOut(I)(x) = \bigcup_{J \in \text{succ}(I)} \text{fwdIn}(J)(x)
\]

using $\bot \ldots = \bot$. On the other hand, this modification may require one to revisit the translation between dataflow solutions and ALEF types.
**Unbalanced branches** In situations where the two branches of a conditional jump differ in the number of uses of a variable, the compiler may insert additional duplication instructions. Likewise, by inserting `skip` instructions we can remove redundant copies of variables, and inserting `id`-instructions allows us to move values between operand queues connected to different functional units. We expect that these techniques in combination will make all variables forwardable. Whenever a variable is accessed, a duplication instruction is inserted, and one copy is used immediately while the other one is kept for later. In principle, forwarding should be possible for all variables for logical reasons: in Chapter 5 we observed that registers play the role of exponentials in the linear setting, so the duplication instruction should eliminate the need for registers.

### 7.7.2.2 Efficient translations

Translating `Φ`-instructions into several `id` instructions often results in an intolerably high static and dynamic instruction count. Consider for example the Java excerpt

```java
int f(){
    int x=3;
    int y=2;
    for (int z=10, z>0, z--){x=x+y;}
    x=x+1
}
```

and its naive translation into SSA-IL

\[
\begin{align*}
N(1) &= [1]x1 = 3 \\
[2]y1 = 2 \\
[3]z1 = 10 \\
[4]jmp 5 \\
N(5) &= [5] \begin{pmatrix}
x2 \\
z2
\end{pmatrix} = \begin{pmatrix}
x1 & x3 \\
z1 & z3
\end{pmatrix} \begin{pmatrix}
4 \\
9
\end{pmatrix} \\
[6] if z2 \ 10 \\
N(7) &= [7]x3 = x2 + y1 \\
[8]z3 = z2 - 1 \\
[9] jmp 5 \\
N(10) &= [10]x4 = x2 + 1
\end{align*}
\]
In our translation the $\Phi$-block results in four $\text{id}$-instructions, two at the end of block 1 and one at the end of block 7. All these instructions represent the only uses of their respective operands $x_1$, $x_3$, $z_1$ and $z_3$. Their results are used exactly once, and are indeed both forwardable to ALU. All this information is captured in the dataflow analysis and a less wasteful translation would reduce the number of $\text{id}$ instructions, in effect reversing the splitting of variables performed during the transformation into SSA.

In general, optimisation can be achieved either by transformations during compilation or by post-compilation optimisation, similar to peephole optimisation. The presence of types in the compilation output is beneficial in both cases. For example, $\text{dec } q_1 \text{id}^{\gamma(q_1)} q_1 q_2$ may only be replaced by $\text{dec } q_2$ if the value consumed by the second instruction actually is the one resulting from the first one. These context-dependencies are captured in types, as was shown when considering data-dependence between instruction with output-conflicts in Chapter 4.

### 7.7.2.3 Integrating dataflow analysis and forwardability

The decision whether a value is forwardable occurs as a separate step in our approach, between the dataflow analysis and the allocation. It can however be integrated with the dataflow analysis by using the more sophisticated lattice

```
T

0      ALU      MUL      MEM      BU

1

1
```

The functional units sit flatly between 1 and $\top$ but behave like 1 with respect to $\oplus$. The element $1$ is used for polymorphism in $\text{id}$ instructions arising from assignments $y = x$. Alternatively, it may be possible to integrate the forwardability decision with processor-dependent compiler phases such as queue and register allocation.
7.7.2.4 Allocation for distributed and bounded execution

The conflict graph constructed in Section 7.5 is suited for allocating operand queues with respect to sequential execution of ALEF programs. Race conditions arising from out-of-order execution do not surface using x-paths and are therefore not eliminated. Neither did we require the colouring to satisfy length limitation of operand queues – in fact an optimal colouring will most likely yield long operand queues.

One approach for obtaining allocations for distributed and finite-queue execution consists of post-processing sequential allocations using the techniques from Chapter 4. More elegantly, it may be possible to capture additional requirements in the conflict graph or the dataflow equations. For example, edges may be introduced between intermediate variables whose defining instruction(s) execute on different functional units. This generalises the forwardability predicate by complementing the destination information with the source of a value, similar to pipeline-dependencies on the ALEF level. Leaving the details for future research, we expect that data-dependencies and finiteness of queues may be tackled by restricting the graph colourings or by a more fine-grained dataflow analysis which keeps track of values which are jointly in-flight under distributed execution.

7.7.2.5 Relating program analysis frameworks

The construction of non-standard dynamic semantics in Chapter 6 followed a very natural regime: incorporate the lattice of the dataflow equations into the dynamic semantics, taking into account that the dataflow analysis proceeds in the opposite direction as the program execution by dualising $\oplus$ to $\ominus$. If this approach can be generalised to arbitrary dataflow-equations, proving these correct would amount to relating the natural dynamic semantics arising from (a solution to) the dataflow equations to the standard dynamic semantics. Further effort is needed to see whether this is indeed possible, to formalise such correspondences in frameworks such Nielsen et al’s [NNH99], and to explore whether a connection to general approaches for relating syntax and operational semantics exists.

We noted in Section 7.4 that the correspondence between dataflow solutions and low-level types is slightly unsatisfactory and future work should consider homomorphic
7.7. Discussion

The analysis and compilation substantiate the argument that a programming language based approach is beneficial for linking processor structure and program analysis, allowing compilers to optimise operand communication. We draw two conclusions.

1. Ensuring safe operand usage in ALEF programs can be lifted from the assembly level to earlier stages of the compiler. This demonstrates that our approach of typing the operand usage fits well with other approaches pursued in the typed-assembly-language community, and with the types-in-compiler paradigm.

2. The restrictions we imposed on ALEF programs for being type-correct are generous enough to accommodate many forwardable variables. In particular, the unification condition during the composition of basic blocks appeared rather restrictive on the assembly level, but is naturally met by programs originating from IL via the correspondence between dataflow solutions and ALEF typings.

The next chapter will describe an initial implementation of the compilation and report on results obtained from translating Java code into ALEF.
In this chapter, we describe a prototypical implementation of our compilation approach and give experimental results for a standard benchmark suite. The purpose of this implementation is to validate some of the calculi described in the earlier chapters rather than to provide a full compiler of high efficiency. We chose ML as implementation language, using the Moscow-ML compiler kit [RRS00].

Concentrating on the regular fragment of IL, our implementation follows the approach of Chapters 6 and 7. Dataflow analysis and register and queue allocation phases are implemented using a simple iterative algorithm and greedy colouring algorithms respectively. While for production-strength compilers more efficient implementations are necessary, the chosen algorithms provide sufficient insight for our purpose.

In order to exercise the translation on larger programs, we implemented two symbolic conversions from a subset of the Java Virtual Machine Language (JVML, [LY97]) into IL. Although some manual intervention is currently required to convert JVM code into the format accepted by our conversion tool, an overall translation process is achieved accepting JAVA source code and emitting ALEF code. We exercised this compiler flow on the Linpack benchmark suite (Java version [DWM]), a numerically intensive linear equation solver commonly used for evaluating parallelising compilers and computer architectures [Don89].

The following section summarises the conversion schemes for translating JVML code into IL. Subsequently, Sections 8.2 and 8.3 present and discuss our experimental results.
8.1 Analysis of JVM bytecode

We briefly summarise the Java virtual machine language (JVML) before outlining how JVML code may be converted into IL.

8.1.1 Java virtual machine code

The semantics of JVML is defined (see [LY97]) by an operational model for bytecode instructions which manipulate an operand stack, local variables, an object heap, a call stack of method invocations and a set of available classes. Both translations concentrate on the bodies of methods, interpret the invocation of a method as a single symbolic instruction and aim at converting the operand stack carried values into forwardable entities. Ignoring call stack, heap and class-file environment, the semantics of an instruction can be approximated by triples of the form \((s, LV, pc)\) where \(s\) is a stack of 32-bit values, \(LV\) a map from local variables to values and \(pc\) the program counter. Table 8.1 shows a subset of JVML’s integer instructions and their semantic actions. Arguments \(i\) of \(iinc\), \(iload\) and \(istore\) represent the index of the local variable to be manipulated, and the static semantics of JVM code ensures that variable \(i\) is a legal entry in \(LV\). The instruction \(ireturn\) pushes the top value of the current operand stack to the operand stack of the calling method and then returns, and the primed terms in Table 8.1 represent the respective components of the calling frame.

For converting JVML code into IL instructions, we considered two translation schemes which differ in the way IL variables are assigned to entities on the operand stack. Due to the limited instruction sets of both IL and ALEF, we decided to implement symbolic conversions which model the consumption of operands correctly but interpret some instructions in a functionally imprecise way. For example, the execution of Linpack programs requires the implementation of floating point instructions (of double precision), which our conversions replace by integer operations on symbolic data. Likewise, method invocations and function calls translate into symbolic consumption and creation of operands and return values rather than in assembly-level procedure calls. While this symbolic approach suffices for analysing the usage of operands, it precludes the possibility of executing the resulting code in a meaningful way.
8.1. Analysis of JVM bytecode

<table>
<thead>
<tr>
<th><strong>Bytecode</strong></th>
<th><strong>Semantics</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>iadd</td>
<td>((abs, LV, pc) \rightarrow (b + a) s, LV, pc + 1)</td>
</tr>
<tr>
<td>isub</td>
<td>((abs, LV, pc) \rightarrow (b - a) s, LV, pc + 1)</td>
</tr>
<tr>
<td>imul</td>
<td>((abs, LV, pc) \rightarrow (b*a) s, LV, pc + 1)</td>
</tr>
<tr>
<td>iinc i a</td>
<td>((s, LV, pc) \rightarrow (s, LV[i \mapsto (LV i) + a], pc + 1))</td>
</tr>
<tr>
<td>iload i</td>
<td>((s, LV, pc) \rightarrow ((LV i) s, LV, pc + 1))</td>
</tr>
<tr>
<td>istore i</td>
<td>((as, LV, pc) \rightarrow (s, LV[i \mapsto a], pc + 1))</td>
</tr>
<tr>
<td>icode a</td>
<td>((s, LV, pc) \rightarrow (as, LV, pc + 1)) where (a \in {-1, \ldots, 5})</td>
</tr>
<tr>
<td>ineg</td>
<td>((as, LV, pc) \rightarrow ((-a) s, LV, pc + 1))</td>
</tr>
<tr>
<td>dup</td>
<td>((as, LV, pc) \rightarrow (aas, LV, pc + 1))</td>
</tr>
<tr>
<td>pop</td>
<td>((as, LV, pc) \rightarrow (s, LV, pc + 1))</td>
</tr>
<tr>
<td>goto pc’</td>
<td>((s, LV, pc) \rightarrow (s, LV, pc’))</td>
</tr>
<tr>
<td>if_icmplt pc’</td>
<td>((abs, LV, pc) \rightarrow \begin{cases} (s, LV, pc’) &amp; \text{if } b &lt; a \ (s, LV, pc + 1) &amp; \text{otherwise} \end{cases})</td>
</tr>
<tr>
<td>iflt pc’</td>
<td>((as, LV, pc) \rightarrow \begin{cases} (s, LV, pc’) &amp; \text{if } a &lt; 0 \ (s, LV, pc + 1) &amp; \text{otherwise} \end{cases})</td>
</tr>
<tr>
<td>ireturn</td>
<td>((as, LV, pc) \rightarrow (as’, LV’, pc’))</td>
</tr>
</tbody>
</table>

Table 8.1: Operational semantics of some JVML instructions
Both translations introduce IL variables for operand stack communicated values and translate each JVML instruction into a sequence of IL instructions such that creation and consumption of stack values corresponds exactly to assignment and usage of these stack variables. Furthermore, the translations employ local variables for encoding JVM variables in LV and auxiliary variables for modelling the effect of method return and symbolic instruction execution. The three categories of IL variables are separated by syntactic means in IL, and we write $x_i$ for stack variables, $y_i$ for local variables and $w_i$ for auxiliary variables. The forwardability analysis concentrates on variables of the first category.

### 8.1.2 Translating each stack position into a variable

The first translation (henceforth referred to as scheme $A$) statically assigns one IL variable to each position of the operand stack. Consequently, the number of stack variables employed coincides with the maximal height of the operand stack. This height is statically known at compile time of a JAVA program and explicitly available in a JVML class-file. For example, the JVM program

\[
\begin{align*}
0 & \text{i}load\ 1 & 4 & \text{i}add \\
1 & \text{i}const\ 2 & 5 & \text{i}sub \\
2 & \text{i}load\ 3 & 6 & \text{i}add \\
3 & \text{i}const\ 3 & 7 & \text{i}return
\end{align*}
\]

(8.1)

uses an operand stack of height 4 and is translated into the IL program

\[
\begin{align*}
[0] & x_0 = y_1 \\
[1] & x_1 = 2 \\
[2] & x_2 = y_3 \\
[3] & x_3 = 3 \\
[4] & x_2 = x_2 + x_3 \\
[5] & x_1 = x_1 - x_2 \\
[6] & x_0 = x_0 + x_1 \\
[7] & w_0 = x_0
\end{align*}
\]

(8.2)

where each variable $x_i$ corresponds to the stack position $i$. Local variables are translated bijectively into IL variables of the second category (see instructions $[0]$ and $[2]$), while auxiliary variables are created as needed for the correct modelling of operand consumption (instruction $[7]$).
8.1. Analysis of JVM bytecode

<table>
<thead>
<tr>
<th>Bytecode</th>
<th>IL code</th>
<th>sp update</th>
</tr>
</thead>
<tbody>
<tr>
<td>iadd</td>
<td>(x_{sp-2} = x_{sp-2} + x_{sp-1})</td>
<td>(sp := sp - 1)</td>
</tr>
<tr>
<td>isub</td>
<td>(x_{sp-2} = x_{sp-2} - x_{sp-1})</td>
<td>(sp := sp - 1)</td>
</tr>
<tr>
<td>imul</td>
<td>(x_{sp-2} = x_{sp-2} \times x_{sp-1})</td>
<td>(sp := sp - 1)</td>
</tr>
<tr>
<td>iinc a</td>
<td>(y_1 = y_1 + a)</td>
<td>* -</td>
</tr>
<tr>
<td>iload i</td>
<td>(x_{sp} = y_1)</td>
<td>(sp := sp + 1)</td>
</tr>
<tr>
<td>istore i</td>
<td>(y_1 = x_{sp-1})</td>
<td>(sp := sp - 1)</td>
</tr>
<tr>
<td>iconst a</td>
<td>(x_{sp} = a)</td>
<td>(sp := sp + 1)</td>
</tr>
<tr>
<td>ineg</td>
<td>(x_{sp-1} = \sim x_{sp-1})</td>
<td>* -</td>
</tr>
<tr>
<td>dup</td>
<td>(\begin{align*} w_0 &amp;= x_{sp-1} \ x_{sp} &amp;= w_0 \ x_{sp-1} &amp;= w_0 \end{align*} )</td>
<td>(sp := sp + 1)</td>
</tr>
<tr>
<td>pop</td>
<td>(w_0 = x_{sp-1})</td>
<td>(sp := sp - 1)</td>
</tr>
</tbody>
</table>

Table 8.2: Translation scheme A for non-jumps

The translation inside each basic block follows the sequence of instructions using the translation scheme given in Table 8.2, where the additional unary IL operator \(\sim\) denotes negation. The effect of instruction execution is modelled by internally maintaining the stack pointer \(sp\) of variables \(x_i\). For each instruction, the context conditions given in [LY97] ensure that Table 8.2 is well-defined as the stack pointer never decreases beyond zero and is always greater than or equal to the highest index used in the corresponding line in Table 8.2. Most JVM instructions pop their operands off the stack during execution, which yields a single usage for most stack-communicated values. This pattern may be extended to stack-manipulating instructions pop and dup at the cost of additional assignments to auxiliary variables. Alternatively, one could implement pop without emitting any IL code by purely decrementing the stack pointer. Likewise, dup could be implemented by a single instruction \(x_{sp} = x_{sp-1}\). The translation shown in Table 8.2 thus sacrifices runtime efficiency for the uniform consumption of stack variables and for the coherence with the behaviour of the corresponding ALEF instructions skip and dup1.
At the end of basic blocks, programs compose correctly by the stipulation in the JVM specification that all control flow edges deliver operand stacks of the same height — indeed, even the stacks’ shapes must coincide, i.e. the types of values at each stack position must agree. As our translation does not distinguish between values of different types, monitoring the height of the operand stack suffices. All conditional jump instructions are symbolically modelled by the IL conditional if, and the consumption of operands is translated into assignments to an auxiliary variable (Table 8.3).

**Example.** Consider the Java function \(f_1\)

\[
\text{int } f1()\{
    \text{int } \text{sum} = 5;
    \text{for (int } i = 0; i < 5; i++)\{\text{sum }+= 3;\} \quad (8.3)
    \text{return } \text{sum};
\}
\]

and its translation into JVML

\[
\begin{align*}
0 & \text{iconst } 5 & 4 & \text{goto } 13 & 14 & \text{iconst } 5 \\
1 & \text{istore } 1 & 7 & \text{iinc } 13 & 15 & \text{if icmplt } 7 \\
2 & \text{iconst } 0 & 10 & \text{iinc } 21 & 18 & \text{iload } 1 \\
3 & \text{istore } 2 & 13 & \text{iload } 2 & 19 & \text{ireturn}
\end{align*}
\]
Inserting explicit unconditional jumps for the fall-through case of conditional jumps, the translation scheme A results in the IL program

\[
\begin{align*}
N(0) = [0]|x_0 &= 5 & N(13) = [13]|x_0 &= y_2 \\
[1]|y_1 &= x_0 & [14]|x_1 &= 5 \\
[2]|x_0 &= 0 & [15a]|w_0 &= x_0 \\
[3]|y_2 &= x_0 & [15b]|w_0 &= x_1 \\
[4]|\text{jmp} 13 & [15]|\text{if } w_0 7 18 \\
N(7) = [7]|y_1 &= y_1 + 3 \\
[10]|y_2 &= y_2 + 1 & N(18) = [18]|x_0 &= y_1 \\
[100]|\text{jmp} 13 & [19]|w_0 &= x_0
\end{align*}
\]

The stack regime results in each assignment to a stack variable \(x_4\) being used linearly, as the \(x_0\)-paths ([0], [1], [2], [3]), ([13], [14], [15a]) and ([18], [19]) and the \(x_1\)-path ([14], [15a], [15b]) show. Indeed, both variables are forwardable to any functional unit as all consuming instructions are of the form \(x = y\) and will be translated to id instructions in ALEF.

All stack variables are used linearly as most JVM instructions pop their operands from the stack. Instructions whose sole purpose consists of modifying the stack (pop and dup) conform to this discipline by virtue of the assignments to auxiliary variables. However, not all stack positions translate into forwardable stack variables as variable names are extensively reused, and the respective values are consumed by instructions executing on different functional units.

**Example.** The program

```java
int f2(){
    int sum = 5; int j = 2;
    for (int i =0; i<5; i++)
        \{ j++; sum = sum+j; \}
    return sum * j;
}
```

(8.4)
results in the JVML code

\[
\begin{align*}
0 & \text{ iconst 5} & 5 & \text{ istore 3} & 14 & \text{ iadd} & 21 & \text{ if,icmplt 19} \\
1 & \text{ istore 1} & 6 & \text{ goto 19} & 15 & \text{ istore 1} & 24 & \text{ iload 1} \\
2 & \text{ iconst 2} & 9 & \text{ iinc 2 1} & 16 & \text{ iinc 3 1} & 25 & \text{ iload 2} \\
3 & \text{ istore 2} & 12 & \text{ iload 1} & 19 & \text{ iload 3} & 26 & \text{ imul} \\
4 & \text{ iconst 0} & 13 & \text{ iload 2} & 20 & \text{ iconst 5} & 27 & \text{ ireturn}
\end{align*}
\]

which translates into

\[
\begin{align*}
N(0) &= [0] \quad x_0 = 5 \\
[1] & \quad y_1 = x_0 \\
[2] & \quad x_0 = 2 \\
[3] & \quad y_2 = x_0 \\
[4] & \quad x_0 = 0 \\
[5] & \quad y_3 = x_0 \\
[6] & \quad \text{ jmp 19}
\end{align*}
N(9) &= [9] \quad y_2 = y_2 + 1 \\
[12] & \quad x_0 = y_1 \\
[13] & \quad x_1 = y_2 \\
[14] & \quad x_0 = x_0 + x_1 \\
[15] & \quad y_1 = x_0 \\
[16] & \quad y_3 = y_3 + 1 \\
[100] & \quad \text{ jmp 19}
\]

\[
\begin{align*}
N(19) &= [19] \quad x_0 = y_3 \\
[20] & \quad x_1 = 5 \\
[21a] & \quad w_0 = x_0 \\
[21b] & \quad w_0 = x_1 \\
[21] & \quad \text{ if } w_0 \ 9 \ 24
\end{align*}
N(24) &= [24] \quad x_0 = y_1 \\
[25] & \quad x_1 = y_2 \\
[26] & \quad x_0 = x_0 * x_1 \\
[27] & \quad w_0 = x_0
\]

The usages of \(x_0\) and \(x_1\) are linear - the dataflow equations have a solution with

\[
\begin{align*}
Out([0])(x_0) &= Out([2])(x_0) = Out([4])(x_0) = Out([12])(x_0) = Out([14])(x_0) \\
&\quad = Out([19])(x_0) = Out([24])(x_0) = Out([26])(x_0) = 1
\end{align*}
\]

and

\[
\begin{align*}
Out([13])(x_1) &= Out([20])(x_1) = Out([25])(x_1) = 1
\end{align*}
\]

but neither variable is forwardable due to the conflicting uses in \([14]\) \(x_0 = x_0 + x_1\) and \([26]\) \(x_0 = x_0 * x_1\).

Translation according to scheme A consequently results in a low number of IL variables (and in a fast computation of dataflow solutions), but also limits the opportunities for forwarding.
8.1. Analysis of JVM bytecode

<table>
<thead>
<tr>
<th>Bytecode</th>
<th>internal stack (pre)</th>
<th>IL code</th>
<th>internal stack (post)</th>
</tr>
</thead>
<tbody>
<tr>
<td>iadd</td>
<td>$xyS$</td>
<td>$x_v = y + x$</td>
<td>$x_vS$</td>
</tr>
<tr>
<td>isub</td>
<td>$xyS$</td>
<td>$x_v = y - x$</td>
<td>$x_vS$</td>
</tr>
<tr>
<td>imul</td>
<td>$xyS$</td>
<td>$x_v = y * x$</td>
<td>$x_vS$</td>
</tr>
<tr>
<td>iinc i a</td>
<td>$S$</td>
<td>$y_i = y_i + a$</td>
<td>$S$</td>
</tr>
<tr>
<td>iload i</td>
<td>$S$</td>
<td>$x_v = y_i$</td>
<td>$x_vS$</td>
</tr>
<tr>
<td>istore</td>
<td>$xS$</td>
<td>$y_i = x$</td>
<td>$S$</td>
</tr>
<tr>
<td>iconst a</td>
<td>$S$</td>
<td>$x_v = a$</td>
<td>$x_vS$</td>
</tr>
<tr>
<td>ineg</td>
<td>$xS$</td>
<td>$x_v = \sim x$</td>
<td>$x_vS$</td>
</tr>
<tr>
<td>dup</td>
<td>$xS$</td>
<td>$\begin{cases} x_{v1} = w_0 \ x_{v2} = w_0 \end{cases}$</td>
<td>$x_{v2}x_{v1}S$</td>
</tr>
<tr>
<td>pop</td>
<td>$xS$</td>
<td>$w_0 = x$</td>
<td>$S$</td>
</tr>
</tbody>
</table>

Table 8.4: Translation scheme B for non-jumps

8.1.3 Translating each push operation into a variable

The second translation (referred to as scheme $B$) aims to make more variables forwardable. This is achieved by introducing a fresh IL variable for each assignment to a stack position. Instead of monitoring the height of the operand stack using a stack pointer, the translation internalises the stack behaviour by maintaining a stack of IL variables.

For simple instructions, the translation scheme is given in Table 8.4 where in each line, $x_v$ represents a globally fresh variable name of the stack variable category. Again, instructions `pop` and `dup` could be implemented more efficiently than by assignments to auxiliary variables. The translation results in an SSA-like usage of variables as no variable is (statically) assigned to repeatedly. For example, example program (8.1) from the previous section is translated into

\[
\begin{align*}
[0] & x_0 = y_1 \\
[1] & x_1 = 2 \\
[2] & x_2 = y_3 \\
[3] & x_3 = 3 \\
[4] & x_4 = x_2 + x_3 \\
[5] & x_5 = x_1 - x_4 \\
[6] & x_6 = x_0 + x_5 \\
[7] & w_0 = x_6
\end{align*}
\] (8.5)
### Table 8.5: Translation scheme B for jumps

<table>
<thead>
<tr>
<th>Bytecode</th>
<th>internal stack (pre)</th>
<th>IL code</th>
<th>internal stack (post)</th>
</tr>
</thead>
</table>
| goto pc'       | \( S \)              | \( \text{jmp } pc' \) \[
  \begin{align*}
  w_0 &= y \\
  w_0 &= x
  \end{align*}\] | \( S \) |
| if_icmplt pc'  | \( xyS \)            | \( \{ \)
  \( \text{if } w_0 \ p c' \ p c + 1 \)? \[
  \begin{align*}
  w_0 &= x \\
  w_0 &= x
  \end{align*}\] | \( S \) |
| iflt pc'       | \( xS \)             | \( \{ \)
  \( \text{if } w_0 \ p c' \ p c + 1 \)? \[
  \begin{align*}
  w_0 &= x_k \\
  \vdots \\
  w_0 &= x_1 \\
  w_0 &= x
  \end{align*}\] | \( \varepsilon \) |
| ireturn        | \( xx_1 \ldots x_k \) | \( \{ \) |

if the creation of fresh \( x_i \) variables follows the natural numbers.

At basic block boundaries, compensation code is introduced similar to the one arising from the elimination of \( \Phi \)-blocks, without prior explicit insertion of \( \Phi \)-instructions. Again based on the JVM requirement that stacks delivered by separate control flow edges must match, a cascade of assignments \( x_i = x_j \) is inserted where each variable on the left is (syntactically) the one expected by the following basic block and each \( x_j \) is the variable name at the same stack position as delivered by the current basic block. Jump instructions are translated as shown in Table 8.5. and a prior phase is employed to ensure (standard) edge-split form. For example, the Java function

```java
final int abs (int i) {return (i >= 0) ? i : -i;}
```

results in JVM code

```
0 iload 1 8 iload 1
1 iflt 8 9 ineg
4 iload 1 10 ireturn
5 goto 10
```
which translates to

\[
\begin{align*}
N(0) & = [0]x_0 = y_1 & N(8) & = [8]x_2 = y_1 \\
[1] & \text{if } x_0 \ 8 \ 4 & [9] & x_3 = ~ x_2 \\
N(4) & = [4]x_1 = y_1 & [100] & x_1 = x_3 \\
[5] & \text{jump 10} & N(10) & = [10]w_0 = x_1
\end{align*}
\]

The stack at the entry of block \(N(10)\) contains one element, and instruction \([100]\) is the glue instruction moving \(x_3\) to \(x_1\). Our implementation differs slightly from our theoretical treatment in Section 7.2 as basic block \(N(4)\) is not equipped with compensation code, and \(N(10)\) instead operates on the stack delivered by \(N(4)\) directly. Notice that in contrast to scheme A, most \(x_4\) variables are now not only used linearly but also forwardable.

**Example.** Returning to the two example programs from the previous section, program (8.3) translates to

\[
\begin{align*}
N(0) & = [0]x_0 = 5 & N(13) & = [13]x_2 = y_2 \\
[1] & y_1 = x_0 & [14] & x_3 = 5 \\
[2] & x_1 = 0 & [15a] & w_0 = x_2 \\
[3] & y_2 = x_1 & [15b] & w_0 = x_3 \\
[4] & \text{jump 13} & [15] & \text{if } w_0 \ 7 \ 18 \\
N(7) & = [7]y_1 = y_1 + 3 & N(18) & = [18]x_4 = y_1 \\
[10] & y_2 = y_2 + 1 & [19] & w_0 = x_4 \\
[100] & \text{jump 13} & &
\end{align*}
\]

using scheme B, and all stack variables \(x_i\) are used linearly and indeed forwardable to
any functional unit. Likewise, program (8.4) translates to

\[
\begin{align*}
N(0) & = [0] x_0 = 5 & N(9) & = [9] y_2 = y_2 + 1 \\
[1] y_1 = x_0 & & [12] x_9 = y_1 \\
[3] y_2 = x_1 & & [14] x_{10} = x_8 + x_9 \\
[4] x_2 = 0 & & [15] y_1 = x_{10} \\
[5] y_3 = x_2 & & [16] y_3 = y_3 + 1 \\
[6] \text{jmp 19} & & [100] \text{jmp 19}
\end{align*}
\]

\[
\begin{align*}
N(19) & = [19] x_3 = y_3 & N(24) & = [24] x_5 = y_1 \\
[20] x_4 = 5 & & [25] x_6 = y_2 \\
[21a] \bar{w}_0 = x_3 & & [26] x_7 = x_5 * x_6 \\
[21b] \bar{w}_0 = x_4 & & [27] \bar{w}_0 = x_7 \\
[21] \text{if } \bar{w}_0 924 & &
\end{align*}
\]

and all stack variables \( x_1 \) are used linearly and are indeed forwardable: \( x_5 \) and \( x_6 \) are forwardable to MUL, \( x_8 \) and \( x_9 \) to ALU and the remaining \( x_1 \) to any functional unit. \( \diamond \)

In general, some stack variables remain non-forwardable in scheme B such as \( x_0 \) in program

\[
\begin{align*}
N(0) & = [0] x_0 = y_1 & N(5) & = [5] x_2 = 3 & N(8) & = [8] x_4 = 7 \\
[1] x_1 = 2 & & [6] x_3 = x_0 + x_2 & & [9] x_5 = x_0 * x_4 \\
[2] \text{if } x_1 58 & & [7] \bar{w}_0 = x_3 & & [10] \bar{w}_0 = x_5 
\end{align*}
\]

which results from the JVML program

\[
\begin{align*}
0 \text{ iload 1} & & 5 \text{ iconst 3} & & 8 \text{ iconst 7} \\
1 \text{ iconst 2} & & 6 \text{ iadd} & & 9 \text{ imul} \\
2 \text{ iflt 8} & & 7 \text{ ireturn} & & 10 \text{ ireturn.}
\end{align*}
\]

However, the experimental results in the following section indicate that this pattern of operand usage does not occur often in JVM code generated by javac.
8.1.4 Implemented instruction set

We implemented both translation schemes for an instruction set which extends the one given in Table 8.1 by (see Table 8.6)

**instructions operating on 64-bit values:** The definition of the JVM stipulates that values of datatypes `double` and `long` are stored using two operand stack positions of width 32 bit each. As both translation schemes relate single stack positions to IL variables, two variables are used for to represent a 64-bit value. Consequently, most instructions involving datatypes `double` and `long` result in a sequence of (at least) two IL assignments. Furthermore, the values of types `double` and `long` are represented symbolically by integer values (in both components) as functionally precise assembly instructions are not available in IL or ALEF.

**instructions involving JVM references and arrays:** References into the object heap are represented as 32-bit entities in JVML and are manipulated on the operand stack. As our conversions do not model the structure of the heap, instructions operating on references are again interpreted symbolically. Likewise, arrays of various datatypes are treated symbolically due to the absence of a memory component and references in IL.

**additional stack-manipulating instructions:** These are converted in a way similar to `pop`, `iconst` and `ireturn`. As mentioned above the conversion of duplication is such that linear usage is preserved, at the cost of an additional assignment to an auxiliary variable.

**more variants of conditionals:** Comparison operations and conditional branches are parameterised by the comparison operator (`≤, <, =, ≠, ≥, >, …`) used, but all variants are translated to IL’s single `if` instruction.

**instructions for object creation and field access:** as the object heap is not modelled in IL, we convert `new` into the creation of a symbolic value representing the resulting object reference on the operand stack. Instructions `getfield`, `putfield` and `getstatic` operate on symbolic data.
<table>
<thead>
<tr>
<th>Category</th>
<th>Bytecodes</th>
</tr>
</thead>
<tbody>
<tr>
<td>64-bit operations</td>
<td>dload, dstore, dconst, dneg, dmul, ddiv, dadd, dsub, dcmpl, dcmpg</td>
</tr>
<tr>
<td>arrays and references</td>
<td>anewarray, newarray, multianewarray, aconst_null, aaload, aastore, aload, astore, iaload, iastore, daload, dastore</td>
</tr>
<tr>
<td>stack manipulation</td>
<td>ldc, bipush, 12d, i2d, d2i, 1dc2_w, i2l, dup2, sipush, return, dreturn, areturn</td>
</tr>
<tr>
<td>conditionals</td>
<td>if_acmp op, if op, if_icmp op</td>
</tr>
<tr>
<td>object management</td>
<td>new, getfield, putfield, getstatic</td>
</tr>
<tr>
<td>method invocation</td>
<td>invokevirtual, invokespecial, invokevirtual, invokevirtual,</td>
</tr>
</tbody>
</table>

Table 8.6: Implemented JVML instructions

**method invocations:** Depending on the number and types of arguments, and the type of the return value of the method being called, a varying number of operands is popped from / pushed onto the operand stack. As the signature of the called method is statically known at compile time, our translations insert as many assignments to auxiliary variables and stack variables as there are operands consumed and returned, respectively. Again, double and long operands are modelled by pairs of assignments.

This instruction set allows us to convert all but one of the fifteen methods of the Linpack benchmark suite (Java version). The remaining method actionPerformed uses exception handling and was therefore not considered.

## 8.2 Results

### 8.2.1 Conversion into IL

We first present results for the conversion from JVM to IL.
<table>
<thead>
<tr>
<th>Method name</th>
<th>Basic blocks</th>
<th>Bytecode instrs</th>
<th>IL instrs</th>
<th>Stack variables total</th>
<th>Stack variables linear</th>
<th>Stack variables fwd</th>
<th>Local variables</th>
</tr>
</thead>
<tbody>
<tr>
<td>abs</td>
<td>4</td>
<td>9</td>
<td>21</td>
<td>4</td>
<td>4</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>daxpy</td>
<td>16</td>
<td>88</td>
<td>134</td>
<td>9</td>
<td>9</td>
<td>3</td>
<td>13</td>
</tr>
<tr>
<td>ddot</td>
<td>15</td>
<td>85</td>
<td>114</td>
<td>7</td>
<td>7</td>
<td>1</td>
<td>13</td>
</tr>
<tr>
<td>dgesfa</td>
<td>17</td>
<td>145</td>
<td>228</td>
<td>10</td>
<td>10</td>
<td>6</td>
<td>15</td>
</tr>
<tr>
<td>dgesl</td>
<td>21</td>
<td>191</td>
<td>321</td>
<td>12</td>
<td>12</td>
<td>7</td>
<td>13</td>
</tr>
<tr>
<td>dmxpy</td>
<td>7</td>
<td>30</td>
<td>52</td>
<td>8</td>
<td>8</td>
<td>4</td>
<td>8</td>
</tr>
<tr>
<td>dscal</td>
<td>10</td>
<td>46</td>
<td>77</td>
<td>6</td>
<td>6</td>
<td>2</td>
<td>9</td>
</tr>
<tr>
<td>epsilon</td>
<td>4</td>
<td>31</td>
<td>69</td>
<td>5</td>
<td>5</td>
<td>1</td>
<td>11</td>
</tr>
<tr>
<td>idamax</td>
<td>29</td>
<td>132</td>
<td>207</td>
<td>4</td>
<td>4</td>
<td>3</td>
<td>12</td>
</tr>
<tr>
<td>matgen</td>
<td>19</td>
<td>89</td>
<td>155</td>
<td>6</td>
<td>6</td>
<td>5</td>
<td>10</td>
</tr>
<tr>
<td>Linpack</td>
<td>1</td>
<td>27</td>
<td>48</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>1</td>
</tr>
<tr>
<td>second</td>
<td>3</td>
<td>17</td>
<td>42</td>
<td>4</td>
<td>4</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>run_benchmark</td>
<td>16</td>
<td>244</td>
<td>472</td>
<td>7</td>
<td>7</td>
<td>2</td>
<td>22</td>
</tr>
<tr>
<td>init</td>
<td>1</td>
<td>383</td>
<td>602</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>7</td>
</tr>
</tbody>
</table>

Table 8.7: Linpack results for scheme A: Translation JVM → IL

Table 8.7 shows for each method the number of basic blocks and the number of instructions of the JVM code as obtained from javac in the leftmost columns. For translation scheme A, the fourth column gives the number of resulting IL instructions. The increase over the number of JVM instructions results from the conversion of (single) JVM instructions operating on values of type `double` into pairs of IL-instructions and from additional instructions involving auxiliary variables or symbolic data. The remaining columns show the total number of stack variables $x_l$, the share of linearly used and forwardable stack variables, respectively, and the number of IL variables of the local-variable category. For example, method `abs` uses four stack variables, corresponding to the maximal operand stack height of the original JVM code. All four variables are used linearly each time they occur, and all but one are forwardable. Additionally, three local IL variables are used. The data given in the table confirms our characterisation of translation scheme A: all stack variables are used linearly as values.
<table>
<thead>
<tr>
<th>Method name</th>
<th>Basic blocks</th>
<th>Bytecode instrs</th>
<th>IL instrs</th>
<th>Stack variables</th>
<th>Local variables</th>
</tr>
</thead>
<tbody>
<tr>
<td>abs</td>
<td>4</td>
<td>9</td>
<td>23</td>
<td>11</td>
<td>11</td>
</tr>
<tr>
<td>daxpy</td>
<td>16</td>
<td>88</td>
<td>134</td>
<td>82</td>
<td>82</td>
</tr>
<tr>
<td>ddot</td>
<td>15</td>
<td>85</td>
<td>114</td>
<td>72</td>
<td>72</td>
</tr>
<tr>
<td>dgefa</td>
<td>17</td>
<td>145</td>
<td>228</td>
<td>126</td>
<td>126</td>
</tr>
<tr>
<td>dgesl</td>
<td>21</td>
<td>191</td>
<td>321</td>
<td>176</td>
<td>176</td>
</tr>
<tr>
<td>dmxpy</td>
<td>7</td>
<td>30</td>
<td>52</td>
<td>28</td>
<td>28</td>
</tr>
<tr>
<td>dscal</td>
<td>10</td>
<td>46</td>
<td>77</td>
<td>43</td>
<td>43</td>
</tr>
<tr>
<td>epsilon</td>
<td>4</td>
<td>31</td>
<td>69</td>
<td>43</td>
<td>43</td>
</tr>
<tr>
<td>idamax</td>
<td>29</td>
<td>132</td>
<td>215</td>
<td>119</td>
<td>119</td>
</tr>
<tr>
<td>matgen</td>
<td>19</td>
<td>89</td>
<td>157</td>
<td>75</td>
<td>75</td>
</tr>
<tr>
<td>Linpack</td>
<td>1</td>
<td>27</td>
<td>48</td>
<td>23</td>
<td>23</td>
</tr>
<tr>
<td>second</td>
<td>3</td>
<td>17</td>
<td>42</td>
<td>24</td>
<td>24</td>
</tr>
<tr>
<td>run_benchmark</td>
<td>16</td>
<td>244</td>
<td>476</td>
<td>280</td>
<td>280</td>
</tr>
<tr>
<td>init</td>
<td>1</td>
<td>383</td>
<td>602</td>
<td>300</td>
<td>300</td>
</tr>
</tbody>
</table>

Table 8.8: Linpack results for scheme B: Translation JVM → IL

are popped from the stack during their first usage. However, the repeated usage of variables for communicating different entities results in drastic variations in forwardability. For some methods, all stack variables are forwardable (init, Linpack), while other methods do not benefit from forwarding at all (second). On average, 57.3% of the stack variables are forwardable.

The same data for conversion scheme B is given in Table 8.8. While the first two columns agree with Table 8.7, the differences in the third column result from compensation code at the boundary of basic blocks in the under scheme B. The fourth column shows that scheme B uses far more stack variables. Again confirming our expectations, these are all used linearly and forwardable. Our results thus indicate that the operand stack may de facto be implemented using forwarding. This is of particular interest for compilation into native code as is performed by just-in-time compilers like HotSpot [Sun01] or high-performance compilers like Jalapeno [AAB+00].
### Table 8.9: Linpack results for scheme A: Translation JVM $\rightarrow$ ALEF

<table>
<thead>
<tr>
<th>Method name</th>
<th>ALEF instructions</th>
<th>Registers no fwd</th>
<th>fwd</th>
<th>Stack registers no fwd</th>
<th>fwd</th>
<th>Queues no fwd</th>
<th>fwd</th>
</tr>
</thead>
<tbody>
<tr>
<td>abs</td>
<td>21</td>
<td>9</td>
<td>6</td>
<td>4</td>
<td>1</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>daxpy</td>
<td>134</td>
<td>23</td>
<td>20</td>
<td>9</td>
<td>6</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>ddot</td>
<td>114</td>
<td>22</td>
<td>21</td>
<td>7</td>
<td>6</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>dgefa</td>
<td>228</td>
<td>26</td>
<td>20</td>
<td>10</td>
<td>4</td>
<td>0</td>
<td>6</td>
</tr>
<tr>
<td>dgesl</td>
<td>321</td>
<td>27</td>
<td>20</td>
<td>12</td>
<td>5</td>
<td>0</td>
<td>7</td>
</tr>
<tr>
<td>dmxpy</td>
<td>52</td>
<td>17</td>
<td>13</td>
<td>8</td>
<td>4</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>dscal</td>
<td>77</td>
<td>16</td>
<td>13</td>
<td>6</td>
<td>4</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>epsilon</td>
<td>69</td>
<td>19</td>
<td>18</td>
<td>5</td>
<td>4</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>idamax</td>
<td>207</td>
<td>17</td>
<td>14</td>
<td>4</td>
<td>1</td>
<td>0</td>
<td>3</td>
</tr>
<tr>
<td>matgen</td>
<td>155</td>
<td>18</td>
<td>13</td>
<td>6</td>
<td>1</td>
<td>0</td>
<td>3</td>
</tr>
<tr>
<td>Linpack</td>
<td>48</td>
<td>6</td>
<td>3</td>
<td>3</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>second</td>
<td>42</td>
<td>8</td>
<td>8</td>
<td>4</td>
<td>4</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>run_benchmark</td>
<td>472</td>
<td>31</td>
<td>29</td>
<td>7</td>
<td>5</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>init</td>
<td>602</td>
<td>14</td>
<td>9</td>
<td>5</td>
<td>0</td>
<td>0</td>
<td>5</td>
</tr>
</tbody>
</table>

**8.2.2 Translation into ALEF**

IL code resulting from the two conversion schemes was translated into ALEF using the approach of Chapters 6 and 7. For scheme A, Table 8.9 shows the size of the resulting ALEF programs, and the number of registers and operand queues needed. Register usage is given as the total number of allocated registers and as registers employed for stack variables (stack registers). For comparison, we also give the data resulting from an allocation where all variables are mapped to registers, regardless of their forwardability. Columns marked no fwd thus indicate register requirements in traditional assembly code, while the column marked fwd contain the values where all forwardable variables are indeed mapped to operand queues. The table shows that for those cases where there are enough forwardable variables, few queues suffice to reduce the number of (stack) registers significantly.
The number of registers and queues allocated is a means to evaluate the static costs of forwarding in terms of hardware needed for a processor implementation. However, one of the motivations for introducing forwarding queues is the reduction of dynamic costs such as the energy consumption during access operations. Therefore, a more relevant performance measure is the number of read and write operations to registers and queues during program execution. As a realistic simulation of the emitted ALEF code is precluded by the symbolic nature of our translation, we decided to approximate the number of access operations using a symbolic interpretation which executes each instruction once. Although the resulting control flow does not represent a legal program execution, we expect that the number of access operations along this trace gives a rough indication of the benefits of forwarding. Table 8.10 presents the resulting data, again for each function of the Linpack suite separately. The first two columns contain the total number of register read and write operations (including registers used for local and auxiliary variables) for the translation where no forwarding is used. The following four columns show the effect of forwarding, and give the number of read and write operations to registers and operand queues, respectively. Notice that the total number of read and write operations for the forwarding and non-forwarding cases agree. Differences between the number of reads and writes for operand queues result from the hypothetical trace in combination with non-empty operand stacks at basic block boundaries in JVM programs.

Again, the performance benefit of translation scheme A varies across the different methods. Forwarding is (obviously) of no benefit if no forwardable variables exist (method seccond), and yields little performance benefit for programs with few forwardable stack variables (methods dscal, ddot and epslon). On the other hand, for JVM programs where a large percentage of operand stack positions can be turned into forwardable variables, more than half of the dynamic register access operations may be replaced by queue operations (method Linpack). On average, forwarding reduces the number of register read and write operations to 76.8% and 79.1%, respectively, of the corresponding values for the non-forwarding case.

For conversion scheme B, registers and queues allocated and the number of access operations are shown in Tables 8.11 and 8.12. Regarding the number of allocated
## 8.2. Results

<table>
<thead>
<tr>
<th>Method name</th>
<th>No forwarding Registers</th>
<th>Forwarding Registers</th>
<th>Queues</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>R</td>
<td>W</td>
<td>R</td>
</tr>
<tr>
<td>abs</td>
<td>15</td>
<td>17</td>
<td>10</td>
</tr>
<tr>
<td>daxpy</td>
<td>133</td>
<td>118</td>
<td>115</td>
</tr>
<tr>
<td>ddot</td>
<td>114</td>
<td>99</td>
<td>112</td>
</tr>
<tr>
<td>dgefa</td>
<td>206</td>
<td>211</td>
<td>196</td>
</tr>
<tr>
<td>dgesl</td>
<td>291</td>
<td>300</td>
<td>234</td>
</tr>
<tr>
<td>dmxpy</td>
<td>48</td>
<td>45</td>
<td>32</td>
</tr>
<tr>
<td>dscal</td>
<td>76</td>
<td>67</td>
<td>72</td>
</tr>
<tr>
<td>epsilon</td>
<td>61</td>
<td>65</td>
<td>59</td>
</tr>
<tr>
<td>idamax</td>
<td>173</td>
<td>178</td>
<td>114</td>
</tr>
<tr>
<td>matgen</td>
<td>126</td>
<td>136</td>
<td>81</td>
</tr>
<tr>
<td>Linpack</td>
<td>33</td>
<td>47</td>
<td>10</td>
</tr>
<tr>
<td>second</td>
<td>31</td>
<td>39</td>
<td>31</td>
</tr>
<tr>
<td>run_benchmark</td>
<td>417</td>
<td>456</td>
<td>401</td>
</tr>
<tr>
<td>init</td>
<td>448</td>
<td>601</td>
<td>148</td>
</tr>
</tbody>
</table>

Table 8.10: Register/queue access operations (scheme A)
<table>
<thead>
<tr>
<th>Method name</th>
<th>ALEF instructions</th>
<th>Registers no fwd</th>
<th>Registers fwd</th>
<th>Stack registers no fwd</th>
<th>Stack registers fwd</th>
<th>Queues no fwd</th>
<th>Queues fwd</th>
</tr>
</thead>
<tbody>
<tr>
<td>abs</td>
<td>23</td>
<td>9</td>
<td>5</td>
<td>4</td>
<td>0</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>daxpy</td>
<td>134</td>
<td>23</td>
<td>14</td>
<td>9</td>
<td>0</td>
<td>0</td>
<td>4</td>
</tr>
<tr>
<td>ddot</td>
<td>114</td>
<td>22</td>
<td>15</td>
<td>7</td>
<td>0</td>
<td>0</td>
<td>4</td>
</tr>
<tr>
<td>dgefa</td>
<td>228</td>
<td>26</td>
<td>16</td>
<td>10</td>
<td>0</td>
<td>0</td>
<td>5</td>
</tr>
<tr>
<td>dgesl</td>
<td>321</td>
<td>27</td>
<td>15</td>
<td>12</td>
<td>0</td>
<td>0</td>
<td>5</td>
</tr>
<tr>
<td>dmxpy</td>
<td>52</td>
<td>17</td>
<td>9</td>
<td>8</td>
<td>0</td>
<td>0</td>
<td>4</td>
</tr>
<tr>
<td>dscal</td>
<td>77</td>
<td>16</td>
<td>10</td>
<td>6</td>
<td>0</td>
<td>0</td>
<td>5</td>
</tr>
<tr>
<td>epslon</td>
<td>69</td>
<td>19</td>
<td>14</td>
<td>5</td>
<td>0</td>
<td>0</td>
<td>5</td>
</tr>
<tr>
<td>idamax</td>
<td>215</td>
<td>17</td>
<td>13</td>
<td>4</td>
<td>0</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>matgen</td>
<td>157</td>
<td>18</td>
<td>12</td>
<td>6</td>
<td>0</td>
<td>0</td>
<td>3</td>
</tr>
<tr>
<td>Linpack</td>
<td>48</td>
<td>6</td>
<td>3</td>
<td>3</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>second</td>
<td>42</td>
<td>8</td>
<td>4</td>
<td>4</td>
<td>0</td>
<td>0</td>
<td>5</td>
</tr>
<tr>
<td>run_benchmark</td>
<td>476</td>
<td>31</td>
<td>24</td>
<td>7</td>
<td>0</td>
<td>0</td>
<td>6</td>
</tr>
<tr>
<td>init</td>
<td>602</td>
<td>14</td>
<td>9</td>
<td>5</td>
<td>0</td>
<td>0</td>
<td>2</td>
</tr>
</tbody>
</table>

Table 8.11: Linpack results for scheme B: Translation JVM → ALEF
registers, Table 8.11 confirms that all stack registers may be replaced by forwarding queues and again, a low number of queues suffices. Also, if no forwarding is used, the number of registers holding stack variables (column five of Table 8.11) agrees with the height of the operand stack (column five of Tables 8.7 or 8.9). This is a consequence of the instruction order in the translated program and will be briefly discussed in the next section.

As all stack variables are forwarded, the effect on the dynamic number of access operations is even more favourable than that of scheme A. Table 8.12 indicates that stack variables are responsible for a large amount of the total dynamic register usage. Replacing these registers by forwarding eliminates at least half of the register read operations, with best case reductions of up to 75% (method second) and average reduction of 65%. At the same time, register write operations are reduced also by up to 75% (method ddot), with an average reduction of 62%.

8.3 Discussion

The results indicate that many opportunities for forwarding exist in application programs, which can be extracted via a compilation to JVML and subsequent conversion of the operand stack into forwarding. Our translation followed a simple pattern where the order of IL (and ALEF) instructions corresponds to the order of JVM instructions. While this is sufficient for analysing forwardability, it is wasteful regarding the number of registers and operand queues allocated, as these favour different instruction orders. A pathological example is the JVM instruction sequence

\[
\text{iconst } 1 \quad \ldots \quad \text{iconst } 10 \quad \text{iadd } \quad \ldots \quad \text{iadd}
\]

which (for scheme B) results in IL code

\[
x_0 = 1 \quad \ldots \quad x_9 = 10 \quad x_{10} = x_8 + x_9 \quad \ldots \quad x_{18} = x_{17} + x_0.
\]

The operands are consumed in the reverse order of their creation. Consequently, the operand stack height cannot be converted into the length of a single operand queue, and ten queues are required. Reordering the IL instructions to

\[
x_8 = 9 \quad x_9 = 10 \quad x_{10} = x_8 + x_9 \quad x_7 = 8 \quad x_{10} = x_8 + x_{10} \ldots
\]
<table>
<thead>
<tr>
<th>Method name</th>
<th>No forwarding</th>
<th></th>
<th>Forwarding</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Registers</td>
<td></td>
<td>Registers</td>
<td></td>
<td>Queues</td>
</tr>
<tr>
<td></td>
<td>R</td>
<td>W</td>
<td>R</td>
<td>W</td>
<td>R</td>
</tr>
<tr>
<td>abs</td>
<td>17</td>
<td>19</td>
<td>6</td>
<td>6</td>
<td>11</td>
</tr>
<tr>
<td>daxpy</td>
<td>133</td>
<td>118</td>
<td>51</td>
<td>36</td>
<td>82</td>
</tr>
<tr>
<td>ddot</td>
<td>114</td>
<td>99</td>
<td>41</td>
<td>26</td>
<td>73</td>
</tr>
<tr>
<td>dgefa</td>
<td>206</td>
<td>211</td>
<td>80</td>
<td>85</td>
<td>126</td>
</tr>
<tr>
<td>dgesl</td>
<td>291</td>
<td>300</td>
<td>115</td>
<td>124</td>
<td>176</td>
</tr>
<tr>
<td>dmxpy</td>
<td>48</td>
<td>45</td>
<td>20</td>
<td>17</td>
<td>28</td>
</tr>
<tr>
<td>dscal</td>
<td>76</td>
<td>67</td>
<td>33</td>
<td>24</td>
<td>43</td>
</tr>
<tr>
<td>epsilon</td>
<td>61</td>
<td>65</td>
<td>18</td>
<td>22</td>
<td>43</td>
</tr>
<tr>
<td>idamax</td>
<td>181</td>
<td>186</td>
<td>62</td>
<td>59</td>
<td>119</td>
</tr>
<tr>
<td>matgen</td>
<td>126</td>
<td>138</td>
<td>51</td>
<td>61</td>
<td>75</td>
</tr>
<tr>
<td>Linpack</td>
<td>33</td>
<td>47</td>
<td>10</td>
<td>24</td>
<td>23</td>
</tr>
<tr>
<td>second</td>
<td>31</td>
<td>39</td>
<td>7</td>
<td>15</td>
<td>24</td>
</tr>
<tr>
<td>run_benchmark</td>
<td>409</td>
<td>460</td>
<td>129</td>
<td>176</td>
<td>280</td>
</tr>
<tr>
<td>init</td>
<td>448</td>
<td>601</td>
<td>148</td>
<td>301</td>
<td>300</td>
</tr>
</tbody>
</table>

Table 8.12: Register/queue access operations (scheme B)
results in a program which needs two registers if no forwarding is used and one queue in the presence of forwarding. In addition to implementing more sophisticated allocation algorithms, a compiler may consequently reduce the number of registers and queues by reordering the IL instructions. The task of transforming a stack oriented instruction order into a queue oriented one resembles the task of optimising the evaluation order of expressions. We therefore expect that modifications of standard algorithms [ASU86] will lead to improvements in Tables 8.9 and 8.11. Furthermore, code optimisations as discussed in Section 7.7 and the usage of dedicated floating-point instructions should be beneficial for decreasing the instruction count and the number of operands communicated.

Finally, the exceptional case of scheme $B$ where a variable is communicated to different branches and read by instructions executing on different functional units did not occur in our experiments. This may either be a characteristic of the chosen benchmark programs or represent an artefact of javac’s compilation strategy. Code inspection suggests that the latter is that case as javac uses the operand stack almost exclusively for references and intermediate values during the evaluation of expressions while Java program variables result in local JVM variables. Consequently, most basic blocks produced by javac operate on an initially empty stack and also deliver an empty stack. Further exploration of different compilation schemes of Java programs into JVM and development of direct translation from Java into IL are needed to evaluate the benefits of forwarding across basic block boundaries more thoroughly.
Chapter 9

Discussion

9.1 Summary

This thesis studied computational models resulting from the interaction between asynchronous processor operation and operand forwarding. We

- proposed to employ programming language concepts such as the separation into static and dynamic aspects using structural techniques

- unified existing architectures with forwarding by a novel computational model, asynchronous queue machines

- introduced an assembly language which makes forwarding explicit and supports the formal study of various dynamic models of operation, based on structured operational semantics

- demonstrated that typing calculi support the reasoning about forwarding schemes

- presented the program analysis for forwardability at the level of an intermediate language using dataflow equations

- exhibited a compilation approach which transforms intermediate language programs into assembly programs whose low-level structure is guaranteed to conform to the type systems
• supported the theoretical investigations with results obtained from a prototypical implementation.

To our knowledge, this work represents the first systematic investigation into the interactions between program analysis, asynchronous architectures and forwarding. Although we concentrated on very limited operational models and simplifying architectural abstractions, we argue that this thesis demonstrates the validity of the advocated approach. The implementational results show that despite these restrictions a compile-time analysis can uncover a relevant number of forwarding opportunities. This motivates the further development of mathematically rigorous reasoning techniques at the boundary between computer architecture, program analysis and optimisation, and the scheduling of communication.

9.2 Shortcomings and future work

We finally outline some topics for future research, ranging from more detailed implementations of the proposed calculi to more ambitious research into advanced program analysis and applications of typing frameworks.

9.2.1 Full implementation

Our implementation concentrated on the most basic analysis of regular programs and their translation into ALEF. A future implementation should realise some of the modifications mentioned at the end of Chapter 7 including multiple forwarding, balancing of branches with different usage patterns and compilation of SSA code. This implementation should interface to a compilation framework including other optimisation phases and better graph colouring algorithms. The target language should be an extended ALEF instruction set with integer and floating point operations, and the operational semantics should include cost measures regarding the latency and energy consumption of registers and operand queues. This would enable a detailed exploration of different compilation strategies and forwarding schemes for a high-level programming language. Benefits for just-in-time compilation of JVM code could subsequently be explored by turning the two symbolic conversion schemes into full translations.
9.2.2 Extensions of the computational model

The computational model of ALEF ignores several modern architectural issues, including speculation, branch prediction, and precise interrupts, and does not support procedure calls. In the future, the interaction of all these topics with forwarding should be explored, and we hope that formalisms such as the ones we presented will be useful for this task.

The forwarding capabilities of ALEF’s computational model are less powerful than dynamic forwarding schemes of some hardware-based implementations. These restrictions appeared necessary for achieving an initial understanding of forwarding and for developing the presented reasoning formalisms. Relaxing these limitations could be supported by our techniques in three stages.

9.2.2.1 Dependent forwarding

As operand queue names and registers appear fixed in an ALEF instruction’s opcode, the decision whether and to which queue a value should be forwarded must be made at compile-time. When motivating the introduction of registers, we saw that this scheme results in additional instructions if values are repeatedly accessed or used by different functional units. Operationally, it is not difficult to extend the instruction set to enable more flexible forwarding patterns. For example, assigning different destinations to forwarded values depending on the outcome of a branch may be modelled by splitting branch instructions into one instruction which computes the outcome of a branch condition and one instruction which actually performs the jump operation. The Fred architecture [RB96] employed a similar scheme, where code inside basic blocks is subsequently reordered to allow the former instruction to be computed earlier. As a result, instructions of the target basic blocks may be loaded from memory earlier, similar to our scheme of interleaving program execution and refilling of instruction pipelines in Chapter 4. Likewise, current architectures often store the branch status in a dedicated register inside the branch unit [HP96]. Making this value explicitly available to other instructions yields the above dependent forwarding. Unfortunately, incorporating this dependency in the static semantics appears not to be equally straight-forward. The interaction between control flow (outcome of branch conditions) and data flow (forward-
Chapter 9. Discussion

In principle, the capturing of behaviour may be achieved through dependent types such as the dependent function space with argument type \( \text{bool} \). Future work is needed to see how this can be integrated into the rule for composing basic blocks and how the effect of successive branch instructions can be modelled correctly. As an alternative to dependent types in this context, union and intersection types should be explored.

### 9.2.2.2 Indirect forwarding

Additional forwarding flexibility may be obtained by introducing instructions which can issue forwarding request or modify the destination queue of other values or the queue in which an instruction expects its operands. At first, this may be achieved by interpreting some values as the index of operand queues. A more structured approach turns operand queue names into forwardable entities. Ultimately, this yields computational models with *dynamically programmable forwarding* where instructions may influence each other’s forwarding behaviour by communicating appropriate queue names. Both schemes are expected to admit extremely powerful forwarding schemes which are difficult to reason about without formal support. On the other hand, type theory and concurrency theory offer calculi for combining types with computation, and for communicating names. These calculi might give guidance as to what restrictions need to be imposed on forwarding disciplines in order to obtain flexible and yet analysable behaviour.

### 9.2.2.3 Multi-clustered architectures

Advancing integration density offers the opportunity to implement multi-processor architectures on a single chip. For example, [ONH+96]’s architecture contains several computational nodes, each equipped with some local memory. Nodes are connected by a communication network and may share data using global on-chip caches. Alternatively, multi-threaded architectures [TEL95] execute instructions from different processor contexts simultaneously. These may either compute a given task cooperatively by operating mainly independently and exchanging results only occasionally, or may stem from independent application programs. Understanding and verifying the issue logic and the interaction between different components will become more diffi-
9.2. Shortcomings and future work

cult as the number of components increases. In our asynchronous framework, nodes of such processors may be modelled as a collection of functional units and local forwarding capabilities, with additional operand queues and registers for global data exchange. The set of forwarding queues would consequently be partitioned into private queues for exchanging data within a node and public queues for communication between nodes. In order to support reasoning about concurrent execution in a compositional way, these abstraction barriers should be reflected in the type system, so that local forwarding may be reasoned about locally before the global communication is considered.

Again, similar topics have been studied in concurrency theory, and solutions such as restricting the visibility of (operand queue) names should enable one to incorporate the distinction public/private into the type system.

9.2.3 Connections to linear logic and other intermediate forms

We employed a rather intuitive interpretation of the linear logic-based notation, without attempting to establish deeper connections to formal proof theory. A more detailed exploration might not only help in modelling more complex forwarding schemes using other (linear) logical connectives, but might also allow us to relate executional models of different granularity. For example, different interleavings in the distributed model of execution resemble the structure of different derivations for the same final sequent. A particularly useful result would be a logics-based mechanism to relate the instruction-level interleaving to the fine-grained interleaving which arises from the small-step transition system of the micro-instructions read and write.

Since its inception [Gir86], linear logic has been seen as a logic for reasoning about resource consumption. Most widely known are its applications to concurrency formalisms such as the π-calculus [BS94] and Petri nets [MOM91] [EW90] [BG90], and to functional programming languages [Wad90] [TW99] [Bar96]. In particular, the latter work aims to detect linear usage of values in functional programs, similar to our usage analysis in Chapter 6. Due to the high level of functional programming languages, these calculi are not immediately applicable to the analysis of low-level architectural behaviour. However, if these calculi could be transferred from high-level programming language to functional intermediate or low-level forms [App98a], [FSDF93], forward-
ability analysis may be possible. In fact, functional intermediate forms and SSA are closely related [App98b], suggesting that it might be possible to obtain a structural correspondence between linearity types for a functional language and the dataflow solutions in our imperative intermediate language. A particular challenge will be to avoid the overhead of copying or moving operands at the boundaries of basic blocks which appears to arise in the translation of [App98b]. A possible compromise between functional abstraction and low-level behaviour might be to investigate the relation between \( \lambda \)-lifting/dropping [Dan99] and control flow properties such as the dominance property in the imperative intermediate language, similar to lightweight closures [SW97]. An alternative may be to follow the sequence of compilation steps of [MWCG98].

### 9.2.4 Explicitness of architectural features

The decision to model operand forwarding explicitly on the assembly level made the detailed calculus dependent on architectural configurations. We discussed the benefits of this design decision for reasoning and verification, as well as for compiler-based optimisation of operand communication. Future work should explore whether our calculi can be made parametric in configurations (see below) and how they apply to existing hardware-based forwarding schemes in more detail. Practically, algorithms for scheduling the communication of operands should be developed which are parametric in the performance and availability of individual forwarding paths. In addition to yielding specific compiler optimisation phases, the inclusion of performance characteristics in architectural descriptions would then allow one to study and compare architectural families abstractly by classifying them according to the performance of the various operand communication mechanisms.

More generally, future work should evaluate programming language based reasoning for other architectures, synchronous or asynchronous. As was alluded to in the discussion of Chapter 4, (virtually) reconfigurable architectures might be particularly well-suited due to their interaction between configurational changes and program execution. Other recent architectures which exposed the communication structure to the compiler and should thus profit from static semantics include the already mentioned TTA and RAW models of computation, or digital signal processors.
9.2.5 Formalisations

We expect that most of the material in Chapters 3 to 5 can be formalised in a state-of-the-art theorem prover such as Isabelle/HOL or PVS. This might not only reveal more insight into the subtleties involved in relating the operational models to each other and to the static semantics, but would also link our work to formalisations of traditional hardware implementations of forwarding schemes such as Tomasulo’s algorithm [McM98] [HSG98] and score-boards [MP96].

Our architectural models were restricted as mappings of instructions to functional units, number and types of functional units and bindings of operand and instruction queues to functional units were fixed. In contrast, a future compiler should be parametric in architectural descriptions of the target machine. At least two generalisations would be desirable. Firstly, verification should be modular in the configuration of an architectural scheme. For example, a system designer might need to explore the consequences of introducing a second functional unit of type ALU or of associating a different number of operand queues to a particular functional unit. Secondly, it may be desirable to vary the architectural constraints, for example by allowing functional units to be served by several instruction queues. Despite the conceptual simplicity, the entities of our models yield a high degree of design freedom: instruction queues might serve single or multiple functional units of identical or different type, operand queues might be associated to instruction queues or functional units (again either exclusively or non-exclusively), and the network might restrict the paths which a forwarded value may travel. Additional complexity arises if instructions may be allocated to instruction pipelines dynamically.

Evaluating the effect of any such architectural choice on the legality of forwarding schemes requires formal support. A type-theoretic solution may be to understand architectural configurations as contextual declarations which represent associations between entities of different categories:

\[ iq : fu : an\ instruction\ queue\ for\ instructions\ executing\ on\ functional\ unit\ fu \]
\[ fu : ALU : declaring\ functional\ unit\ fu\ to\ be\ of\ type\ ALU \]
\[ q : fu : allowing\ operand\ queue\ q\ to\ hold\ values\ sent\ from\ any\ functional\ unit\ to\ fu \]
\( q : fu' \rightarrow fu \): allowing operand queue \( q \) to hold values for the specific forwarding path \( fu' \rightarrow fu \)

\( q : iq \): an operand queue whose entries can only be consumed by instructions in instruction queue \( iq \)

In a first step, conformance of a program to a specific configuration is expressed by including the context in the typing derivation. For example, a rule such as

\[
\frac{\Gamma \vdash q_1 : iq \quad \Gamma \vdash iq : fu \quad \Gamma \vdash fu : ALU}{\Gamma \vdash [n]dec\ q_1\ q_2 : X \otimes q_1 \rightarrow X \otimes q_2}
\]

captures the fact that instruction \([n]dec\ q_1\ q_2\) may only be mapped to an instruction queue \( iq \) which serves a functional unit of type \( ALU \), where \( q_1 \) is associated to \( iq \). The three hypothesis are instantiations of a generic look-up rule

\[
\frac{x : y \in \Gamma}{\Gamma \vdash x : y}
\]

Consistency of configurations over the whole program is ensured in a cut rule

\[
\frac{\Gamma \vdash s : A \rightarrow B \quad \Gamma \vdash t : B \rightarrow C}{\Gamma \vdash st : A \rightarrow C}
\]

and a similar rule for combining basic blocks. Based on meta-theoretic results regarding substitution, weakening and strengthening, it may subsequently be possible to make the architectural configuration less opaque, and thus the typing calculi less dependent on exact architectural naming conventions.

In a second step, constraints on the formation of contexts may be used to classify architectural families. Examples are the restriction that instruction queues and functional units are in one-to-one correspondence or the stipulation that no forwarding path between two particular functional units exists.

This logical frameworks-perspective [HHP87] [Pfe01] should support the exploration of the relationship between architectural constraints and processor configurations and their consequences on forwarding schemes. In particular, we expect that serialisation constraints and their discharge may be incorporated in this framework by using judgements for pipeline-dependencies. Furthermore, such a formalisation would stress the connections between ALEF and the TIC/TAL/PCC work.
As mentioned in Chapter 2, we have had some initial success in formalising ALEF’s types using Ohori’s SCC [Oho99b]. The proof-theoretic foundations of transforming a language with a single, central stack and a single operational model (JVM) into one with several, distributed queues and multiple operational models (ALEF) are beyond the scope of this thesis, and left for future research. Notice however, that such an understanding might represent the formal basis of conversions between JVML and ALEF as presented in Chapter 8.

9.3 A future application

The need to verify structural correctness of assembly code has recently been recognised more widely as programs are dynamically (re)compiled and parts of applications are dynamically downloaded from untrustworthy sources. The concept of proof-carrying code (PCC) offers mathematically sound technology to enforce security properties and to ensure that performance requirements are met. In future, mobile applications will pose additional challenges regarding power consumption and resource capabilities of the underlying machinery. As statically fixed processor architectures appear unlikely to offer the flexibility necessary for applications of different domains, virtual architectures and reconfiguration will become more prevalent. In particular, it is anticipated that in addition to application programs, parts of configurational descriptions may only be available dynamically. This will require to specify available resources at abstract levels. Due to the inherent complexity of underlying hardware, reconfiguration and mobile code, formal verification techniques will be mandatory. A natural extension of the PCC approach under these conditions are proof-carrying processor configurations (PCPC). These bundle descriptions of configurations with formal proofs which are unforgeable, witness the sanity of the configurations with respect to the underlying hardware and certify the appropriateness for the intended application program.
Bibliography


[Ber01] Lennart Beringer. Typing assembly programs with explicit forwarding. In Naoki Kobayashi and Benjamin C. Pierce, editors, *Proceedings of


[BM92] Rod Burstall and James McKinna. Deliverables: A categorical approach to program development in type theory. Technical Report ECS-LFCS-


408, Laboratory for Computer Science, Massachusetts Institute of Technology, June 1998.


[Sha01] Zhong Shao. Type-based certifying compilation. In *Tutorial presented at the ACM SIGPLAN Conference on Programming Language*


[SSM94] Robert F. Sproull, Ivan E. Sutherland, and Charles E. Molnar. Counter-


