Intermetrics, Inc. - RED Rationale - March 1979

.	Navigate

2. Lexical Structure

3. Types (Part 2)

TYPES
(Part 1)

Type and Subtype     Basic Concepts     Implementation Alternatives
Avoiding Run-Time Dope Vectors     Multiple Representations     Use of Type Decls     Summary
Type Equivalence     Intro     Name Equivalence     Structural Equivalence     Single-Instance Generators     Extended Name Equivalence

3.   TYPES (Part 1)
This chapter addresses the data type facility for the RED language. The use of the various type features is explained and illustrated, the major issues underlying the design decisions are discussed, and alternative approaches are compared.
Section 3.1 presents the notions of type and subtype and analyzes a set of implementation techniques which can be used for code optimization.
Section 3.2 examines the type equivalence issue and justifies the approach taken by RED.
Section 3.3 discusses the "type opacity" question: when a type is declared in terms of an underlying type, what is the relationship between the two? Literals and constructors are also described here.
Section 3.4 covers a variety of issues concerning the predefined RED types BOOL, INT, and FLOAT, and explains why no fixed point facility was included.
Section 3.5 discusses a number of topics connected with enumeration types.
Section 3.6 addresses the design decisions relevant to aggregate types (arrays, records, unions, strings, sets).
Finally, Section 3.7 is concerned with indirect (i.e., "pointer") types, their use, and their implementation.

3.1  THE NOTIONS OF TYPE AND SUBTYPE

3.1.1  BASIC CONCEPTS

One of the most troublesome errors in assembly language programming is to apply to a data item an operation inappropriate to the way the data should be used. For example, treating an integer bit pattern as though it were an address (or even an instruction); treating a floating point bit pattern as though it were a sequence of characters. With the development of High Order Languages has come an understanding that the best way to prevent such errors is via "strong typing" (SM 3A); that is, a program's data are partitioned into (disjoint) categories called types and only such operations may be applied to a data item as are consistent with its type.
In many cases, however, the notion of "type" is not sufficiently restrictive. To illustrate this point, we consider the integer data type (INT, in RED). In the interest of portability, readability and efficiency, Steelman 3-1C requires that the lower and upper bounds of each INT variable be specified explicitly by the programmer. These range bounds cannot be considered to be part of the type -- i.e., it would be incorrect for the language to define lNT(i..j) as a type -- for two reasons:

The bounds may be run-time determinable (3-1C), whereas the type must be known at translation time (3A).

It must be possible to write an assignment statement where source and destination have different ranges, without requiring an explicit conversion (3-1C), but implicit conversions between types are prohibited (3B).

Range bounds do not define new types, but rather serve to constrain the value space of the (single) INT type and are called attributes associated with the INT type. A type, together with values for its attributes, is called a subtype. Thus INT(l..10) and INT(-i..i) are INT subtypes, denoting the integer ranges 1,2,...,10 and -1,...-1,0,1,...i, respectively.
In RED, the notion of subtype applies uniformly to all types, whether built-in or user-defined. A subtype must be specified for each variable at the time of its allocation and cannot be changed during the lifetime of the variable. The values for attributes may be computed at run-time and interrogated via a "dot qualification" form. Either a subtype or a type may be specified at the declaration of a formal parameter to a routine. Allowing a type with no subtype constraints to appear in such contexts is extremely useful (and is, in fact, required by SM 7G); e.g., a WRITE procedure must be able to output INT values from arbitrary INT subtypes.
The requirement in RED that subtype constraints be specified for variables is somewhat stronger than is necessary to satisfy SM 3D which states that "the value of a subtype constraint for a variable may be specified when the variable is declared." We adopted the more restrictive (but uniform) rule in the interest of language simplicity, readability, and efficiency. Note, however, that subtype constraints are not necessary when a constant is declared. That is, the user may simply declare
      CONST message := "MANUAL INTERVENTION REQUIRED";
instead of counting characters as in
      CONST message : STRING[ASCII](28) := "MANUAL INTERVENTION REQUIRED";
As an illustration of the principles of type and subtype, it is straightforward to write a function which finds the maximum value in an array of FLOATs

and to invoke the function on actual arrays of FLOATs:

Note that the following attempted declarations are illegal:

since subtype specifications are required for variable declarations.
A dot qualification form is used to interrogate the values of attributes, with the attribute name used as the selector. This convention applies uniformly across all types, built-in and user-defined. Thus, with the declarations

we have i.MIN=1 and i.MAX:1O (MIN and MAX denote the bounds attributes for INT data), and v.n:8 (the formal parameter identifiers in type declarations denote the attributes associated with user-defined types). The same notation was chosen for both attribute inquiry and record component selection, since conceptually the user can regard a data item as though it were a record comprising the values for the attributes (constant for the lifetime of the data item) and a data value (modifiable if the data item is a variable). We emphasize that in practice it is not always necessary to store the values for attributes at run-time; this issue is discussed in the next section.

3.1.2  IMPLEMENTATION ALTERNATIVES

Avoiding Run-Time "Dope Vectors"

A data item has a type (known at translation time) and a set of values for the attributes associated with the type (in general, not known until run time). A natural implementation is to store the subtype information -- the values for the attributes -- at run time along with the data value (e.g., in a "dope vector").¹ A typical case where this would occur is array data. If a routine has a formal parameter which is, say, an ARRAY INT OF BOOL, then the lower and upper bounds of the array index can be passed at run time with the actual parameter to each invocation of the routine.
This does not imply that such "dope vector" information must always be stored at run time; there are important cases where the translator can effect considerable efficiency by maintaining subtype information at compile time. For example, if the user declares
      VAR x : 1NT(1..8);
then the translator can keep track of the bounds values for x. Suppose that x is passed as an actual parameter to a routine R which expects a formal parameter f of subtype INT(m..n) where m and n are manifest. (The bounds of f can also be maintained by the translator.) Independent of the binding class for f, no run-time dope vector is needed for x:

In the case of VAR or READONLY, the translator can check the validity of the invocation R(x); the subtypes must be identical.

With the other bindings, a run-time check may be necessary to ensure that x is in the range m..n (before entry to R, when f is CONST), or that f is in the range 1..8 (after exit from R, when f is OUT). The array bounds, however, need not be stored with either the actual or the formal parameter at run time.

If m or n is not manifest, then the translator can generate code which first produces a dope vector for x comprising 1 and 8, and then passes as the actual parameter the dope vector together with the value of x. Thus the dope vector need _ only be generated when it is required.
What does the translator do if the formal parameter is specified simply with a type (INT) and not a subtype? In this case it can certainly take the second approach mentioned above and generate the dope vector at the point of invocation. If it is important to conserve storage for the executable code bodies, at the possible expense of data storage and execution time, then such an approach can be adopted. There are, however, two alternative techniques which can be employed. First, the translator can regard the routine as though it were generic with respect to the subtype of the formal parameter (it might adopt such an approach if the routine contains the INLINE pragmat). For example, given the declaration
      PROC p(f : INT); ... END PROC p;
the translator can produce separate versions of p for each different manifest subtype of the actual parameters to invocations of p; if p is invoked with an actual parameter whose subtype is not manifest, then an instance of p can be generated which expects the range bounds of its formal parameter in a run-time dope vector.
As a second approach, the translator can avoid passing "dope vector" information without relying on such expansion. The main point is that if the attributes of a routine's formal parameter are not interrogated (directly or indirectly) from the routine, then there is no need to store the values for these attributes with the actual parameter when the routine is invoked. This is a particularly useful optimization for CONST and READONLY parameters, especially for attributes such as range which are relevant only for the target of an assignment and which are thus not likely to be interrogated. In the above example, if the MIN and MAX attributes of f are never used, then the range bounds for x need not be passed at the invocation D(x) -- even when these bounds are not manifest. Note that, to perform this optimization, the translator must check if f is passed as an actual parameter to a routine which interrogates its formal parameter's attributes, etc.

Types Supporting Multiple Representations

As a general principle of language design, it is advisable to constrain data types to single representations. In fact, one of the changes from Ironman to Steelman was the deletion of the requirement that the language support multiple representations per type. The basic problem with multiple representations is that efficiency may be sacrificed (via run-time code to perform conversions between representations) without the programmer's knowledge.
Although the full generality of multiple representations per type is not justified, there are some important special cases where efficiency can be gained, rather than lost, when the translator implements the same type under different storage schemes. The two most frequent cases are INT and FLOAT, since many machine architectures support integer data of different lengths (e.g., halfword, word, doubleword) and floating point data of different precisions (e.g., single and double).( The implementation issue may be summarized as follows: how can the translator select the smallest storage unit compatible with the declaration of a data item, and how can a routine be compiled if a formal parameter is specified with type (rather than subtype)? For ease of exposition, we treat the INT and FLOAT cases separately.
The selection of a storage unit for an INT variable can proceed as follows:

When the range bounds are known at translation time, the minimum storage unit which can support that range may be chosen.

When the range bounds are determined at run time, it will frequently be the case that the bounds of the range bounds are known at translation time. For example, given the declarations

the translator can determine that the maximal range for x is INT(m.min .. n.max); i.e., INT(0..1000). Thus the minimal storage unit supporting this range can be chosen. (Note that the actual range m..n is still used for checking on assignments to x; the bounds m and n will in general be stored at run-time).

When the range bounds are determined at run time but the approach of case (2) is inapplicable, the translator can choose a storage unit consistent with its efficiency considerations. In particular, it need not make "worst case" assumptions; e.g., it may adopt full-word integers as the standard representation and provide double-word representations only when manifest bounds require them.

The preceding discussion focused on the translator's choice of a storage unit for an INT variable based on the subtype information specified for it. what happens when a formal INT parameter is declared and no subtype information is present, either on the formal parameter or through ASSERTions? One approach is to treat the . (routine as though it were generic with respect to subtype, as discussed earlier in connection with dope vector avoidance. Alternatively, the translator can treat the routine as though it were generic with respect to the representation. That is, each invocation of the routine passes an INT of some particular representation (halfword, word, doubleword), and a body for the routine can be compiled with the formal parameter having the same storage unit as the actual. For example, if x and y are halfword INTs, x+y could be compiled into halfword arithmetic; if they are full-word INTs, full-word arithmetic code would be generated. It is important to note in this regard that the language does not guarantee a unique maximal range for the results of arithmetic operations; this allows the implementation to produce either a halfword, fullword, or doubleword as the result of x+y, based on the sizes of the operands and the size of the target for the result. Note also that if the translator chooses the above approach and treats a routine with a formal INT parameter as though it were generic with respect to representation, then it may perform the analysis described earlier to avoid passing subtype information at run time. That is, if the attributes of the parameter are not interrogated (directly or indirectly) by the invoked routine, they need not be passed at run time.
The implementation of multiple representations for the FLOAT type is analogous to INT, with respect to both the selection of storage size and the translation of routines taking formal FLOAT parameters with no subtype information.
The determination of storage size is actually somewhat easier in the case of FLOAT than with INT, since the specified precision when a FLOAT variable is declared must be manifest (SM 3-1D). Thus the translator can choose a storage size (and actual precision) large enough to support the specified precision, subject to a target-machine dependent limit (e.g., double precision).
When a routine is declared with one or more formal FLOAT parameters lacking subtype information, the translator has considerable flexibility regarding implementation. At one extreme, the translator may represent all FLOAT data at maximal actual precision and pass the subtype information (specified precision, range bounds) as part of a run-time dope vector. This minimizes the number of code bodies at the expense of run-time data storage and execution. Although clearly unacceptable in production environments, such a strategy may be appropriate for "debugging" translators, where speed of translation rather than object code efficiency is important. At the other extreme, the translator may treat the routine as though it were generic with respect to subtype. Thus the routine would have a separate instantiation for each different manifest subtype of_actual parameters in invocations, as well as instantiations which expect the range bounds in a dope vector. For example, if x and y were declared to be of subtypes FLOAT(4, 0.0 .. 1.0) and FLOAT(8, -10.0 .. 10.0) then the translator could produce code for x+y which was tailored to these subtypes, taking into account that x has single precision and y has double precision. This strategy conserves execution time and data space at the expense of code space.
In practice, we expect a production-quality translator to take a compromise position between these extremes and in the process generate efficient run-time code without proliferating code bodies. In particular, the following strategy is recommended:

Treat routines as generic with respect to specified precision but not with respect to range. Thus, with the above example, code for x+y could be tailored to precisions 4 and 8.

To avoid passing range bounds at run time, the techniques discussed earlier in this section can be employed. For example, since the "+" routine does not require the range bounds of its actual parameters (just a knowledge of the precisions is sufficient), there is no need to pass these at run-time.

Note that in many cases the translator can optimize code space by treating a FLOAT routine as though it were generic with respect to actual (instead of specified) precision. In particular, if the specified precision of the formal parameter is not interrogated (directly or indirectly) in the routine, then such an optimization can be employed. For example, suppose that the user declares the following square root function:

Provided that the routines to which x is passed as an actual parameter (number_iterations, initial_approx, "/") do not interrogate the specified precision of their formal parameters, the translator can produce just two versions of sqrt -- one for single precision and one for double -- without passing specified precision at run-time.

Use of Type Declarations

In the preceding discussion we have looked at techniques which the translator can use to implement subtypes efficiently. It should be noted that the programmer can directly effect the production of efficient code, at the sacrifice of some generality, by turning "subtype" information into "type" information. Specifically, this allows the translator to avoid storing such information at run time, since it must be manifest and will be the same for each object of the type. The following example shows how specified precision can thus be made to distinguish type:

Summary

In this section we have addressed the main issues concerning the implementation of subtypes, focusing particularly upon techniques for maintaining and interrogating subtype information at translation time instead of run time. The basic strategy is for the translator to apply the same techniques which are used in the implementation of the generic facility. An analysis of the application of these methods to two important cases, the INT and FLOAT types, demonstrated how the translator can produce efficient object code and take advantage of target-machine specific storage representations.

3.2  TYPE EQUIVALENCE

3.2.1  INTRODUCTION

In a strongly typed language, one of the most important issues which must be addressed is that of type equivalence²; i.e., when are two data items considered to have the same type? The language rules in this area have major impact on the usability, readability, simplicity, and efficiency of programs. In this section we discuss the alternatives which were considered for type equivalence in RED, and justify the decision which we made.) In summary, two type invocations in RED denote the same type provided that their names are lexically identical, where a name is obtained by expanding all abbreviations (if any) and then removing all "round-bracketed" information. Type equivalence can always be checked at translation time.
For example, with the data declarations

the types of x, y, and z are RECORD[a : INT, b : INT], RECORD[c : INT, d : INT], and RECORD[a : INT, b : INT], respectively. Thus x and z have the same type, which is distinct from the type of y.
The rules for type identity must take into account built-in types (such as BOOL, INT, FLOAT), generated types (such as enumeration, record, array, union, and indirect types), and new (i.e., user-defined) types. Moreover, the rules must be consistent with the language conventions for type opacity; this implies that the declaration of an ABBREViation does not introduce a new type, whereas the declaration of a type does. Four alternative approaches were considered with respect to type identity, which we will denote by the terms "pure name equivalence", "structural equivalence", "single-instance generators", and "extended name equivalence". The approaches differ primarily with respect to rules regarding the equivalence of generated types.

3.2.2  PURE NAME EQUIVALENCE

The pure name equivalence approach simplifies the problem of type identity by eliminating those cases which can cause complications. Under this scheme, it would be illegal to invoke generated types except in the definition of new types. That is, the attempted declaration

would be illegal, and the user would have to define a new type before declaring x; e.g.,

The question of type identity would then be resolved simply by checking type identifiers for lexical identity. (Two identifiers are considered "lexically identical" if they are textually identical and correspond to the same declaration.)
This approach is basically that proposed for the Phase 1 RED language [Br78]. Its main disadvantage is that it introduces into programs a proliferation of type names, many of which are only used once, with a resulting loss of readability. Additionally, it interacts badly with subarray selection: with the type and variable declarations for x_type and x given above, it is not clear what the type of x(2..5) should be. As a result of these drawbacks, the pure name equivalence approach was rejected.

3.2.3  STRUCTURAL EQUIVALENCE

At the opposite extreme from pure name equivalence, the structural equivalence approach allows maximal freedom in determining type identity for generated types. Stated somewhat loosely, any two invocations of the same generator would produce the same type provided that their component types are the same. The following examples illustrate the intent of this rule:

Variables xt1 and xt2 would have different types -- t1 and t2 -- whereas xa1 and xa2 would have the same type -- RECORD[INT,INT].
With the structural equivalence strategy, two enumeration types would be the same if they comprise the same elements in the same order, and two array types would be the same if their corresponding index types and component types are the same. The rules for union types would be the same as for record types; field names are not relevant with respect to type identity.
Problems with structural equivalence arise primarily in connection with indirect types. First, the presence of cycles in ABBREViation declarations would make it difficult for the reader to decide when two types are the same. Consider the examples

Do p1, p2, and p3 have the same type? Kieburtz et al; [KBH76] state a rule which implies an affirmative answer: two indirect types "are judged equivalent if they are alleged to be equivalent, and the allegation is not disproved by a structural equivalence test upon the types pointed to." However, the formulation of a recursive algorithm to solve the problem does not eliminate the confusion that readers of the program are likely to feel.
The second problem with structural equivalence concerns the role of subtype information in indirect types and arises when "anonymous" indirect types are allowed (e.g., on data declarations). Suppose that the user intends to declare a variable which is constrained to point to a string of a given fixed length. For example,

The assignment x2 :: x1 should not be permitted. (If it were, then either the length information from x1 would be inherited by x2, conflicting with the declaration of x2 which constrains the length to 10, or a subsequent assignment such as x2(8) :: x2(1) could modify the contents of an arbitrary location in memory.) The assignment x2 := x3 should be permitted only if n = 10. Thus, x1, x2, and x3 would have the same type -- PTR STRING[ASClI] -- and subtype constraints 7, 10, and n which are checked on assignment and parameter passing. Consider now, however, the case where a variable is intended to point to strings of arbitrary length. One possible syntax would be

The type of x4 is PTR STRING[ASCII], but the notion of the subtype of x4 is not so clear, since x4 can be used more freely as the target of an assignment than can x1, x2, or x3. If anonymous indirect types are permitted, as implied by the structural equivalence approach, special rules would be needed in the language to explain the role of subtypes for indirect types.
An additional problem with structural equivalence concerns the role of field identifiers in record and union types. If the field identifiers are not relevant with respect to type identity, then in the following example

z and p would have the same type, which was probably not the programmer's intent. Although this problem could be solved with a suitable rephrasing of the language rules, the difficulties intrinsic to structural equivalence when anonymous indirect types are allowed have caused us to reject this approach to type identity.

3.2.4  SINGLE-INSTANCE GENERATORS

One possible compromise between the extremes of pure name equivalence and structural equivalence is for each invocation of a type generator to produce a separate type, even when the invocations are lexically identical. This approach will be called that of "single-instance generators". For example, consider

Since INT is a type (not a type generator), i1 and i2 would have the same type.
However, e1 and e2 would have different types (ENUM is a type generator); r1 and r2 would have the same type, different from the type of r3.
Although reasonably simple to implement, the single-instance generator approach detracts from the ease with which the language could be learned and used.

A semantically significant distinction is made between a type and a type generator, but in practice it is not so easy to remember into which category a given candidate falls. For example, there is no intuitive reason to regard enumeration types as instances of type generators; given the analog to integers with respect to range specifications, one might expect that e1 and e2 above should have the same type. In the case of user-defined (or library-provided) types the issue is more subtle. For example, does FILE[INT(O..100)] produce a new type every time it is invoked?

The interaction with ABBREViations can lead to confusion. If the user declares
ABBREV color_type : ENUM['red, 'yellow, 'green, 'blue];

then the semantics of

must be explained (and learned) very carefully. In particular, the abbreviations cannot simply be expanded lexically (as one might expect), since this would imply that c1 and c2 have different types.

Type generators become useless on formal parameter specifications, since no actual parameter (except possibly a constructor whose type was ambiguous) could ever have the same type as the formal. For example, though it might be useful to have a procedure which sorted arbitrary arrays of floating point values, the straightforward approach of declaring

would not work, since the type of the formal parameter x would be a different type from all other ARRAY INT OF FLOAT types.

The fact that

would sometimes be equivalent to

(when T is a type), and sometimes not (when T is an invocation of a type generator) is anomalous.

It would become very clumsy to refer (in comments or informal discourse) to the type of an object. Given the declaration

how does one describe the type of x? The denotation "ARRAY INT OF FLOAT" would be incorrect, since this denotes a new type, distinct from that of x.

The semantics of array slicing would be somewhat confusing. Given the declarations

what is the type of x(1..3)? Which of the following assignments is legal?

The above problems have caused us to reject the single-instance generator approach to type equivalence. Though some of these difficulties could be avoided by the user's declaring new types instead of using anonymous generators, this would reduce the approach to (and thus result in the drawbacks of) the pure name equivalence scheme discussed earlier.

3.2.5  EXTENDED NAME EQUIVALENCE

Extended name equivalence represents a compromise between pure name equivalence and structural equivalence- which avoids the anomalies of the single-instance generators approach. The following are the essential attributes of extended name equivalence.

Anonymous (and abbreviated) indirect types are prohibited. Thus, if one intends to use an indirect type, a new type must be declared.

A game is derived from a type invocation by expanding all ABBREViations and comma-separated field lists, and then removing all "round-bracketed" information.

Two type invocations yield the same type provided that their resulting names are lexically identical.

These rules apply consistently to built-in, generated, and user-defined types, as illustrated by the following examples:

The extended name equivalence approach was adopted for the RED language because it provides intuitively reasonable semantics without dictating a particular programming style, and because it meshes well with the language's approach to types and subtypes. It has an advantage over pure name equivalence in allowing anonymous enumeration, array, record, and union types. It has an advantage over structural equivalence in avoiding the non-intuitive semantics of anonymous indirect types and in treating field identifiers as relevant to type identity. And it has an advantage over single-instance generators in avoiding the variety of anomalous behaviors described above.
***********
1 Note that "dope vector" information need not be stored contiguously with the data. Thus several data items of the same subtype could share the same dope vector.
2 The terms "equivalent types", "identical types", and "same types" are used interchangeably.

2. Lexical Structure

3. Types (Part 2)

Overview
Requirements
      Strawman
      Woodenman
      Tinman
      Ironman
      Steelman
RED Reference
RED Rationale
Types in RED
Time/Life Computer Languages
Memories
Site Index

Overview Reference ToC Rationale ToC Site Index