David Holdsworth
CAMiLEON Project
University of Leeds
LS2 9JT UK
My own experience is in the preservation of an operating system for an obsolete mainframe system of the 1970s. The techniques advocated in this short (I hope) document are a direct consequence of this work, which has been done by writing emulation code in both C and Java.
Jeff Rothenberg has proposed that at the time of preservation of the actual bytes (or bits) that comprise the preserved object, we also preserve the specification of the platform upon which this object ran. My own experience suggests that it will be too difficult to capture all relevant information with confidence unless an emulation is actually achieved (see Appendix B below).
IBM and the British Library are involved in a project in which the intention is to design a Universal Virtual Machine (UVM), which is then used for actual emulator implementation. It seems courageous to me to christen something "universal" even before it has been designed, but perhaps if you are IBM you can make such nomenclature stick. The view taken is that a high-level language is too transient, and will not stand the test of time. History teaches us that some languages achieve such pre-eminence (and have such software investment dependent upon them) that they outlast virtual architectures, and the only hardware architecture that is in the same league is the IBM360/370/390, and that is younger than FORTRAN, although it can probably claim to pre-date C.
My own work (with a colleague, Delwyn Holroyd) has implemented emulation of the ICL1900 system to the extent that we can run the George3 operating system, including its time-sharing feature. This also gives us access to software systems written to run under George3, including the world's first Algol68 compiler.
Java's view of memory is much more abstract, but still allows arrays of integers, which make a convenient representation of emulated main storage. Java has a multi-tasking model as an integral part of the language, whereas the thread facilities in C are a more recent feature of the language, and not necessarily supported on all platforms.
In our emulation of the ICL system, we have used C for emulating the main 1900 processor, and used Java for emulation of the communications processor (7903) for which the multi-tasking aspects are valuable. Although still imperfect in some respects, the system works well enough to evoke immediate recognition by those who know the original system, and to vindicate the techniques used in its construction. The emulation has run successfully on Win32, Irix, Solaris and Linux. We have not tried any other platforms.
However, we wish to address the longer term. I think it unlikely that some alternative programming paradigm (e.g. functional programming) will completely eclipse the traditional style. When we look at C we can observe that many of its features are to be found in other languages. Here I am concerned with features at the semantic level. As an example, the assignment statement exists in C, Algol60, Algol68, Pascal, Ada83/95 and Java, to name but a few. There is a syntactic difference in that C and Java have x = y, whereas the others have x := y. On the other hand, there are features of C that have been deliberately discarded in newer languages, e.g. macros, address arithmetic, variadic parameter lists.
I propose that we recommend a subset of C in which to write emulators for long-term preservation. My personal experience is that the amount of work involved is by no means excessive. Our emulation of the ICL1900 was achieved as a spare-time activity over a period of about 18 months. We are both of us in full-time employment.
Tentative proposals for selection of the subset are in Appendix A.
The expectation is that over time it will become necessary either to modify the subset if it turns out to contain features that are removed from the language (indicating a bad choice of subset), or to move the policy to use a subset of a different language.
A further opportunity is opened up by this approach. We must consider the time when C becomes computational Latin, and is replaced by another lingua franca (let us call it E). The C yacc parser could be the vehicle for implementation of software for the automatic translation of C emulators into E .
Of course, it is always possible that yacc may not last for ever either but yacc is a C program, and could possibly be translated into C when the time came when it was no longer seen as part of the standard kit of parts. Making it generate E may be more problematic.
It is likely, that the restrictions of C will make some things impossible. The desire to exclude variadic parameter lists would restrict the use of printf. We thus propose that a C program may require linkage with a small (and we stress small) section of code written in C. It is assumed that any migration away from C would involve hand coding of these small C sections. It may prove possible to make this C section common to more than one emulator.
Features for omission from C :
The George3 operating system which runs under our emulator was written in assembler by a team of programmers. As a result it seems to use every quirk of the machine's order code at some point. A final break-through into reliable operation came when we finally implemented a property of the overflow register that was not hinted at in the summary chart, and was detailed once in a thick four-volume manual. It seems likely that such a property might escape the specification process.
The source text of George3 was an invaluable reference from time to time. Some of the later features of the system's interfaces were not in the main stream manuals, although they may have featured in software notices. The thought of reading through many many hundreds of these was sufficient disincentive to make inspection of the source code a more fruitful way to investigate mysteries. One particular feature of the interface to the communications processor was only revealed by a comment in the source code, after which dim recollection of 25 year-old knowledge was sufficient.
We have deliberately steered clear of system-dependent features in our use of C. Each of the two authors is routinely using a different compiler (Visual C++ and Cygnus gcc), and from time-to-time checks out operability on other systems. We have factored out the parts that are necessarily platform specific.
David Holdworth
December 2000