\documentclass[preprint]{sigplanconf}
\usepackage{includes}
\usepackage{natbib}
\bibpunct[: ]{[}{]}{,}{n}{}{;} 

\begin{document}
\newcommand{\anonymize}[2]{#2}

\urlstyle{tt}

\title{Subtleties of the \ISO{} \C{} standard}
\titlebanner{draft version}
\authorinfo{\anonymize {Robbert Krebbers \and Freek Wiedijk}
             {\emph{author(s) omitted because of double blind reviewing}}}
           {\anonymize {Radboud University Nijmegen}{}}
           {}
\maketitle

\begin{abstract}
In our
\anonymize{\Formalin{} project}{efforts}
to formalize \Celeven{} (the \ISO{} 
standard of the \C{} programming language) we discovered many unexpected
subtleties that make formalization of that standard difficult.
Most of these difficulties are the result of the \C{} standard giving
compilers room for strong optimizations based on aliasing analysis.

We discuss some of these subtleties and indicate how they may be addressed in
a formal \C{} semantics.
Furthermore,
we discuss why the \C{} standard should address the possibility
of stack overflow,
and
we argue that evaluation of C expressions does not
preserve typing in the presence of variable length array types.
\end{abstract}

\keywords
C programming language,
system programming languages,
formal methods,
programming language standardization, 

\section{Introduction}

\subsection{Problem}
Current programming technology is rather fragile: programs regularly crash, 
hang, or even allow viruses to have free reign.
An important reason is that a lot of programs are developed using low-level 
programming languages.
One of the most extreme instances is the widespread use of the \C{} 
programming language.
In the \TIOBE{} programming language popularity index~\cite{tio:xx} C is (fall 2012) in the top 
position.

Whereas most modern programming languages require an exception to be thrown
when exceptional behavior occurs (\eg{} when dereferencing a 
\NULL{} pointer, when accessing an array out of its bounds, or on integer overflow),
\C{}~\cite{ker:ric:88,iso:12} does not impose such requirements.
Instead, it classifies these behaviors as \emph{undefined} and allows
a program to do literally anything in such situations~\cite[3.4.3]{iso:12}.
On the one hand, this allows a compiler to omit runtime checks and to generate
more efficient code, but on the other hand these undefined behaviors often
lead to security vulnerabilities~\cite{cow:wag:pu:bea:wal:00,lac:des:08,wan:che:che:jia:zel:kaa:12}.

There are two main approaches for improving this situation:
\begin{itemize}
\item Switch to a more modern and higher level programming language.
This approach reduces the number of programming errors, and if there still is
an error, the chance of it being used by an exploit is much lower.
%
One disadvantage of this approach is that there will be a thicker layer between
the program and the hardware of the system.
This costs performance, both in execution speed and in memory usage, but it
also means a reduction in control over the behavior of the system.
Especially for embedded systems and operating system kernels this is an
undesired consequence.

\item Stick to a low-level programming language like \C{}, but add a formal
methods layer on top of it to establish that programs do not exhibit undefined
behavior.
%
Such a layer might allow the developer to annotate their programs with
invariants and to prove that these invariants indeed hold.
To be practical most of these invariants should be proven automatically, and
the remaining ones by interactive reasoning.

This approach is an extension of \emph{static analysis}. But whereas static 
analysis tools often yield false-positives, interactive reasoning allows the developer
to prove that a false-positive is not an actual error.

For functional correctness, this approach has also been successful.
For example, there are various projects to prove the \C{} source code of a microkernel
operating system correct~\cite{coh:ah:hil:lei:mos:san:sch:tob:09,kle:09}.
\end{itemize}

\noindent
There are many tools for the second approach, like
\VCC~\cite{coh:ah:hil:lei:mos:san:sch:tob:09},
\Verifast~\cite{jac:pie:08} and \FramaC~\cite{moy:mar:09}.
However, these tools do not use an explicit formal \C{} semantics and only
implicitly `know' about the semantics of \C{}. 
Therefore the connection between the correctness proof and the behavior of
the program when compiled with a real world compiler is shallow.
The soundness of these tools is thus questionable~\cite{ell:ros:12:slides}.

For this reason,
\anonymize{
we started in 2011 at the \RU{} a project to provide a formal
semantics of the \C{} programming language: the \Formalin{} % or \CHtwoO{} 
project~\cite{kre:wie:11}.
}{
we recently started a project to provide a formal
semantics of the \C{} programming language.
}
This semantics was to be developed for interactive theorem provers, allowing one
to base formal proofs on it.
Although there already exist various versions of a formal semantics of 
significant fragments of \C{} (see Section~\ref{section:related} for an
overview), our goal was to formalize the `official' semantics of \C, as written
down in the \Celeven{} standard%
\anonymize{\footnote{Back then, the target was
\Cninetynine, as \Celeven{} was not finished yet. However, nearly all of the
issues we describe, were present in \Cninetynine{} already, and have not been
changed in \Celeven.}.}{.}
We intended not to gloss over the more difficult aspects of \C{} and to provide
a formal semantics of the whole language.

Unfortunately,
\anonymize{the \Formalin{} project has}{our formalization efforts have}
turned out to be much harder than we
anticipated because the \Celeven{} standard turned out to be much more difficult
to formalize than expected.
We were aware that \Celeven{} includes many features, so that we would
need to write a large formalization to include them all.
Also, since the standard is written in English, we knew we had to deal with
inherent ambiguity and incompleteness.
But we had not realized how difficult things were in this respect.

Already, the very basis of our formalization, the memory model, turned out to be
almost impossible to bring into line with the standard text. 
The main problem is that \C{} allows both \emph{high-level} (by means of 
typed expressions) and \emph{low-level} (by means of bit manipulation) access 
to the memory.
The \Cninetynine{} and \Celeven{} standards have introduced various restrictions
on the interaction between these two levels to allow compilers to make more
effective non-aliasing hypotheses based on typing.
As also observed in~\cite{iso:09,mac:01} these restrictions have lead
to unclarities and ambiguities in the standard text.
We claim that such
issues are inherent for programming languages that wish to combine high-level
and low-level memory access while still allowing strong optimizations.

\subsection{Approach}
The aim of this paper is to discuss the situation.
We describe various issues by small example programs, and discuss what the
\Celeven{} standard says about them, and how a formal semantics may handle
these.

\anonymize{
During the year that the \Formalin{} project has been running,}
{As part of this work,}
we have developed
an (implicit) prototype of a \Celeven{} semantics in the form of a large 
\Haskell{} program.
This program can be seen as a \emph{very} critical \C{} interpreter.
If the standard says that a program has undefined behavior, our \Haskell{} 
interpreter will terminate in a state that indicates this.

The intention of our prototype was to develop a clear semantics of the
high-level part.
To this end, we postponed low-level details as bytes, object
representations, padding and alignment. 
Due to the absence of low-level details, we were able to support features
that are commonly left out, or handled incorrectly, in already existing
formal versions of \C. 
In particular, we treat effective types, the common initial segment rule, 
indeterminate values, pointers to one past the last element, variable length
arrays, and \lstinline|const|-qualified objects.
But even without the low-level part, we experienced many other difficulties,
that are also described in this paper.

Our prototype is currently being ported to the interactive theorem
prover \Coq.
\anonymize{The source of the \Haskell{} version can be inspected at
\url{http://ch2o.cs.ru.nl/}}{In the final version of the paper we will
give the URL of the source of the \Haskell{} version, but because
of double blind reviewing it is omitted here.}

While working on a formal version of the \Celeven{} standard, we had four rules
that guided our thinking:
\begin{enumerate}
\item If the standard is absolutely clear about something, our semantics
should not deviate from that.
That is, if the standard clearly states that certain programs should not
exhibit undefined behavior, we are not allowed to take the easy way out and
let \emph{our} version of the semantics assign undefined behavior to it.

\item If it is \emph{not} clear how to read the standard, our semantics should 
err on the side of caution.
Generally, this means assigning undefined behavior as we do not want our
semantics to allow one to prove that a program has a certain property, when
under a different reading of the standard this property might not hold.

\item Any \C{} idiom that is heavily used in practice should not be considered to 
exhibit undefined behavior, even if the standard is not completely clear about it.

\item If real-world \C{} compilers like \GCC{} and \clang{} in \ISO{} C mode
exhibit behavior that is in conflict with a straightforward reading of the
standard, but that can be explained by a contrived reading of the standard, our
semantics should take the side of the compilers and allow their behavior.
\end{enumerate}

\noindent
Of course there is a tension between the second and third rule.
Furthermore, the fourth rule is a special case of the second, but
we included it to stress that compiler behavior can be taken as
evidence of where the standard is unclear.

\subsection{Related Work}
\label{section:related}
This related work section consists of three parts: discussion of related work on unclarities
in the \C{} standard, discussions of related work on undefined behavior, and 
a brief comparison of other versions of a formal semantics of \C{}.

An important related document is a post by Maclaren~\cite{mac:01}
on the standard committee's mailing list where he expresses his concerns about
the standard's notion of an \emph{object} and \emph{effective type}, and
discusses their relation to multiprocessing.
Like our paper, he presents various issues by considering example programs.
Most importantly, he describes three directions to a more consistent \C{}
standard.
We will treat those in Section~\ref{section:bits}.

The standard committee's website contains a list of defect reports.
These reports describe issues about the standard, and after
discussion by the standard committee, they may lead to a revision or clarification of the
official standard text.
\drcite{260} raises similar issues as we do and will be discussed thoroughly
throughout this paper.

There is little related work on undefined behavior and its relation to bugs
in both programs and compilers.
Wang \etal{}~\cite{wan:che:che:jia:zel:kaa:12} classified various kinds of
undefined behavior and studied its consequences to real-world systems.
They have shown that undefined behavior is a problem in practice and
that various popular open-source projects (like the \Linux{} kernel and
\PostgreSQL{}) use compiler workarounds for it.
However, they do not treat the memory model, and non-aliasing specifically,
and also do not consider how to deal with undefined behavior in a formal
\C{} semantics.

Yang \etal{}~\cite{yan:che:eid:reg:11} developed a tool to randomly generate
\C{} programs to find compiler bugs.
This tools has discovered a significant number of previously unknown bugs in
state of the art compilers.
In order to do this effectively, they had to minimize the number of generated
programs that exhibit undefined behavior.
However, they do not seem to treat the kinds of undefined behavior that
are considered in this paper.

Lastly, we will briefly compare the most significant already existing formal 
versions of a \C{} semantics (there will be a more extensive discussion of
these in Section~\ref{section:bits}).
There are also many others like~\cite{coo:sub:94, pap:98,lei:08},
but these only cover small fragments of \C{} or are not recent enough to
include the troublesome features of \Cninetynine{} and \Celeven{} that are the
topic of this paper.

Norrish defined a semantics of a large part of \Ceightynine{}
in the interactive theorem prover \HOL{}~\cite{nor:98}.
His main focus was to precisely capture the non-determinism in evaluation of
expressions and the standard's notion of \emph{sequence points}.
However, the problems described in our paper are due to more recent
features of the standard than Norrish's work.

Blazy and Leroy~\cite{bla:ler:09} defined a semantics of a
large part of \C{} in the interactive theorem prover \Coq{} to prove the 
correctness of the optimizing compiler \CompCert.
\CompCert{} treats some of the issues we raise in this paper, but as 
its main application is to compile code for embedded systems, its developers
are more interested in giving a semantics to various undefined behaviors
(such as wild pointer casts) and to compile those in a faithful manner,
than to support \C{}'s non-aliasing features to their full extent (private
communication with Leroy).

A project which builds on the \CompCert{} semantics \cite{cam:pol:10}, is the
\CerCo{} project \cite{cer:xx}.
The goal of that project is to prove a compiler to be complexity preserving.

Ellison and Rosu~\cite{ell:ros:12, ell:ros:12:tr} defined an executable
semantics of the \Celeven{} standard in the \Kframework, and described various
ways to deal with undefined behavior in a semantics of a \C-like language.
Although their semantics is very complete, has been thoroughly 
tested, and has some interesting applications, it seems infeasible to be used 
for interactive theorem provers.
Besides, it is unclear whether their current memory model conforms to the
\C{} standard with respect to the issues we present.

\subsection{Scope}
It might seem that this paper is only of interest to those who
are specifically interested in \C{}.
However, the issues that we discuss are not specific to just \C.
Any programming language that combines the following two design goals will have
to deal with the kind of problems discussed in this article:
\begin{itemize}
\item It can be used for system programming, \ie, it allows
low-level access to the bit level representation of the data that is stored
in the memory.
\item It has pointers, and allows a compiler to perform aggressive
optimizations based on aliasing analysis.
\end{itemize}

\noindent
Furthermore, this paper is not just a `list of bugs in the standard'.
Given that the standard text does not use a mathematical formalism, bugs are
almost inevitable, but generally not a big problem in practice and easily fixed.
We will argue that the issues we present are more fundamental than just
`bugs' and are not easily resolved.
%In this case it is not just a matter of choosing what approach for resolving the problems
%is best.
%We know of \emph{no}
%mathematically consistent way to resolve these issues beyond
%taking extreme measures like (as for instance the \CompCert{} semantics does)
%taking bytes not to consist of bits but to be abstract entities.
%Clearly this is too foreign to \C{} as it is used in practice
%to be really considered a fully acceptable solution.

\subsection{Contribution}
The contribution of this paper is fourfold:

\begin{enumerate}
\item We indicate various subtleties of the \Celeven{} memory model and type
system that we discovered while working on our formal semantics
(Section \ref{section:bits}, \ref{section:structs}, \ref{section:uninitialized}
and \ref{section:subjectreduction}).

\item We argue for some imperfections of \Celeven: 
absence of (an abstract variant of) stack overflow (Section
\ref{section:stackoverflow}), and lack of preservation of typing (Section
\ref{section:subjectreduction}).

\item We present many small example programs that can be used as a
`benchmark' for comparing different formal versions of a C semantics.

\item We give some considerations on how to best proceed with formalizing
the C standard, given that the existing standard text is
imprecise and maybe even inconsistent.
\end{enumerate}

\section{Undefined behavior}
The \C{} standard uses the following notions of under-specification.
\begin{itemize}
\item \textbf{Unspecified behavior}~\cite[3.4.4]{iso:12}.
	Constructs for which the standard provides two or more possibilities,
	\eg{} order of evaluation.
	The behavior may vary for each use of the construct.
\item \textbf{Implementation defined behavior}~\cite[3.4.1]{iso:12}.
	Unspecified behavior, but the implementation has to document its choice,
	\eg{} size of integer types, endianness.
\item \textbf{Undefined behavior}~\cite[3.4.3]{iso:12}.
	constructs for which the standard imposes no requirements at all,
	\eg{} dereferencing a \NULL-pointer, integer overflow.
\end{itemize}

\noindent
Under-specification is used extensively to make \C{} portable, and to allow
compilers to generate fast code.
An important feature of undefined behavior is that it is dynamic, rather
than static (many undefined behaviors are even impossible to detect statically).
When it occurs, a program is allowed to do literally anything, so as to
avoid compilers having to insert (possibly expensive) dynamic checks to handle
corner cases.

\C's excessive use of undefined behavior (it describes more than 200
circumstances in which it occurs~\cite[J.2]{iso:12}) is quite different from
more modern programming languages.
There, most incorrect programs are ruled out statically by a strong type system
(as in \Haskell{} for example), or a program is required to throw an exception
(as in \Java{} for example).

Undefined behavior is important to allow strong optimizations.
Based on undefined behavior, a compiler may assume that certain behaviors
cannot occur, and optimize accordingly.
We give examples of this in this paper.

In a formal semantics, unspecified behavior corresponds to non-determinism, and
implementation defined behavior corresponds to parametrization by an
environment describing properties of the implementation.
Programs exhibiting undefined behavior are incorrect, and a formal semantics should
therefore exclude those.
The job of a formal semantics is thus not only to describe the meaning of
correct programs, but also to exclude execution of incorrect programs.

\section{Pointer aliasing versus bit representations}
\label{section:bits}
An important feature of \C{} is to allow both \emph{high-level} (by means of 
typed expressions) and \emph{low-level} (by means of bit manipulation) access 
to the memory.
For low-level access, the standard requires that each value is represented as a 
sequence of bytes~\cite[3.6, 5.2.4.2.1]{iso:12},
called the \emph{object representation}~\cite[6.2.6.1p4, 6.5.3.4p2]{iso:12}.

In order to allow various compiler optimizations (in particular strong
non-aliasing analysis), the standard has introduced various restrictions on 
the interaction between these two levels of access. 
Let us consider the following program \cite{iso:09}:

\begin{lstlisting}
int x = 30, y = 31;
int *p = &x + 1, *q = &y;
if (memcmp(&p, &q, sizeof(p)) == 0)
  printf("%d\n", *p);
\end{lstlisting}

\noindent
Here we declare two objects \lstinline|x| and \lstinline|y| of type 
\lstinline|int| and use the \lstinline|&|-operator to take the address of both
(Figure~\ref{figure:adjacent_blocks}a).
Increasing the pointer \lstinline|&x| by one moves it \lstinline|sizeof(int)| 
bytes ahead and yields a pointer to the right edge of the \lstinline|x| 
block. 
It may seem strange that such pointers are allowed at all~\cite[6.5.6p8]{iso:12}
because they cannot be dereferenced, but their use is common programming
practice when looping through arrays\footnote{Note that \C{} allows one to
use objects that do not have array type as an array of length
one~\cite[6.5.6p7]{iso:12}.}.

\begin{figure}[t!]
\centering
\subfloat[]{\begin{tikzpicture}[]
	\draw[fill=green] (0,0) rectangle node {\lstinline|30|} +(1.2,-0.5);
	\draw[fill=yellow] (1.7,0) rectangle node {\lstinline|31|} +(1.2,-0.5);
	\draw[ultra thick,->] (0,-0.8) node[below] {\lstinline|&x|} -- +(0,0.3);
	\draw[ultra thick,->] (1.7,-0.8) node[below] {\lstinline|&y|} -- +(0,0.3);
\end{tikzpicture}} \hspace{1cm}
\subfloat[]{\begin{tikzpicture}[]
	\draw[fill=green] (0,0) rectangle node {\lstinline|30|} +(1.2,-0.5);
	\draw[fill=yellow] (1.7,0) rectangle node {\lstinline|31|} +(1.2,-0.5);
	\draw[ultra thick,->] (1.2,-0.8) node[below] {\lstinline|p|} -- +(0,0.3);
	\draw[ultra thick,->] (1.7,-0.8) node[below] {\lstinline|q|} -- +(0,0.3);
\end{tikzpicture}} \hspace{1cm}
\subfloat[]{\begin{tikzpicture}[]
	\draw[fill=green] (0,0) rectangle node {\lstinline|30|} +(1.2,-0.5);
	\draw[fill=yellow] (1.2,0) rectangle node {\lstinline|31|} +(1.2,-0.5);
	\draw[ultra thick,->] (0.8,-0.8) node[below] {\lstinline|p|} -- +(0.4,0.3);
	\draw[ultra thick,->] (1.6,-0.8) node[below] {\lstinline|q|} -- +(-0.4,0.3);
\end{tikzpicture}}
\caption{Adjacent blocks.}
\label{figure:adjacent_blocks}
\end{figure}

We store these pointers into objects \lstinline|p| and \lstinline|q| of type 
pointer to \lstinline|int| (Figure~\ref{figure:adjacent_blocks}b).
The next step is to check whether these pointers \lstinline|p| and \lstinline|q|
are equal (note: not whether the memory they point to is equal; that
would be checked by \lstinline{memcmp(p, q, sizeof(*p))}\penalty 100\lstinline{== 0}).
We do this by using the \lstinline|memcmp| function, which checks whether 
their object representations are equal.
It is important to use bitwise comparison, instead of the ordinary 
\lstinline|p == q|, to test whether additional
information is stored.
If the object representations of the two are equal, we can conclude that both 
pointers point to the same memory location and do not contain conflicting bounds 
information.
From this we are allowed to conclude that \lstinline|x| and \lstinline|y| are
allocated adjacently (Figure~\ref{figure:adjacent_blocks}c).

Now we have ended up in a situation where the low- and high-level world are in 
conflict. 
On the one hand, \lstinline|p| is a pointer to the right edge of the \lstinline|x| 
block, and thus dereferencing it should lead to undefined behavior.
On the other hand, the object representation of \lstinline|p| is the same as
the object representation of \lstinline|q|, and so \lstinline|p| and
\lstinline|q| should behave identically.

Although the standard itself is very unclear about these problems,
in~\drcite{260} the committee expressed the following
judgment:
\begin{quote}
Implementations are permitted to track the origins of a bit-pattern and treat 
those representing an indeterminate value as distinct from those representing a
determined value. 
They may also treat pointers based on different origins as distinct even 
though they are bitwise identical. 
\end{quote}
\noindent
Apparently a value can contain additional information about its \emph{origin}
that is not reflected in its object representation.
The reason the committee allows this, is to allow compiler optimizations
that would not be correct otherwise.
%For example, it allows them to perform register allocation more 
%efficiently or to change the order in which instructions are performed.

To show that this is a real issue, we changed the last example slightly
and compiled it with \GCC{}:\footnote{Using 
\lstinline|gcc -O2 -std=c99 -pedantic -Wall|, version 4.7.1. These compiler
flags are used for all future examples.}

\begin{lstlisting}
int x = 30, y = 31;
int *p = &x + 1, *q = &y;
if (memcmp(&p, &q, sizeof(p)) == 0) {
  *p = 10;
  printf("%d %d\n", *p, *q);
}
\end{lstlisting}

\noindent
This does not give any compiler warnings and executing the program prints two
distinct values `\lstinline|10 31|'.
Despite the fact that \lstinline|p| and \lstinline|q| are identical
on the bit level (which follows from the fact that the \lstinline|printf| is 
executed at all), they still behave differently on the object level, as indeed
\dr{260} allows for.

\subsection{More extreme aliasing analysis}
\label{section:more_extreme_aliasing}
Given that \dr{260} allows a compiler to take the origin of a pointer value
into account, a natural question is whether that also holds for non-pointer
values.
Particularly, does this hold for the integer type \lstinline|intptr_t|?
This integer type has the property that any pointer of type \lstinline|(void *)|can be
converted to it, and when such an integer value is converted back to a
pointer of type \lstinline|(void *)|, it will compare equal to the original
pointer~\cite[7.20.1.4]{iso:12}.
Consider the following program
\anonymize{(suggested to us by Marc Schoolderman)}
{(from the GCC bug tracker\footnote{
\url{http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54945}})}:

\begin{lstlisting}
int x = 30, y = 31;
int *p = &x + 1, *q = &y;
intptr_t i = (intptr_t)p, j = (intptr_t)q;
printf("%ld %ld %d\n", i, j, i == j);
\end{lstlisting}

\noindent
When compiled with \GCC{}, it outputs:

\begin{lstlisting}
140734814994316 140734814994316 0
\end{lstlisting}

\noindent
This means that although the integer variables \lstinline|i| and \lstinline|j|
have the same numerical value \lstinline|140734814994316|, they still compare as
different because the origins of their values differ.

\anonymize{
Now should this be taken to be a bug in \GCC{}, or is this a legitimate -- although
extreme -- variant of the behaviors that \dr{260} allows for?
We reported this problem to the \GCC{} bug
tracker\footnote{See~\url{http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54945}}.
In the reactions to that report this was unequivocally considered to be a bug.
}{
In the reactions to the bug report from which this example was taken, this was unequivocally
considered to be a bug.
}
In other words, although some compiler writers think the license for
optimizations that \dr{260} gives them is justified, this example was
too extreme for the \GCC{} community to be taken in that light.

Regardless of whether optimizations as in the above example are justified,
it it not clear where the border of what is allowed should be.
Specifically, is \lstinline|p == q| allowed to evaluate to \lstinline|0| in
case the storage of \lstinline|p| and \lstinline|q| are allocated adjacently?
And what about the bitwise comparison \lstinline|memcmp(&p, &q, sizeof(p))|,
could that also take `origin' into account?

To use the semantics to certify an optimizing compiler, it makes sense to let
these `dangerous' pointer comparisons exhibit undefined behavior.
For example, a first pass in the \CompCert{} compiler, is to remove 
non-determinism as this significantly eases correctness proofs~\cite{ler:09}.
However, this means that the phase that removes non-determinism has to
fix whether such a comparison is true or not.
This is problematic, as future phases may change the memory layout, and
therefore may change the truth of such comparisons.

\subsection{Three directions for improvement}
Maclaren describes similar unclarities of the standard with respect to the
size of array objects corresponding to pointer values~\cite{mac:01}.
He presents three directions the standard may take:
% We describe these directions with respect to pointer aliasing versus bit
% representations.
\begin{enumerate}
\item \textbf{The original Kernighan \& Ritchie approach.}
	Pointer values are simply addresses, and their object representations are
	all that matters.
\item \textbf {Pointers carry their origin.}
	Each pointer value carries its \emph{origin}: a history on how it is
	constructed.
	Certain operations (\eg{} dereferencing a pointer) are disallowed if the
	origin conflicts.
%	This history should be preserved by various operations on pointers.
%	Furthermore, if the pointer is refined to a subobject (for example by array
%	indexing, or by taking a
%	field of a structure), it is impossible to get the original pointer
%	back, even if they correspond to the same address in memory.
\item \textbf{Only visible information is relevant.}
	This is a weaker variant of the previous one where the history
	is limited to a certain visibility, for example the current scope.
\end{enumerate}

\noindent
The first approach is taken by the semantics of Norrish~\cite{nor:98}
in \HOL{}.
Although this approach is clearly
understood, and is convenient to formalize, it implies that pointers with equal
object representation always have to be treated as equal.
Therefore, optimizations as
for example performed by \GCC{} and allowed by \dr{260} are not sound with
respect to Norrish's semantics.

The second approach allows the most compiler optimizations and is therefore
the most appealing one.
However, it is unclear how bit-level operations and library functions like
\lstinline|memcpy| should deal with a pointer's origin.
An earlier version of the \CompCert{} memory model by Leroy and
Blazy~\cite{ler:bla:08}, models bytes as abstract entities, instead of
simple sequences of bits (even on the level of the assembly), to deal with this
issue.
This approach ensures that the abstract information can easily be preserved,
even by byte-level operations.
The drawback is of course that many bit-level operations become undefined.
Hence, in a more recent version of their memory model~\cite{ler:app:bla:ste:12},
only bytes that constitute a pointer are abstract entities, whereas those that
constitute integers or floating point numbers are sequences of bits.

The semantics by Ellison and Rosu~\cite{ell:ros:12:tr} in the \Kframework{} 
use a similar representation.
Symbolic execution is used to model bytes of pointers as abstract entities.

The third approach requires a careful definition of `visibility'.
It is unclear whether such a definition can be given that is both consistent
and that allows sufficient compiler optimizations~\cite{mac:01}.
This approach therefore seems not very attractive.

The memory model of our prototype semantics takes the second approach to its fullest
extent.
Our memory is a finite map from indexes to trees that expose the full structure
of values, and pointers are paths through these trees.
Subobjects (subarrays, fields of structures or unions) correspond to
subtrees of the memory.

The origin of pointers in our semantics is much more detailed than
its counterpart in \CompCert{}~\cite{ler:app:bla:ste:12} and in the semantics of
Ellison and Rosu~\cite{ell:ros:12:tr}.
There, only a pointer's block and offset into that block is stored, whereas
we store the entire path corresponding to the history of the construction of the pointer.
In particular, since we do not flatten arrays and structures, we are able to
distinguish subobjects.
For example, it allows our semantics to be aware that something special
is happening in the
following example:

\begin{lstlisting}
int a[2][2] = { {13, 21}, {34, 35} };
struct t { int *r, *p, *q; } s;
s.p = &a[0][2]; s.q = &a[1][0];
if (s.p == s.q)
  printf("%d\n", *s.p);
\end{lstlisting}

\noindent
Figure~\ref{figure:memory_tree} displays the representation of the memory
after executing the first three lines of this code in both a concrete memory
and our abstract memory as trees.

Our semantics imposes special treatment on pointers that point to elements one
past the end of an object.
First of all, we do not allow these pointers to be dereferenced.
Secondly, when such a pointer is used in a comparison with a pointer to another
subobject, we can easily impose undefined or non-deterministic behavior.
For the moment (although this probably does not follow to the \C{} standard),
we use undefined behavior to be as cautious as possible with
respect to questionable situations as described in
Section~\ref{section:more_extreme_aliasing}.

%\CompCert{}~\cite{ler:app:bla:ste:12} does not allow pointers to the end of an
%object to be used at all.
%It seems not to be a fundamental limitation of their model to incorporate a
%similar treatment as ours and Ellison and Rosu's.

Of course, the drawback of our current memory model is that bytes are not
present at all, whereas the \CompCert{} memory model at least allows
byte-level operations on objects solely consisting of integers and floating
point numbers.
But since the intention of our semantics was to obtain a better understanding
of the high-level part of the \C{} memory, for the moment we postponed
accounting for object representations in our model.

\begin{figure}[t!]
\centering
\subfloat[flattened]{$\hspace{-1em}$\begin{tikzpicture}[]
\draw[densely dashed] (0,0) -- +(-0.4,0);
\draw[densely dashed] (0,-0.5) -- +(-0.4,0);
\draw[snake=brace,segment amplitude=4mm] (0,0) -- +(4, 0)
  node [pos=0.5,anchor=south,yshift=4mm] {\lstinline|a|};
\draw[fill=green] (0,0) rectangle node {\(13\)} +(1,-0.5);
\draw[fill=green] (1,0) rectangle node {\(21\)} +(1,-0.5);
\draw[fill=green] (2,0) rectangle node {\(34\)} +(1,-0.5);
\draw[fill=green] (3,0) rectangle node {\(55\)} +(1,-0.5);
\draw[densely dashed] (4,0) -- +(1,0);
\draw[densely dashed] (4,-0.5) -- +(1,0);
\draw[snake=brace,segment amplitude=4mm] (5,0) -- +(3, 0)
	node [pos=0.5,anchor=south,yshift=4mm] {\lstinline|s|};
\draw[fill=yellow] (5,0) rectangle node {\(\undef\)} +(1,-0.5);
\draw[fill=yellow] (6,0) rectangle node (d1) {\(\bullet\)} +(1,-0.5);
\draw[fill=yellow] (7,0) rectangle node (d2) {\(\bullet\)} +(1,-0.5);
\draw[ultra thick,->] (d1) .. controls (3,-1.3) .. (2,-0.5);
\draw[ultra thick,->] (d2) .. controls (3.5,-1.6) .. (2,-0.5);
\draw[densely dashed] (8,0) -- +(0.4,0);
\draw[densely dashed] (8,-0.5) -- +(0.4,0);
\end{tikzpicture}$\hspace{-1em}$}

\subfloat[abstract]{$\hspace{-1em}\begin{tikzpicture}
\draw[fill=green] (1.75,0) rectangle node {\lstinline|a|} +(1,-0.5);

\draw (2.25,-0.5) -- +(0,-0.5);
\draw[fill=green] (1.25,-1) rectangle node {} +(1,-0.5);
\draw[fill=green] (2.25,-1) rectangle node {} +(1,-0.5);

\draw (1.75,-1.5) -- +(-0.75,-0.5);
\draw (2.75,-1.5) -- +(0.75,-0.5);
\draw[fill=green] (0,-2) rectangle node {\(13\)} +(1,-0.5);
\draw[fill=green] (1,-2) rectangle node {\(21\)} +(1,-0.5);
\draw[fill=green] (2.5,-2) rectangle node {\(34\)} +(1,-0.5);
\draw[fill=green] (3.5,-2) rectangle node {\(35\)} +(1,-0.5);

\draw (6.75,-0.5) -- +(-1.25,-0.5);
\draw (6.75,-0.5) -- +(0,-0.5);
\draw (6.75,-0.5) -- +(1.25,-0.5);
\draw[fill=yellow] (6.25,0) rectangle node {\lstinline|s|} +(1,-0.5);
\draw[fill=yellow] (5,-1) rectangle node {\(\undef\)} +(1,-0.5);
\draw[fill=yellow] (6.25,-1) rectangle node (d1) {\(\bullet\)} +(1,-0.5);
\draw[fill=yellow] (7.5,-1) rectangle node (d2) {\(\bullet\)} +(1,-0.5);

\draw[ultra thick,->] (d1) .. controls (6,-3.3) and (2.4,-3.3) .. (2,-2.5);
\draw[ultra thick,->] (d2) .. controls (7,-3.2) and (2.9,-3.2).. (2.5,-2.5);
\end{tikzpicture}$\hspace{-1em}}

\begin{comment}
\subfloat[abstract]{\(\hspace{-1em}\)\begin{tikzpicture}
\draw (0,0) node {\lstinline|a|}
child [sibling distance=25mm] {
	node[](b1){\(\cdot\)}
	child [sibling distance=8mm]{node[]{\(13\)}}
	child [sibling distance=8mm] {node[](c1){\(21\)}}
	child [sibling distance=8mm] {node[](c2){} edge from parent[->,dotted]}
}
child [sibling distance=25mm] {
	node[]{\(\cdot\)}
	child [sibling distance=8mm] {node[](c3){\(34\)}}
	child [sibling distance=8mm] {node[]{\(55\)}}
	child [sibling distance=8mm] {node[]{} edge from parent[->,dotted]}
}
child [sibling distance=25mm] {node[]{} edge from parent[->,dotted]};
\draw(4.5,0) node[](a2){\lstinline|s|}
child [sibling distance=8mm] {node[]{\(\undef\)}}
child [sibling distance=8mm] {node[](d1){\(\bullet\)}}
child [sibling distance=8mm] {node[](d2){\(\bullet\)}};

\draw[ultra thick,->] (d1) .. controls (1,-4) .. (c2);
\draw[ultra thick,->] (d2) .. controls (1.5,-4) .. (c3);
\end{tikzpicture}\(\hspace{-1em}\)}
\end{comment}

\caption{Example contents of the memory.}
\label{figure:memory_tree}
\end{figure}

\section{The common initial sequence}
\label{section:structs}
\C{} supports various data types to build more complex
types: in particular, structure, union, and array types.
Structures are like product types and unions are like sum types.
Due to the low-level nature of \C{}, unions are \emph{untagged} 
instead of \emph{tagged}, which means that the current variant of the union is
not stored.
Consider:

\begin{lstlisting}
union int_or_float { int x; float y; };
\end{lstlisting}

\noindent
Given an object of type \lstinline|int_or_float|, it is not possible to
test whether it contains the \lstinline|int| or the \lstinline|float| variant.
This may seem unnatural, but it is in the
spirit of \C{} to let the programmer decide whether or 
not to store the tag.

\begin{figure}[t!]
\centering
\begin{tikzpicture}[]
\draw (0,-0.25) node {\lstinline|s1|};
\draw[fill=green] (0.5,0) rectangle node {\lstinline|m|} +(2,-0.5) ;
\draw[fill=yellow] (2.5,0) rectangle node {\lstinline|f|} +(4,-0.5) ;

\draw (0,-0.75) node {or};

\draw (0,-1.25) node {\lstinline|s2|};
\draw[fill=green] (0.5,-1) rectangle node {\lstinline|m|}+(2,-0.5) ;
\draw[fill=yellow] (2.5,-1) rectangle node {\lstinline|s|} +(1,-0.5) ;
\draw (3.5,-1) rectangle +(3,-0.5) ;

\draw[densely dashed] (0.5,-0.5) -- +(0, -1) ;
\draw[densely dashed] (2.5,-0.5) -- +(0, -1) ;
\draw[densely dashed] (6.5,-0.5) -- +(0, -1) ;
\end{tikzpicture}
\caption{A union containing two structures with a common initial sequence.}
\label{figure:cis}
\end{figure}

An interesting consequence of untagged unions is the standard's \emph{common 
initial sequence} rule~\cite[6.5.2.3]{iso:12}.
Consider:

\begin{lstlisting}
struct t1 { int m; float f; };
struct t2 { int m; short s; };
union { struct t1 s1; struct t2 s2; } u;
\end{lstlisting}

\noindent
The representation of the object \lstinline|u| might look as pictured in 
Figure~\ref{figure:cis}.
Although that there may be additional space between the members (due to
alignment), the standard guarantees that the integer parts always coincide.
Even stronger, in this case it guarantees the following~\cite[6.5.2.3p6]{iso:12}:

\begin{quote}
\ldots{} it is permitted to inspect the common initial part of any of them 
anywhere that a declaration of the completed type of the union is visible.
\end{quote}

\noindent
For example, that means we are allowed to do things like:

\begin{lstlisting}
int main(void) {
  u.s2.m = 20;
  printf("%d\n", u.s1.m);
}
\end{lstlisting}

\noindent
So, we set the integer part via the one variant of the union, and read it out
via the other.
However, the following program exhibits undefined behavior as the relation to
the union type is not visible in the body of the function \lstinline|f|.

\begin{lstlisting}
int f(struct t1 *p1, struct t2 *p2) {
  p2->m = 20;
  return p1->m;
}
int main(void) {
  printf("%d\n", f(&u.s1, &u.s2));
}
\end{lstlisting}

\noindent
This restriction allows compilers to make stronger aliasing assumptions about
\lstinline|p1| and \lstinline|p2| because their types are different.
Real compilers, like \GCC{}, happily use this, and indeed this example
can be adapted for \GCC{} such that something 
different from the naively expected `\lstinline|20|' is printed.

The standard's definition of `visible' is rather unclear, especially when
a pointer to a common initial segment is passed through another function.
For example, consider:

\begin{lstlisting}
int *f(int *p) { return p; }
int main(void) {
  u.s2.m = 20;
  printf("%d\n", *f(&u.s1.m));
}
\end{lstlisting}

\noindent
Does passing the pointer through \lstinline|f| remove the visibility of the
common initial segment?

The \GCC{} manual only allows the common initial segment rule to be used when
an object is accessed directly via the union type, and therefore gives no
guarantees about the previous program, and even not about simpler cases like the
following program:

\begin{lstlisting}
int main(void) {
  u.s2.m = 20;
  int *p = &u.s1.m;
  printf("%d\n", *p);
}
\end{lstlisting}

\noindent
Our semantics takes the same position as the \GCC{} manual.
We have implemented this by annotating each structure fragment in a path of
a pointer with a flag whether the common initial segment rule may be used.
When a pointer is stored in the memory, this flag is cleared.
As a result, undefined behavior is imposed on the last two examples.

\section{Indeterminate values}
\label{section:uninitialized}
Uninitialized variables and padding bytes of objects of structure type take 
an \emph{indeterminate value}~\cite[6.2.6.1p6]{iso:12}. 
An indeterminate value is an object that either describes an unspecified value 
or is a \emph{trap representation}~\cite[3.17.2]{iso:12}.
A trap representation is an object representation that does not represent a 
value of the object type and reading it leads to undefined 
behavior~\cite[6.2.6.1p5]{iso:12}. 

\subsection{Uninitialized variables}
Since an object of type \lstinline|unsigned char| cannot have a trap value, 
reading it does not exhibit undefined behavior.
Instead it just gives an unspecified value.
This property is important to allow simple minded bit-wise copying of 
structures, without having to worry about padding bytes between members.
For example:

\begin{lstlisting}
struct T { short a; int b; } x = {10, 11}, y;
for (size_t i = 0; i < sizeof(x); i++)
  ((unsigned char*)&y)[i] =
  ((unsigned char*)&x)[i];
\end{lstlisting}

\noindent
Figure~\ref{figure:structure_T} displays the representation of \lstinline|x| 
on a standard 32 bits architecture with a 4 bytes alignment requirement for
integers.
This means that integers should be aligned at addresses that are multiples of 4,
and therefore we have 2 padding bytes between the members. 
In case reading indeterminate values of type \lstinline|unsigned char| (and in 
particular these padding bytes) would exhibit undefined behavior, this copy 
would also exhibit undefined behavior.

\begin{figure}[t!]
\centering
\begin{tikzpicture}[]
\draw[fill=green] (0,0) rectangle node {\lstinline|a|} +(2,-0.5);
\draw[] (2,0) -- +(2,0);
\draw[] (2,-0.5) -- +(2,0);

\draw[fill=yellow] (4,0) rectangle node {\lstinline|b|} +(4,-0.5) ;
\draw[] (1,-0.4) -- +(0,-0.1);
\draw[] (1,0) -- +(0,-0.1);
\draw[] (3,0) -- +(0,-0.5);
\draw[] (5,0) -- +(0,-0.5);
\draw[] (6,-0.4) -- +(0,-0.1);
\draw[] (6,0) -- +(0,-0.1);
\draw[] (7,0) -- +(0,-0.5);
\end{tikzpicture}
\caption{\lstinline|struct T \{ short a; int b; \}|
on a standard 32 bits architecture.}
\label{figure:structure_T}
\end{figure}

An interesting property of indeterminate values is that \drcite{260} allows them
to change arbitrarily, so reading an indeterminate value twice might yield 
different results.
This is useful for an optimizing compiler because it may figure out the
actual lifetime of two values to be disjoint and therefore share the storage
location of both.
As an example (the mysterious \lstinline|&x| will be explained later),

\begin{lstlisting}
unsigned char x; &x;
printf("%d\n", x);
printf("%d\n", x);
\end{lstlisting}

\noindent
does \emph{not} exhibit undefined behavior (because an object of type 
\lstinline|unsigned char| cannot contain a trap value), but \dr{260} allows
the two printed values to be different.
This is not so strange: the compiler might do liveness analysis and decide
that \lstinline|x| does not need to be saved on the stack when calling
\lstinline|printf|.
And then of course \lstinline|printf| might clobber the register that
contains \lstinline|x|.

Unfortunately, the standard is very unclear about the imposed behavior of
various operations on indeterminate values, \emph{e.g.}, what happens when they
are copied or used in expressions.
For example, should \lstinline|y| be indeterminate after evaluating

\begin{lstlisting}
unsigned char x, y; &x;
y = x/2;
\end{lstlisting}

\noindent
Surely the most significant bit of \lstinline|y| will be 0 after this?
Or is that not something that the standard guarantees?
But if \lstinline|y| is \emph{not} indeterminate after this,
what about:

\begin{lstlisting}
y = x/1;
\end{lstlisting}

\noindent
We just changed the constant,
and therefore after this statement \lstinline|y| also should be determinate?
But after:

\begin{lstlisting}
y = x;
\end{lstlisting}

\noindent
will it still not be indeterminate?
This seems almost indistinguishable from the \lstinline|x/1| case,
but the liveness analysis argument surely now also will apply to \lstinline|y|?
A formal C semantics will have to take a stance on this.

Also, should the following print `\lstinline|0|', or
may it print a different value as well, because the \lstinline|x| may have changed during the
evaluation of the subtraction expression?

\begin{lstlisting}
unsigned char x; &x;
printf("%d\n", x - x);
\end{lstlisting}

\drcite{338} remarks that on some architectures (\eg{} IA-64) registers 
may hold trap values that do not exist in memory.
Thus, for such architectures, programs cannot safely copy uninitialized
variables of type \lstinline|unsigned char| because these might reside in
registers.
In the \Celeven{} standard this problem has been fixed by including the
following workaround~\cite[6.3.2.1p2]{iso:12}:

\begin{quote}
If the lvalue designates an object of automatic storage duration that could 
have been declared with the register storage class (never had its address 
taken), and that object is uninitialized (not declared with an initializer and 
no assignment to it has been performed prior to use), the behavior is undefined.
\end{quote}

\noindent
This is the reason we had to take the address of the uninitialized variables
in the preceding examples.
%Our argument that a compiler might clobber the register remains of course
%valid as on many architectures registers \emph{cannot} hold trap values that
%do not exist in memory.
Of course even with the \lstinline|&x| present, a compiler can decide to ignore it
and still use a register for \lstinline|x|, as this address is never used.

This workaround again shows that treating uninitialized objects as 
indeterminate has its difficulties.
In the memory model of our semantics we keep track of uninitialized (or freed) 
memory by special \(\undef\) nodes (see
Figure~\ref{figure:memory_tree}).
Since objects of structure or union type cannot be indeterminate, we only
allow these \(\undef\) nodes on the leaves.
Again, we take the most cautious way, and let operations on these special
\(\undef\) nodes, like reading them, exhibit undefined behavior.

\subsection{Pointers to freed memory}
The standard states that the value of a pointer variable becomes indeterminate when the 
object it points to reaches the end of its lifetime~\cite[6.2.4]{iso:12}.
In particular this means that whenever some memory is freed, all pointers 
to it become indeterminate. 
For example, assuming the \lstinline|malloc|s do not fail, the following 
program can still exhibit undefined behavior

\begin{lstlisting}
int *p = malloc(sizeof(int));
free(p);
int *q = malloc(sizeof(int));
printf("%d\n", p == q);
\end{lstlisting}

\noindent
because the value of the pointer \lstinline|p| has become indeterminate and
now can be a trap value.
Of course, we can still compare the bit patterns of both pointers, and if they
are equal, try to use \lstinline|p| instead.

\begin{lstlisting}
int *p = malloc(sizeof(int));
free(p);
int *q = malloc(sizeof(int));
if (memcmp(&p, &q, sizeof(p)) == 0)
  *p = 10;
\end{lstlisting}

\noindent
Again, \drcite{260} states that this program exhibits undefined behavior.
%To treat this in a formal semantics, the memory model needs to know which bits
%belong to pointers to freed memory.

The fact that a pointer object becomes indeterminate after the block
it points to has been freed means that if we copy pointers to various
places in memory, then all copies should become indeterminate and not
just the argument of \lstinline|free|
(which does not even need to be
an lvalue).
This means that a \lstinline|free| operation will affect the formal memory state globally.
And what about individual bytes of a pointer that have been copied?
Will these also become indeterminate after freeing that pointer?

\subsection{Padding bytes}
The standard states that when a value is stored in a member of an 
object of structure or union type, padding bytes take an unspecified 
value~\cite[6.2.6.1p6]{iso:12}, and that when a value is stored in a member of 
a union type, bytes that do not correspond to that member take an unspecified 
value~\cite[6.2.6.1p7]{iso:12}. 
Consider:\footnote{Adapted from Shao Miller, Bounds Checking as Undefined 
Behaviour, \texttt{comp.std.c} newsgroup, July 29, 2010.}

\begin{lstlisting}
union { int a[1]; int b[2]; } x;
x.b[0] = 10; x.b[1] = 11;
x.a[0] = 12;
printf("%d\n", x.b[1]);
\end{lstlisting}

\noindent
This example can lead to undefined behavior, as assigning to \lstinline|x.a|
makes the bytes of \lstinline|x.b| that do not belong to \lstinline|x.a| 
unspecified, and thus \lstinline|x.b[1]| possibly indeterminate.
Taking \dr{260} into account, it is unclear whether the bytes
that belong to \lstinline|x.b[1]| after this may change arbitrarily. 

This program also suggests that the representation of pointers in a formal semantics
should contain information describing which parts of the memory should
become indeterminate upon assignment to them. 
If instead of assigning to \lstinline|x.a[0]| directly we do:

\begin{lstlisting}
int *p = &x.a[0];
*p = 12;
\end{lstlisting}

\noindent
\lstinline|x.b[1]| will \emph{still} become indeterminate.
But the assignment to \lstinline|p| might happen anywhere in the program, even
in a context where the union type is not visible at all.

%As usual, the information of the fact that assigning to \lstinline|*p| will 
%change the status of a different part of the memory does not have to be 
%reflected in the object representation.

\section{Stack overflow}
\label{section:stackoverflow}
A programming language implementation typically organizes its storage into 
two parts: the \emph{stack} and the \emph{heap}.
On a function call, the stack is extended with a frame containing the
function's arguments, local variables and return address (the address of the instruction to be 
executed when the function is finished).
The heap contains dynamically allocated storage.

The standard abstracts from implementation details like these, and
thus also allows implementations that do not organize their memory in this way.
We agree that this is the right approach, but we do believe it should at 
the least account for (an abstract version of) the problem of \emph{stack overflow}.
Unfortunately, the notion of stack overflow is not mentioned by the
standard~\cite{iso:12} or the standard's rationale~\cite{iso:03} at all.
This is very troublesome, as for most actual implementations stack overflow
is a real problem.
Let us consider the following function.

\begin{lstlisting}
/*@ decreases n; ensures \result == n; */
unsigned long f(unsigned long n) {
  if (n != 0) return f(n - 1) + 1;
  else return 0;
}
\end{lstlisting}

\noindent
With most tools for \C{} verification one can prove that
the function \lstinline|f| behaves as the identity function
(for example, the \Jessie{} plug-in for \FramaC{}~\cite{moy:mar:09} allowed us
to do so).
%turns these four lines into
%92 lines of Why3ML code, leading to four verification conditions
%that are all trivially provable).
However, in a real \C{} implementation, a call like \lstinline|f(10000000)|
will \emph{not} return \lstinline|10000000|, but will crash with a
message like `\lstinline|segmentation fault|'.

Stack overflow does not necessarily have to result in a crash
with an error message, but might also overwrite non-stack parts
of the memory (possibly putting the address of virus code there).
Also, it can occur without function calls.
For example, the program

\begin{lstlisting}
int main(void) { int a[10000000]; }
\end{lstlisting}

\noindent
might also crash\footnote{
In order to make this really happen with an optimizing compiler,
one might need to use the array
\lstinline|a| to prevent it from being optimized away.
For example when compiled with
\GCC{} or \clang{} (with \texttt{-O2}) the following crashes when
running it with the standard stack size:
\begin{lstlisting}
int main(void) {
  int a[10000000];
  for (int i = 0; i < 10000000; i++) a[i] = i;
  return a[10];
}
\end{lstlisting}}.

This all means that a proof of correctness of a program with respect
to the standard only guarantees correctness relative to the assumption that the
stack does not overflow.
As we have seen, this assumption does not hold in general.
But worse, it is not even clear for which \(n\) the program

\begin{lstlisting}
int main(void) { int a[$n$]; }
\end{lstlisting}

\noindent
is guaranteed not to overflow the stack.
On a microcontroller this might already happen for rather small \(n\).
All this means that a correctness proof of a program with respect to the \C{}
standard actually does not guarantee anything.

\begin{comment}
Therefore, as there is no a clear division between the \(n\) that are safe and
the ones that are not, it seems reasonable that even the extreme program,
which is a variation on the
case for \(n = 0\),

\begin{lstlisting}
int main(void) { }
\end{lstlisting}

\noindent
potentially could overflow a (very) small stack.
For one thing, there need to be space on the stack for the \lstinline|int|
return value, and then the runtime environment that calls \lstinline|main|
also might need a bigger stack than is there.
\end{comment}

The obvious way to change the text of the \C{} standard to address this issue
would be to add to~\cite[6.5.2.2]{iso:12} something like:

\begin{quote}
A function call %(even the initial call to \lstinline|main|)
might overflow the
stack, in which case the behavior is undefined.
It is implementation defined under what circumstances this is guaranteed not to
happen.
\end{quote}

\noindent
It is important that it is not just \emph{unspecified} when these overflows
happen, for then it still would not be possible to
formally guarantee anything about a  program from the text of the standard.

A consequence of this addition would be that \emph{strictly conforming programs}
no longer exist, as one of their defining properties is that they `shall not
produce output dependent on {\dots} implementation defined
behavior'~\cite[4.5]{iso:12}.
Of course, once one takes stack overflow into account
no program has that property anymore.
%
Another consequence is that then \emph{all} \C{} implementations become
\emph{conforming}, as that notion is defined by quantification over all 
strictly conforming programs~\cite[4.6]{iso:12}.
Therefore, the definition of conforming \C{} implementations should also 
be adapted.

Section 5.2.4.1 of the standard states that each implementation
`shall be able to translate and execute at least one program that [satisfies a list of properties].'
This seems to imply that for each implementation
at least \emph{one} program should exist that does \emph{not} crash with
stack overflow.
However, this does not imply that there exists a program that
\emph{all} implementations will execute correctly (the program
might be different for different implementations), and hence
it still does not imply the existence of any strictly conforming program.
Also, the section really seems to be about lower bounds for limits
of the compilation process: limits on the execution of the program
are not really taken into account.

The fact that the standard does not allow to portably test for stack overflow 
is to us one of the main omissions of the language.
A call to \lstinline|malloc| will return a \NULL{} pointer when there
is no storage available, which means that a program can test for that 
situation.
But there is no counterpart to this for function calls, or for entering a block.
Variable length arrays, discussed in
Section~\ref{section:subjectreduction}, make this an even bigger omission.

Nearly all existing formal \C{} semantics seem to ignore stack
overflow entirely.
For example \CompCert{}~\cite{ler:09} models an infinite memory and stack,
even on the level of the assembly.
Although this is undesirable, it is unclear how to deal with stack overflow
in a formal semantics such that it still can be used for compiler correctness
proofs.
The following approaches are clearly not fully satisfactory:

\begin{itemize}
\item Change the compiler correctness proof to guarantee that a program is
either compiled to a program that exhibits the correct behavior, or to a
program that runs out of memory/stack.
In this case compilation to a program that always runs out of memory would be
correct.

\item Change the compiler correctness proof to guarantee that a program is
compiled correctly only if the source program have some upper bound on stack
and memory usage.
This approach is non-trivial because a compiler does not necessarily preserve
the stack memory consumption of the program it compiles. 
For example, inlining of functions and spilling of variables may
increase the stack usage arbitrarily~\cite{ler:09}.
\end{itemize}

\noindent
Verification with stack overflow
taken into account is an instance of verification of the resource
consumption of a program.
The \CerCo{} project \cite{cer:xx} aims at implementing a formally
verified complexity preserving
\C{} compiler, and therefore falls in this class as well.

\section{Failure of the subject reduction property}
\label{section:subjectreduction}
A desired property for a typed programming language is \emph{subject reduction},
which means that evaluation preserves typing.
As proven by Norrish, this property holds for (his small-step expression 
semantics of) \Ceightynine~\cite{nor:98}.
We will argue that due to the introduction of variable length arrays in
\Cninetynine, this property no longer holds.
Before pursuing the problem of subject reduction, we briefly introduce variable
length arrays and some other problems of these.

Before \Cninetynine, arrays were required to have a size that could be
determined at compile-time.
To work around this restriction, programmers had to use dynamically allocated memory 
(through \lstinline|malloc| and \lstinline|free|) for
arrays of dynamic size.
To loosen this restriction, \Cninetynine{} introduced support for
\emph{variable length arrays} (VLAs) to the language.

The first shortcoming of VLAs is related to the fact that there is no portable way
to test if there is sufficient storage available (on the stack) when entering a block
(see Section~\ref{section:stackoverflow}).
Since most implementations use the stack to store VLAs, not being able to 
perform such a test, makes VLAs dangerous in practice.
Consider the following program.

\begin{lstlisting}
int main(void) {
  int n;
  scanf("%d", &n);
  int a[n];
}
\end{lstlisting}

\noindent
Since there is no way to test if there is enough space on the 
stack, it is impossible to portably ensure that the program does not crash.

Another problem of VLAs is that \C{} allows casts to variable length types.
Since size expressions in such casts may impose side-effects, this
makes the situation rather complex.
For example, consider the following program:\footnote{The type
\lstinline|int(*)[f(5)]| should be read as `pointer to integer array of length
\lstinline|f(5)|'.}

\begin{lstlisting}
1 ? (int(*)[f(5)])0 : (int(*)[f(3)])0
\end{lstlisting}

\noindent
where \lstinline|f| is a function returning a positive \lstinline|int|.
We were unable to find anything in the standard on which of the function
calls \lstinline|f(5)| and \lstinline|f(3)| may (or should) be evaluated.
It is reasonable to allow implementations to evaluate neither of them, 
as programs can generally be executed correctly without needing to know the
size of array bounds in pointer casts.
But what about:

\begin{lstlisting}
printf("%d\n",
  sizeof(*(1 ? 0 : (int(*)[f(3)])0)));
\end{lstlisting}

\noindent
Here an implementation clearly needs to evaluate the function call,
to obtain the value of the \lstinline|sizeof|, even though
it is in the branch of the conditional that is not taken.
The standard includes the following related clause~\cite[6.7.6.2p5]{iso:12}:

\begin{quote}
Where a size expression is part of the operand of a \lstinline|sizeof| 
operator and changing the value of the size expression would not affect the 
result of the operator, it is unspecified whether or not the size expression is 
evaluated.
\end{quote}

\noindent
Consider the fact that the size expression in the `not taken' branch of
this example \emph{has} to be evaluated, one may wonder if the expression in
the `not taken' branch of our earlier example is also allowed to be evaluated.

It also is unclear how these function calls in casts are evaluated with
respect to sequence points.
As they are evaluated in branches that are not taken, it seems they are
exempt from the normal execution flow of an expression.
But if they already can be executed before the sequence point that
starts the subexpression that contains them,
is there a reason they cannot be executed before the evaluation of the statement that
contains them starts?

The standard's (implicit) distinction between \emph{static} and \emph{run-time}
typing is the reason that subject reduction breaks.
This means that an expression can get a more restricted type during its
evaluation.
For example, statically the expression \lstinline|(int(*)[f(5)])0| has type
`pointer to integer array of variable length', whereas at run-time it 
will become of the more restricted type `pointer to integer array of length 
\(n\)' where \(n\) is the result of evaluating \lstinline|f(5)|.

The previous example already indicates that one has to sacrifice either subject 
reduction or uniqueness of types (that is, each expression has a unique type).
However, we will argue that the situation is worse, and that for a reduction
semantics whose rules are applied locally, subject reduction will fail even
if expressions are allowed to have multiple types.
Consider:

\begin{lstlisting}
1 ? (int(*)[f(5)])0 : (int(*)[3])0
\end{lstlisting}

\noindent
In this example the two subexpressions \lstinline|(int(*)[f(5)])0| and 
\lstinline|(int(*)[3])0| have (static) types \lstinline|int(*)[*]| and
\lstinline|int(*)[3]|, respectively, where \lstinline|$T$[*]| denotes the
variable length array type over \(T\).
By typing of the conditional and composite types~\cite[6.5.15p6, 6.2.7]{iso:12}
the full expression has type \lstinline|int(*)[3]|.

If the function call \lstinline|f(5)| gets evaluated, and returns a value 
different from \lstinline|3|, typing breaks,
\ie, a well typed expression is evaluated to a non-well typed expression.
Luckily, the standard imposes undefined behavior on this example by 
means of the following clause~\cite[6.7.6.2p5]{iso:12}:

\begin{quote}
If the two array types are used in a context which requires them to be 
compatible, it is undefined behavior if the two size specifiers evaluate 
to unequal values.
\end{quote}

\noindent
Currently, our \C{} semantics deals with this problem by allowing
evaluation of size expressions at any place in a bigger expression.
Only after the conditional gets reduced, we check if both arguments have compatible
types, and if not, assign undefined behavior.
This approach has two obvious drawbacks. First of all, it breaks subject
reduction, and secondly, undefined behavior gets caught at a late moment, or 
might not get caught at all. For example, in

\begin{lstlisting}
g() ? (int(*)[f(5)])0 : (int(*)[3])0
\end{lstlisting}

\noindent
it may happen that \lstinline|f(5)| is evaluated first, and returns a value 
unequal to \lstinline|3|, in which case typing has already failed, but
our semantics has not yet established this.
After that, it may happen that the call to \lstinline|g()| does
not return (for example because it invokes an \lstinline|exit| or loops
indefinitely), resulting in the undefined behavior not getting described by our
semantics.

But does this mean that in a formal \C{} semantics each reduction step has to
establish that typing does not break in order to catch undefined behavior?
This would fix some version of subject reduction, but destroys the locality
of a small-step reduction semantics.

\section{Conclusion}
\label{section:conclusion}

\subsection{Discussion}
We will finish our discussion of the subtleties of the \C{} standard by asking the following questions:
\begin{enumerate}
\item Is the interpretation of the \C{} standard that we presented in this
	paper a reasonable reading of the standard?
\item Is the \C{} standard itself (under this interpretation) reasonable?
\end{enumerate}

\noindent
First of all, we claim that the standard is not fully clear.
The standard committee's response in \drcite{260} is not obvious from the
text of the standard.
Also, Maclaren~\cite{mac:01} presents various issues
about which the standard does not have an unambiguous reading, even when
taking \dr{260} into account.
%One might even claim that the exact text of the standard
%(without giving it a liberal reading) is inconsistent, \ie{} that from a logical
%point of view the standard implies anything one likes.
%But of course that position is not very productive.

However, the relation between \dr{260} and the official standard text is not
even completely clear.
The report \emph{does} include a response by the standard committee and
actual compilers are making use of it, and as such it should not be taken lightly.
It was a clarification of the \Cninetynine{} version of the standard,
and hence it seems obvious that there was some defect in that text.
However, the parts of the text of the standard in the \Celeven{} version
relevant for the issues discussed in the report have not changed from their
counterpart in the \Cninetynine{} standard at all.

The standard makes a very clear distinction between `normative' and
`informative' text in the standard, and the explanations of \dr{260} certainly
are not part of the `normative' text of the \Celeven{} standard.
Therefore, it seems an option to decide to ignore the committee's response in
this report, especially the notion of the \emph{origin} of an object, which
does not occur in the standard at all.
In that case, one could read the standard in a `Kernighan \& Ritchie' manner.
But of course, in that case various optimizations performed by actual
compilers \emph{will}
be incorrect, and one will get standard compliant behavior only when compiling
using flags like \texttt{-fno-strict-aliasing}.
Many real world programs, like for example the \Linux{} kernel, are
compiled with such flags anyway~\cite{wan:che:che:jia:zel:kaa:12},
but on the other hand this attitude would mean that for compilers to be
standard compliant, they would have to compile to executables with
less than optimal performance, because they would
need to generate more code to test for changes due to aliasing.

%One also wonders what the standard committee thinks of the subtleties
%discussed in this paper.
%Do they agree that the text of the standard is not fully clear about this?
%Also, do they think that these issues matter?

The second question, whether the standard can and should be improved
is even more interesting.
It seems very difficult (or maybe even impossible)
to give a mathematically precise version of the memory model from the standard,
which is a sign that one should aim for a clearer explanation of the issues involved.
However, it is already not clear at all what a mathematically precise clarification
of the standard should look like.

For example, in the memory model of \CompCert, bytes (objects of type
\lstinline|unsigned char|) that constitute a pointer are \emph{not} simple 
sequences of bits, but instead are abstract entities~\cite{ler:app:bla:ste:12}.
This means that treating them like numbers -- calculating with them, printing 
them -- will exhibit undefined behavior.
This seems not in the spirit of the language.
On the other hand, for actual practical programs, this restriction
on the programs that are considered meaningful probably is not 
important at all.

\subsection{Future work}
\anonymize{
The future work in our \Formalin{} project has two directions.}
{We consider two directions for future work.}
On the one hand we are porting the prototype semantics that we have developed
in \Haskell{} to the theorem prover \Coq.
For the moment, this development mainly ignores the issues described in this
paper and aims at an elegant semantics for non-local control flow, block scoped
variables, expressions with side effects, and sequence points.
\anonymize{
In~\cite{kre:wie:12}, we give
}{
It includes
}
both an operational semantics
and an axiomatic semantics, and
\anonymize{prove}{proves}
that the axiomatic semantics is
correct with respect to the operational semantics.

On the other hand we are experimenting with a formal memory model
in which the issues from this paper are taken into account.
We are investigating two approaches.
The first is to extend the memory representation that we described in
this paper (where objects are represented as trees).
Instead of having values at the leaves, we use sequences of bytes at the leaves.
These bytes are similar as those in \CompCert~\cite{ler:app:bla:ste:12}: they
are either a concrete sequence of bits, or an abstract entity constituting a
fragment of a pointer.
This approach combines the best of two worlds: it still allows many
byte-wise operations while incorporating the aliasing restrictions.
Also, in this approach, the memory remains executable.

However, this approach still does not allow pointer bytes to be used as
integers.
Therefore, we are also experimenting with a memory whose state consists of
a pair of an \emph{abstract} version of the memory (based on our tree
representation), and a \emph{concrete} version of the memory (which consists of
sequences of bytes encoding the data).
The puzzle is how to deal with information flow between these two components.
We have not been able to resolve these issues in a satisfactory
way yet.

If we succeed in creating such a `\dr{260}-compliant' formal memory
model, we will incorporate it into the operational semantics that we
already are working on.
We then also will be able to take a guaranteed consistent
position on the issues discussed in this paper.

\acks
\anonymize{
We thank
Rob Arthan,
Sandrine Blazy,
Hans Boehm,
Herman Geuvers,
Andreas Lochbihler,
Nick Maclaren,
Joseph Myers,
Mike Nahas,
Michael Norrish,
Erik Poll
and
Marc Schoolderman
for discussions or feedback on this topic, Xavier Leroy for details
on the \CompCert{} memory model, and the people on the \texttt{comp.std.c} 
newsgroup for helpful discussions on the \C{} standard.

This work is financed by the Netherlands Organisation for Scientific 
Research (NWO) project `A Formalization of the C99 Standard in \HOL,
\Isabelle{} and \Coq'.
}{
Acknowledgments omitted because of double blind reviewing.
}

\bibliographystyle{plain}
\def\bibfont{\small}
\bibliography{../CH2O}

\end{document}
