GHC

ブロック図

The flow of code through GHC is a process of translation between several intermediate languages and optimizations and transformations thereof. A common pattern for many of these AST types is they are parametrised over a binder type and at various stages the binders will be transformed, for example the Renamer pass effectively translates the HsSyn datatype from a AST parametrized over literal strings as the user enters into a HsSyn parameterized over qualified names that includes modules and package names into a higher level Name type.

  • Parser/Frontend: An enormous AST translated from human syntax that makes explicit possible all expressible syntax ( declarations, do-notation, where clauses, syntax extensions, template haskell, ... ). This is unfiltered Haskell and it is enormous.
  • Renamer takes syntax from the frontend and transforms all names to be qualified (base:Prelude.map instead of map) and any shadowed names in lambda binders transformed into unique names.
  • Typechecker is a large pass that serves two purposes, first is the core type bidirectional inference engine where most of the work happens and the translation between the frontend Core syntax.
  • Desugarer translates several higher level syntactic constructors
    • where statements are turned into (possibly recursive) nested let statements.
    • Nested pattern matches are expanded out into splitting trees of case statements.
    • do-notation is expanded into explicit bind statements.
    • Lots of others.
  • Simplifier transforms many Core constructs into forms that are more adaptable to compilation. For example let statements will be floated or raised, pattern matches will simplified, inner loops will be pulled out and transformed into more optimal forms. Non-intuitively the resulting may actually be much more complex (for humans) after going through the simplifier!
  • Stg pass translates the resulting Core into STG (Spineless Tagless G-Machine) which effectively makes all laziness explicit and encodes the thunks and update frames that will be handled during evaluation.
  • Codegen/Cmm pass will then translate STG into Cmm (flavoured C--) a simple imperative language that manifests the low-level implementation details of runtime types. The runtime closure types and stack frames are made explicit and low-level information about the data and code (arity, updatability, free variables, pointer layout) made manifest in the info tables present on most constructs.
  • Native Code The final pass will than translate the resulting code into either LLVM or Assembly via either through GHC's home built native code generator (NCG) or the LLVM backend.

Information for about each pass can dumped out via a rather large collection of flags. The GHC internals are very accessible although some passes are somewhat easier to understand than others. Most of the time -ddump-simpl and -ddump-stg are sufficient to get an understanding of how the code will compile, unless of course you're dealing with very specialized optimizations or hacking on GHC itself.

Flag Action


-ddump-parsed Frontend AST. -ddump-rn Output of the rename pass. -ddump-tc Output of the typechecker. -ddump-splices Output of TemplateHaskell splices. -ddump-types Typed AST representation. -ddump-deriv Output of deriving instances. -ddump-ds Output of the desugar pass. -ddump-spec Output of specialisation pass. -ddump-rules Output of applying rewrite rules. -ddump-vect Output results of vectorize pass. -ddump-simpl Ouptut of the SimplCore pass. -ddump-inlinings Output of the inliner. -ddump-cse Output of the common subexpression elimination pass. -ddump-prep The CorePrep pass. -ddump-stg The resulting STG. -ddump-cmm The resulting Cmm. -ddump-opt-cmm The resulting Cmm optimization pass. -ddump-asm The final assembly generated. -ddump-llvm The final LLVM IR generated.

コア

Core is the explicitly typed System-F family syntax through that all Haskell constructs can be expressed in.

To inspect the core from GHCi we can invoke it using the following flags and the following shell alias. We have explicitly disable the printing of certain metadata and longform names to make the representation easier to read.

alias ghci-core="ghci -ddump-simpl -dsuppress-idinfo \
-dsuppress-coercions -dsuppress-type-applications \
-dsuppress-uniques -dsuppress-module-prefixes"

At the interactive prompt we can then explore the core representation interactively:

$ ghci-core
λ: let f x = x + 2 ; f :: Int -> Int

==================== Simplified expression ====================
returnIO
  (: ((\ (x :: Int) -> + $fNumInt x (I# 2)) `cast` ...) ([]))

λ: let f x = (x, x)

==================== Simplified expression ====================
returnIO (: ((\ (@ t) (x :: t) -> (x, x)) `cast` ...) ([]))

ghc-core is also very useful for looking at GHC's compilation artifacts.

$ ghc-core --no-cast --no-asm

Alternatively the major stages of the compiler ( parse tree, core, stg, cmm, asm ) can be manually outputted and inspected by passing several flags to the compiler:

$ ghc -ddump-to-file -ddump-parsed -ddump-simpl -ddump-stg -ddump-cmm -ddump-asm

Reading Core

Core from GHC is roughly human readable, but it's helpful to look at simple human written examples to get the hang of what's going on.

id :: a -> a
id x = x
id :: forall a. a -> a
id = \ (@ a) (x :: a) -> x

idInt :: GHC.Types.Int -> GHC.Types.Int
idInt = id @ GHC.Types.Int
compose :: (b -> c) -> (a -> b) -> a -> c
compose f g x = f (g x)
compose :: forall b c a. (b -> c) -> (a -> b) -> a -> c
compose = \ (@ b) (@ c) (@ a) (f1 :: b -> c) (g :: a -> b) (x1 :: a) -> f1 (g x1)
map :: (a -> b) -> [a] -> [b]
map f []     = []
map f (x:xs) = f x : map f xs
map :: forall a b. (a -> b) -> [a] -> [b]
map =
  \ (@ a) (@ b) (f :: a -> b) (xs :: [a]) ->
    case xs of _ {
      []     -> [] @ b;
      : y ys -> : @ b (f y) (map @ a @ b f ys)
    }

Machine generated names are created for a lot of transformation of Core. Generally they consist of a prefix and unique identifier. The prefix is often pass specific ( i.e. ds for desugar generated name s) and sometimes specific names are generated for specific automatically generated code. A non exhaustive cheat sheet is given below for deciphering what a name is and what it might stand for:

Prefix Description
$f... Dict-fun identifiers (from inst decls)
$dmop Default method for 'op'
$wf Worker for function 'f'
$sf Specialised version of f
$gdm Generated class method
$d Dictionary names
$s Specialized function name
$f Foreign export
$pnC n'th superclass selector for class C
T:C Tycon for dictionary for class C
D:C Data constructor for dictionary for class C
NTCo:T Coercion for newtype T to its underlying runtime representation

Of important note is that the Λ and λ for type-level and value-level lambda abstraction are represented by the same symbol (\) in core, which is a simplifying detail of the GHC's implementation but a source of some confusion when starting.

-- System-F Notation
Λ b c a. λ (f1 : b -> c) (g : a -> b) (x1 : a). f1 (g x1)

-- Haskell Core
\ (@ b) (@ c) (@ a) (f1 :: b -> c) (g :: a -> b) (x1 :: a) -> f1 (g x1)

The seq function has an intuitive implementation in the Core language.

x `seq` y
case x of _ {
  __DEFAULT -> y
}

One particularly notable case of the Core desugaring process is that pattern matching on overloaded numbers implicitly translates into equality test (i.e. Eq).

f 0 = 1
f 1 = 2
f 2 = 3
f 3 = 4
f 4 = 5
f _ = 0


f :: forall a b. (Eq a, Num a, Num b) => a -> b
f =
  \ (@ a)
    (@ b)
    ($dEq :: Eq a)
    ($dNum :: Num a)
    ($dNum1 :: Num b)
    (ds :: a) ->
    case == $dEq ds (fromInteger $dNum (__integer 0)) of _ {
      False ->
        case == $dEq ds (fromInteger $dNum (__integer 1)) of _ {
          False ->
            case == $dEq ds (fromInteger $dNum (__integer 2)) of _ {
              False ->
                case == $dEq ds (fromInteger $dNum (__integer 3)) of _ {
                  False ->
                    case == $dEq ds (fromInteger $dNum (__integer 4)) of _ {
                      False -> fromInteger $dNum1 (__integer 0);
                      True -> fromInteger $dNum1 (__integer 5)
                    };
                  True -> fromInteger $dNum1 (__integer 4)
                };
              True -> fromInteger $dNum1 (__integer 3)
            };
          True -> fromInteger $dNum1 (__integer 2)
        };
      True -> fromInteger $dNum1 (__integer 1)
    }

Of course, adding a concrete type signature changes the desugar just matching on the unboxed values.

f :: Int -> Int
f =
  \ (ds :: Int) ->
    case ds of _ { I# ds1 ->
    case ds1 of _ {
      __DEFAULT -> I# 0;
      0 -> I# 1;
      1 -> I# 2;
      2 -> I# 3;
      3 -> I# 4;
      4 -> I# 5
    }
    }

See:

インライナ

infixr 0  $

($):: (a -> b) -> a -> b
f $ x =  f x

Having to enter a secondary closure every time we used ($) would introduce an enormous overhead. Fortunately GHC has a pass to eliminate small functions like this by simply replacing the function call with the body of it's definition at appropriate call-sites. There compiler contains a variety heuristics for determining when this kind of substitution is appropriate and the potential costs involved.

In addition to the automatic inliner, manual pragmas are provided for more granular control over inlining. It's important to note that naive inlining quite often results in significantly worse performance and longer compilation times.

{-# INLINE func #-}
{-# INLINABLE func #-}
{-# NOINLINE func #-}

For example the contrived case where we apply a binary function to two arguments. The function body is small and instead of entering another closure just to apply the given function, we could in fact just inline the function application at the call site.

{-# INLINE foo #-}
{-# NOINLINE bar #-}

foo :: (a -> b -> c) -> a -> b -> c
foo f x y = f x y

bar :: (a -> b -> c) -> a -> b -> c
bar f x y = f x y

test1 :: Int
test1 = foo (+) 10 20

test2 :: Int
test2 = bar (+) 20 30

Looking at the core, we can see that in test2 the function has indeed been expanded at the call site and simply performs the addition there instead of another indirection.

test1 :: Int
test1 =
  let {
    f :: Int -> Int -> Int
    f = + $fNumInt } in
  let {
    x :: Int
    x = I# 10 } in
  let {
    y :: Int
    y = I# 20 } in
  f x y

test2 :: Int
test2 = bar (+ $fNumInt) (I# 20) (I# 30)

Cases marked with NOINLINE generally indicate that the logic in the function is using something like unsafePerformIO or some other unholy function. In these cases naive inlining might duplicate effects at multiple call-sites throughout the program which would be undesirable.

See:

辞書

The Haskell language defines the notion of Typeclasses but is agnostic to how they are implemented in a Haskell compiler. GHC's particular implementation uses a pass called the dictionary passing translation part of the elaboration phase of the typechecker which translates Core functions with typeclass constraints into implicit parameters of which record-like structures containing the function implementations are passed.

class Num a where
  (+) :: a -> a -> a
  (*) :: a -> a -> a
  negate :: a -> a

This class can be thought as the implementation equivalent to the following parameterized record of functions.

data DNum a = DNum (a -> a -> a) (a -> a -> a) (a -> a)

add (DNum a m n) = a
mul (DNum a m n) = m
neg (DNum a m n) = n

numDInt :: DNum Int
numDInt = DNum plusInt timesInt negateInt

numDFloat :: DNum Float
numDFloat = DNum plusFloat timesFloat negateFloat
+ :: forall a. Num a => a -> a -> a
+ = \ (@ a) (tpl :: Num a) ->
  case tpl of _ { D:Num tpl _ _ -> tpl }

* :: forall a. Num a => a -> a -> a
* = \ (@ a) (tpl :: Num a) ->
  case tpl of _ { D:Num _ tpl _ -> tpl }

negate :: forall a. Num a => a -> a
negate = \ (@ a) (tpl :: Num a) ->
  case tpl of _ { D:Num _ _ tpl -> tpl }

Num and Ord have simple translation but for monads with existential type variables in their signatures, the only way to represent the equivalent dictionary is using RankNTypes. In addition a typeclass may also include superclasses which would be included in the typeclass dictionary and parameterized over the same arguments and an implicit superclass constructor function is created to pull out functions from the superclass for the current monad.

data DMonad m = DMonad
  { bind   :: forall a b. m a -> (a -> m b) -> m b
  , return :: forall a. a -> m a
  }
class (Functor t, Foldable t) => Traversable t where
    traverse :: Applicative f => (a -> f b) -> t a -> f (t b)
    traverse f = sequenceA . fmap f
data DTraversable t = DTraversable
  { dFunctorTraversable :: DFunctor t  -- superclass dictionary
  , dFoldableTraversable :: DFoldable t -- superclass dictionary
  , traverse :: forall a. Applicative f => (a -> f b) -> t a -> f (t b)
  }

Indeed this is not that far from how GHC actually implements typeclasses. It elaborates into projection functions and data constructors nearly identical to this, and are implicitly threaded for every overloaded identifier.

特殊化

Overloading in Haskell is normally not entirely free by default, although with an optimization called specialization it can be made to have zero cost at specific points in the code where performance is crucial. This is not enabled by default by virtue of the fact that GHC is not a whole-program optimizing compiler and most optimizations ( not all ) stop at module boundaries.

GHC's method of implementing typeclasses means that explicit dictionaries are threaded around implicitly throughout the call sites. This is normally the most natural way to implement this functionality since it preserves separate compilation. A function can be compiled independently of where it is declared, not recompiled at every point in the program where it's called. The dictionary passing allows the caller to thread the implementation logic for the types to the call-site where it can then be used throughout the body of the function.

Of course this means that in order to get at a specific typeclass function we need to project ( possibly multiple times ) into the dictionary structure to pluck out the function reference. The runtime makes this very cheap but not entirely free.

Many C++ compilers or whole program optimizing compilers do the opposite however, they explicitly specialize each and every function at the call site replacing the overloaded function with it's type-specific implementation. We can selectively enable this kind of behavior using class specialization.

module Specialize (spec, nonspec, f) where

{-# SPECIALIZE INLINE f :: Double -> Double -> Double #-}

f :: Floating a => a -> a -> a
f x y = exp (x + y) * exp (x + y)

nonspec :: Float
nonspec = f (10 :: Float) (20 :: Float)

spec :: Double
spec = f (10 :: Double) (20 :: Double)

Non-specialized

f :: forall a. Floating a => a -> a -> a
f =
  \ (@ a) ($dFloating :: Floating a) (eta :: a) (eta1 :: a) ->
    let {
      a :: Fractional a
      a = $p1Floating @ a $dFloating } in
    let {
      $dNum :: Num a
      $dNum = $p1Fractional @ a a } in
    * @ a
      $dNum
      (exp @ a $dFloating (+ @ a $dNum eta eta1))
      (exp @ a $dFloating (+ @ a $dNum eta eta1))

In the specialized version the typeclass operations placed directly at the call site and are simply unboxed arithmetic. This will map to a tight set of sequential CPU instructions and is very likely the same code generated by C.

spec :: Double
spec = D# (*## (expDouble# 30.0) (expDouble# 30.0))

The non-specialized version has to project into the typeclass dictionary ($fFloatingFloat) 6 times and likely go through around 25 branches to perform the same operation.

nonspec :: Float
nonspec =
  f @ Float $fFloatingFloat (F# (__float 10.0)) (F# (__float 20.0))

For a tight loop over numeric types specializing at the call site can result in orders of magnitude performance increase. Although the cost in compile-time can often be non-trivial and when used function used at many call-sites this can slow GHC's simplifier pass to a crawl.

The best advice is profile and look for large uses of dictionary projection in tight loops and then specialize and inline in these places.

Using the SPECIALISE INLINE pragma can unintentionally cause GHC to diverge if applied over a recursive function, it will try to specialize itself infinitely.

静的コンパイル

On Linux, Haskell programs can be compiled into a standalone statically linked binary that includes the runtime statically linked into it.

$ ghc -O2 --make -static -optc-static -optl-static -optl-pthread Example.hs
$ file Example
Example: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.32, not stripped
$ ldd Example
        not a dynamic executable

In addition the file size of the resulting binary can be reduced by stripping unneeded symbols.

$ strip Example

非ボックス型

The usual numerics types in Haskell can be considered to be a regular algebraic datatype with special constructor arguments for their underlying unboxed values. Normally unboxed types and explicit unboxing are not used in normal code, they are an implementation detail of the compiler and many optimizations exist to do the unboxing in a way that is guaranteed to be safe and preserve the high level semantics of Haskell. Nevertheless it is somewhat enlightening to understand how the types are implemented.

data Int = I# Int#

data Integer
  = S# Int#              -- Small integers
  | J# Int# ByteArray#   -- Large GMP integers

data Float = F# Float#
λ: :set -XMagicHash
λ: :m +GHC.Types
λ: :m +GHC.Prim

λ: :type 3#
3# :: GHC.Prim.Int#

λ: :type 3##
3## :: GHC.Prim.Word#

λ: :type 3.14#
3.14# :: GHC.Prim.Float#

λ: :type 3.14##
3.14## :: GHC.Prim.Double#

λ: :type 'c'#
'c'# :: GHC.Prim.Char#

λ: :type "Haskell"#
"Haskell"# :: Addr#

λ: :i Int
data Int = I# Int#      -- Defined in GHC.Types

λ: :k Int#
Int# :: #

An unboxed type with kind # and will never unify a type variable of kind *. Intuitively a type with kind * indicates a type with a uniform runtime representation that can be used polymorphically.

  • Lifted - Can contain a bottom term, represented by a pointer. ( Int, Any, (,) )
  • Unlited - Cannot contain a bottom term, represented by a value on the stack. ( Int#, (#, #) )
{-# LANGUAGE BangPatterns, MagicHash, UnboxedTuples #-}

import GHC.Exts
import GHC.Prim

ex1 :: Bool
ex1 = gtChar# a# b#
  where
    !(C# a#) = 'a'
    !(C# b#) = 'b'

ex2 :: Int
ex2 = I# (a# +# b#)
  where
    !(I# a#) = 1
    !(I# b#) = 2

ex3 :: Int
ex3 = (I# (1# +# 2# *# 3# +# 4#))

ex4 :: (Int, Int)
ex4 = (I# (dataToTag# False), I# (dataToTag# True))

The function for integer arithmetic used in the Num typeclass for Int is just pattern matching on this type to reveal the underlying unboxed value, performing the builtin arithmetic and then performing the packing up into Int again.

plusInt :: Int -> Int -> Int
(I# x) `plusInt` (I# y) = I# (x +# y)

Where (+#) is a low level function built into GHC that maps to intrinsic integer addition instruction for the CPU.

plusInt :: Int -> Int -> Int
plusInt a b = case a of {
    (I# a_) -> case b of {
      (I# b_) -> I# (+# a_ b_);
    };
};

Runtime values in Haskell are by default represented uniformly by a boxed StgClosure* struct which itself contains several payload values, which can themselves either be pointers to other boxed values or to unboxed literal values that fit within the system word size and are stored directly within the closure in memory. The layout of the box is described by a bitmap in the header for the closure which describes which values in the payload are either pointers or non-pointers.

The unpackClosure# primop can be used to extract this information at runtime by reading off the bitmap on the closure.

{-# LANGUAGE MagicHash, UnboxedTuples #-}
{-# OPTIONS_GHC -O1 #-}

module Main where

import GHC.Exts
import GHC.Base
import Foreign

data Size = Size
  { ptrs  :: Int
  , nptrs :: Int
  , size  :: Int
  } deriving (Show)

unsafeSizeof :: a -> Size
unsafeSizeof a =
  case unpackClosure# a of
    (# x, ptrs, nptrs #) ->
      let header  = sizeOf (undefined :: Int)
          ptr_c   = I# (sizeofArray# ptrs)
          nptr_c  = I# (sizeofByteArray# nptrs) `div` sizeOf (undefined :: Word)
          payload = I# (sizeofArray# ptrs +# sizeofByteArray# nptrs)
          size    = header + payload
      in Size ptr_c nptr_c size

data A = A {-# UNPACK #-} !Int
data B = B Int

main :: IO ()
main = do
  print (unsafeSizeof (A 42))
  print (unsafeSizeof (B 42))

For example the datatype with the UNPACK pragma contains 1 non-pointer and 0 pointers.

data A = A {-# UNPACK #-} !Int
Size {ptrs = 0, nptrs = 1, size = 16}

While the default packed datatype contains 1 pointer and 0 non-pointers.

data B = B Int
Size {ptrs = 1, nptrs = 0, size = 9}

The closure representation for data constructors are also "tagged" at the runtime with the tag of the specific constructor. This is however not a runtime type tag since there is no way to recover the type from the tag as all constructor simply use the sequence (0, 1, 2, ...). The tag is used to discriminate cases in pattern matching. The builtin dataToTag# can be used to pluck off the tag for an arbitrary datatype. This is used in some cases when desugaring pattern matches.

dataToTag# :: a -> Int#

For example:

-- data Bool = False | True
-- False ~ 0
-- True  ~ 1

a :: (Int, Int)
a = (I# (dataToTag# False), I# (dataToTag# True))
-- (0, 1)

-- data Ordering = LT | EQ | GT
-- LT ~ 0
-- EQ ~ 1
-- GT ~ 2

b :: (Int, Int, Int)
b = (I# (dataToTag# LT), I# (dataToTag# EQ), I# (dataToTag# GT))
-- (0, 1, 2)

-- data Either a b = Left a | Right b
-- Left ~ 0
-- Right ~ 1

c :: (Int, Int)
c = (I# (dataToTag# (Left 0)), I# (dataToTag# (Right 1)))
-- (0, 1)

String literals included in the source code are also translated into several primop operations. The Addr# type in Haskell stands for a static contagious buffer pre-allocated on the Haskell heap that can hold a char* sequence. The operation unpackCString# can scan this buffer and fold it up into a list of Chars from inside Haskell.

unpackCString# :: Addr# -> [Char]

This is done in the early frontend desugarer phase, where literals are translated into Addr# inline instead of giant chain of Cons'd characters. So our "Hello World" translates into the following Core:

-- print "Hello World"
print (unpackCString# "Hello World"#)

See:

IOとST

Both the IO and the ST monad have special state in the GHC runtime and share a very similar implementation. Both ST a and IO a are passing around an unboxed tuple of the form:

(# token, a #)

The RealWorld# token is "deeply magical" and doesn't actually expand into any code when compiled, but simply threaded around through every bind of the IO or ST monad and has several properties of being unique and not being able to be duplicated to ensure sequential IO actions are actually sequential. unsafePerformIO can thought of as the unique operation which discards the world token and plucks the a out, and is as the name implies not normally safe.

The PrimMonad abstracts over both these monads with an associated data family for the world token or ST thread, and can be used to write operations that generic over both ST and IO. This is used extensively inside of the vector package to allow vector algorithms to be written generically either inside of IO or ST.

{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}

import GHC.IO ( IO(..) )
import GHC.Prim ( State#, RealWorld )
import GHC.Base ( realWorld# )

instance  Monad IO  where
    m >> k    = m >>= \ _ -> k
    return    = returnIO
    (>>=)     = bindIO
    fail s    = failIO s

returnIO :: a -> IO a
returnIO x = IO $ \ s -> (# s, x #)

bindIO :: IO a -> (a -> IO b) -> IO b
bindIO (IO m) k = IO $ \ s -> case m s of (# new_s, a #) -> unIO (k a) new_s

thenIO :: IO a -> IO b -> IO b
thenIO (IO m) k = IO $ \ s -> case m s of (# new_s, _ #) -> unIO k new_s

unIO :: IO a -> (State# RealWorld -> (# State# RealWorld, a #))
unIO (IO a) = a
{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}
{-# LANGUAGE TypeFamilies #-}

import GHC.IO ( IO(..) )
import GHC.ST ( ST(..) )
import GHC.Prim ( State#, RealWorld )
import GHC.Base ( realWorld# )

class Monad m => PrimMonad m where
  type PrimState m
  primitive :: (State# (PrimState m) -> (# State# (PrimState m), a #)) -> m a
  internal :: m a -> State# (PrimState m) -> (# State# (PrimState m), a #)

instance PrimMonad IO where
  type PrimState IO = RealWorld
  primitive = IO
  internal (IO p) = p

instance PrimMonad (ST s) where
  type PrimState (ST s) = s
  primitive = ST
  internal (ST p) = p

See:

ghc-heap-view

Through some dark runtime magic we can actually inspect the StgClosure structures at runtime using various C and Cmm hacks to probe at the fields of the structure's representation to the runtime. The library ghc-heap-view can be used to introspect such things, although there is really no use for this kind of thing in everyday code it is very helpful when studying the GHC internals to be able to inspect the runtime implementation details and get at the raw bits underlying all Haskell types.

{-# LANGUAGE MagicHash #-}

import GHC.Exts
import GHC.HeapView

import System.Mem

main :: IO ()
main = do
  -- Constr
  clo <- getClosureData $! ([1,2,3] :: [Int])
  print clo

  -- Thunk
  let thunk = id (1+1)
  clo <- getClosureData thunk
  print clo

  -- evaluate to WHNF
  thunk `seq` return ()

  -- Indirection
  clo <- getClosureData thunk
  print clo

  -- force garbage collection
  performGC

  -- Value
  clo <- getClosureData thunk
  print clo

A constructor (in this for cons constructor of list type) is represented by a CONSTR closure that holds two pointers to the head and the tail. The integer in the head argument is a static reference to the pre-allocated number and we see a single static reference in the SRT (static reference table).

ConsClosure {
  info = StgInfoTable {
    ptrs = 2,
    nptrs = 0,
    tipe = CONSTR_2_0,
    srtlen = 1
  },
  ptrArgs = [0x000000000074aba8/1,0x00007fca10504260/2],
  dataArgs = [],
  pkg = "ghc-prim",
  modl = "GHC.Types",
  name = ":"
}

We can also observe the evaluation and update of a thunk in process ( id (1+1) ). The initial thunk is simply a thunk type with a pointer to the code to evaluate it to a value.

ThunkClosure {
  info = StgInfoTable {
    ptrs = 0,
    nptrs = 0,
    tipe = THUNK,
    srtlen = 9
  },
  ptrArgs = [],
  dataArgs = []
}

When forced it is then evaluated and replaced with an Indirection closure which points at the computed value.

BlackholeClosure {
  info = StgInfoTable {
    ptrs = 1,
    nptrs = 0,
    tipe = BLACKHOLE,
    srtlen = 0
  },
  indirectee = 0x00007fca10511e88/1
}

When the copying garbage collector passes over the indirection, it then simply replaces the indirection with a reference to the actual computed value computed by indirectee so that future access does need to chase a pointer through the indirection pointer to get the result.

ConsClosure {
  info = StgInfoTable {
    ptrs = 0,
    nptrs = 1,
    tipe = CONSTR_0_1,
    srtlen = 0
  },
  ptrArgs = [],
  dataArgs = [2],
  pkg = "integer-gmp",
  modl = "GHC.Integer.Type",
  name = "S#"
}

STG

After being compiled into Core, a program is translated into a very similar intermediate form known as STG ( Spineless Tagless G-Machine ) an abstract machine model that makes all laziness explicit. The spineless indicates that function applications in the language do not have a spine of applications of functions are collapsed into a sequence of arguments. Currying is still present in the semantics since arity information is stored and partially applied functions will evaluate differently than saturated functions.

-- Spine
f x y z = App (App (App f x) y) z

-- Spineless
f x y z = App f [x, y, z]

All let statements in STG bind a name to a lambda form. A lambda form with no arguments is a thunk, while a lambda-form with arguments indicates that a closure is to be allocated that captures the variables explicitly mentioned.

Thunks themselves are either reentrant (\r) or updatable (\u) indicating that the thunk and either yields a value to the stack or is allocated on the heap after the update frame is evaluated All subsequent entry's of the thunk will yield the already-computed value without needing to redo the same work.

A lambda form also indicates the static reference table a collection of references to static heap allocated values referred to by the body of the function.

For example turning on -ddump-stg we can see the expansion of the following compose function.

-- Frontend
compose f g = \x -> f (g x)
-- Core
compose :: forall t t1 t2. (t1 -> t) -> (t2 -> t1) -> t2 -> t
compose =
  \ (@ t) (@ t1) (@ t2) (f :: t1 -> t) (g :: t2 -> t1) (x :: t2) ->
    f (g x)
-- STG
compose :: forall t t1 t2. (t1 -> t) -> (t2 -> t1) -> t2 -> t =
    \r [f g x] let { sat :: t1 = \u [] g x; } in  f sat;
SRT(compose): []

For a more sophisticated example, let's trace the compilation of the factorial function.

-- Frontend
fac :: Int -> Int -> Int
fac a 0 = a
fac a n = fac (n*a) (n-1)
-- Core
Rec {
fac :: Int -> Int -> Int
fac =
  \ (a :: Int) (ds :: Int) ->
    case ds of wild { I# ds1 ->
    case ds1 of _ {
      __DEFAULT ->
        fac (* @ Int $fNumInt wild a) (- @ Int $fNumInt wild (I# 1));
      0 -> a
    }
    }
end Rec }
-- STG
fac :: Int -> Int -> Int =
    \r srt:(0,*bitmap*) [a ds]
        case ds of wild {
          I# ds1 ->
              case ds1 of _ {
                __DEFAULT ->
                    let {
                      sat :: Int =
                          \u srt:(1,*bitmap*) []
                              let { sat :: Int = NO_CCS I#! [1]; } in  - $fNumInt wild sat; } in
                    let { sat :: Int = \u srt:(1,*bitmap*) [] * $fNumInt wild a;
                    } in  fac sat sat;
                0 -> a;
              };
        };
SRT(fac): [fac, $fNumInt]

Notice that the factorial function allocates two thunks ( look for \u) inside of the loop which are updated when computed. It also includes static references to both itself (for recursion) and the dictionary for instance of Num typeclass over the type Int.

ワーカとラッパ

With -O2 turned on GHC will perform a special optimization known as the Worker-Wrapper transformation which will split the logic of the factorial function across two definitions, the worker will operate over stack unboxed allocated machine integers which compiles into a tight inner loop while the wrapper calls into the worker and collects the end result of the loop and packages it back up into a boxed heap value. This can often be an order of of magnitude faster than the naive implementation which needs to pack and unpack the boxed integers on every iteration.

-- Worker
$wfac :: Int# -> Int# -> Int# =
    \r [ww ww1]
        case ww1 of ds {
          __DEFAULT ->
              case -# [ds 1] of sat {
                __DEFAULT ->
                    case *# [ds ww] of sat { __DEFAULT -> $wfac sat sat; };
              };
          0 -> ww;
        };
SRT($wfac): []

-- Wrapper
fac :: Int -> Int -> Int =
    \r [w w1]
        case w of _ {
          I# ww ->
              case w1 of _ {
                I# ww1 -> case $wfac ww ww1 of ww2 { __DEFAULT -> I# [ww2]; };
              };
        };
SRT(fac): []

See:

Zエンコーディング

The Z-encoding is Haskell's convention for generating names that are safely represented in the compiler target language. Simply put the z-encoding renames many symbolic characters into special sequences of the z character.

String Z-Encoded String
foo foo
z zz
Z ZZ
. .
() Z0T
(,) Z2T
(,,) Z3T
_ zu
( ZL
) ZR
: ZC
# zh
. zi
(#,#) Z2H
(->) ZLzmzgZR

In this way we don't have to generate unique unidentifiable names for character rich names and can simply have a straightforward way to translate them into something unique but identifiable.

So for some example names from GHC generated code:

Z-Encoded String Decoded String
ZCMain_main_closure :Main_main_closure
base_GHCziBase_map_closure base_GHC.Base_map_closure
base_GHCziInt_I32zh_con_info base_GHC.Int_I32#_con_info
ghczmprim_GHCziTuple_Z3T_con_info ghc-prim_GHC.Tuple_(,,)_con_in
ghczmprim_GHCziTypes_ZC_con_info ghc-prim_GHC.Types_:_con_info

Cmm

Cmm is GHC's complex internal intermediate representation that maps directly onto the generated code for the compiler target. Cmm code code generated from Haskell is CPS-converted, all functions never return a value, they simply call the next frame in the continuation stack. All evaluation of functions proceed by indirectly jumping to a code object with it's arguments placed on the stack by the caller.

This is drastically different than C's evaluation model, where are placed on the stack and a function yields a value to the stack after it returns.

There are several common suffixes you'll see used in all closures and function names:

Symbol Meaning
0 No argument
p Garage Collected Pointer
n Word-sized non-pointer
l 64-bit non-pointer (long)
v Void
f Float
d Double
v16 16-byte vector
v32 32-byte vector
v64 64-byte vector

Cmm Registers

There are 10 registers that described in the machine model. Sp is the pointer to top of the stack, SpLim is the pointer to last element in the stack. Hp is the heap pointer, used for allocation and garbage collection with HpLim the current heap limit.

The R1 register always holds the active closure, and subsequent registers are arguments passed in registers. Functions with more than 10 values spill into memory.

  • Sp
  • SpLim
  • Hp
  • HpLim
  • HpAlloc
  • R1
  • R2
  • R3
  • R4
  • R5
  • R6
  • R7
  • R8
  • R9
  • R10

Examples

To understand Cmm it is useful to look at the code generated by the equivalent Haskell and slowly understand the equivalence and mechanical translation maps one to the other.

There are generally two parts to every Cmm definition, the info table and the entry code. The info table maps directly StgInfoTable struct and contains various fields related to the type of the closure, it's payload, and references. The code objects are basic blocks of generated code that correspond to the logic of the Haskell function/constructor.

For the simplest example consider a constant static constructor. Simply a function which yields the Unit value. In this case the function is simply a constructor with no payload, and is statically allocated.

Haskell:

unit = ()

Cmm:

[section "data" {
     unit_closure:
         const ()_static_info;
 }]

Consider a static constructor with an argument.

Haskell:

con :: Maybe ()
con = Just ()

Cmm:

[section "data" {
     con_closure:
         const Just_static_info;
         const ()_closure+1;
         const 1;
 }]

Consider a literal constant. This is a static value.

Haskell:

lit :: Int
lit = 1

Cmm:

[section "data" {
     lit_closure:
         const I#_static_info;
         const 1;
 }]

Consider the identity function.

Haskell:

id x = x

Cmm:

[section "data" {
     id_closure:
         const id_info;
 },
 id_info()
         { label: id_info
           rep:HeapRep static { Fun {arity: 1 fun_type: ArgSpec 5} }
         }
     ch1:
         R1 = R2;
         jump stg_ap_0_fast; // [R1]
 }]

Consider the constant function.

Haskell:

constant x y = x

Cmm:

[section "data" {
     constant_closure:
         const constant_info;
 },
 constant_info()
         { label: constant_info
           rep:HeapRep static { Fun {arity: 2 fun_type: ArgSpec 12} }
         }
     cgT:
         R1 = R2;
         jump stg_ap_0_fast; // [R1]
 }]

Consider a function where application of a function ( of unknown arity ) occurs.

Haskell:

compose f g x = f (g x)

Cmm:

[section "data" {
     compose_closure:
         const compose_info;
 },
 compose_info()
         { label: compose_info
           rep:HeapRep static { Fun {arity: 3 fun_type: ArgSpec 20} }
         }
     ch9:
         Hp = Hp + 32;
         if (Hp > HpLim) goto chd;
         I64[Hp - 24] = stg_ap_2_upd_info;
         I64[Hp - 8] = R3;
         I64[Hp + 0] = R4;
         R1 = R2;
         R2 = Hp - 24;
         jump stg_ap_p_fast; // [R1, R2]
     che:
         R1 = compose_closure;
         jump stg_gc_fun; // [R1, R4, R3, R2]
     chd:
         HpAlloc = 32;
         goto che;
 }]

Consider a function which branches using pattern matching:

Haskell:

match :: Either a a -> a
match x = case x of
  Left a -> a
  Right b -> b

Cmm:

[section "data" {
     match_closure:
         const match_info;
 },
 sio_ret()
         { label: sio_info
           rep:StackRep []
         }
     ciL:
         _ciM::I64 = R1 & 7;
         if (_ciM::I64 >= 2) goto ciN;
         R1 = I64[R1 + 7];
         Sp = Sp + 8;
         jump stg_ap_0_fast; // [R1]
     ciN:
         R1 = I64[R1 + 6];
         Sp = Sp + 8;
         jump stg_ap_0_fast; // [R1]
 },
 match_info()
         { label: match_info
           rep:HeapRep static { Fun {arity: 1 fun_type: ArgSpec 5} }
         }
     ciP:
         if (Sp - 8 < SpLim) goto ciR;
         R1 = R2;
         I64[Sp - 8] = sio_info;
         Sp = Sp - 8;
         if (R1 & 7 != 0) goto ciU;
         jump I64[R1]; // [R1]
     ciR:
         R1 = match_closure;
         jump stg_gc_fun; // [R1, R2]
     ciU: jump sio_info; // [R1]
 }]

Macros

Cmm itself uses many macros to stand for various constructs, many of which are defined in an external C header file. A short reference for the common types:

Cmm Description
C_ char
D_ double
F_ float
W_ word
P_ garbage collected pointer
I_ int
L_ long
FN_ function pointer (no arguments)
EF_ extern function pointer
I8 8-bit integer
I16 16-bit integer
I32 32-bit integer
I64 64-bit integer

Many of the predefined closures (stg_ap_p_fast, etc) are themselves mechanically generated and more or less share the same form ( a giant switch statement on closure type, update frame, stack adjustment). Inside of GHC is a file named GenApply.hs that generates most of these functions. See the Gist link in the reading section for the current source file that GHC generates. For example the output for stg_ap_p_fast.

stg_ap_p_fast
{   W_ info;
    W_ arity;
    if (GETTAG(R1)==1) {
        Sp_adj(0);
        jump %GET_ENTRY(R1-1) [R1,R2];
    }
    if (Sp - WDS(2) < SpLim) {
        Sp_adj(-2);
        W_[Sp+WDS(1)] = R2;
        Sp(0) = stg_ap_p_info;
        jump __stg_gc_enter_1 [R1];
    }
    R1 = UNTAG(R1);
    info = %GET_STD_INFO(R1);
    switch [INVALID_OBJECT .. N_CLOSURE_TYPES] (TO_W_(%INFO_TYPE(info))) {
        case FUN,
             FUN_1_0,
             FUN_0_1,
             FUN_2_0,
             FUN_1_1,
             FUN_0_2,
             FUN_STATIC: {
            arity = TO_W_(StgFunInfoExtra_arity(%GET_FUN_INFO(R1)));
            ASSERT(arity > 0);
            if (arity == 1) {
                Sp_adj(0);
                R1 = R1 + 1;
                jump %GET_ENTRY(UNTAG(R1)) [R1,R2];
            } else {
                Sp_adj(-2);
                W_[Sp+WDS(1)] = R2;
                if (arity < 8) {
                  R1 = R1 + arity;
                }
                BUILD_PAP(1,1,stg_ap_p_info,FUN);
            }
        }
        default: {
            Sp_adj(-2);
            W_[Sp+WDS(1)] = R2;
            jump RET_LBL(stg_ap_p) [];
        }
    }
}

Handwritten Cmm can be included in a module manually by first compiling it through GHC into an object and then using a special FFI invocation.

#include "Cmm.h"

factorial {
  entry:
      W_ n  ;
      W_ acc;
      n = R1 ;
      acc = n ;
      n = n - 1 ;

  for:
      if (n <= 0 ) {
          RET_N(acc);
      } else {
          acc = acc * n  ;
          n = n - 1 ;
          goto for ;
      }
      RET_N(0);
}
-- ghc -c factorial.cmm -o factorial.o
-- ghc factorial.o Example.hs -o Example

{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnliftedFFITypes #-}
{-# LANGUAGE GHCForeignImportPrim #-}
{-# LANGUAGE ForeignFunctionInterface #-}

module Main where

import GHC.Prim
import GHC.Word

foreign import prim "factorial" factorial_cmm  :: Word# -> Word#

factorial :: Word64 -> Word64
factorial (W64# n) =  W64# (factorial_cmm n)

main :: IO ()
main = print (factorial 5)

See:

Cmm Runtime:

最適化ハック

Tables Next to Code

GHC will place the info table for a toplevel closure directly next to the entry-code for the objects in memory such that the fields from the info table can be accessed by pointer arithmetic on the function pointer to the code itself. Not performing this optimization would involve chasing through one more pointer to get to the info table. Given how often info-tables are accessed using the tables-next-to-code optimization results in a tractable speedup.

Pointer Tagging

Depending on the type of the closure involved, GHC will utilize the last few bits in a pointer to the closure to store information that can be read off from the bits of pointer itself before jumping into or access the info tables. For thunks this can be information like whether it is evaluated to WHNF or not, for constructors it contains the constructor tag (if it fits) to avoid an info table lookup.

Depending on the architecture the tag bits are either the last 2 or 3 bits of a pointer.

// 32 bit arch
TAG_BITS = 2

// 64-bit arch
TAG_BITS = 3

These occur in Cmm most frequently via the following macro definitions:

#define TAG_MASK ((1 << TAG_BITS) - 1)
#define UNTAG(p) (p & ~TAG_MASK)
#define GETTAG(p) (p & TAG_MASK)

So for instance in many of the precompiled functions, there will be a test for whether the active closure R1 is already evaluated.

if (GETTAG(R1)==1) {
    Sp_adj(0);
    jump %GET_ENTRY(R1-1) [R1,R2];
}

results matching ""

    No results matching ""