Dive into GHC: Pipeline

After reading Simon’s call for more volunteer writing about GHC I thought it would be timely to share some knowledge I’ve accumulated over the years about working with the with GHC internals.

I’m by no means an expert on GHC internals, but I have worked with them a fair bit for several projects and the deep dive style of blog posts tends to be a good format for helping ease into exploring the code for themselves. Often times simply a high-level overview and a small bit of runnable example code is enough to encourage further involvement with an open source project and this what I aim to write.

So begins a multipart writeup on the structure of GHC structured around several examples that use the GHC API for some small project that shows off some internal structure of the compiler.

Accompaying Source Code


Official Commentary

GHC core developers have actually spent a great deal of time over the sharing knowledge about the design of the compiler. Some good places to start are the following:

  1. GHC Commentary
  2. OPLSS: Adventures in Types
  3. 2006 GHC Hackathon Videos
  4. David Terei’s Notes
  5. Takenobu Tani’s GHC Illustrated
  6. Edward Yang’s Blog

On top of this there is a literature trail going back 25 years that shows how the historical context and the research that led up to GHC today.

  1. GHC Reading List

Toplevel

GHC is a quirky beast of a codebase, but as far compilers go it is a fairly well-engineered and documented project if you know where to look. Yes, it uses a somewhat idiosyncratic convention in places, but after all it is a 20-year old codebase.

To get the source for the compiler clone the official repo:

$ git clone --recursive git://git.haskell.org/ghc.git
$ cd ghc/

There are many utilities included with the compiler the encompass documentation and the build system, but the important toplevel directories for the compiler itself are primarily:

├── rts          # The Haskell runtime systems
├── compiler     # The Haskell compiler logic
├── includes     # Header files for runtime and code generation
└── libraries    # The base libraries and Prelude source

For this post we’ll concern ourselves with the compiler folder.

├── basicTypes   # Types used across all modules
├── cbits        # Misc C utilities
├── cmm          # Cmm langauge definitions
├── codeGen      # Cmm Compilers
├── coreSyn      # Core language definitions
├── deSugar      # Desugarer
├── ghci         # Interactive shell
├── hsSyn        # Frontend syntax
├── iface        # Interface files
├── llvmGen      # LLVM Code generator
├── main         # Compiler driver logic and options
├── nativeGen    # Assemblers for x86 / SPARC / PPC
├── parser       # Frontend Parser for HsSyn
├── prelude      # Wired-In Types /  Primops and Builtins
├── profiling    # Runtime profiing tools
├── rename       # Frontend renamer
├── simplCore    # Core-To-Core simplifier
├── simplStg     # Stg-To-Stg simplifier
├── specialise   # Specialisation pass ( Eliminates Overloading )
├── stgSyn       # Stg Core Language
├── stranal      # Strictness Analyzer
├── typecheck    # Typechecker
├── types        # Type language, data constructors, and type families
├── utils        # Misc functions and core data structures
└── vectorise    # Vectorisation optimiations

GHC API

Since GHC is itself written in Haskell, GHC is effectively a large library the encompasses the GHC API. The toplevel module is simply called GHC and contains a namespace dump of many of the core types that drive the compilation pipeline.

Beneath this is the main API for compiling plain Haskell source code called HscMain which contains the various drivers for different passes within the compilation. The six core passes make up the compilation pipeline:

  1. Parsing
  2. Renaming
  3. Typechecking
  4. Desugaring
  5. Simplification
  6. Code Generation

The result of this compilation is several artificats which are object files (.o), interface files (.hi) and executables.

GHC Monad

The heart of the compilation process is stored within the GHC Monad, a state monad that handles the internal session state of the compilation pipeline, error handling and sequencing of multi-module compilation.

newtype Ghc a = Ghc { unGhc :: Session -> IO a }

The abstract class GhcMonad provides a lifted version of the GHC monad functions to get at the internal session objects from within the various submonads used throughout compilation (renamer, typechecker, etc).

class (Functor m, MonadIO m, ExceptionMonad m, HasDynFlags m) => GhcMonad m where
  getSession :: m HscEnv
  setSession :: HscEnv -> m ()

The evaluation function takes in a path to the libdir and returns the result inside of IO.

runGhc :: Maybe FilePath -> Ghc a -> IO a

The filepaths are installation specific paths indicating the local installation and paths to the GHC compiler. These are provided by the ghc-paths package.

import GHC.Paths

libdir, docdir, ghc, ghc_pkg :: FilePath

At the heart of the session object is a very important structure called HscEnv which holds the internal state of compilation.

data HscEnv
  = HscEnv 
  { hsc_dflags :: DynFlags
  , hsc_targets :: [Target]
  , hsc_mod_graph :: ModuleGraph
  , hsc_IC :: InteractiveContext
  , hsc_HPT :: HomePackageTable
  -- Many more ... (truncated for brevity)
  }

The hsc_dflags holds the settings objects (more on this next). The hsc_targets holds the roots of the Module graph which are traversed bottom-up to build up the entire set of modules needed for compilation of the current package. The entire set of modules involved in this (roots and non-roots) is stored in hsc_mod_graph which holds the whole ModuleGraph, which is not necessarily in topological order. The hsc_IC field contains the interactive context which is used for the interactive shell and for when the end targets are linked in memory. Specific commands in GHCi such as adding modules to the top-level scope modifying this structure state fully.

The hsc_HPT holds the home package table which describes already-compiled home-package modules, When a module done being compiled, and is loaded with loadModule it is internally added to this mapping.

DynFlags

DynFlags contains a collection of flags relating to the compilation of a single file or GHC session. This is the core datatype that informs how compilation occurs and is passed to most of the various pass functions.

data DynFlags
  = DynFlags 
  { ghcMode :: GhcMode
  , ghcLink :: GhcLink
  , hscTarget :: HscTarget
  , settings :: Settings

  , flags :: [DynFlag]
  , extensionFlags :: [ExtensionFlag]

  , pkgState :: PackageState
  , pkgDatabase :: Maybe [PackageConfig]
  , packageEnv :: Maybe FilePath
  , packageFlags :: [PackageFlag]
  , extraPkgConfs :: [PkgConfRef] -> [PkgConfRef]
  -- Many more flags... (truncated for brevity)
  }

The GhcMode informs whether we’re doing multi-module compilation or one-shot single-file compilation. In the case of multi-module the ModuleGraph is built up via the Finder function which searches the home package for the dependent modules.

GhcMode
CompManager --make
OneShot ghc -c Foo.hs
MkDepend ghc -M

The HscTarget datatype defines the target code type of the compilation. By default this is HscAsm.

HscTarget
HscC Generate C code.
HscAsm Generate assembly using the native code generator.
HscLlvm Generate assembly using the llvm code generator.
HscInterpreted Generate bytecode.
HscNothing Don’t generate any code. See notes above.

After compilation is done (for multi-module) GHC then begins the linker phase and the GhcLink setting determines what to do with the resulting object files.

GhcLink
NoLink Don’t link at all
LinkBinary Link object code into a binary
LinkInMemory Use the in-memory dynamic linker (works for both bytecode and object code).
LinkDynLib Link objects into a dynamic lib (DLL on Windows, DSO on ELF platforms)
LinkStaticLib Link objects into a static lib

The simplest initializer of a GHC session simply uses the defaults and sets up a interpreted session that links any modules it is given in memory.

example :: IO ()
example = runGhc (Just libdir) $ do
  dflags <- getSessionDynFlags
  setSessionDynFlags $ dflags { hscTarget = HscInterpreted
                              , ghcLink   = LinkInMemory
                              }

GHC exposes many compiler flags on the commandline and these are themselves reflected in various subfields of the DynFlags struct. The three major classes of flags are DumpFlag (example: -ddump-simpl), GeneralFlag (example: -fspec-constr) and ExtensionFlag (example: -XTypeInType). There are various helper functions that modifying the DynFlags to twiddle these flags on or off.

dopt_set :: DynFlags -> DumpFlag -> DynFlags
gopt_set :: DynFlags -> GeneralFlag -> DynFlags
xopt_set :: DynFlags -> ExtensionFlag -> DynFlags

Through the compilation GHC will query the state of these flags to dispatch to different codepaths based on whether a language extension is set or other flag behavior. This is done through querying the GhcMonad instance to get the dynflags and using one of the various flag specific functions.

xopt :: ExtensionFlag -> DynFlags -> Bool
gopt :: GeneralFlag -> DynFlags -> Bool
dopt :: DumpFlag -> DynFlags -> Bool

To enable various flags we use modify the current dflags object using the flag set functions.

example :: IO ()
example = runGhc (Just libdir) $ do
  dflags <- getSessionDynFlags
  let dflags' = dflags { hscTarget = HscInterpreted , ghcLink = LinkInMemory }
               `dopt_set` Opt_D_dump_BCOs        -- Set Dump Flag
               `xopt_set` Opt_OverloadedStrings  -- Set Language Extension Flag

Compilation

To start compilation we first add a target to the state. This modifies the hsc_targets field of the environment. To two types of targets are either module names or filenames. The guessTarget will discriminate on the given string’s extension it to determine which target object to create.

addTarget :: GhcMonad m => Target -> m ()
guessTarget :: GhcMonad m => String -> Maybe Phase -> m Target

Targets specify the source files or modules at the top of the dependency tree. For a executable program there is just a single target Main.hs, for a library the targets are visible module in the library.

Target
TargetModule A module name: search for the file
TargetFile FilePath A filename: preprocess and parse it to find the module name.

If with the modules added to the state we can then perform dependency analsysis to determine the module graph to proceed with multi-module compilation. Dependency analysis entails parsing the import directives of the module and resolving the ModuleGraph which is a type alias for a list of ModuleSummary which contains the targets. This is performed by the depanal function.

depanal :: GhcMonad m => [ModuleName] -> Bool -> m ModuleGraph

After a target is created the compiler is then run on the module yielding the resulting artifacts and it is loaded into the home package table. This is accomplished via the load command.

load :: GhcMonad m => LoadHowMuch -> m SuccessFlag
LoadHowMuch
LoadAllTargets Load all targets and its dependencies.
LoadUpTo Load only the given module and its dependencies.
LoadDependenciesOf Load only the dependencies of the given module, but not the module itself.

A full example of this would be the compilation of a module Example.hs in the current working directory that is interpreted and linked in memory.

example :: IO ()
example = runGhc (Just libdir) $ do
  dflags <- getSessionDynFlags
  setSessionDynFlags $ dflags { hscTarget = HscInterpreted
                              , ghcLink   = LinkInMemory
                              }

  target <- guessTarget "Example.hs" Nothing
  addTarget target
  load LoadAllTargets

Interactive Context

On top of simply generating compiler artifacts. GHC can compile and link code into memory to be evaluated interactively. The state of the interpreter backing this is held in the InteractiveContext.

The set of modules in the interactive scope can be modified by the setContext function.

getContext :: GhcMonad m => m [InteractiveImport]
setContext :: GhcMonad m => [InteractiveImport] -> m ()

When a module is interpreted and loaded as an interactive import it has its full top-level scope available. We can manipulate, query and extend this scope using various function.

parseName can be used to resolve a name (or names) from a given string to a set of symbols in the interactive context. This returns a Name object (more on this later) which is GHC’s internal name type that holds position and a unique identifier.

parseName :: GhcMonad m => String -> m [Name]

To resolve the type of an given expression the exprType can be used to extract the type information within the current context.

exprType :: GhcMonad m => String -> m Type

And within the entire interactive context we can query the set of all names that have been brought into scope by imports. This is used for the interactive :browse command.

getNamesInScope :: GhcMonad m => m [Name]

And the most important function is evaluation of arbitrary expressions with in the interactive context. Which is accomplished via dynCompileExpr. This returns a Dynamic which can be safely cast using fromDynamic for any instance of Typeable. This is used to dynamically evaluate a string expression within the interactive context.

dynCompileExpr :: GhcMonad m => String -> m Dynamic
fromDynamic :: Typeable a => Dynamic -> a -> Maybe a

Package Database

In it’s default state GHC is aware of two package databases: the global package database in /usr/lib/ghc-x.x.x/ and the user database in ~/.ghc/lib.

This however can be extended via the “GHC_PACKAGE_PATH” environment variable which reads the path variable and applies the extraPkgConfs function to add it to the package database. This is used in the various modern sandboxing techniques used in tools like cabal and stack.

extraPkgConfs :: [PkgConfRef] -> [PkgConfRef]

To modify the given dynflags with a filepath, the following function can be used to extend the state.

addPkgDbs :: GhcMonad m => [FilePath] -> m ()
addPkgDbs fps = do 
  dfs <- getSessionDynFlags
  let pkgs = map PkgConfFile fps
  let dfs' = dfs { extraPkgConfs = (pkgs ++) . extraPkgConfs dfs }
  setSessionDynFlags dfs'
  _ <- initPackages dfs'
  return ()

Stack sets this when launching the shell with stack repl. More on modifying this will be discussed later.

Mini GHCi

Ok, so let’s a build a very small interactive shell for GHC. If you’re not familiar with Haskeline (the platform-agnostic readline abstraction) then read up on that first.

The Haskeline interface is exposed as a monad transformer InputT which inside of IO gives us our interactive repl monad.

type Repl a = InputT IO a

To set up the initial session set get the default dynflags, set the target to be interpreted and memory-linked and twiddle the -XExtendedDefaultRules flag. We set the interactive shell to import the Prelude and then monadically return the resulting session so that we can progressively add to it on each shell commnad.

initSession :: IO HscEnv
initSession = runGhc (Just libdir) $ do
  liftIO $ putStrLn "Setting up HscEnv"
  dflags <- getSessionDynFlags
  let dflags' = dflags { hscTarget = HscInterpreted , ghcLink = LinkInMemory }
                `xopt_set` Opt_ExtendedDefaultRules
  setSessionDynFlags dflags'
  setContext [ IIDecl $ simpleImportDecl (mkModuleName "Prelude") ]
  env <- getSession
  return env

Each our interactive shell commands is then wrapped in a helper function session which spins up a new Ghc monad but restores the session from the last compilation. The monadic action is then evaluated and the resulting session afterwards is returned as a value to be reused.

session :: HscEnv -> Ghc a -> IO HscEnv
session env m = runGhc (Just libdir) $ do
  setSession env 
  m
  env <- getSession
  return env

The evaluator function tries two different compilation steps. First it tries to compile the expression as is to see if it evaluates to a IO a action. If it does it is then evaluated directly within the monad. If it does not then the fromDynamic cast will simply yield a Nothing and we’ll try to wrap the expression in a print statement. The resulting compiled expression is guaranteed to be an IO a so we unsafely coerce the compiled code pointer that GHC gives us into IO and run it.

eval :: String -> Ghc ()
eval inp = do
  dyn <- fromDynamic <$> dynCompileExpr inp
  case dyn of
    Nothing -> do
      act <- compileExpr ("Prelude.print (" <> inp <> ")")
      -- 'print' is constrained to 'IO ()' so unsafeCoerce is "safe"
      liftIO (unsafeCoerce act)
    Just act -> liftIO $ act

To add an import we simply cons the import as a module name to the context and then yield the new state.

addImport :: String -> Ghc ()
addImport mod = do
  ctx <- getContext
  setContext ( (IIDecl $ simpleImportDecl (mkModuleName mod)) : ctx )

Then we do the naughty thing of catching all exceptions that are thrown and just printing them out. This is fairly justified in the case that if expression compilation fails we have to just trap and report the failure in the embedded interpreter logic.

ghcCatch :: MonadIO m => IO a -> m (Maybe a)
ghcCatch m = liftIO $ do
  mres <- try m
  case mres of
    Left (err :: SomeException) -> do
      liftIO $ print err
      return Nothing
    Right res -> return (Just res)

The REPL then just reads the user’s input and dispatch based on whether the line starts with the keyword import. Depending on the expression line it then spins up a GHC session with the currently held HscEnv from the last line and trys to compile it. If succesfully it then calls repl with the new env state ad-infinitum. Unix signals for aborting are handled by Haskeline monad.

repl :: HscEnv -> Repl ()
repl env = do
  minput <- getInputLine ">>> "
  case minput of
    Nothing -> outputStrLn "Goodbye."

    Just input | "import" `isPrefixOf` input -> do
      let mod = concat $ tail $ words input
      env' <- ghcCatch (session env (addImport mod))
      maybe (repl env) repl env'

    Just input -> do
      env' <- ghcCatch (session env (eval input))
      maybe (repl env) repl env'

Then putting it all together.

main :: IO ()
main = do
  env <- initSession
  runInputT defaultSettings (repl env)

We can then run our little shell.

$ stack build dive
$ stack exec dive 

Setting up HscEnv
>>> fmap (+1) [1..10]
[2,3,4,5,6,7,8,9,10,11]
>>> import Data.Text
>>>
Goodbye.

So that’s our custom Mini GHCi. In practice real GHCi does things a little differently, but some underlying machinery remains the same. Other features like name lookup and introspection are left as an exercise to the reader. A fun next project would be to create tiny shell with an introspection tool querying the original source code of any definition in scope.

Summary & Next Steps

This is the “Very High Level” API we can use to interact with GHC. Next we’ll concern ourselves with the guts of the internal artifacts used and how to introspect and build them programatically.