Dive into GHC: Intermediate Forms

In the last blog post, we discussed the high-level API used inside the GHC compiler. This time let’s zoom into the pieces that comprise the frontend of the compilation pipeline, namely:

Parsing
Renaming
Typechecking
Desugaring
Simplification
Code Generation

Accompaying Source Code

Passes

The following functions are the main passes are that are used inside of the compiler driver functions in HscMain .

-- Parse a module.
parseModule :: GhcMonad m => ModSummary -> m ParsedModule

-- Typecheck and rename a parsed module.
typecheckModule :: GhcMonad m => ParsedModule -> m TypecheckedModule

-- Desugar a typechecked module.
desugarModule :: GhcMonad m => TypecheckedModule -> m DesugaredModule

-- Extract ModGuts
coreModule :: DesugaredMod m => m -> ModGuts

-- Generated STG bindings from Core module
coreToStg :: DynFlags -> Module -> CoreProgram -> IO [StgBinding]

Each of the these functions manipulates an intermediate form used in the process of compiling a Haskell module into a form that can be compiled into machine code or interpreted.

Parser

The frontend syntax for Haskell is fairly large and is partially defined in the Haskell 2010 specification and extended by GHC. The parser is written using two custom tools:

Alex, for lexical analysis. See: Lexer.x
Happy, for the parser itself. See: Parser.y

Happy is a parser generator system for Haskell, similar to the tool `yacc’ for C. It works as a preprocessor with it’s own syntax that generates a parse table from two specifications, a lexer file and parser file. The Happy parser consists of a list of a production rules and a monad to running the parser in. Production rules consist of a set of options on the left and generating Haskell expressions on the right with indexed variables ($1, $2, …) mapping to the ordered terms on the left (i.e. in the second term term ~ $1, term ~ $2).

terms
    : term                   { [$1] }
    | term terms             { $1 : $2 }

The most complicated part of the lexer is with regards to whitespace-sensitive layout. To handle whitespace the lexer will analyze the line position of certain productions and will output virtual tokens (vbrace, vsemi, etc) which are used to canonicalize the syntax. For example the use of layout rule in do-notation:

main = do
  putStrLn "Hello"
  putStrLn "Haskell"

Is converted into explicit virtual braces and semicolons at parse time.

main = do { putStrLn "Hello"; putStrLn "Haskell" }

The main entry points are defined in the Parser and resolve into a monad P which holds state about the source buffer, and filename and location. The resulting AST is called HsSyn and contains position information preserved from the tokens.

parseType :: P (LHsType RdrName)
parseExpression :: P (LHsExpr RdrName)
parseModule :: P (Located (HsModule RdrName))

Located

Frontend syntax in GHC carries position information along with it that can be used in error reporting. Location information is defined by a Located type which has a single type paramater for the AST element to be “located”. Typically type synonyms follow the convention that a “L” before the type indicates a synonym for a located element. So a located HsExpr is a LHsExxpr.

data GenLocated l e = L l e
type Located e = GenLocated SrcSpan e

unLoc :: GenLocated l e -> e
unLoc (L _ e) = e

getLoc :: GenLocated l e -> l
getLoc (L l _) = l

noLoc :: e -> Located e
noLoc e = L noSrcSpan e

Most notably there is a functor instance for fmaping over a located element, which modifies the underlying type preserving the location.

instance Functor (GenLocated l) where
  fmap f (L l e) = L l (f e)

Frontend Syntax

The frontend syntax (called HsSyn) is spread across several data types which contain all possible syntax elements generated from the parser. These are parameterized over the name type so that transformations (renamer, etc) can alter the structure of bindings and identifiers while preserving the syntax tree structure.

HsModule - Top-level structure for a module. Contains module name, exports, imports, and toplevel declarations.
HsDecl - Type, class, value, and interface signature decls
HsBind - Value bindings used for both top level and nested bindings.
HsGroup - A HsGroup is a group of related HsDecls.
HsLit - Literal value (Int, Char, etc)
HsExpr - A Haskell expression.
HsType - Frontend type annotation.
Pat - Pattern match in a case statement or toplevel binding.

There are many variants of identifiers used in different phases of compilation. In order of generality:

OccName - Simplest name used, it holds the names as strings with the namespace that the name came from and position.
RdrName - Is the type of names that come directly from the parser. Contains information whether the name is unqualified or qualified (with module name).
Name - Is the type of names that have had their scoping and binding resolved. They have an OccName but also a Unique that disambiguates Names
Var - Is the type of names that not only have a Name but also a Type. Vars are either global or local.

Names are disambiguated from each other by an internal unique numerical key called a Unique , which are generated from a UniqSupply . These are often embedded within the monad transformer MonadUnique layer of the pass.

After typechecking we deal with Var and Id names which have the property that the Name and Type can always be recovered using the following functions.

varName :: Var -> Name
varType :: Var -> Kind

In addition, many lookup functions use a sum type TyThing which wraps “things” in scope which have types, such as identifiers and type constructors.

data TyThing
  = AnId Id
  | AConLike ConLike.ConLike
  | ATyCon TyCon
  | ACoAxiom (CoAxiom Branched)

Outputable

Inside the GHC API is a pretty printer which can be used to dump out a pretty printed summary of a large number of the internal types. The structure of the class is similar to Show typeclass.

class Outputable a where
  ppr :: a -> SDoc
  pprPrec :: Rational -> a -> SDoc

showPpr :: Outputable a => DynFlags -> a -> String

To simplify our life a bit we can write a wrapper showGhc which uses the default dynflags to dump out an instance of Outputable.

showGhc :: (Outputable a) => a -> String
showGhc = showPpr unsafeGlobalDynFlags

Tracing Compilation

Given the following example module we’d like to trace the transformation of the intermediate forms as it’s run through the compiler.

module Example where

data Animal = Cat | Dog

add :: Int -> Int -> Int
add x y = x + y

Using the machinery of the last post we import the module target, load it, and then instead of adding it to an interactive context we explicitly call the functions of the individual passes and dump our their structure to the screen.

main :: IO ()
main = runGhc (Just libdir) $ do
  env <- getSession
  dflags <- getSessionDynFlags
  setSessionDynFlags $ dflags { hscTarget = HscInterpreted }

  target <- guessTarget "Example.hs" Nothing
  setTargets [target]
  load LoadAllTargets
  modSum <- getModSummary $ mkModuleName "Example"

  pmod <- parseModule modSum      -- ModuleSummary
  tmod <- typecheckModule pmod    -- TypecheckedSource
  dmod <- desugarModule tmod      -- DesugaredModule
  let core = coreModule dmod      -- ModGuts
  stg <- liftIO $ coreToStg dflags (mg_module core) (mg_binds core)

  liftIO $ banner "Parsed Source"
  liftIO $ putStrLn $ showGhc ( parsedSource pmod )

  liftIO $ banner "Renamed Module"
  liftIO $ putStrLn $ showGhc ( tm_renamed_source tmod )

  liftIO $ banner "Typechecked Module"
  liftIO $ putStrLn $ showGhc ( tm_typechecked_source tmod )

  liftIO $ banner "Typed Toplevel Definitions"
  liftIO $ putStrLn $ showGhc ( modInfoTyThings (moduleInfo tmod) )

  liftIO $ banner "Typed Toplevel Exports"
  liftIO $ putStrLn $ showGhc ( modInfoExports (moduleInfo tmod) )

  liftIO $ banner "Core Module"
  liftIO $ putStrLn $ showGhc ( mg_binds core )

  liftIO $ banner "STG"
  liftIO $ putStrLn $ showGhc stg

Parsing

From the module summary we can extract the underlying text stream of the source file Example.hs. Running the parser over this generates a ParsedSource datatype which contains the HsSyn AST with raw OccNames.

data ParsedModule = ParsedModule 
  { pm_mod_summary :: ModSummary
  , pm_parsed_source :: ParsedSource
  , pm_extra_src_files :: [FilePath]
  , pm_annotations :: ApiAnns
  }

===============================Parsed Source===============================
module Example where
data Animal = Cat | Dog
add :: Int -> Int -> Int
add x y = x + y

Renaming

After parsing we run the renamer pass which is responsible for resolving scopes, operator precedences, and unique’ifying all names so that clashes cannot occur. The renamer resolves RdrNames into Names across all syntax elements. In addition it annotates each name with whether they are qualified and global or local to the current module. If a name is not in scope, compilation simply halts here and reports the missing name.

===============================Renamed Module===============================
Just (add :: Int -> Int -> Int
      add x y = x + y
      
      data Animal = Cat | Dog,
      [import (implicit) Prelude],
      Nothing,
      Nothing)

Typechecking

The renamed AST is then run through the typechecker which maps Names into Id and enriches them with inferred types using type inference.

data TypecheckedModule = TypecheckedModule 
  { tm_parsed_module :: ParsedModule
  , tm_renamed_source :: Maybe RenamedSource
  , tm_typechecked_source :: TypecheckedSource
  , tm_checked_module_info :: ModuleInfo
  , tm_internals_ :: (TcRnTypes.TcGblEnv, HscTypes.ModDetails)
  }

The resulting structure is a typechecked module which contains the toplevel types of all bindings and all subexpressions.

=============================Typechecked Module=============================
{AbsBinds [] []
   {Exports: [add <= add
                <>]
    Exported types: add :: Int -> Int -> Int
                    [LclId, Str=DmdType]
    Binds: add x y = (+) x y}}

The resulting datatype has a field for ModuleInfo, which contains various metadata about the typing environment, the names in scope, and typeclasses exported.

data ModuleInfo = ModuleInfo 
  { minf_type_env :: HscTypes.TypeEnv
  , minf_exports :: NameSet.NameSet
  , minf_rdr_env :: Maybe RdrName.GlobalRdrEnv
  , minf_instances :: [ClsInst]
  , minf_iface :: Maybe ModIface
  , minf_safe :: SafeHaskellMode
  , minf_modBreaks :: ModBreaks
  }

Using the coreModule we can also construct the gruesomely named ModGuts datatype which contains an expanded form of all the various exports and types with associated metadata.

data ModGuts = ModGuts 
  { mg_module :: !Module,
  , mg_boot :: IsBootInterface
  , mg_exports :: ![Avail.AvailInfo]
  , mg_deps :: !Dependencies
  , mg_dir_imps :: !ImportedMods
  , mg_used_names :: !NameSet.NameSet
  , mg_used_th :: !Bool
  , mg_rdr_env :: !RdrName.GlobalRdrEnv
  , mg_fix_env :: !FixityEnv
  , mg_tcs :: ![TyCon]
  , mg_insts :: ![ClsInst]
  , mg_fam_insts :: ![FamInst]
  , mg_patsyns :: ![PatSyn.PatSyn]
  , mg_rules :: ![CoreSyn.CoreRule]
  , mg_binds :: !CoreSyn.CoreProgram
  , mg_foreign :: !ForeignStubs
  , mg_warns :: !Warnings
  , mg_anns :: [Annotations.Annotation]
  , mg_hpc_info :: !HpcInfo
  , mg_modBreaks :: !ModBreaks
  , mg_vect_decls :: ![CoreSyn.CoreVect]
  , mg_vect_info :: !VectInfo
  , mg_inst_env :: InstEnv.InstEnv
  , mg_fam_inst_env :: FamInstEnv.FamInstEnv
  , mg_safe_haskell :: SafeHaskellMode
  , mg_trust_pkg :: Bool
  , mg_dependent_files :: [FilePath]
  }

The ModuleInfo can be queried for any type of information about types or names at this point. For instance if we were interesed in the types of a global name we could look it up as a TyThing inside the typing environment.

modInfoTyThings :: ModuleInfo -> [TyThing]
modInfoExports :: ModuleInfo -> [Name]

=========================Typed Toplevel Definitions=========================
[Type constructor ‘Animal’, Data constructor ‘Cat’,
 Data constructor ‘Dog’, Identifier ‘add’, Identifier ‘Cat’,
 Identifier ‘Dog’]

===========================Typed Toplevel Exports===========================
[Animal, Cat, Dog, add]

Core

The Core language is GHC’s most central data types. Core is a very small, explicitly-typed, variant of System FC. The output of the typecheckModule function is a datatype CoreModule which contains the Core program as the typing environment.

data CoreModule = CoreModule 
  { cm_module :: !Module
  , cm_types :: !HscTypes.TypeEnv
  , cm_binds :: CoreSyn.CoreProgram
  , cm_safe :: SafeHaskellMode
  }

Looking at the core module we see that each name in the program is now annotated with an explicit type and typeclass dictionaries have been inserted.

================================Core Module================================
[add :: Int -> Int -> Int
 [LclIdX, Str=DmdType]
 add = \ (x :: Int) (y :: Int) -> + @ Int $fNumInt x y]

Core then is translated into a intermediate form that is amenable to various analysis passes that optimize the program for code generation. This form is an abstract machine known a Spineless Tagless G-Machine. In this syntax every closure has an associated Static Reference Table or SRT. More on this later.

====================================STG====================================
[add :: Int -> Int -> Int
 [LclIdX, Str=DmdType] =
     \r srt:SRT:[r6ry :-> $fNumInt] [x y] + $fNumInt x y;]

Summary & Next Steps

We mentioned Core and STG only briefly. The process by which frontend syntax is transformed into Core will be our next topic. The design of Core defines a great deal of aspects about how the language is structured and informs the type system as well as code-generation.