Cooking Classes with Datatype Generic Programming

Haskell Generics are a somewhat misunderstood topic but are an extremely powerful technique for writing reusable and comparable interfaces across an enormous universe of types with very little effort. They are probably my favorite example of the advantages of an expressive type system endows us with.

Source for examples code is available here.

Generics are a form of datatype-generic programming, which although the namesake has some similarity to Java Generics they are different concepts entirely. GHC’s implementation of Generics fall out of the simple observation that all datatypes in Haskell can be written as a combination of a sum of products.

A sum type, is a data structure used to hold a value that could take on several different, but fixed, types. For example:

data Pastry
  = Turnover
  | Macaroon
  | Brownie
  | Cookie

A product type, is a data structure used to hold a fixed ordered set of several types. Selecting a single field is called projection.

data Person = Person
  { firstName       :: String
  , lastName        :: String
  , age             :: Int
  , height          :: Float
  , phoneNumber     :: String
  , flavor          :: String
  }

In Haskell all datatypes can be expressed as sums of products:

data Expr
  = Add { l :: Expr, r :: Expr }
  | Mul { l :: Expr, r :: Expr }
  | Sub { l :: Expr, r :: Expr }
  | Div { l :: Expr, r :: Expr }
  | Number { val :: Int }

During compilation most of the information about the structure of the datatypes is thrown out, by design Haskell erases all type information. Prior to type-checking a phase known as elaboration expands out all record selectors into toplevel functions which extract the named fields of a product.

data Point a b = Point { x :: a, y :: b }
x :: Point a b -> a
x (Point a _) = a

y :: Point a b -> b
y (Point _ b) = b

The rest of the information about products largely gets thrown out after compilation, and the product just get expanded into pattern matching code. For sum types the only information that is kept around is the tag for each constructor of the sum type. For instance Add is assigned the tag 1, Mul is assigned 2 etc. In a case statement the only information that is available at runtime is which branch we’re scrutinizing.

So what if consider not tossing out all this information and instead exposed it to our program so that we could write generic logic that can introspect the “structure” of our datatypes.

Compiler Hooks

Since GHC 6.10 we’ve had type families which, among other things, allow us to associate data types with our typeclass. So the structure of our Generic class can have a associated Rep type which can carry information along with the typeclass.

class Generic a where
  type Rep a :: * -> *
  from :: a -> (Rep a) x
  to :: (Rep a) x -> a

To represent the structure of our datatype we need to set up several datatypes to encode, sums, products, empty branches and various metadata about the names of fields, constructors and their types. All of which have a free parameter p which is bound to the head of typeclass instance when used in the associated datatype Rep a.

data    V1        p                       -- Empty
data    U1        p = U1                  -- ()
data    (:+:) f g p = L1 (f p) | R1 (g p) -- Sum
data    (:*:) f g p = (f p) :*: (g p)     -- Product
newtype K1    i c p = K1 { unK1 :: c }    -- a container for a c
newtype M1  i t f p = M1 { unM1 :: f p }  -- metadata wrapper

Now we could write this instance by hand for all of our datatypes, and for a simple enumeration it would look like the following:

data Ingredient
  = Flour
  | Sugar

instance Generic Ingredient where
  type Rep Ingredient = M1 D (T_Ingredient ((M1 C (C_Flour U1)) :+: (M1 C (C_Sugar U1))))

  from Flour = M1 (L1 (M1 U1))
  from Sugar = M1 (R1 (M1 U1))

  to (M1 (L1 (M1 U1))) = Flour
  to (M1 (R1 (M1 U1))) = Sugar

data T_Ingredient
data C_Flour
data C_Sugar

The instance here is purely mechanical and can be derived from GHC’s internal representation of it’s syntax tree, namely the types GHC.DataCon and GHC.TypeCon. Using the -XDeriveGeneric extension we can have GHC crank this typeclass out automatically:

{-# LANGUAGE DeriveGeneric #-}

data Ingredient
  = Flour
  | Sugar
  deriving (Generic)

Lest we not handwave away the work that GHC is doing, let’s actually recreate the introspection logic that GHC uses when instantiating a Generic class from a module’s data definitions. Let’s load a module dynamically, intercept the compilation and dump out the internal structure of the it’s datatypes to see how this would be mechanically translated into a typeclass instance.

import GHC
import GHC.Paths as Paths

import Name
import TyCon
import TypeRep
import DataCon
import HscTypes

import Text.Show.Pretty

main :: IO ()
main = do

  -- Inside the GHC Monad
  rep <- runGhc (Just Paths.libdir) $ do

    -- Spin up a GHC compiler environment
    dflags <- getSessionDynFlags
    setSessionDynFlags dflags

    -- Make a dummy module to inject
    let mn = mkModuleName "Test"

    -- Make a dummy target
    addTarget Target {
      targetId = TargetModule mn
    , targetAllowObjCode = True
    , targetContents = Nothing
    }

    -- Run the GHC pipeline
    load LoadAllTargets
    modSum <- getModSummary mn
    p <- parseModule modSum
    t <- typecheckModule p

    -- Pluck out the module tycons after we're done type-checking
    DesugaredModule tcmod modguts <- desugarModule t
    let tycons = mg_tcs modguts

    -- Deconstruct all datatypes into their sums-of-products.
    return (deconstruct tycons)

  putStrLn (ppShow rep)

Now that we have access to GHC’s internal representation of the “module guts” we can write our deconstructor logic. The logic is a slim few hundred lines of mostly ADT munging.

deconstruct :: [TyCon] -> [Data]
deconstruct = fmap go
  where
    go x
      | isProduct x = M1 $ D Datatype
        { dataTypeName = getOccString (tyConName x)
        , modName      = modString x
        , isNewtype    = isNewTyCon x
        , datatype     = Product (mkProduct x)
        , recursive    = isRecursiveTyCon x
        }

      | isVoid x = M1 $ D Datatype
        { dataTypeName = getOccString (tyConName x)
        , modName      = modString x
        , isNewtype    = isNewTyCon x
        , datatype     = V1
        , recursive    = isRecursiveTyCon x
        }

      | otherwise = M1 $ D Datatype
        { dataTypeName = getOccString (tyConName x)
        , modName      = modString x
        , isNewtype    = isNewTyCon x
        , datatype     = Sum (mkProduct x)
        , recursive    = isRecursiveTyCon x
        }

mkRecord :: TyCon -> [Data]
mkRecord x = concatMap mkRProduct (tyConDataCons x)

mkProduct :: TyCon -> [Data]
mkProduct x = fmap go (tyConDataCons x)
  where
    go :: DataCon -> Data
    go x | isRecord x   = Product (mkRProduct x)
    go x | isDProduct x = Product (mkDProduct x)
    go x                = M1 (C (Constructor (conNames x)))

mkDProduct :: DataCon -> [Data]
mkDProduct xs = [K1 (showType x) | x <- dataConOrigArgTys xs]

mkRProduct :: DataCon -> [Data]
mkRProduct x = [M1 (S (Selector (getOccString fld)) ty) | (fld, ty) <- zip (fieldNames x) (mkDProduct x)]

Setting up the dummy module Test.hs to run our decompilation script:

data PlatonicSolid
  = Tetrahedron
  | Cube
  | Octahedron
  | Dodecahedron
  | Icosahedron

data Person = Person
  { firstName       :: String
  , lastName        :: String
  , age             :: Int
  , height          :: Float
  , phoneNumber     :: String
  , flavor          :: String
  } deriving (Show)

data T
  = T1 { a :: Int, b :: Float }
  | T2 { c :: Int, d :: Double }

For PlatonicSolid we get the representation:

M1
   (D Datatype
        { dataTypeName = "PlatonicSolid"
        , modName = "Test"
        , isNewtype = False
        , datatype =
            Sum
              [ M1 (C Constructor { conName = "Tetrahedron" })
              , M1 (C Constructor { conName = "Cube" })
              , M1 (C Constructor { conName = "Octahedron" })
              , M1 (C Constructor { conName = "Dodecahedron" })
              , M1 (C Constructor { conName = "Icosahedron" })
              ]
        , recursive = False
        })

For Person we get the representation:

M1
    (D Datatype
         { dataTypeName = "Person"
         , modName = "Test"
         , isNewtype = False
         , datatype =
             Product
               [ M1 (S Selector { selName = "firstName" } (K1 "String"))
               , M1 (S Selector { selName = "lastName" } (K1 "String"))
               , M1 (S Selector { selName = "age" } (K1 "Int"))
               , M1 (S Selector { selName = "height" } (K1 "Float"))
               , M1 (S Selector { selName = "phoneNumber" } (K1 "String"))
               , M1 (S Selector { selName = "flavor" } (K1 "String"))
               ]
         , recursive = False
       })

For the sum of products T we get the representation:

M1
  (D Datatype
       { dataTypeName = "T"
       , modName = "Test"
       , isNewtype = False
       , datatype =
           Sum
             [ Product
                 [ M1 (S Selector { selName = "a" } (K1 "Int"))
                 , M1 (S Selector { selName = "b" } (K1 "Float"))
                 ]
             , Product
                 [ M1 (S Selector { selName = "c" } (K1 "Int"))
                 , M1 (S Selector { selName = "d" } (K1 "Double"))
                 ]
             ]
       , recursive = False
       })

These data points are then used to generate the Rep instance in the derived Generic instances. So that’s a rough approximation of how -XDeriveGeneric works under the hood, nothing terribly complicated just book keeping.

GHC.Generics

From the internal representation we crank out several typeclass instances which store the metadata about the various constructors.

class Datatype d where
  datatypeName :: t d f a -> String
  moduleName   :: t d f a -> String
  isNewtype    :: t d f a -> Bool
  isNewtype _ = False

class Selector s where
  selName :: t s f a -> String

class Constructor c where
  conName :: t c f a -> String

  conFixity :: t c f a -> Fixity
  conFixity _ = Prefix

  conIsRecord :: t c f a -> Bool
  conIsRecord _ = False

For example, for Ingredient example from before, we’d have several constructor instances automatically generated by which we could query the names from the AST.

type Rep Ingredient = M1 D (T_Ingredient ((M1 C (C_Flour U1)) :+: (M1 C (C_Sugar U1))))

data T_Ingredient
data C_Flour
data C_Sugar

instance Datatype T_Ingredient where
  datatypeName _ = "Ingredient"
  moduleName _ = "Main"

instance Constructor C_Flour where
  conName _ = "Flour"

instance Constructor C_Sugar where
  conName _ = "Sugar"

Unlike reflection in languages like Java, Generics are not pushing type information into the runtime. Apart from a dictionary lookup for they are a effectively free abstraction that has no overhead. We’re simply making more information from the compiler manifest in the types during the type-checking phase, all of which gets erased during compilation.

Example

I tried to come up a non-contrived example for illustrating the usefulness of generics, and there are plenty of examples (serializes for JSON, Protocol Buffers, SQL Generation, traversals, command line parsers, etc) that are well-documented elsewhere on the web. So let’s consider an example based on the silly pun in the title of this article, namely cooking typeclasses.

So we have a Pie type, naturally.

data Pie = Pie
  { filling :: Filling
  , topping :: Maybe Topping
  } deriving (Show, Generic)

data Filling = Apple | Cherry | Pumpkin
  deriving (Show, Generic)

data Topping = IceCream | WhipCream
  deriving (Show, Generic)

Using generics we’d like to a generate a list of the types of pie that we can put on a menu from the structure of the Haskell types. Records will denote named variations (“filling” vs “topping”) of the menu item, while sum types denote the various options in the variations (“cherry filling” vs “apple filling”).

data Item
  = Item Text [Item]
  | Variant Text [Item]
  | Choice Text
  deriving (Show, Generic)

We implement a typeclass with a default signature which gives us the option to manually specify how a type gets converted into a menu item, or fall back on using it’s generic representation to automatically generate it.

class Menu a where
  menu :: a -> [Item]
  default menu :: (Generic a, GMenu (Rep a)) => a -> [Item]
  menu _ = gmenu (Proxy :: Proxy a)

gmenu :: forall a. (Generic a, GMenu (Rep a)) => Proxy a -> [Item]
gmenu _ = gopts (Proxy :: Proxy (Rep a))

Our generic menu operates over various GHC.Generics types to expand out the sums and products into the Item categories that correspond to the menu. The instance for GMenu (K1 R f) has a Menu constraint which allows manual override for specific datatypes. Since we’re passing around a proxy we’ll have to manually thread the dictionary around sometimes by passing an undefined cast to the type of the instance we need to resolve.

-- Generic Menu
class GMenu a where
  gopts :: Proxy a -> [Item]

-- Datatype
instance GMenu f => GMenu (M1 D x f) where
  gopts _ = gopts (Proxy :: Proxy f)

-- Constructor Metadata
instance (GMenu f, Constructor c) => GMenu (M1 C c f) where
  gopts x
    | conIsRecord (undefined :: t c f a) =
      [Item (pack (conName m)) (gopts (Proxy :: Proxy f))]

    | otherwise = [Choice (pack (conName m))]
    where m = (undefined :: t c f a)

-- Selector Metadata
instance (GMenu f, Selector c) => GMenu (M1 S c f) where
  gopts _ = [Variant (pack (selName m)) (gopts (Proxy :: Proxy f))]
    where m = (undefined :: t c f a)

-- Constructor Paramater
instance (GMenu (Rep f), Menu f) => GMenu (K1 R f) where
  gopts _ = menu (undefined :: f)

-- Sum branch
instance (GMenu a, GMenu b) => GMenu (a :+: b) where
  gopts _ = gopts (Proxy :: Proxy a) ++ gopts (Proxy :: Proxy b)

-- Product branch
instance (GMenu a, GMenu b) => GMenu (a :*: b) where
  gopts _ = gopts (Proxy :: Proxy a) ++ gopts (Proxy :: Proxy b)

-- Void branch
instance GMenu U1 where
  gopts _ = []

Specifically we’ll override the Maybe type so that it simply expands out to a choice of either “AsIs” of the variant or just the list of choices endowed by the inner parameter.

instance Menu a => Menu (Maybe a) where
  menu _ = [Choice (pack "AsIs")] ++ (menu (undefined :: a))

As an example Maybe Topping expands out into three choices.

menu (a :: Maybe Topping) ~ [Choice "AsIs", Choice "IceCream", Choice "whipCream"]

Using GHC 7.10’s new -XDeriveAnyClass extension we can actually go back and automatically derive Menu inside the deriving clause.

data Pie = Pie
  { filling :: Filling
  , topping :: Maybe Topping
  } deriving (Show, Generic, Menu)

Now synthesizing a dictionary for Pie we can get a menu

menu (undefined :: Pie)

And voila:

[ Item
    "Pie"
    [ Variant
        "filling" [ Choice "Apple" , Choice "Cherry" , Choice "Pumpkin" ]
    , Variant
        "topping"
        [ Choice "AsIs" , Choice "IceCream" , Choice "WhipCream" ]
    ]
]

Since our logic is datatype generic, any Haskell we can write down can be automatically translated to a Menu just by deriving Menu. So now we can a new Crisp desert (my favorite!) and we get everything for free!

data Crisp = Crisp
  { contents :: Filling
  , temperature :: Temperature
  } deriving (Show, Generic, Menu)

data Temperature = Warm | Cold
  deriving (Show, Generic, Menu)

-- Add an instance for a pair of menu items. That expands into multiple items.
instance (Menu a, Menu b) => Menu (a,b) where
  menu _ = menu (undefined :: a) ++ menu (undefined :: b)

And we can generate the composite menu of both deserts:

menu (undefined :: (Pie, Crisp))
[ Item
    "Pie"
    [ Variant
        "filling" [ Choice "Apple" , Choice "Cherry" , Choice "Pumpkin" ]
    , Variant
        "topping"
        [ Choice "AsIs" , Choice "IceCream" , Choice "WhipCream" ]
    ]
, Item
    "Crisp"
    [ Variant
        "contents" [ Choice "Apple" , Choice "Cherry" , Choice "Pumpkin" ]
    , Variant "temperature" [ Choice "Warm" , Choice "Cold" ]
    ]
]

So that’s generics. One of the best goto examples of how an expressive type system and a few clever compiler hooks can make programmers lives easier by cooking our boilerplate for us and giving tastier more correct code.