The Joy and Agony of Haskell in Production

There have been several good talks about using Haskell in industry lately, and several people asked me to write about my personal experiences. Although I can’t give specific details I will speak broadly about some things I’ve learned and experienced.

The myths are true. Haskell code tends to be much more reliable, performant, easy to refactor, and easier to incorporate with coworkers code without too much thinking. It’s also just enjoyable to write.

The myths are sometimes trueisms. Haskell code tends to be of high quality by construction, but for several reasons that are only correlated; not causally linked to the technical merits of Haskell. Just by virtue of language being esoteric and having a relatively higher barrier to entry we’ll end up working with developers who would write above average code in any language. That said, the language actively encourage thoughtful consideration of abstractions and a “brutal” (as John Carmack noted) level of discipline that high quality code in other languages would require, but are enforced in Haskell.

Prefer to import libraries as qualified. Typically this is just considered good practice for business logic libraries, it makes it easier to locate the source of symbol definitions. The only point of ambiguity I’ve seen is disagreement amongst developers on which core libraries are common enough to import unqualified and how to handle symbols. This ranges the full spectrum from fully qualifying everything (Control.Monad.>>=) to common things like (Data.Maybe.maybe) or just disambiguating names like (Map.lookup).

Consider rolling an internal prelude. As we’ve all learned the hard way, the Prelude is not your friend. The consensus historically has favored the “Small Prelude Assumption” which presupposes that tools get pushed out into third party modules, even the core tools that are necessary to do anything (text, bytestring, vector, etc). This makes life easier for library authors at the cost of some struggle for downstream users.

In practice any non-trivial business logic module can very easily have 100+ lines just of imports, and frankly it gets tiring. One common way of abstracting this is by rolling a custom prelude using module reexports. Consider a minimal use case like the following:

module MegaCorpPrelude ( 
  module Exports,
) where

import Data.Int as Exports
import Data.Tuple as Exports
import Data.Maybe as Exports
import Data.String as Exports
import Data.Foldable as Exports
import Data.Traversable as Exports

import Control.Monad.Trans.Except
  as Exports
  (ExceptT(ExceptT), Except, except, runExcept, runExceptT, 
   mapExcept, mapExceptT, withExcept, withExceptT)

This can be put into a cabal package which transitively pulls in the core dependencies and then is used in our downstream module.

{-# LANGUAGE NoImplicitPrelude #-}

import MegaCorpPrelude

There are several custom preludes that are available on Hackage in the Prelude category.

Haskell has world class libraries. There is an abundance of riches on Hackage in libraries like quickcheck, mtl, pipes, conduit, tasty, attoparsec, sbv and many more. Knowing where to start with the ecosystem can be a little tricky, and there are sometimes multiple competing solutions. A conservative start to a library might consist of something like the following build-depends in our cabal file:

  build-depends:       
    base                 >= 4.6   && <4.9,
    deepseq              >= 1.3   && <1.5,
    hashable             >= 1.2.2 && <1.3,

    text                 >= 1.1   && <1.3,
    bytestring           >= 0.10  && <0.11,
    split                >= 0.2   && <0.3,

    unordered-containers >= 0.2   && <0.3,
    containers           >= 0.5   && <0.6,
    vector               >= 0.11  && <0.12

    mtl                  >= 2.2   && <3.0,
    transformers         >= 0.4   && <0.6,

    time                 >= 1.6   && <1.7,
    process              >= 1.1   && <1.3,
    directory            >= 1.2   && <1.3,
    optparse-applicative >= 0.10  && <0.13

For many problem domains the libraries simply aren’t written yet. There are many domains that Haskell is used in, for tasks as diverse as trading systems, spam filtering, to web services. Chances are there are a plethora of libraries available some tasks. Yet it goes without saying that Haskell is not Java or Python and there simply isn’t an equivalent mindshare for certain tasks. If we need to connect to Microsoft SQL Server or a SOAP service, we’re probably going to have more trouble. The primitives are probably there to do it, but often there is no off-the-shelf solution.

Usually it boils down to: If you don’t write that library, no one else will.

There isn’t a global concensus on how to write Haskell. Being an abnormally expressive languages means Haskell is written in such wildly different styles to the point of being almost a different language in some cases. There are wildly different views on how to structure large applications and not a whole lot is written about best practices for doing so. Most of the schools of thought differ about how far along the spectrum we should strive for correctness and what power-to-weight ratio is appropriate for certain tasks.

It’s hard to give universal advice about how to structure Haskell logic that applies to all problems, and I’d be skeptical of anyone who did. For a certain set of tasks that are command line utilities, processing pipelines, or web services it’s certainly possible to write applications that don’t involve monad transformers but there are certainly many other domain where the natural choice is to roll a tree-like structure of monad transformer objects that encapsulate common logic like Error and State.

If we look at the history of programming, there are many portents of the future of Haskell in the C++ community, another language where no two developers (that I’ve met) agree on which subset of the language to use.

Configuration For configuration Bryan’s configurator library is invaluable. The library specifies an external configuration flat file which can hold credentials, connections and cluster topology information. A typical pattern is to embed this in a ReaderT and then asks for any field necessary in downstream logic.

newtype ConfigM a = ConfigM (ReaderT ConnectInfo a)
  deriving (Monad, MonadReader ConnectInfo)

handleConfig :: FilePath -> IO ConnectInfo
handleConfig config_filename = do
    config <- Config.load [ Config.Required config_filename ]

    hostname <- Config.require config "database.hostname"
    username <- Config.require config "database.username"
    database <- Config.require config "database.database"
    password <- Config.require config "database.password"

    return $ ConnectInfo
     { connectHost     = hostname
     , connectUser     = username
     , connectDatabase = database
     , connectPort     = 5432
     , connectPassword = fromMaybe "" password
     }

The configuration file might look like the following:

database
{
  hostname = "mydb.rds.amazonaws.com"
  database = "employees"
  username = "stephen"
  password = "hunter2"
}

Haskell is full of distractions. There are plenty of distractions that will lead us down the wrong path or distract us from our core business. Working in industry means that ultimately our job is create software that drives the creation of revenue, not necessarily to advance the state of art. Although the two are not necessarily mutually incompatible.

With this in mind, it’s important to note there are plenty of vocal Haskellers who work under a value system that is largely incompatible with industrial practices. Much of which stems from hobbyists or academics who use Haskell as a vehicle for their work. Not to diminish this category people or their work, yet the metrics for success in this space are different enough that they tend to view the programming space from a perspective that can be incommensurable to industrial programmers.

The tensions of academic and industrial influences in Haskell is one of the strong driving forces for progress; but it also leads to conflicts of interest for many of us who write code to support ourselves. If one decides to engage with the community, it’s important to realize that many of the topics being actively discussed are likely 3+ years out from being at the point where one would want to bet the livelihood of a company on.

Testing and building. For development builds using cabal sandboxes it’s usually essential to be able to pull in internal libraries that are not on Hackage. To do with cabal sandboxes this can be achieved with either a script to provision the dependencies.

$ git clone https://github.com/bscarlet/llvm-general
$ cd llvm-general
$ git checkout ca6489fdddde5c956a4032956e28099ff890a80b
$ cd ..
$ cabal sandbox add-source vendor/llvm-general-pure

With stack this can actually all be configured in the stack.yaml file.

packages:
- location:
    git: https://github.com/bscarlet/llvm-general
    commit: ca6489fdddde5c956a4032956e28099ff890a80b
  subdirs:
    - llvm-general-pure/

Private TravisCI or Codeship are not worth the trouble of setting up if one ever envisions the project spanning multiple repos. Getting their virtual machine provisioned with the proper credentials to pull from multiple Github repos is still a source of trouble. For build slaves and continuous integration I’ve used BuildBot successfully to work with the usual cabal and stack toolchain.

For large multi-package builds, I can’t speak highly enough of Neil Mitchell’s build system shake which is itself written in Haskell. The shake build uses Shakefiles which are monadic description of a graph of dependencies to resolve and their artifacts. For a contrived example consider running a Markdown file through Pandoc.

import Development.Shake
import Development.Shake.FilePath

main = shakeArgs shakeOptions $ do
    want ["book.html"]
    "book.html" *> \out -> do
        need ["book.md"]
        system' "pandoc" ["book.md","-o","book.html"]

Fast builds lead to faster turnaround. If care isn’t taken, projects can quickly devolve to become unmanageably slow to compile, usually the problem is avoidable with some care. Deriving instances of Read/Show/Data/Generic for largely recursive ADTs can sometimes lead to quadratic memory behavior when the nesting gets deep. The somewhat ugly hack to speed up compile time here is to run ghc -ddump-deriv Module.hs and then manually insert the resulting code instead of deriving it everytime. Not a great solution, but I’ve seen it drastically improve compilation footprint and time. Also be tactical with uses of INLINE and SPECIALIZE as inlining at many call sites has a non-trivial cost. Avoid TemplateHaskell as it can cause ridiculously inflated build times and enormous memory footprints in GHCi.

I’ts pretty common to use ghci and ghcid to during development stage. Your mileage may also vary with ghc-mod support for Vim and Emacs which allows in-editor type introspection.

Pulling from cabal to provision our test server can take minutes to hours depending on the size of our dependency tree. Fortunately it’s easy to set up a Hackage server mirror that contains all of our internal dependencies that can be served quickly from our local servers or an S3 bucket. We can then simply alter our ~/.cabal/config to change the remote-repo to our custom mirror.

Records The record system is a continual source of pain. It’s best to come up with an internal convention for naming record accessors and use qualified imports. It sucks for now, but there are some changes coming up the 8.0 release that will make life easier.

When using Generics to derive ToJSON and FromJSON instances there is fieldLabelModifier field that can be used to modify the derived field so the serialize does not have to match the Haskell record accessors. For example we’ll drop the first three characters:

data Login = Login
  { _lgusername :: Text
  , _lgpassword :: Text
  } deriving (Eq, Ord, Show, Generic)

instance ToJSON Login where
  toJSON = genericToJSON defaultOptions { fieldLabelModifier = drop 3 }

This will serialize out to.

{
  "username": "stephen"
  "password": "hunter2"
}

Performance and Monitoring A common performance problem is that of many small updates updates to records with large numbers of fields. Records of hundreds of fields are somewhat pathological but in practice they show up in a lot of business logic that needs to interact with large database rows. Too much of this can very noticeable impact on GC pressure by doing allocations on each update. If you notice runaway memory performance, one of the first places to look (after the usual suspects) is look for overgrown records and possibly inside of StateT with use of lots of modify`` in sequence.

The other very common library for live performance monitoring is ekg which simply forks off a thread that manages the state of the GHC runtime internals and can serve this data to other logging services via HTTP + JSON or via a web server. For example:

{-# Language OverloadedStrings #-}

import Control.Monad
import System.Remote.Monitoring

main :: IO ()
main = do
  ekg <- forkServer "localhost" 8000
  putStrLn "Started server on http://localhost:8000"
  forever $ getLine >>= putStrLn

ekg has several large dependencies so sometimes its desirable to optionally enable it with a cabal configuration flag so that it’s not included unless we want a development build. We just qualify our build-depends to include it in the dependencies if the flag is set via cabal configure -fekg.

flag ekg
  manual: True
  default: True
  description: Compile with ekg monitoring.

build-depends:
  if flag(ekg)
    build-depends:
      ekg >= 0.4 && < 0.5

Strings The strings types are mature, but unwieldy to work with in practice. It’s best to just make peace with the fact that in literally every module we’ll have boilerplate just to do simple manipulation and IO. OverloadedStrings overcomes some of the issues, but it’s still annoying that you’ll end up playing string type-tetris a lot.

If you end up rolling a custom prelude it’s worth just correcting putStrLn and print to what they should be in a just world:

-- IO
putStr :: MonadIO m => Text -> m ()
putStr = liftIO . Data.Text.IO.putStr

putStrLn :: MonadIO m => Text -> m ()
putStrLn = liftIO . Data.Text.IO.putStrLn

print :: (MonadIO m, Show a) => a -> m ()
print = liftIO . Prelude.print

A common pattern is to use a multiparamter typeclass to do string conversions between all the common (Data.Text.Text, Data.Text.Lazy, Data.ByteString.UTF8 Data.ByteString.Lazy.UTF8, [Char]) types. You’ll end up eating at least one typeclass dictionary lookup per call to s but this is fairly benign in most cases.

class StringConvert a b where
  s :: a -> b

instance (ToString a, FromString b) => StringConvert a b where
  s = fromString . toString

instance FromString UTF8.ByteString where
    fromString = UTF8.fromString

instance FromString LUTF8.ByteString where
    fromString = LUTF8.fromString

instance ToString UTF8.ByteString where
    toString = UTF8.toString

instance ToString LUTF8.ByteString where
    toString = LUTF8.toString

There are libraries on hackage ( string-convert, string-conv ) that implement this pattern.

Documentation is abysmal. Open source Haskell libraries are typically released with below average or non-existent documentation. The reasons for this are complicated confluence of technical and social phenomena that can’t really be traced back to one root cause. Basically, it’s complicated. What this means for industrial use is to always budget extra hours of lost productivity needed to reverse engineering libraries from their test suites just to get an minimal example running. It’s not great, but that’s the state of things.

You own the transitive dependency tree. With such a small ecosystem, anything we pull in you have to be able to maintain and support should the upstream source dry up or wither on the vine. The cold truth is if there’s no community-provided documentation for the library and you depend on it for your product, you’ve just added technical debt to your company. The person you have to hand the code off to will have to read through your code and all it’s transitive dependencies, and while the undocumented upstream libs might make sense they may utterly confound your successor.

If you’re depending on your Haskell code being stable and supportable it’s worth being conservative in what dependencies you pull into your tree.

Avoid TemplateHaskell. Enough said, it’s a eternal source of pain and sorrow that I never want to see anywhere near code that I had to maintain professionally. The best quote about this is found in this StackOverlow thread:

Think about it. Half the appeal of Haskell is that its high-level design allows you to avoid huge amounts of useless boilerplate code that you have to write in other languages. If you need compile-time code generation, you’re basically saying that either your language or your application design has failed you.

If you need to crank out that much boilerplate just to get something done, go back and rethink. If a upstream library forces you to use it, don’t depend on that library. There is almost always a way to accomplish the task without falling back on TH.

Don’t be afraid to train people. Contrary to a lot of popular myths, with the right direction people can indeed pick up Haskell quite quickly. There are great developers outside the community who given a little bit of insight into Haskell will turn into great Haskellers. People who have a little experience with Scheme, Clojure, Scala, Ocaml can pretty quickly learn the language.

I was fortunate enough to train an a very talented intern named Dan who came in not knowing any Haskell (was primarily a Java developer) and in two weeks had picked up the language and was amazingly productive. Learning on your own is much more time consuming than having a Haskell friend sitting next to you. It’s a time investment, but it can pay off exponentially with the right person and with Dan it most certainly did.

Network with other industrial users. There is no shortage of hobbyist Haskell programmers to consult with about problems. Though by my estimates in the United States there are probably only around 70-100 people working on Haskell fulltime and a good deal more working part-time or anticipating using it. It’s worth networking with other industrial users to share best practices.

Eventually with enough use, many of these rough corners in the language will get polished over, best practices become absorbed, and libraries will get built.