One Serialization Class to Rule Them All
In Haskell, the predominant way to serialize data is to use a type class. Libraries such as aeson
, binary
, cereal
, cborg
, protobuf
, msgpack
, capnproto
, avro
, thrift
, xml
, yaml
, toml
, bson
, etc. etc.
typically provide a type class that you can implement for your data type to encode and decode to and from a given format. In general, I find that code written for a particular domain is content to provide anywhere between one and
three data representations for their data types.
For example, a web application might have a data type for a user that looks like this:
data User = User
{ userId :: Int
, userName :: Text
, userEmail :: Text
, userCreatedAt :: UTCTime
, userUpdatedAt :: UTCTime
}
In this hypothetical application. We might also want to send it over the wire to a client, so we would write a ToJSON
and FromJSON
instance for it.
We might also want to store it in a cache where we care mostly about serialization speed and don't care about backwards compatibility, so we would write a Store
instance for it.
We might later find that we want to provide a stable serialization representation for it when used in a worker queue, so we would write a Protobuf
instance for it.
Depending on the application, it's also not uncommon to end up with multiple representations of the same data type, especially for common types provided by libraries.
For example, loads of libraries support serialization of ()
, Bool
, Int
, Text
, ByteString
, UTCTime
, due to their ubiquity and generally simple serialization formats.
For a given domain that you control, it's often simplest to just pick one serialization format and stick with it. But for libraries, it's often desirable to support multiple serialization formats so that users can choose the one that best fits their needs. Wouldn't it be nice if, as a library author, you could let consumers of your library hande you a list of serialization formats that they want to use, and you could just pick the best one for each data type?
In an ideal world, we could imagine writing something like this:
type RawPayload = RawPayload
{ payloadData :: ByteString
-- ^ The serialized data
, payloadMetadata :: Map Text ByteString
-- ^ Metadata about the payload, such as the encoding type
}
-- | We want to be able to serialize & deserialize the parameters of a function using an arbitrary serialization format.
class Codec fmt a where
-- | Similar to a content-type header, this is a string that identifies the format of the payload.
-- it will be set on the 'encoding' metadata field of the payload.
encodingType :: fmt -> Proxy a -> ByteString
messageType :: fmt -> a -> ByteString
default messageType :: (Typeable a) => fmt -> a -> ByteString
messageType _ _ = C.pack $ show $ typeRep (Proxy @a)
encodePayload :: fmt -> a -> ByteString
decode :: fmt -> RawPayload -> Either String a
encode :: forall fmt a. Codec fmt a => fmt -> a -> RawPayload
encode fmt x = RawPayload
(encodePayload fmt x)
(Map.fromList
[ ("encoding", (encodingType fmt (Proxy @a)))
, ("messageType", messageType fmt x)
]
)
Here, we've defined a Codec
type class that can be implemented for any serialization format. In order to understand details about the actual payload in question, we
also have the RawPayload
carry metadata with it. This is similar to how HTTP headers include a Content-Type
and Content-Encoding
header to inform clients how
they should interact with the contents of a request body.
Now, let's define a few instances of this type class:
data JSON = JSON
instance (Typeable a, Aeson.ToJSON a, Aeson.FromJSON a) => Codec JSON a where
encodingType _ _ = "json/plain"
encodePayload _ x = BL.toStrict $ Aeson.encode x
decode _ = Aeson.eitherDecodeStrict' . inputPayloadData
data Binary = Binary
instance Codec Binary ByteString where
encodingType _ _ = "binary/plain"
encodePayload _ x = x
decode _ = Right . inputPayloadData
data Protobuf = Protobuf
instance (Message a) => Codec Protobuf a where
messageType _ x = encodeUtf8 $ messageName $ pure x
encodingType _ _ = "binary/protobuf"
encodePayload _ x = encodeMessage x
decode _ = decodeMessage . inputPayloadData
These instances come from the aeson
, binary
, and proto-lens
libraries, respectively. Now, let's imagine a composite codec that can choose the best codec for a given type. Let's say
that it's desirable for our situation to use direct binary serialization where possible, followed by protobuf, followed by JSON. We can define a composite codec like this:
data Composite (codecs :: [Type]) where
CompositeNil :: Composite '[]
CompositeCons :: codec -> Composite codecs -> Composite (codec ': codecs)
defaultCodec :: Composite '[Binary, Protobuf, JSON]
defaultCodec = CompositeCons Binary $ CompositeCons Protobuf $ CompositeCons JSON CompositeNil
Here we have a fairly standard heterogeneous list. What would the Codec
instance for this look like?
Let's start with the base caes. Obviously, an empty list of codecs can't encode anything, so we'll make that a type error.
instance TypeError (('ShowType a) ':<>: 'Text " is not supported by any of the provided codecs") => Codec (Composite '[]) a where
messageType = error "unreachable"
encodingType = error "unreachable"
encodePayload = error "unreachable"
decode _ _ = Left "No recognized codec for this type"
Now, how do we handle the case where we have a non-empty list of codecs? We want to try each codec in order until we find one that works.
This is where things get tricky. We can easily write an instance that ensures that a value of type a
is supported by the first codec in the list,
but it's not really what we want:
instance Codec codec a => Codec (Composite (codec ': codecs)) a where
messageType (CompositeCons codec _) = messageType codec
encodingType (CompositeCons codec _) = encodingType codec
encodePayload (CompositeCons codec _) = encodePayload codec
decode (CompositeCons codec _) = decode codec
This instance will always pick the first codec in the list, even if it doesn't support the type. We want to be able to try each codec in order until we find one that works.
The fundamental issue is that Haskell doesn't natively admit a way to express "try this constraint, and if it fails, try this other constraint". With a few language extensions and a lot of boilerplate, however, we can eventually come with something like this:
class c || d where
resolve :: (c => r) -> (d => r) -> r
infixr 2 ||
This is a type class on constraints themselves with a single method, resolve
. In theory, this type class would resolve the value on the left-hand side using the provided
instance c
, if the constraint c
was satisfied, otherwise it would try to resolve the constraint d
instead. On a very very manual basis, this sort of works! Supposing that we wanted
to dispatch on whether or not a type provides a Show
instance, we could write an instance like this:
instance (Show Int || d) where resolve = \r _ -> r
For things that don't admit an instance, we can provide a constraint d
that makes it impossible for the constraint c
to be satisfied, and call the resolve
method with the
right-hand argument:
instance d => (Show (a -> b) || d) where resolve = \_ r -> r
🤯 The fact that we can do this at all is absolutely wild to me. Unfortunately, we can't generalize this process to write something like this:
instance Show a => (Show a || d) where resolve = \r _ -> r
To use this technique, we would have to write an instance for every. single. type. that we wanted to support. Given that the original goal was to reduce boilerplate and manual type wrangling for the user, this is obviously a big step backward. As Matt Parsons puts it:
The problem comes down to where methods live, like, at the runtime level. an Object in JS or Java contains, in addition to its attributes, a lookup table for all the methods that can operate on it. And Haskell just doesn't have that. So we can't ask a type "Can you respond to this message?" because Haskell objects just don't respond to messages (aside from, like, record field labels, lol).
GHC's instance stuff is like a global hashmap of Type -> MethodDict, and you can do Typeable stuff to get the dict at compile time -
foo :: Typeable a => a -> Value foo a = case cast a :: Maybe Int of Just i -> toJSON i -- GHC can statically insert the dict here Nothing -> mempty
Or you can grab the
MethodDict
usingDict :: c => Dict c
But there's not a nice way to smuggle a Dict (c a) inside of an a, allowing you to call methods on it
– Matt Parsons
OK, so blog post is over, right? We can't do it, so we're done. Thanks for coming to my TED talk.
Well, we have one means of redress left. There is a GHC plugin called if-instance
, that implements a magic (||)
type class
for us that works exactly as we want it to by satisfying the constraint on the left-hand side if it can, and otherwise satisfying the constraint on the right-hand side at compile time
without us having to write any instances. This is exactly what we want! All we have to do is add a pragma to the top modules that use the (||)
type class.
{-# OPTIONS_GHC -fplugin=IfSat.Plugin #-}
Let's turn back to our Codec
instance. Now that we have the (||)
type class, we can write an instance that tries each codec in order until it finds one that works:
instance TypeError (('ShowType a) ':<>: 'Text " is not supported by any of the provided codecs") => Codec (Composite '[]) a where
messageType = error "unreachable"
encodingType = error "unreachable"
encodePayload = error "unreachable"
decode _ _ = Left "No recognized codec for this type"
instance (Codec fmt a || Codec (Composite codecs) a) => Codec (Composite (fmt ': codecs)) a where
messageType fmt = dispatch @(Codec fmt a) @(Codec (Composite codecs) a)
(case fmt of CompositeCons codec _ -> messageType codec)
(case fmt of CompositeCons _ codecs -> messageType codecs)
encodingType fmt = dispatch @(Codec fmt a) @(Codec (Composite codecs) a)
(case fmt of CompositeCons codec _ -> encodingType codec)
(case fmt of CompositeCons _ codecs -> encodingType codecs)
encodePayload fmt = dispatch @(Codec fmt a) @(Codec (Composite codecs) a)
(case fmt of CompositeCons codec _ -> encodePayload codec)
(case fmt of CompositeCons _ codecs -> encodePayload codecs)
decode fmt payload = dispatch @(Codec fmt a) @(Codec (Composite codecs) a)
(case fmt of
CompositeCons codec codecs -> if Just (encodingType codec (Proxy @a)) == payload.inputPayloadMetadata Map.!? "encoding"
then decode codec payload
else ifSat @(Codec (Composite codecs) a)
(decode codecs payload)
(Left "No codec for this type supports this payload")
)
(case fmt of CompositeCons _ codecs -> decode codecs payload)
For each method, we use the dispatch
function from if-instance
to try the codec on the left-hand side, and if it fails, try the list of codecs on the right-hand side.
Let's try it out!
> encode defaultCodec (1 :: Int)
RawPayload {inputPayloadData = "foo", inputPayloadMetadata = fromList [("encoding","binary/plain"),("messageType","ByteString")]}
> newtype JsonInt = JsonInt Int deriving (ToJSON, FromJSON)
> encode defaultCodec (JsonInt 1)
RawPayload {inputPayloadData = "1", inputPayloadMetadata = fromList [("encoding","json/plain"),("messageType","JsonInt")]}
We provide the basic mechanism needed to perform content negotiation between a set of known serialization formats backed by their respective libraries.
For a more complete example, check out the Temporal.Payload module in the hs-temporal-sdk library. This module underpins RPC-style argument serialization and deserialization for the Haskell client for the Temporal workflow engine.
Deep thanks go to:
- Noah Luck Easterly– for his writing on implementing constraint unions by hand in Haskell, which proved that this was even possible in the first place.
- Sam Derbyshire– for eliminating the deep drudgery of writing constraint unions by hand with his if-instance library, and for his extremely prompt bugfix when I reported an issue with it.