We use ALOT of serialization in the current system I work with. Serializing/deserializing 100,000,000 objects in a day is pretty common. For a long time we knew that the binary formatter was fat and slow but never rationalized writing something custom as we were always fast enough. Unfortunately our data throughput has raised 400% in the last year (when you start with gigs and gigs of messages this is a huge gain) and our little three or four year old dual xeon 2.2 has turned into the little engine that could during peaks lately so we finally bit the big one and threw something together quickly.

A toast ... to the little server that could!

This solution is for a fairly niche condition and is heavily optimized so please read the explanations below to see if it will be good for your scenario before using it.

This is the first of a series of posts dealing with this ... Let's start with introducing a new interface to our system

    public interface ICustomBinarySerializable      {          void WriteDataTo(BinaryWriter _Writer);          void SetDataFrom(BinaryReader _Reader);      }

You would then implement this interface in your object like this, only write out exactly what you need and write it in the simplest way possible.

    class TestObject : ICustomBinarySerializable      {          public int Integer;          public TestObject(){}            public TestObject(int _Integer)          {              Integer = _Integer;          }            public virtual void WriteDataTo(BinaryWriter _Writer)          {              _Writer.Write((int) Integer);          }            public virtual void SetDataFrom(BinaryReader _Reader)          {              Integer = _Reader.ReadInt32();          }      }

Then I wrote a custom formatter that operates on objects that are ICustomBinaryObjectSerializable. You may note that for the index that represents the type I use an integer. This is probably more appropriate to be a short than an integer and we could save a few bytes here.

    public class CustomBinaryFormatter : IFormatter      {          private SerializationBinder m_Binder;          private StreamingContext m_StreamingContext;          private ISurrogateSelector m_SurrogateSelector;          private readonly MemoryStream m_WriteStream;          private readonly MemoryStream m_ReadStream;          private readonly BinaryWriter m_Writer;          private readonly BinaryReader m_Reader;          private readonly Dictionary<type, int> m_ByType = new Dictionary<type int >();          private readonly Dictionary m_ById = new Dictionary();          private readonly byte[] m_LengthBuffer = new byte[4];          private readonly byte[] m_CopyBuffer;            public CustomBinaryFormatter()          {              m_CopyBuffer = new byte[20000];              m_WriteStream = new MemoryStream(10000);              m_ReadStream = new MemoryStream(10000);              m_Writer = new BinaryWriter(m_WriteStream);              m_Reader = new BinaryReader(m_ReadStream);          }            public void Register(int _TypeId) where T:ICustomBinarySerializable          {              m_ById.Add(_TypeId, typeof(T));              m_ByType.Add(typeof (T), _TypeId);          }            public object Deserialize(Stream serializationStream)          {              if(serializationStream.Read(m_LengthBuffer, 0, 4) != 4)                  throw new SerializationException("Could not read length from the stream.");              IntToBytes length = new IntToBytes(m_LengthBuffer[0], m_LengthBuffer[1], m_LengthBuffer[2], m_LengthBuffer[3]);              //TODO make this support partial reads from stream              if(serializationStream.Read(m_CopyBuffer, 0, length.i32) != length.i32)                   throw new SerializationException("Could not read " + length + " bytes from the stream.");              m_ReadStream.Seek(0L, SeekOrigin.Begin);              m_ReadStream.Write(m_CopyBuffer, 0, length.i32);              m_ReadStream.Seek(0L, SeekOrigin.Begin);              int typeid = m_Reader.ReadInt32();              Type t;              if(!m_ById.TryGetValue(typeid, out t))                  throw new SerializationException("TypeId " + typeid + " is not a registerred type id");              object obj = FormatterServices.GetUninitializedObject(t);              ICustomBinarySerializable deserialize = (ICustomBinarySerializable) obj;              deserialize.SetDataFrom(m_Reader);              if(m_ReadStream.Position != length.i32)                   throw new SerializationException("object of type " + t + " did not read its entire buffer during deserialization. This is most likely an inbalance between the writes and the reads of the object.");              return deserialize;          }            public void Serialize(Stream serializationStream, object graph)          {              int key;              if (!m_ByType.TryGetValue(graph.GetType(), out key))                  throw new SerializationException(graph.GetType() + " has not been registered with the serializer");              ICustomBinarySerializable c = (ICustomBinarySerializable) graph; //this will always work due to generic constraint on the Register              m_WriteStream.Seek(0L, SeekOrigin.Begin);              m_Writer.Write((int) key);              c.WriteDataTo(m_Writer);              IntToBytes length = new IntToBytes((int) m_WriteStream.Position);              serializationStream.WriteByte(length.b0);              serializationStream.WriteByte(length.b1);              serializationStream.WriteByte(length.b2);              serializationStream.WriteByte(length.b3);              serializationStream.Write(m_WriteStream.GetBuffer(), 0, (int) m_WriteStream.Position);          }            public ISurrogateSelector SurrogateSelector          {              get { return m_SurrogateSelector; }              set { m_SurrogateSelector = value; }          }            public SerializationBinder Binder          {              get { return m_Binder; }              set { m_Binder = value; }          }            public StreamingContext Context          {              get { return m_StreamingContext; }              set { m_StreamingContext = value; }          }      }

So that it is clear how this works ... when you instantiate a custom formatter, you associate types back to integer ids. Example from my tests:

formatter.Register<TestObject>(1);

This says when you get a type id of 1 it should be a TestObject and vice versa when you write a TestObject give it a type id of 1.

When writing an object the format is

<4 bytes length><4 bytes type id><object data>

When we read the data we first read the 4 bytes of length (n), then read n bytes off the stream. We then copy that into our local buffer (see notes below). We then seek to the beginning of the buffer and tell the object to read its state using the binary reader we provide to it.

Performance

Before we look at all of the bad an evil things this is doing let's try some basic performance tests. To run tests I used the following simple object (what this library was designed to be really fast with). I grabbed this object off someone's blog who was also playing with serialization and added the interface but can't seem to find the link of which one it was to give credit for saving me a good minutes worth of typing :).

[Serializable]  public class Customer : ICustomBinarySerializable {       private String _lastname;       private String _firstname;       private String _address;       private int _age;       private int _code;        public Customer()      {                }      public Customer(String lastName, String firstName, String address, int age, int code)      {          _lastname = lastName;          _firstname = firstName;          _address = address;          _age = age;          _code = code;      }        public String LastName {             get {return _lastname;}             set {_lastname = value;}       }       public String FirstName       {             get {return _firstname;}             set {_firstname = value;}       }       public String Address       {             get {return _address;}             set {_address = value;}       }         public int Age       {             get {return _age;}             set {_age = value;}       }         public int Code       {             get {return _code;}             set {_code = value;}       }        public void WriteDataTo(BinaryWriter _Writer)      {          _Writer.Write((string)_lastname);          _Writer.Write((string)_firstname);          _Writer.Write((string)_address);          _Writer.Write((Int32)_age);          _Writer.Write((Int32)_code);      }        public void SetDataFrom(BinaryReader _Reader)      {          _lastname = _Reader.ReadString();          _firstname = _Reader.ReadString();          _address = _Reader.ReadString();          _age = _Reader.ReadInt32();          _code = _Reader.ReadInt32();      }  }

Speed

To test the speed of the serializer I chose to serialize / deserialize one of these objects 10,000,000 times to/from a MemoryStream.

Test	Time (lower is better)
Serialize (Binary)	01:48.54
Serialize (Custom)	00:06.73
Deserialize (Binary)	2:01.29
Deserialize (Custom)	0:08.55

So on serializing the custom one is a whopping 1612% faster and on deserializing it is 1418% faster. That's not too bad as both are more than an order of magnitude.

Size

The other area I really wanted to optimize as it is common for us to have 40+ gb transaction files for a day (disk IO is expensive) is the size of each message. Because we are not writing the same kind of schema information that the binary formatter does we can also be quite a bit smaller than its output. For the object given the binaryformatter results in 232 bytes of output while the custom formatter results in 41. This message has quite a few string which add into the amount of serialized data (on our messages (about 40) we average about a 1/10 ratio between the two). Even so its still a 500% gain in storage space required. Don't let this fool you though there are some ....

Problems

There are a number of problems with this type of strategy. It is imperative that you know about the tradeoffs involved with this code before using it. This was written for a niche situation and it may really hurt you if you aren't careful!

Versioning

There is no versioning information provided by default in the data. One could easily provide this in their custom serialization implementation but the formatter does not provide it by default for you.

Endianess

One of the interesting things here is dealing with the length. I have done this using a quite unsafe (but faster) solution.

    [StructLayout(LayoutKind.Explicit)]      public struct IntToBytes      {          public IntToBytes(Int32 _value) { b0 = b1 = b2 = b3 = 0; i32 = _value; }          public IntToBytes(byte _b0, byte _b1, byte _b2, byte _b3) {              i32 = 0;              b0 = _b0;              b1 = _b1;              b2 = _b2;              b3 = _b3;          }          [FieldOffset(0)]          public Int32 i32;          [FieldOffset(0)]          public byte b0;          [FieldOffset(1)]          public byte b1;          [FieldOffset(2)]          public byte b2;          [FieldOffset(3)]          public byte b3;      }

This has endian problems if you use it on multiple machines that have different endianess like say mono on a ppc vs clr on x86. One could easily get around this by just using BitConverter instead (or doing some binary arithmetic if you miss having real reasons for doing so :)). For us however most of these objects are being serialized between processes on the same machine so its not an issue for us.

Copying of Data

Another problem (read: decision) has to do with how the formatter deals with the stream itself internally. It copies data off the stream into an internal memory buffer, it does this so it can reuse the same binaryreader/writer every time. This makes it non-reentrant and forces the copy but in testing with many very small messages the copying of the data turned out to be faster than creating a new Reader/Writer to the original stream on every iteration. This may turn out different for you, I will leave it as an exercise for the reader to change this (I promise it won't take more than 5 minutes)

Typing

Its a lot of typing in your objects (we can work around this with some IL generation) but that's a whole other post isn't it.

Anyways I hope people enjoy this and can find a niche place of their own to use such a strategy.

Source Click Here.

Dotnet Thread

Monday, August 25, 2008

Fast Serialization

Performance

Speed

Size

Problems

Versioning

Endianess

Copying of Data

Typing

No comments:

Post a Comment

Originals Enjoy

Dotnet Thread

Monday, August 25, 2008

Fast Serialization

Performance

Speed

Size

Problems

Versioning

Endianess

Copying of Data

Typing

No comments:

Post a Comment

Originals Enjoy

Archives