Rapid Addition’s Rupert Smith spoke to Werner Schuster at QCon London 2012.
Transcript and video courtesy of InfoQ
We are here at Qcon London 2012 and I am sitting here with Rupert Smith. Rupert, who are you?
My name is Rupert Smith and I am currently working for Rapid Addition and myself and Kevin Houstoun have been here today giving a talk on our FIX engine specifically looking at low latency performance.
So, why don`t you like latency?
Latency can mean you miss your trade, basically if somebody is selling something in the market you got to be the quickest to get to it if you want to have a chance. It’s not necessarily by being the quickest, we were talking before, it’s about making the software real time which simply means making it more predictable, and because there are things like garbage collection, and other sources of jitter that might mean you have a certain chance of missing a trade, so even though our FIX engine might perform quite well and have low latency, there will always be that section of high latency outliers that you might run into. So we initially started out just with the aim of making the FIX engine more predictable and removing the high latency outliers and what we actually found is that we managed to lower the latency baseline as well. So it’s about being able to give people a predictable execution and to avoid running into latency spikes.
So what’s your approach to avoiding these? Do you use Java?
The FIX engine I have written was originally written in C# and prior to this I have worked on Apache Qpid and had various ideas about how messaging software should be written and hadn’t necessarily been able to apply those ideas at the time when I worked on it, so when I went to be interviewed at Rapid they told me about their ideas, so we kind of merged our ideas together really, so a lot of ideas I have implemented are not my own, I have taken what existed in the C# engine. One thing that they suggested doing which is quite controversial is to make the engine garbage free, the whole point of C# and Java is that these languages have garbage collectors so that you can allocate objects and don’t have to worry, they got cleaned up after you. And obviously every time the garbage collector runs it can pause your program and inject latency into it. So they suggested that we make it garbage free, and when Java first came out, we might be talking pre Java 1, people actually used to make the programs garbage free, because the object allocation was very slow in some of the originals versions. But now doing object pooling for example is considered to be an anti pattern, something you shouldn’t do, it’s going to make your program slower, so it’s quite a controversial thing to attempt to do.
So how did you do it? Did you use object pool or did you rewrite your code to just use primitives?
We’ve got a hierarchy of four different techniques: and because the Java compiler itself is smart enough to eliminate some of the garbage so if you allocate an object say within just one method, and then you don’t return that object, the compiler says: well that object never escapes from the scope which is in, so I can just allocate its fields on the stack and then it never has to get allocated on the head and get cleaned up afterwards. So usually I write the program and then I run it under a profiler and only eliminate the garbage that actually is there. Which might be specific to a particular implementation of Java because there is not just the Oracle JDK, there’s IBM and other ones as well, so I’ve just gone for Oracle Java. So that’s the first thing we do, just let the compiler take care of it. The second thing you can do is you can make your objects mutable, and so you can change the fields but generally that’s bad programming practice, all the libraries in Java are immutable, when they can be, so for example if I pass a String to a method that you’ve written, and if your method were to change the String without telling me, that would be very annoying. And what you can do is you can make an object mutable and you can just reuse it like changing its fields at a later point in time which means it will let you introduce bugs into your program, which makes it more tricky, it’s not necessarily a nice thing to do, but it can help you save on garbage. The next level you might go to is object pooling which basically is mutable objects that you just keep in an array somewhere. You take some out, use them, when you finish with them you put them back again, so you might use that in a situation where you don’t know how many you need in advance. And then the highest level technique would be reference counting where you need to pass data from one thread to another, so if you had two threads operating on the same data, and you don’t know which one is going to finish first. So when your reference count goes back to zero you see that they both finished and it’s safe to put it back into the object pool. So in fact we don’t actually use object pooling very much, object pooling on its own is only used in one place in the code where I don’t know how many of a particular kind of object I have got to allocate, and the reference counting is really only used on the actual FIX message objects where they’re handed between one thread or another.
But do all of these techniques – they don’t actually guarantee that there will never be a garbage collection, because if you use some library that might do some naughty things there might still be one.
Well there is no third party open source code in our FIX engine, we wrote all the libraries that we use ourselves. Obviously we used what is in the Java standard library but if you are trying to be garbage free you might run into problems because you are using something in the standard library and it’s not actually garbage free so then you got to write your own implementation. I’ve been fairly lucky with that, caused me a few headaches but mostly I was able to work around them.
So basically you have to look at all of the data paths and code paths and see if they contain any ‘new’ operator basically?
Well ‘new’ is ok if you are going to hang on to the object for a long time, it’s when you are creating stuff and then throwing it away quickly that’s the problem. If I try to make some code garbage free what I generally do is make a JProfiler and I just look at the allocation hot spots and that will tell me where objects are being allocated continuously, how much memory I am using so I can focus quite quickly on where the garbage collection is going. Generally you can also use some command line options with Java, -Xloggc and it will print out all the console information every time it’s doing garbage collection I run it with that and check that it’s not actually running in the garbage collector. I do things like leave it running for the weekend, come back and on Monday check if it managed to keep running all weekend with no garbage collection.
That’s very interesting, that is a pure code approach to low latency, no GCs basically so you don’t use any kind of realtime VMs, have you used realtime virtual machines?
No, I haven’t at all because a real time virtual machines will only let you allocate things, on the stack. I think the realtime Java VM is not free though, so if we produced a version that requires a realtime VM that might put people off attempting to use it because then they would have to buy the realtime version.
So you are doing interesting things with FPGAs, programmable hardware, why do you need that, isn’t the CPU good enough?
CPU is pretty good, CPUs are very fast, and I mean they clocked it up to 4 GHz, and an FPGA might only be clocked at about 600 MHz, but you can do things in parallel in hardware that you can’t do in software. Even despite the lower clock speed you can actually get speed up, if you target your part of implementation towards a very specific problem that you are trying to solve.
So what kind of problems do you solve with the FPGA?
I suppose that one thing is the serialization and deserialization of the message, which we have written in software, it’s an ASCII based protocol and we are constantly converting the ASCII message into binary format to give it to an application, so you’d write your application that expects the price, for example, to be decimal number, it doesn’t want an ASCII, then likewise when you send an order, that is going to be converted back into ASCII so we are going to be implementing the ASCII to binary and binary to ASCII part in hardware to make that as quickly as possible. So for example in software a cycle of that loop may take about eight microseconds, in hardware we are looking to get that down to much less than that.
So, how do you actually use the hardware? Do you write Verilog or something?
Rupert Smith: It’s all written in Verilog but I am not actually doing that, somebody else is the hardware guy.
Werner Schuster: That’s always good.
Rupert Smith: Don’t ask me too many questions about that.
So how do you approach, how do you design for an FPGA? Do you have different algorithms? How does that work?
FPGA is really cool because a program for FPGA written in a hardware definition language, the entire thing runs in parallel, and every line of code is simultaneously executing, it’s quite a bizarre thing to get your head around to begin with. Once you begin to understand it, it is really cool.
So with an FPGA do you have to write everything from scratch or can you use libraries?
Rupert Smith: Well, there are libraries for things, but it’s not quite like software where you have got lots of libraries. You are doing some very primitive operations with bits that’s why it can be fast, because you are designing an electrical circuit, made out of gates, ultimately that’s juggling bits, so you can get right down there to the lowest level of detail and make everything just how you want it. Basically the FIX engine has layers, it has got a session layer on top, which handles things like logons and heartbeat messages for you, mostly messages just pass through and we give them to the application, and below that it’s got the translation layer that does the serialization/ deserialization between ASCII and binary, then below that we’ve got a network layer, and we’re moving the bottom two layers of that stack down in the hardware so the message will arrive from the network, we turn it into a binary format and then hand it straight up to the session layer which really just passes it on to the application and so it’s like zero copy IO it will not be any copying going on internally in the stack, during my talk I had a slide showing you some of the copying that can actually go on between the network card and by the time the information arrives at the application level internally within the stack, within the TCP stack itself and within the FIX engine eliminating all copying so that the data would just be directly handed over and that’s a thing that we can do in hardware that we can’t do in software and like cut through, and TCP is not really designed for cut through because there is a message length, and the checksum in the header of the message so you can’t write those things out until you finished writing the body of the message because they go at the beginning. But when a message is arriving off of the network we can certainly stream that message into the FIX core straight away and before we even have received the last bit of it off the network just if we encounter an error in the message later on, after we’ve started passing it to the FIX core we need to send a signal say “The last message is corrupt by the way” and forget that last one. So even though the protocol is not designed with cut through in mind, there are tricks we can do to get at least some kind of cut through on it. Likewise with the PCI bus itself the PCI bus protocol is not really designed for cut through. I am not actually too familiar with the format I just know that it doesn’t. I read a paper comparing the PCI bus to a hyper transport bus and the hyper transport was designed with cut through so for example in hyper transport you’d have a header on your message, and a message body and then a tail at the end and then you put the checksum in the tail at the end so you don’t need to know up front before you even start sending data. And I don’t know for certain but I think the Intel bus is also much better designed for cut through. But we can do a similar trick when you are sending a message you put it on the PCI bus, we can start feeding that into the FIX core straight away we may find out that that message is corrupt and to cancel the operation but at least in one direction we are getting some use of cut through. So that let’s us merge the layers of the stack a bit and have them running in parallel. That’s where we get an advantage with FPGA.
Werner Schuster: That’s a real powerful solution.
Rupert Smith: Yes, you can’t do these things in software, you don’t have that option.
Werner Schuster: Because you need signals, to race ahead. Ok, that’s very useful.
Rupert Smith: Yes, and Intel have been very helpful, and of course they have their QPI bus as well, they have been quite keen to promote that to us as an option, and so you’d have a network card, the network cable goes straight into something that plugs into the CPU socket and is talking directly to the bus. That would be a lower latency solution, on the PCI. The implementation we are doing now on the FPGA is going to be now on the PCI card.
So these are all very useful techniques and I think we should all check out FPGAs for our low latency needs.
I would recommend getting a book, and having to read it is really quite fascinating thing although I am not doing the hardware implementation, someone else has been employed to do that, I did find out quite a lot about it so I can understand it and I was quite blown away about what you can do, it’s quite a cool thing.
So it seams to be a new venue for lots of things.
Yes, I mean you can buy like a Xilinx board on eBay for thirty dollars that is really basic starter board and a book and have a play round with it.
But you have to learn Verilog or VHDL.
Rupert Smith: Yes.
Werner Schuster: That’s a high level-ish language. It has curly braces.
Rupert Smith: I think VHDL is considered to be a bit more high level than Verilog, it has like a type system, I don’t actually know VHDL, Verilog is considered to be slightly lower level, get down and dirty kind of language than VHDL.