Since FPGAs hit the institutional capital markets scene with force around 2008, adoption of FPGAs has been sparse within the capital markets ecosystem. But the successful development of a new use case that demonstrates the potential of hybridized FPGA-CPU solutions could open a door to a whole new range of alpha discovery and capture possibilities.
Despite the technology initially becoming available in the mid-1980s, FPGAs hit the institutional capital markets scene with force around 2008, with the introduction of the first low-latency, hardware-accelerated market data solutions. With the exception of additional reports of their use for real-time risk analysis among select Tier 1 banks in mid-2011, (and despite their growing popularity as a lower-cost ASIC substitute for use cases in other industries), news of further adoption of FPGAs has been sparse within the capital markets ecosystem.
TABB Group believes that this scenario is about to change. The successful development of a new use case around enhanced-performance FIX engines is on the verge of bringing FPGAs back into the headlines. These new hybridized and general purpose FIX engines feel and act just like 100 percent software solutions yet operate a heck of a lot faster. Moreover, this use case also represents a template wherein hybridized FPGA-CPU workload solutions demonstrate the potential to open a door to a whole new range of alpha discovery and capture possibilities. What follows here is the background and details for a potential resurgence of FPGA usage in capital markets:
Field-programmable gate arrays (FPGAs) are integrated circuits (ICs) that are designed to be configurable by customers or designers after they have been manufactured. FPGAs represent an alternative to functionality-specific or application-specific integrated circuits (ASICs), but with the key distinction of “field programmability.” This added flexibility offered by FPGAs gives end users the ability to implement custom logical functions (unlike ASICs, which have a fixed design built into them at the foundry), with the added benefit of post-shipping partial or total reconfiguration of design, no mask-set costs associated with and FPGA design, and far lower non-recurring engineering (NRE) costs relative to ASIC designs.
FPGAs offer numerous other advantages over ASICs, as well. Leading that list are general benefits that include a reputation for easier debugging, potential for rapid prototyping, lower total cost of ownership (TCO), and a lower risk of product obsolescence. In other words, FPGAs offer greater potential for “future-proofing.”
Perhaps most important for capital markets firms – more so than the inherent re-configurability and lower TCO of FPGAs – are the performance parameters. Ultralow-latency characteristics – as in, extremely high-throughput (otherwise known as “I/O”) – make FPGAs highly desirable for certain types of high-performance computing (HPC) use cases. This is particularly true for a growing spectrum of trading- and risk-related applications in global markets.
Rise in demand for customizable integrated circuits, TCO considerations (including power efficiency), and higher-performance solutions all serve as the key long-term drivers for the growth of the overall FPGA market. With these factors as a backdrop, analysts who focus on the broader semiconductor market have consistently expected the FPGA market to grow faster than it ultimately has. The consensus conclusion to date for these misses has been that the relative stagnation of FPGA growth is mostly due to the inefficiency of the predominant SRAM-based FPGA technology. In other words, the “high programmability overhead” suggests that many of the current ASIC designs cannot be replaced by their FPGA equivalent. So what has really happened is that many designers chose to use older (ASIC) node standard cells instead of an FPGA – a facet of the narrative that is really beyond the scope of this analysis, but worth mentioning nonetheless.
Another challenge comes from design complexities in FPGAs that can mean that code updates may take months to develop and test – thereby challenging the aforementioned rapid prototyping thesis, and furthermore suggesting that specific developer skills are required.
Fortunately, FPGA adoption is not dependent on any one industry, since they are used in so many, including telecommunications, automotive, and medical imaging, among others. With growth in use cases across multiple industry verticals, FPGAs represent a global market that is far larger than one would initial consider for such a specialized technology. TABB Group believes that increasing proliferation in high-performance and big data use cases across the board serves as a strong tailwind for higher-than-normal growth expectations. Grand View Research and other specialist research firms estimate that the global market for FPGAs (for all use cases) was valued at US$5.4 billion in 2013, and is expected to reach US$9.9 billion by 2020, growing at a CAGR of 9.1% from 2014 to 2020.
In Exhibits 1 and 2, below, we illustrate overall FPGA market growth forecasts based on both empirical data (which is highly correlated to the overall long-term semiconductor industry CAGR of 5.5%) and the enhanced expectations mentioned above (at CAGR 9.1%).
Exhibits 1 and 2: FPGA Market Dominance, Size and Growth
Sources: Grand View Research, EE Times, TABB Group
Beyond the growth issues, the FPGA market is dominated by a concentration of two manufacturers: Xilinx and Altera. Recent estimates illustrate that these two firms represent two-thirds of the market, with another six manufacturers comprising the balance. On top of this, TABB Group outreach suggests that performance differences are not material between the chipsets of the top manufacturers, thereby suggesting some level of commoditization.
Capital Markets Adoption
FPGAs came on the institutional capital markets scene around 2008-2009, near the peak of the HFT profitability boom – most notably with Activ Financial’s launch of its hardware-accelerated market data solution. Among the architectural benefits for speed, the ability to filter essential metadata items and feed them into nearby cores led to increasing FPGA adoption for these and related use cases.
In particular, the post-global financial crisis (GFC) focus on improved risk analytics, and higher update frequencies of those analytics, led Tier 1 banks to experiment aggressively – reputed to have tried “nearly everything” – on FPGAs. While most global banks eventually gave up, having found that successful implementations were way more difficult and expensive than originally anticipated (due to many factors related primarily to the cost of specialized skills and compute power), some notable success stories did emerge from early experimentation with FPGAs.
For instance, by mid-2011, JP Morgan claimed to be able to compress the computation time required for the pricing and associated risk analytics for a global credit portfolio to near real time – from an 8-hour runtime in 2008 to an FPGA-enhanced update frequency of 12 seconds. To achieve this, JP Morgan’s original C++ code – which used a lot of templates and objects – was rewritten, removing these and other C++ abstractions. This “flattening” exercise made it easier to identify opportunities for parallelism that the FPGA design could exploit. The resulting FPGA VHDL (or VHSIC Hardware Description Language) could be executed in parallel making extensive use of pipelining techniques. Additional sources of performance came from combining FPGAs with multi-node servers (with 2 FPGAs per node).
But aside from isolated success from proprietary developments, the buzz around FPGAs languished in the years following the GFC. TABB Group believes that this lull makes sense given the regulatory-induced distractions of the post-GFC era coupled with the overall declines in the profitability of high-frequency trading strategies. Yet, like the initial vendor-led usage of FPGAs around low-latency market data, it is the solution provider community today that is bringing FPGAs back to the forefront.
The Hybridized General Purpose FIX Engine
Enter: The new hybridized general purpose (HGP) FIX engine. A UK-headquartered firm, Rapid Addition, is the force behind the initial developments of this new FIX engine. At the base of it, the HGP-FIX approach achieves a clever balancing act – removing code that would run on the CPU and placing it into dedicated electronics on the FPGA, this approach optimizes the amount of code running on the central processing units (CPUs). The FIX data model and the rules for parsing FIX messages are imaged onto the FPGA, and the business logic that interacts with that data model runs on the CPU.
Among the core benefits of this approach – other than the performance enhancements (which we will get to in a minute) – is the fact that the HGP-FIX engine solution is designed to be used for any FIX use case – just like the functionality of any other FIX engine (which would typically run entirely on the CPU). As such, this hybrid FIX engine can serve as a replacement for other FIX engines, such as QuickFIX, Cameron and Appia, wherever latency or broader performance needs have increased beyond the abilities of those products.
Here’s the setup behind the intended benefits of this approach: FIX engines don’t have any business logic built into them; they are essentially a workflow component used to send messages. It is by virtue of the relatively static FIX data model that it lends itself well to the use of FPGAs. Because of this separation of code between the FPGA and CPU, if your business logic changes, you don’t have to touch the FPGA – located on a specialized network interface controller (NIC) card that includes an FPGA that is inserted into a standard PCIe slot, thereby creating the computational equivalent of a Swiss Army knife. And since this approach is designed to allow firms to manage the business logic in Java, C or .NET code, end users retain maximum agility for using a regular software-centric FIX engine except with FPGA-level performance. In short, end user application code can remain unchanged.
Initial testing of this hybrid FPGA-CPU solution is compelling, resulting in average roundtrip latency cost of about 5 microseconds (µs) versus the total software solution, which represents average round trip latencies of 13µs (using a specialized NIC such as those from Solarflare or Mellanox) and 25µs (using a general purpose NIC), representing improvements of 100%-500% (see Exhibit 3, below). There also is much lower “jitter” (i.e., standard deviation of performance) of ~1µs in the hybrid FPGA-CPU version relative to ~4µs in the software-only solution. Furthermore, upcoming improvements in this hybrid FPGA-CPU approach not only bode well for FPGA adoption, but also may influence changes in choice of NICs for relevant use cases.
Exhibit 3: Hybrid FIX Engine – FPGA vs. Software
Source: Rapid Addition, TABB Group
Furthermore, consider an example of a rapidly evolving market landscape, such as smart order routing (SOR) for FX trading – a growing theme as the FX markets become increasingly automated and liquidity becomes increasingly targeted by highly automated and high-turnover trading strategies. Most technology interfaces in that marketplace are FIX-compliant, and latency clearly matters in this market as much as any other. If your firm’s SOR business logic is written into an FPGA – and then the competitive market dynamics change, which they often will at this stage of its evolution and level of competition – this logic may need frequent recalibration. Therefore, solutions that embed business logic images onto FPGAs (in hopes of improving overall performance) will fail to keep pace, often needing months to develop, test and implement new logic.
However, with the hybrid FPGA approach – where the FIX data model is imaged onto the FPGA and algorithms or other business logic are maintained on the CPU – there are opportunities for significant performance improvements while retaining the agility of a traditional software-only solution. (For this hybrid approach, the current and immediately previous versions of FIX are simultaneously maintained on the FPGA. New versions of the FIX data model can be loaded onto the FPGA at run time. A setup program can upgrade the FPGA image and also confirms that the latest updates have been loaded successfully.)
In practice, when you start a FIX engine, a list of metadata tags that will be used are fed into that engine so that it knows how to parse the incoming message. This is essentially the same approach as the 100% software method, just about 100%-500% faster – and with throughput improvements approaching 800,000–1.2 million messages per second (MPS), too. In other words, it doesn’t matter which combination of FIX metadata fields are used for an execution report or market data message, for instance. Any combination of fields can be handled by the hybrid approach, same as traditional software-only solutions.
Source of Speed (FIX Engine)
The essential source of speed for the new FIX engine comes from the hybrid platform architecture and how it allocates resources between the loads that are applicable to the FPGA and the loads that are applicable to the CPU (see Exhibit 4, below). Moving the TCP/IP and FIX message parser onto the NIC allows for the execution of these steps to run in parallel rather than sequentially.
Exhibit 4: Hybrid vs. Traditional Platform Architecture
Source: Rapid Addition, TABB Group
For best performance, firms would typically code only a few business processes onto a server. Though the NICs (with FPGA embedded) can handle up to 32 FIX sessions – and therefore potentially connect to several markets per board – this strategy will not yield the lowest latencies. Targeting multiple sessions (connected to multiple markets) per board is principally due to space, design expediency, and multi-core (16 or maybe 32) servers. However, the highest speeds are achieved by dedicating and isolating one core per FIX session (and designating other cores to perform everything else).
Moreover, if you wanted one session to go as fast as technically possible, you would likely have one FPGA per box and then “pin” both the business logic (i.e., algo) and the hybrid FIX engine – in this case – to adjacent cores on the die. This is because different cores on the die have different levels of communication (and latencies) among them and other components of the server. In parallel with this tactic, all other computations – basically anything that could interrupt the FIX engine and its business logic – must be pinned to other cores so there is no risk of “cache pollution.” In other words, you don’t want any other code moving into the cache and slowing the targeted functionality.
A quick sidebar on the advantages of running less software on the CPU: If we were running a full FIX engine, the FIX engine (repository) component might take 60% of the cache, and the business logic / algo might take up to another 60% of the cache. In this scenario, there might be some “spiking” for the available cache for those two programs (more or less). However, with the hybrid version, there is much less software because there isn’t any of the parsing, for example, using the cache. In the hybrid FPGA FIX engine solution, it takes up potentially only 10% of the cache, leaving 90% of the cache available for the algorithm. And that yields an improvement in performance because getting something from the cache into the CPU usually takes low 10’s of nanoseconds (while retrieving something from main memory to CPU can take 100’s of nanoseconds).
The latency battlefield is nearing the end of a maturation phase for what are now becoming known as “traditional” applications involving market data consumption and high-speed execution. As testament to this, increments of advancement have been shrinking, now typically measured in tens or hundreds of nanoseconds. Focus is now migrating in a couple of key directions from pure speed to more comprehensive performance metrics plus TCO, where performance is increasingly defined as “bigger workloads, faster,” and where – as TABB Group has showcased in its recent research and modeling on Benchmarking TCO – TCO includes human capital costs, and also eventually includes a growing sensibility around the opportunity cost of time to market, among other intangibles.
In other words, sensitivity to latency is now also migrating full speed ahead into a much broader spectrum of use cases. In fact, as trading firms and solution providers experiment with the rapidly evolving suite of technical tools, new use cases will be discovered – and not the other way around. With few exceptions, new use cases do not precede new tools. New FIX engine adoption in fixed income and around swap execution facilities (SEFs) are a couple examples of such exceptions. All told, TABB Group believes that many of the most exciting and groundbreaking HPC use cases of the post-GFC era have yet to be conceived or discovered.
For this reason, a new FPGA-enhanced use case is symbolic of the discovery of a new frontier – one that could unfold far beyond FIX messaging, as outlined here. Specifically, by moving some processing from the software (in the CPU) to the hardware (namely, an FPGA) – and improving the communications interface between the CPU and FPGA – initial versions of such a hybrid approach has already demonstrated performance benefits, such as lower latency (20%-33% of traditional solution), less jitter (about 10% of traditional solution), and greater throughput (about 8x traditional solution).
On top of these, direct and indirect TCO benefits are potentially greater, an increasingly important feature set for any new solution in the current environment. Here’s the list of those key benefits:
- Leave existing business application software unchanged;
- Continue to program in familiar languages (Java, C++, and/or .NET);
- Offload the FIX repository and message handling to the FPGA – a significant component of CPU consumption in the traditional software-only approach – which frees up the CPU, and CPU cache, to handle more sophisticated business logic (thereby potentially making such business logic more competitive);
- This hybrid solution supports all versions of FIX from 4.0 to 5.0 SP2; and
- This methodology is relevant to both FIX market data and transaction messages.
Moreover, each of these benefits incrementally adds to much faster time to market than a pure FPGA solution, thereby satisfying our earlier claim about the combination of both performance and TCO.
Perhaps more important for the longer term than anything else said here so far is what this hybrid template suggests about the discovery and refinement of new use cases. TABB Group believes that the applicability of these concepts is very broad. For starters, think of any scenario in which the problems can be broken down into FPGA-based components and the agile business logic is an assembly of those components controlled by code running on the CPU. For instance, consider what the application of reference data models (such as the new legal entity identifier – LEI – or any security master) could mean for high-performance pricing of fixed income products or low-latency detection of the sympathetic effects of catalytic data on fundamentally similar groups of stocks.
Of course, these and similar ideas may not be particularly new for industry leaders, but the idea that increasingly elusive alpha could be discovered and harvested from speeding up more computationally intensive workloads is a fiction that is due for its debut in market reality. On the back of innovations such as the hybrid FPGA FIX engine, a door is now opened to a potentially expansive range of possibilities.