Imperial-efx_Credit_Suisse_23Jan13_new

January 25, 2018 | Author: Anonymous | Category: Math, Statistics And Probability, Statistics

Short Description

Download Imperial-efx_Credit_Suisse_23Jan13_new...

Description

Applications of Computing in Industry: What is Low Latency All About?

eFX – January 2014

Divyakant Bengani Undergrad degree in Management and IT from Manchester Vice President at CS, responsible for eFX Core Technologies Working in the banking industry since 2003 & CS for ~3 years

2

EFX - What do we do? Cash FX Only Spot, Forwards and Swaps Continuous Publication of Prices Streaming Executable Rates Response to Request for Quotes Acceptance and Booking of Trades

3

Key Statistics ~200 Currency Pairs (E.g EURUSD / GBPJPY etc.) 3 billion prices broadcast a day 60000 trades a day >200 client connections

4

Technologies Used Java C# for UIs GWT for Web UIs Oracle Coherence Oracle DB Derby DB Azul Zing JVM Low Latency Fix Engine

5

Protocols Socket Connections Asynchronous JMS Java RMI HTTP (JSON, HESSIAN)

6

Payloads Google Protobuf Fixed Length Byte Arrays FIX - Industry Standard JMS Map Messages Java Serialization

7

EFX - Overall Architecture

8

Service Discovery Zero Conf Dynamically add and remove services Applications do not need to know about each other - just pick up what’s advertised

9

Automated Testing

10

Code Quality Analysis

11

Continuous Integration

12

How to Achieve Low Latency

Daniel Nolan-Neylan Graduated from UCL in 2004 Started working at Credit Suisse in 2006 − First, networking for 4 years − Now, Application Developer in FX IT Different projects: − Distributed caching system for static data − Simplified credit checking library − Pricing and trading gateway (now team lead)

Corporate Design, HCBC 1

November 2011

14

Wait a second! Reminder: 1 second is: − 1,000 milliseconds − 1,000,000 microseconds − 1,000,000,000 nanoseconds

Latency Numbers Every Programmer Should Know L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD* 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150 ms By Jeff Dean:

http://research.google.com/people/jeff/

FX Trading – Latency Numbers 250ms – A human responding to price update 30ms – Bank accepting trade 10ms – Credit checking client 9ms – JVM Garbage Collecting 5ms – Persisting a trade to disk 2ms – JMS networking round-trip 1ms – Raw socket networking round-trip 0.5ms – Max wire-to-wire pricing latency 0.05ms – Min pricing latency 0.005ms – Writing price to FIX engine

Optimization Quotes Michael A. Jackson: “The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.” Rob Pike: “Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you have proven that's where the bottleneck is.”

Where to Optimize? Use Profiler

Measuring Milliseconds and Nanoseconds in Java Measure time taken for operations and log: − System.currentTimeMillis() Good for taking a time/date that can be compared against other systems. Accuracy depends on OS, but 1ms accuracy achievable on modern Unix-based OS (Linux) Bad if more precise measurements are required − System.nanoTime() Good for sub-millisecond measurements Bad if comparable time with other systems required − Realistically, need to use both

Corporate Design, HCBC 1

November 2011

20

Quote Journalling – log latency of every price

Corporate Design, HCBC 1

November 2011

21

Our Soak Test Harness

Corporate Design, HCBC 1

November 2011

22

…and the graphs it can produce

Corporate Design, HCBC 1

November 2011

23

Removing Millisecond Delays Identify the longest-running tasks − Usually I/O delays Disk – Database activity – Synchronous logging – Writing files Network – Calling network services – Remote services far away (e.g. Across Atlantic ~50ms)

Removing Millisecond Delays (2) Analyze whether delays can be eliminated − Disk Database activity -> Use a cache Synchronous logging -> Use asynchronous logging Writing files -> Use buffers and write asynchronously − Network Calling network services -> Cache where possible Remote services far away -> Co-locate in same place

FX Trading – RFQ Example E.g. Incoming request for a price, target response time is 10ms − Need to: Validate request parameters Internally subscribe for prices Obtain a globally unique transaction ID Perform a credit check How to get all this done in just 10ms?

FX Trading – RFQ Example (2) Credit check − Old one took 30-200ms − New one takes 5-10ms Using Caching and Co-location Parallelize all validation Pre-cache prices − by opening up price streams in advance of being required

Don’t Optimize Too Soon Remember: − Only optimize what you need to optimize − Remove longest delays first No point removing micros if you still have delays of millis or worse − Always measure your operations carefully Determine what minimum, maximum, mean, standard deviation, and other percentiles are (99%, 99.9%, etc) − Watch for jitter and solve separately

Removing Microsecond Delays Intra-process delays − Unbalanced / slow queues − Slow algorithms Expensive loops repeated many times Poor use of object creation / memory allocation Contented memory controlled with locks Wasted effort calculating unwanted results

FX Trading – Pricing Example Achieving wire-to-wire latencies of 50μs − Google protobuf parsers replaced with low-garbage creating versions each GC stops the JVM for 9,000μs (i.e. 9ms) − LMAX Disruptors used instead of queues Busy spin consumer threads / single-write principle − “PriceBigDecimal” class to replace Java BigDecimal class BigDecimal slow to instantiate and impossible to mutate − No synchronous logging or network calls − Pre-cache static data before starting price stream

Disruptor or Blocking Queues?

Corporate Design, HCBC 1

November 2011

31

Java BigDecimal or use Low Latency replacement?

Corporate Design, HCBC 1

November 2011

32

Removing Nanoseconds? Use specialist hardware (such as FPGA) Understand low-level CPU interconnectivity with memory, and how CPU caching works (including cache-lines) http://mechanical-sympathy.blogspot.com eFX – No need to pursue this level of performance at the moment

Latency vs Throughput Latency - time taken (typically mean, percentile or worst case) to complete a task Throughput – the number of tasks completed in a given time period (typically, per second)

Throughput is 1/latency (per pipeline)

Increasing Throughput Identify delays − Throughput constrained by latency − Blocking I/O calls delay unprocessed messages Data bursts − What’s the peak throughput required? − What’s the gap typically between bursts?

Techniques to Increase Throughput Batching − Sometimes latent calls are unavoidable − Using batching can strip overhead of making call per transaction − Cost of batching is the delay incurred waiting for new items to add to batch − More difficult to accurately measure delay per item when multiple items are in a batch

FX Trading – Batching Example

Legacy global server in London Regional trade acceptance components Latency between New York and London - 50ms Per thread: 1/0.05 = 20 trades per second max How to increase? − More threads − Add batching per thread

Now, with batch size of 5, 100 trades per second per thread.

Techniques to Increase Throughput(2) Use Asynchronous callbacks − Synchronous calls: boolean doCall() Wait for response Can be delayed for varying time − Asynchronous calls: void doCall(Callback callback) Do not wait and keep processing more events Can additionally overlay timeouts to improve resilience

FX Trading – Asynchronous Callbacks Submission of trade to price service for verification – was originally synchronous Call blocks for 50ms – max 20 trades per second per thread After converting to asynchronous callbacks, the only delay is putting packets on network buffer (μs), so effectively no delay – max numbers of trades is very high!

Q&A eFX – January 2014

Imperial-efx_Credit_Suisse_23Jan13_new

Short Description

Description

Comments

We need your help!