Performance Zone is brought to you in partnership with:

SysAdmin / Operations guy at DataSift managing the various elements of DataSift.com. Casual programmer in my spare time usually working with LAMP or Android (Java) but I also work with C++, Arduino and even embedded C on occasion (see blog for examples). In the past I've ran network cabling, managed the IT for a small business, been part of the SysAdmin team for a data center and managed a team of 10 SysAdmins at a managed hosting company that covered 4 data centers in 3 countries. Gareth is a DZone MVB and is not an employee of DZone and has posted 7 posts at DZone. You can read more from them at their website. View Full User Profile

Platform Performance Gains with Arista Switches

03.29.2013
| 7656 views |
  • submit to reddit

In late 2012 I wrote about the migration of our [DataSift.com's] Hadoop cluster to Arista switches but what I didn’t mention was that we also moved our real time systems over to Arista too.

Within the LAN

During our fact finding trek through the Cisco portfolio we acquired a bunch of 4948 and 3750 switches which were re-purposed into the live platform. Unfortunately, the live platform (as opposed to Hadoop sourced historical data) would occasionally experience performance issues due to the fan-out design of our distributed architecture amplifying the impact of micro-bursts during high traffic events.

For every interaction we receive it is augmented with additional meta data such as language designation, sentiment analysis, trend analysis and more. To acquire these values an interaction is tokenised into the relevant parts (e.g. a Twitter user name for Klout score, sentences for sentiment analysis, trigrams for language analysis etc). Each of those tokens is then dispatched to the service endpoints for processing. A stream of 15,000 interactions a second can instantly becomes 100,000+ additional pieces of data traversing the network which puts load on NICs, switch backplanes and core uplinks.

If a particular request were to fail then precious time would be wasted on waiting for the reply, processing the failure and then re-processing the request. To combat this you might duplicate calls to service endpoints (e.g. speculative execution in Hadoop parlance) and double your chances of success but and those ~100,000 streams would become ~200,000 putting further stress on all your infrastructure.

At DataSift we discuss internal platform latency in terms of microseconds and throughput in tens of gigabits so adding an unnecessary callout here or a millisecond extra there isn’t acceptable. We want to be as efficient, fast and reliable as possible.

When we started looking at ways of improving the performance of the real time platform it was obvious that many of the arguments that made Arista an obvious choice for Hadoop also meant it was ideal for our real time system too. The Arista 7050′s we’d already deployed have some impressive statistics in regards to latency so we needed little more convincing that we were on the right path (although the 1.28 Tbps and 960,000,000 packets per second statistics don’t hurt either). For truly low latency switching at the edge one would normally look at the 7150 series but from our testing the 7048′s were well within the performance threshold we wanted and enabled us to standardise our edge.

We made use of our failure tolerant platform design (detailed further below) to move entire cabs at a time over to the Arista 7048′s with no interruption of service to customers.

Once all cabinets were migrated and with no other optimizations at that point we saw an immediate difference in key metrics;

Simply by deploying Arista switches for our ‘real time’ network we decreased augmentation latency from ~15,000µs down to 2200µs. Further optimisations to our stack and how we leverage the Linux kernels myriad options have improved things even more.

Epic Switches are only half the story

One of the great features of the Arista 7048 switches is their deep buffer architecture but in certain circumstances another buffer in the path is that last thing you want. Each buffer potentially adds latency to the system before the upstream can detect the congestion and react to it.

The stack needs to be free of bottlenecks to prevent the buffers from filling up and the 7048 switches can provide up to 40Gb/s of throughput to the core which fits nicely with 40 1u servers in a 44u cabinet. With that said we’re not ones to waste time and bandwidth by leaving the TOR switch if we don’t have to.

By pooling together resources into ‘cells’ we can reduce uplink utilisation and decrease the RTT of fan out operations by splitting the workload into per-cabinet pools;

With intelligent health checks and resource routing coupled with the Aristas non-blocking full wire speed forwarding in the event of a resource pool suffering failures the processing servers can call out cross-rack with very little penalty;

That’s Great but I’m on the other side of the Internet

We are confident enough in our ability to provide low latency, real time, filtered and augmented content that we publish live latency statistics of a stream being consumed by an EC2 node from the other side of the planet on our status site; http://status.DataSift.com.

We can be this confident because we control and manage every aspect of our platform from influencing how data traverses the Internet to reach us, our routers, our switches all the way down to the SSD chipset or SAS drive spindle speed in the servers. (You can’t say that if you’re on someone’s public cloud!)

When you consider the factors outside of our control it speaks volumes about the trust we have in what we’ve built.

User Latency
(They could be next door to a social platform DC or over in Antarctica)
10ms – 150ms
+
Source Platform Processing time
(e.g. Time taken for Facebook or Twitter to process & send it on)
???ms
+
Trans-Atlantic fibreoptics
(e.g. San Jose to our furthest European processing node)
~150ms
+
Trans-Pacific fibreoptics
(e.g. From a European processing node to a customer in Japan)
~150ms
=

~500ms

When dealing with social data on a global scale there can be a lot of performance uncertainty with under-sea fibre cuts, carrier issues and entire IX outages but we can rest assured that once that data hits our edge we know we can process it with low latencies and high throughput.

In conclusion I’ve once again been impressed by Arista and would whole heartedly recommend their switches to anyone else working with high volume, latency sensitive data.

Reading List:
Arista switches were already a joy to work with (access to bash on a switch, what’s not to love?) but Gary’s insights and advice makes it all the better.
Arista Warrior – Gary A. Donahue

Even with all the epicness of this hardware if you’re lazy with how you treat the steps your data goes through before it becomes a frame on the switch you’re gonna have a bad time so for heavy duty reading The Linux TCP/IP stack book may help.
The Linux TCP/IP Stack: Networking for Embedded Systems – Thomas F Herbert




Published at DZone with permission of Gareth Llewellyn, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)