Cloudflare
Cloudflare has become a popular (or maybe the most popular?) provider for adding useful features to websites and APIs such as CDN and DNS, but what probably sets it apart from competitors is it’s sophisticated firewall and DDOS protection, allowing for fairly fine grained controls over who can access your endpoints, and to protect those endpoints from abuse. Cloudflare also probably has one of the best global networks, helping accelerate requests to your endpoints from anywhere in the world.
Sounds great. So what’s the problem?
Most major cloud providers also optimize their networks in a similar way. They’ll optimize your ingress and egress paths over their own private network so that you enter and exit as close to the endpoints you’re calling so you spend very little time over the public internet.
Sound even better. What’s the point of this infodump?
Well, nothing… most of the time… unless you’re very latency sensitive.
Behind Cloudflare
I’m hosting an API exposed via Cloudflare but my customers expect the absolute tightest latencies. You might be in a similar situation, or you are a client trying to access an API with the best possible latency.
If latency was such a concern, why am I using Cloudflare anyway? Maybe I am worried about DDOS attacks, or I want to make sure my API can only be accessed from a certain region.
It doesn’t really matter what the exact use case is for the purpose of this article.
To experience the observations of my API consumers, I simulate their requests with a simple API client. My API is hosted at GCP us-west-2 (LA) so I host my client code there as well: https://cloud.google.com/about/locations/
To make it even better, Cloudflare also has a datacenter in the same region: https://www.cloudflare.com/network/.
I run my client by calling my GCP endpoint directly without Cloudflare and see 1ms
baseline latencies. Perfect! Hopefully Cloudflare won’t add much to this, so I run the same test with Cloudflare proxied endpoints.
My baseline latency is now over 10ms.
What happened? Is this a cloudflare induced cost? If so, seems a bit excessive.
Before we go ahead and try to debug this, why baseline latencies? It’s because I only care about the best possible latencies to my APU. There will be network induced jitter which will be worse (and sometimes significantly so) if the requests have to travel via he intenet to Cloudflare. Managing the tradeoffs of better latency vs worse jitter is out of the scope of this article.
Digging deeper
Cloudflare provides a /cdn-cgi/trace
endpoint to provide some useful debugging info about how an endpoint proxied behind Cloudflare is being accessed. Running that (anonymized):
|
|
Notice the color=SJC
. SJC
represents the San Jose Cloudflare data center. This essentially represents that the request is hitting the SJC data center even though I’m making the request from LA to an endpoint in LA while Cloudflare has an LA location. What gives?
AWS us-west-1 (N. California) should be pretty close so let’s try that as well. I now see 15ms baseline latencies. Turns out this is even worse:
|
|
Couldn’t be worse. It’s hitting Cloudflare Seattle.
More digginng. Traceroute from GCP:
|
|
MTR is equally unhelpful
|
|
AWS turns out to be more helpful:
|
|
Now that’s interesting.
What is going on
The requests from AWS takes a detour all the way to Seattle with 27 hops within AWS itself before it hits Cloudflare. Presumably Cloudflare has some sort of direct connectivity to AWS via it’s SEA datacenter (perhaps in AWS us-west-2 region), and broadcasts my endpoint there. AWS wants to optimize the network path taken to the endpoint via their dedicated network instead of the internet, which involves routing the request all the way to SEA thinking that it’s bypassing the internet entirely.
Looking at GCP docs regarding their network tells us that GCP is probably doing something similar (and likely most other cloud providers as well): https://cloud.google.com/network-tiers/docs/overview
GCP has two network tiers: Standard and Premium. Premium tier claims to deliver traffic from external systems to Google by using Google’s network with the ingress/egress to the internet being at a PoP google deems is optimal.
AWS and GCP are both doing the same thing that Cloudflare is, that is try to optimize the network path. In my case they are optimizing the path taken by the request to the endpoint so that the egress is as close to where they think the endpoint is, in case of AWS LAS that is Cloudflare SEA and GCP us-west-2 that is SJC. And in doing so, they’re just making things worse.
Fix
Remember the GCP standard tier?
GCP standard tier will have the traffic exiting google’s network at the closest PoP. Turning it on, we see a baseline latency of 2ms!
|
|
Voila! We’ve hit the cloudflare LAX data center. Turns out in this case Google’s standard network tier works better than the Premium one.
Traceroute shows something similar. The requests route via if-ae-6-20.tcore1.eql-losangeles.as6453.net
before hitting cloudflare.
I’m not sure what the equivalent of GCPs standard tier is on AWS or if it is even possible to get AWS to have my requests exit to the internet at the nearest PoP.
TL;DR
If you’re using Cloudflare in a latency sensitive environment in the cloud, take care to account for the fact that your cloud provider and Cloudflare might both be trying to optimize the network usage, clashing with each other’s improvements, ultimately resulting in a worsened performance.
In another article I will show how cloudflare itself might be adding more unpredictability to this.