Cloudflare again
In a previous article I showed the latency impact of using a Cloud provider’s network while also using Cloudflare. Turns out that isn’t the end of the story. If anything it gets better (or worse depending on your perspective).
I’d suggest reading that article first as this one continues where it left off. And this is more of a story of what follows.
The scenario
I ran the previous scenarios from GCP us-west-2 (LA). This time I do the same from us-west-4 (Vegas).
Google has a region in Vegas and so does Cloudflare. We already know by now that in some cases the Standard Tier networking avoids unnecessary hops to another city on Google’s internal network before hitting my Cloudflare proxied endpoint.
For my peace of mind, I double check the baseline latency on Premium Tier network. I expect 1-2ms but get 13ms. From a client in Vegas to and endpoint in Vegas. Perfect! Everything is working as expected. The problem from the previous article wasn’t really limited to one specific region and can be replicated in another.
Last time the Standard Tier worked. The requests got kicked out to the internet and took a shorter path. So I try the Standard Tier network.
On running the same tests with this new setup resulkts in 12ms baseline latency.
What?!
The problem
What went wrong this time?
Using the trusty Cloudflare debug endpoint:
|
|
That can’t be right. The requests from GCP Vegas are going all the way to Cloudflare LAX over the internet before being routed back to Vegas again, even though there is a Cloudflare LAS DC.
Doing traceroute
shows that my requests are definitely going over the internet, hitting ix-ae-54-0.tcore2.lvw-losangeles.as6453.net
on the way to LAX cloudflare.
So what is happening?
Cloudflare has a data center in Vegas, and my client is running from Vegas. Why are the requests not being routed via that DC, and instead going to LAX?
Looking at cloudflare’s status page https://www.cloudflarestatus.com, everything looks good. The LAS data center has no incidents reported, nor does any other data centers in the vicinity.
Re-running the tests over the rest of the day yields no different results. Switching gears, I run some tests against other domains I can find that use cloudflare, calling the /cdn-cgi/trace
endpoint. Turns out many of them route via LAX, but not all. Some do hit LAS.
LAS is definitely alive and serving requests, but it’s not serving all of the domains I’m trying, and from my somewhat random sample, a very small proportion do.
Is Cloudflare perhaps sharding the proxied endpoints across datacenters?
Luckily, with an Enterprise account with Cloudflare, I just start a ticket to figure out what is going on here.
While I’m waiting for their response I repeat the tests over the next few days and get similar results. I also run tests without cloudflare and verify that I can get 1-2ms latencies over the internet from Vegas to Vegas for my API endpoint.
The solution (or not)
Eventually Cloudflare suppprt gets back. Turns out, my endpoint had been given a set of IP addresses by Cloudflare, and those IP addresses are not being broadcast from their LAS data center.
Can they change that?
They want a good reason as to why.
Fortunately, I have an excellent reason.
I wait some more for them to do their thing.
In the end, I don’t get what I want, but I learn something, so maybe I do.
My little endpoint is too popular and Cloudflare doesn’t have capacity at LAS, and won’t be able to broadcast it from LAS. So I’ll have to live with LAX for now.
TL;DR
If you are using Cloudflare with anything latency sensitive, you’re at their mercy for how you’re being routed. Unlike with the Cloud where you can switch regions to suit your needs, Cloudflare routing is opaque and with limited control.