High-Precision Timing with TSC on x86 CPUs

I’ve recently worked on implementing high-precision timing measurements in the critical path on a few projects recently, where using the Time Stamp Counter (TSC) comes in handy. This post covers the implementation of a TSC-based clock and some considerations when using it.

The Time Stamp Counter (TSC)

The TSC is a 64-bit register on x86 processors that counts CPU cycles since reset. It’s accessible via the RDTSC instruction, making it a popular choice for high-resolution timing on x86 platforms.

Why Use TSC?

  1. High Resolution: The TSC increments with each CPU cycle, providing fine-grained timing.
  2. Low Overhead: Reading the TSC is faster than other timing functions.
  3. Monotonic: On most modern processors, the TSC is monotonic and consistent across cores.

Invariant TSC

An important concept when working with TSC is the invariant TSC. Most modern processors often implement an invariant TSC, which provides several advantages for timing measurements. The invariant TSC runs at a constant rate regardless of CPU power state or frequency changes, making it more reliable for timing measurements. On multi-core and multi-socket systems, the invariant TSC is typically synchronized across all cores and sockets, ensuring consistent readings. Additionally, the invariant TSC continues to increment even when the core is in a deep sleep state, maintaining timing consistency.

To check if your processor supports invariant TSC, you can use the CPUID instruction. The invariant TSC feature is indicated by bit 8 of the EDX register when CPUID is executed with EAX = 80000007H.

TSC-based Clock Implementation

The implementation is based on Intel’s whitepaper, with modifications for better performance. I will avoid rehashing information already in the whitepaper and recommend reading it to understand the implementation better.

Key points about this implementation:

  1. We use MFENCE and LFENCE as cheaper alternatives to CPUID for serializing.
  2. The TSC value is combined directly into a 64-bit value, simplifying register handling.
  3. For end measurements, RDTSCP is used instead of RDTSC for better ordering guarantees.

Calibration

To convert TSC ticks to real time, we need to know the TSC frequency:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
static void calibrate_frequency() {
    using namespace std::chrono;
    auto start = steady_clock::now();
    uint64_t start_tsc = tsc();
    std::this_thread::sleep_for(seconds(1));
    auto end = steady_clock::now();
    uint64_t end_tsc = bench_end();
    duration<double> elapsed_seconds = end - start;
    tsc_frequency = static_cast<double>(end_tsc - start_tsc) / elapsed_seconds.count();
}

This method calculates the TSC frequency by comparing TSC readings with wall clock time over a some interval (one second in this example).

Taking a Start Measurement

We first take a measurement representing the start time:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
static uint64_t tsc() {
    uint64_t x;
    asm volatile (
        "mfence\n"  // Memory barrier
        "lfence\n"  // Load barrier
        "rdtsc\n"   // Read time-stamp counter
        "shl $32, %%rdx\n"  // Shift EDX left by 32 bits
        "or %%rdx, %%rax"   // Combine EDX and EAX into RAX
        : "=a" (x)  // Output: EAX (lower part of RAX) mapped to x
        :           // No inputs
        : "rdx"     // Clobbers EDX
    );
    return x;
}

Measuring Duration

To measure duration, we convert TSC ticks to real time:

1
2
3
4
5
6
7
static std::chrono::nanoseconds duration_since_tsc(uint64_t start) {
    uint64_t end = bench_end();
    uint64_t tsc_diff = end - start;
    return std::chrono::nanoseconds{
        static_cast<uint64_t>((static_cast<double>(tsc_diff) / tsc_frequency) * 1e9)
    };
}

Considerations

While TSC-based timing can be precise, there are some caveats:

  1. CPU Frequency Changes: On systems without invariant TSC, the TSC rate might change.
  2. Multi-Socket Systems: TSCs might not be synchronized across multiple CPU sockets.
  3. Virtualization: Virtual machines might not provide reliable TSC readings.

Full Implementation

Here’s the complete TscDurationClock class:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
class TscDurationClock {
public:
    TscDurationClock() = delete;

    static uint64_t tsc() {
        uint64_t x;
        asm volatile (
            "mfence\n"  // Memory barrier
            "lfence\n"  // Load barrier
            "rdtsc\n"   // Read time-stamp counter
            "shl $32, %%rdx\n"  // Shift EDX left by 32 bits
            "or %%rdx, %%rax"   // Combine EDX and EAX into RAX
            : "=a" (x)  // Output: EAX (lower part of RAX) mapped to x
            :           // No inputs
            : "rdx"     // Clobbers EDX
        );
        return x;
    }

    static std::chrono::nanoseconds duration_since_tsc(uint64_t start) {
        uint64_t end = bench_end();
        uint64_t tsc_diff = end - start;
        return std::chrono::nanoseconds{
            static_cast<uint64_t>((static_cast<double>(tsc_diff) / tsc_frequency) * 1e9)
        };
    }

    static void calibrate_frequency() {
        using namespace std::chrono;
        auto start = steady_clock::now();
        uint64_t start_tsc = tsc();
        std::this_thread::sleep_for(seconds(1));
        auto end = steady_clock::now();
        uint64_t end_tsc = bench_end();
        duration<double> elapsed_seconds = end - start;
        tsc_frequency = static_cast<double>(end_tsc - start_tsc) / elapsed_seconds.count();
    }

private:
    static inline uint64_t bench_end() {
        uint64_t x;
        asm volatile (
            "rdtscp\n"  // Read time-stamp counter and processor ID
            "lfence\n"  // Load barrier
            "shl $32, %%rdx\n"  // Shift EDX left by 32 bits
            "or %%rdx, %%rax"   // Combine EDX and EAX into RAX
            : "=a" (x)  // Output: EAX (lower part of RAX) mapped to x
            :           // No inputs
            : "rdx", "rcx"  // Clobbers EDX and ECX (processor ID register)
        );
        return x;
    }

    static double tsc_frequency;
};