Feedback

Theory vs. Practice

Diagnosis is not the end, but the beginning of practice. Martin H. Fischer





Critical wrk and wrk2 bugs: all wrk/wrk2 benchmarks since 2012 are bogus

Nowadays, benchmarking is not a walk in a park. As yet another coincidence, wrk and wrk2 (2012) have been created to complement weighttp (2006) and IBM AB for "Apache Benchmark" (1996)... just after we published G-WAN (ab.c, weighttp-based) ground-breaking 2011-2012 benchmarks bringing new unknown heights in an otherwise boring, infinitely self-complacent industry.

Below is the G-WAN/cache October 2025 version – tested with an heavily corrected wrk2 version renamed wrk3:

              G-WAN RPS            NGINX RPS        G-WAN is N times faster
  -----  -------------------   -------------------   -----------------------
  users  10s  30s  3m   30m    10s  30s  3m   30m     (all 4 tests combined)
  -----  -------------------   -------------------   -----------------------
     1   151k 142k 152k 141k   104k 103k 103k 104k     586 /  414 =    1.41The G-WAN Speed
    10   977k 996k 945k 927k   623k 616k 608k 601k    3845 / 2448 =    1.57  │ (wrk's architecture
    1k   2.0m 1.9m 1.8m 1.8m   1.0m 963k 964k 956k    7.6m / 3.9m =    1.93  │ is the bottleneck)
   10k   3.4m 1.8m 896k 729k   789k 729k 696k 716k    6.8m / 2930 = 2,334.73  ║║║ The G-WAN MultiCore
   20k   5.6m 3.7m 1.0m 713k   755k 724k 671k 682k   11.1m / 2832 = 3,948.96  ║║║ Scalability (wrk's 
   30k   9.0m 4.2m 1.1m 723k         Terminated (OOM)       15.1m /    0 =  infinity  ║║║ architecture is 
   40k  15.0m 5.8m 2.1m 723k         Terminated (OOM)       23.6m /    0 =  infinity  ║║║ the bottleneck)

  G-WAN is 1.42 to 3,948.96 times faster than NGINX with 1 to 20k users and 10 seconds to 30 minutes tests.
  G-WAN keeps going while NGINX has to stop, due to a lack of memory on this 192 GB RAM, Intel Core i9 PC
  (the wrk/2/3 tools consume 190 GB RAM at 40k users, and NGINX uses more RAM than G-WAN).

  G-WAN running for 30 minutes is as fast or faster than NGINX running for 10 seconds – on all concurrencies.
  While letting wrk3 test 1 to 40k users, G-WAN runs marathons faster than NGINX runs 100m sprints.

  So G-WAN is faster than NGINX at short, middle and long test runs, beating the market leader
  (with the favorite NGINX benchmark) for the 100m sprint, 5km, half and full marathon races.

  And G-WAN RPS keep growing when concurrency grows, while NGINX RPS peak at 1k users and decline,
  proof that wrk3 is the bottleneck for G-WAN, and that NGINX is the bottleneck for wrk3.

What follows, is the (long) why and how, and the wrk3 source code fixing 4 wrk2 major multi-threading programming errors.




SUMMARY OF THE FACTS
  • The old wrk2 (10-second) tests showed that at 10k users G-WAN was 453 times faster than NGINX.
  • The new wrk3 (10s to 30m) tests show that at 10k users G-WAN is   2,334 times faster than NGINX.
  • The new wrk3 (10s to 30m) tests show that at 20k users G-WAN is   3,948 times faster than NGINX.
  • The new wrk3 (10s to 30m) tests show that at 30k+ users G-WAN is infinitely faster than NGINX.

Even better, instead of the G-WAN performance drop after 10k users (wrongly) reported by wrk2, now wrk3 (rightly) shows that G-WAN performance NEVER drops (G-WAN RPS grow with the number of users, which is not the case for NGINX after 1k users).

This demonstrates that wrk3 is the bottleneck for G-WAN, and that therefore my upcoming benchmarks (run using the benchmarking tool built into G-WAN) will be even better than those from wrk3.

Any serious expert can confirm these points just by looking at the tests. My wrk2 source code fixes documented below also demonstrate that wrk and wrk2 cannot be trusted (due to their faulty code, their scores cannot reflect the "tested server" performances).


In 2023-2024, I first used wrk (clearly written to favor NGINX and make all others fail), which takes forever to complete benchmarks with a fast server because wrk waits for all the server replies to send new requests, so, if the server takes 10 seconds to complete the test, and wrk is 500 times slower than the server, then wrk will need 500 * 10 seconds = 5000 seconds = 1 hour 23 minutes to complete a... "10-second" test.

Late 2024, an engineer suggested the "slower but more reliable" wrk2 which stops at the specified time – instead of taking forever... and delivering extreme volatility inviting people to do cherry-picking – a capacity that was acceptable to make available to NGINX, the first users of wrk, but clearly not for G-WAN (as we will see, these double standards are a constant).

These 4 consecutive G-WAN wrk tests (ufw firewall ON, powersave CPU) illustrate wrk's erratic volatility (explained later):

./wrk -t5k -c5k "http://127.0.0.1:8080/100.html"
Running 10s test @ http://127.0.0.1:8080/100.html
  5000 threads and 5000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.59ms    2.45ms  96.87ms   60.26%
    Req/Sec   316.65    484.74    24.66k    91.62%
  27896487 requests in 10.22s, 8.60GB read
Requests/sec: 2730160.42
Transfer/sec:    861.82MB

./wrk -t5k -c5k "http://127.0.0.1:8080/100.html"
Running 10s test @ http://127.0.0.1:8080/100.html
  5000 threads and 5000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.20ms    1.95ms  56.34ms   65.67%
    Req/Sec   242.80    239.83    19.81k    92.76%
  89898562 requests in 10.37s, 27.71GB read
Requests/sec: 8673159.26 ......................... 3.18x higher!
Transfer/sec:      2.67GB
./wrk -t10k -c10k "http://127.0.0.1:8080/100.html"
./wrk -t10k -c10k "http://127.0.0.1:8080/100.html"
Running 10s test @ http://127.0.0.1:8080/100.html
  10000 threads and 10000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    11.04ms   71.67ms   2.00s    99.23%
    Req/Sec   155.65    294.40    47.14k    97.57%
  49782723 requests in 10.57s, 15.35GB read
  Socket errors: connect 0, read 0, write 0, timeout 5349
Requests/sec: 4711830.98
Transfer/sec:      1.45GB

./wrk -t10k -c10k "http://127.0.0.1:8080/100.html"
Running 10s test @ http://127.0.0.1:8080/100.html
  10000 threads and 10000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.11ms    4.49ms 610.32ms   67.86%
    Req/Sec   130.23    175.12    26.26k    93.32%
  279233782 requests in 10.56s, 86.08GB read
Requests/sec: 26436985.15 ......................... 5.61x higher!
Transfer/sec:      8.15GB

So, in April 2025, I published new [1k-40k users] wrk2 benchmarks (G-WAN reaching 242m RPS at 10k users). But a few months later, I discovered that installing wrk2 on new machines was crashing at... 10k users.

This was odd because 10k users is the concurrency where G-WAN (at 242m RPS) is vaporizing NGINX and others (which top with less than 1m RPS at 1k users). But I did not have time to fix wrk2, and I was thinking that writting a G-WAN-based benchmark would be a much better value-proposition than fixing the slow, obscure and large code of wrk2 (5,316 lines of code).

Near September 2025, I noticed that an OS update had slowed-down G-WAN from 242m RPS to 8m RPS (so I wrote the G-WAN cache to bypass a suddenly 'faulty' Linux kernel syscall – restoring G-WAN performance to 281m RPS at 10k users with wrk2 again).

I though I was safe from this point. But in April 2026 (1 year after the G-WAN/no-cache tests, and 6 months after the G-WAN/cache tests), I have been told that:

  • "G-WAN wrk2 benchmarks are not honest because they last a mere 10 seconds" (the default wrk/wrk2 duration).

    At the Olympic Games, a 100m sprint is as "honest" as a 42 km marathon, and nobody would dare to pretend that a sprinter winner is not faster than a marathon winner... or that a runner able to win both sprints and marathons (where NGINX does not shine) should be disqualified!

    On computers, larger test durations make server RPS (not application-dependent CPU, RAM usage) converge, due to the OS kernel becoming the bottleneck. Then, all servers deliver similar performance as you no longer test the server. You benchmark the OS kernel.

    So the interesting question is why, in a benchmark supposedly designed to differentiate the fast from the slow HTTP servers, some "experts" insist to disqualify the sprint competition? Or the servers that both win the sprints and the marathons?

    The ones that will benefit from such a decision are... the slowest HTTP servers. When the "neutral experts" revert to fallacies, all discussion is vain.



  • "You should use another benchmark" – the fastest suggested was an obese and asthmatic Rust 12 MB file called "Oha":

      oha -c 1 -z 10s -w "http://127.0.0.1:8080/100.html"
      Requests/sec: 85219.9506
      Requests/sec: 92941.4151 --no-tui
    
      oha -c 10000 -z 10s -w "http://127.0.0.1:8080/100.html"
      Requests/sec: 546039.7608
      Requests/sec: 581882.1891 --no-tui
      
      ----- on the top of disastrous single-core and multi-core performance, the "Oha" Rust tool
      ----- proudly promotes an agenda that can hardly be associated to fairness:
    
      $ oha -h
      "-z 
          Duration of application to send requests:
          On HTTP/1, when the duration is reached, ongoing requests are aborted and counted as "aborted due to deadline".
          On HTTP/2, when the duration is reached, ongoing requests are waited."
       
     As the -w switch prevents sabotaging HTTP/1, why not make it the default like for HTTP/2? Double standards.


    Here again, if you use a slow benchmark tool, then the server can't reply faster than the requests are sent to it, and the benchmark tool is the bottleneck. Then, all servers deliver similar performance because you no longer test the server. You test the benchmark tool.

    It's shocking how many of such "experts" rely to fallacies. There should be public policies to disqualify these recurring outright liars from any public/private educational, research, media, legal and judicial activities (the positions of influence, where they proliferate).



  • "Creating many threads can take so much time that wrk2 may leave no time to the actual benchmark". The following patch was provided, where stop_at is created after start (wrk2 was creating stop_at before start and the creation of threads, excluding the event-loops and threads creation times from the actual benchmark time!):
 --- a/src/wrk.c
 +++ b/src/wrk.c
 @@ -122,7 +122,8 @@

      uint64_t connections = cfg.connections / cfg.threads;
      double throughput    = (double)cfg.rate / cfg.threads;
 -    uint64_t stop_at     = time_us() + (cfg.duration * 1000000);
 +    uint64_t start       = time_us();
 +    uint64_t stop_at     = start + (cfg.duration * 1000000);

      for (uint64_t i = 0; i < cfg.threads; i++) {
          thread *t = &threads[i];
 @@ -163,7 +164,6 @@
      printf("  %"PRIu64" threads and %"PRIu64" connections\n",
              cfg.threads, cfg.connections);

 -    uint64_t start    = time_us();
      uint64_t complete = 0;
      uint64_t bytes    = 0;
      errors errors     = { 0 };

Wow. Here wrk2 is far more wrong than wrk (despite claiming to be "more correct than wrk"). In comparison, wrk (which wrk2 is based on) creates worker threads and then its main() waits for the specified benchmark duration:

 uint64_t start = time_us(); // GPG: after creating scripts + event-loops + threads
 sleep(cfg.duration);        // GPG: high RPS volatility with many threads (scheduler, context-switches)
 stop = 1;                   // GPG: tell any remaining threads to stop now (incorrect)

While technically absolutely incorrect (threads independently start and stop at different times, due to scheduling and system load main() may be late at signaling the end of the party, and threads embarqued in event-queues will certainly be late at noticing it), the wrk code is much less wrong than what wrk2 felt the urge to so wrongly do in an even worse manner.

But wrk and wrk2, despite being well-promoted and widely praised, are not exactly what I would call champions:

Depending on the memory footprint of the tested HTTP server, the Linux kernel OOM (Out of Memory) kill-switch "Terminates" the benchmark tools wrk/2/3 for using 190+ GB at 30-50k users on my 192 GB RAM machine.

In contrast, G-WAN, which is doing many more things than wrk2, consumes around 700 MB of RAM at 40k users (a client being simpler than a server, this fact alone is revealing about how much expertise and care is dedicated to benchmark tools by the best-funded "scalability and benchmark experts"):

G-WAN/cache starting memory footprint: 1.7 MB (with www storing 745 files, 171 MB)

./gwan -p
- running 'gwan[*]' process(es) (use 'sudo ./gwan -p' if pathnames are missing):

   PID    PPID  THRDS   %CPU      VIRT       RSS      SHRD  EXE
338766    4565      2    0.0   25.0 MB    1.7 MB    0.0 KB  /home/gwan/gwan :8080

G-WAN/cache ending memory footprint: 700 MB (with www storing 745 files, 171 MB)

./gwan -p
- running 'gwan[*]' process(es) (use 'sudo ./gwan -p' if pathnames are missing):

   PID    PPID  THRDS   %CPU      VIRT       RSS      SHRD  EXE
338766    4565  40001  376.8    3.8 GB  700.1 MB    0.0 KB  /home/gwan/gwan :8080

That's why I felt the need to make my own benchmark, which will be integrated to and published with G-WAN. With it, it will be possible to benchmark high-concurrencies on miniPCs with 4 GB of RAM. A much (much) welcome change for the unfunded crowds in a world with ever-raising acquisition and operating costs (hardware, energy, floor-space. etc.).

Nevertheless, having promised to investigate the wrk2 issue further, I have discovered that the situation was much worse than presented, as the proposed patch simply ignored the elephant in the corridor:

(1) wrk2's thread calibration takes as much time as the benchmark itself (default for both: 10 seconds – the benchmark duration can be specified on the command-line... but the calibration duration is silently extended: calibrate_delay = 10_seconds + (thread->connections * 5), a total nonsense for all concurrencies, carefully hidden with MACROS stored in a dedicated file!).

(2) wrk2's main() setups a stop_at time before creating the threads and a start time after creating and calibrating the threads, so the benchmark_effective_duration = benchmark_specified_duration - calibration_duration
(what could possibly go wrong in wonderland, right?). Note: after some years wrk has ditched calibration, but not wrk2.

(3) wrk2's main() does the RPS calculation req_per_s = complete / runtime_s which turns the division into a multiplication (leading to bogus RPS values) when the actual benchmark time (default: 10 seconds) is reduced by the calibration time (default: 10 seconds) to less than 1 second (a textbook parallelization bug).

This deadly issue happens most of the time because the calibration time and actual benchmarking time are nearly identical!

The obvious fix was to do this in wrk.c, not in main() but rather in the threads' function (both for wrk and wrk2):

 thread->start   = time_us();
 thread->stop_at = thread->start + (cfg.duration * 1e6); // GPG: <= THE FIX
 aeMain(loop); // GPG: => the actual benchmark starts here, AFTER thread calibration was done

Now, we tell every thread to run for (at least) the user-specified time. wrk2 benchmarks will last longer than before because the thread calibration time will not be subtracted from the thread benchmarking execution time (they will be cumulated).

In real life, not all client threads will start and end at the same time, making benchmarks last even longer (than the default duration, or the one specified on the command-line). Also, the fairy tale of measuring latencies is a scam since the wrk2 client is massively slower than any decent server, with CPU-starved "ready" connections queuing into their event-based loop and waiting for their turn to get some CPU cycles.

This by-design issue generates yet another "Head of Line" (HoL) problem, especially with slow requests (databases, sub-requests, dynamic contents) because 1 single slow request will halt the progress of the event-queue for quite a while (hence the use of more client connections dedicated to these tasks, in a vain attempt to avoid blocking all the other "ready" connections for too long).

It's crazy how bad architectural choices (event-based queues) cumulate detrimental consequences that themselves generate more complexity leading to more bad choices (FastCGI). G-WAN did not attempt to resolve this mess, it has avoided it, hence its much higher scalability (especially for dynamic contents).

But since the starting time and execution time are different for each thread, we can't calculate the RPS in main() like wrk and wrk2 (actually wrk2 was doing much worse by interverting start and stop) are doing it since 2012: by taking the start of the first thread and the end of the last one (yet another unpardonable parallelization bug).

Doing so is necessarily false (due to the OS tasks and threads scheduling, background processes, etc. main() and the worker threads are independent, they do not start and stop at the same time, nor they all have the same lifespan) – that's basic parallelism synchronization, a discipline publicly normalized with the 1995 POSIX threads publication. In 2026, 30 years later, there is no excuse for doing it wrong by-design to such an extent... in a tool supposedly benchmarking high-performance multi-threaded servers!

Instead, the RPS must be accounted for in each thread – which in turn will contribute to report the final server performance (in RPS) a more accurately since all the thread execution durations are more exactly matching the specified benchmark time (yet the overhead of the wrk/2/3 architecture remains, so (1) it reports its own latencies rather than the servers', and (2) is the bottleneck with G-WAN):

 thread->start   = time_us();
 thread->stop_at = thread->start + (cfg.duration * 1e6); // GPG: <= THE FIX
 aeMain(loop); // GPG: => the actual benchmark starts here, AFTER thread calibration was done
 thread->stop_at = time_us(); // GPG: save the REAL (not planned) thread exit time

Now, to understand what's next, a distinction must be done between the wall clock and the thread CPU time: the former is what you see on the office clock, and the latter is the amount of CPU time given by the OS scheduler to each thread (getting time slices to emulate parallelism on non-realtime OSes, where many more programs run than available CPUs or Cores).

IF (1) the number of threads is inferior or equal to the number of CPU Cores, (2) the system is only running the benchmark tool and the tested server, (3) each thread is pinned to a different CPU Core, and (4) threads don't share resources requiring locks or CPU cache reloads, THEN each thread should execute in parallel, and get the same mount of CPU time than shown by the wall clock. In theory, it's just like if you had as many computers as you have CPU Cores.

In practice, modern Desktop and server OSes are not real-time OSes and they run many tasks in the background, rather than just the benchmark tool and the tested server. As a result, even with the 4 code/design conditions listed above, some threads will get deprived from CPU time and will perform worse than other, more lucky, threads.

And if some or none of the ideal conditions described above happen, then the CPU time allocated to each thread is reduced immensely to the point where –for the badly designed programs– you cannot reasonably predict the outcomes (below, in green, SLIMalloc is the only memory allocator to be constant, whatever the system load):

How is it possible for SLIMalloc to avoid the bad faith of all the other memory allocators? Even with the last 2 charts (with heavy erratic background tasks), SLIMalloc is obviously slower than with a system not running background tasks – but SLIMalloc is the only one able to keep a constant execution time.

Well, that's a conjunction of the good multi-thread programming practices (listed above in the highlighted IF/THEN) and a low computational overhead so that each CPU time slice given by the OS scheduler to SLIMalloc (or G-WAN) benefits more to it than to its (unfortunate, badly-designed) competitors.

And, guess what, wrk/wrk2 enjoy none of these refinements, hence the highly volatile execution times and "measured by an elite client tool" bogus RPS and latencies (that actually reveal the client defects, when it is slower than the tested server). Inheriting its architecture from its predecessors, even wrk3 is slow (to the point of becoming the bottleneck with G-WAN), but at least it reports correctly its own client execution times (deceptively presented by the wrk2 author as the server latencies).

So, even before testing the new, "thread-lifespan patched" wrk3 we can tell that it will now show, both for G-WAN and NGINX:

  • higher RPS for low concurrencies, because the wall clock is almost equal to the CPU time allocated to each thread for the benchmark (unless you start heavy programs like video transcoding during the benchmark).


  • lower RPS for high concurrencies, because the wall clock is much shorter than the CPU time allocated to each thread for the benchmark (due to the overhead of the OS scheduler, context-switches, syscalls, locks, event-queues latencies, etc.).

Add the many bottlenecks introduced by the wrk/2/3 architecture and you can understand why the performance are not linear with the number of involved threads, even when they are inferior or equal to the number of CPU Cores (and even when testing a fast and scalable server like G-WAN).

Proof: the G-WAN RPS always grow with the concurrencies (while, past 1k users, the opposite is true for NGINX).

After having modified wrk2 to do this properly... wrk2 has reported even more bogus threads benchmark lifespans, incorrectly claiming that G-WAN delivers 469m RPS at 10k users. How is it possible?

(4) Unfortunately, wrk2 has added to wrk yet another deadly issue: it stops the thread workers far before the planned time, and for no obvious reason. The potential reasons are multiple: broken connections, event loops errors, signals, and even more bugs in all of these organs (hence, probably, the loss of options for end-users trying to make sense out of the resulting incoherence).

This explains the erratic performance, and the "elegant bypass" of the author (who decided to bury the problem rather than to resolve it, by picking out of his hat, as seen previously, a nonsensical threads lifespan from main() without consideration for the thread calibration process overhead), generating the dire consequences exposed here in all of their splendor:

./wrk2 -d10s -t3 -c3 -R100m "http://127.0.0.1:8080/100.html"
Created 3 event-loop(s) in 0.000 seconds
Created 3 thread(s)     in 0.000 seconds
Running 10s test @ http://127.0.0.1:8080/100.html
  3 threads and 3 connections
- thread #? PLANNED start: 1,776,343,474.613 sec, stop:1,776,343,484.613 sec, duration:10.000 sec, cfg.dur:10
- thread #? PLANNED start: 1,776,343,474.613 sec, stop:1,776,343,484.613 sec, duration:10.000 sec, cfg.dur:10
- thread #? PLANNED start: 1,776,343,474.613 sec, stop:1,776,343,484.613 sec, duration:10.000 sec, cfg.dur:10
- thread #0 benchmark time: 0.018 sec (18,200 usec)
- thread #1 benchmark time: 0.029 sec (29,228 usec)
- thread #2 benchmark time: 0.034 sec (34,120 usec)

Note that this 4rth wrk2 bug is visible with a mere 3 threads. It means that every wrk2 reported RPS are wrong. Maybe less visibly wrong than with more threads, but clearly based on by-design incorrectly-calculated thread lifespans (so all the wrk2 tests ever made since 2012 are totally wrong: they don't have any relationship with the actual performance of the "measured server", the only involved criteria is by how much the division became a multiplication – preventing overflows is a basic programming skill).

There's no more wonder about why the wrk2 results may be nonsensical (even after our previous much-needed patches!): a benchmark supposedly lasting 10 seconds (or 10 minutes) is erroneously reported as stopping far before the worker threads have been running for 1 second... turning the RPS division into a multiplication: req_per_s = complete / runtime_s;

Now comes the real problem, because what we see here should have never, ever happened. Something is deeply broken, somewhere, in that horrible mess of 5,316 lines of code called wrk2. And, your mission, if you accept it, is to find it (well, nobody has ever done this in the past 14 years) – not even the wrk2 author, nor the team handling wrk2 bugs and incidents on github.

I am not paid by anyone. My products and papers are censored for 3 decades by the friends of the above geniuses. Yet I have done the work that they failed to do – for their own benchmark tools. And I have documented the guilty: after fixing the first 3 bugs, disabling the defective organ (presented as a "major achievement", a recurring pattern it seems) finally resolves the problem:

  //aeCreateTimeEvent(loop, calibrate_delay, calibrate, thread, NULL); // GPG: disable calibration
  aeCreateTimeEvent(loop, timeout_delay, check_timeouts, thread, NULL);

The tiny threads lifespan issue is resolved (remember, I setup and check all thread benchmark time from within each thread):

./wrk2 -d10s -t3 -c3 -R100m "http://127.0.0.1:8080/100.html"
Created 3 event-loop(s) in 0.000 seconds
Created 3 thread(s)     in 0.000 seconds
Running 10s test @ http://127.0.0.1:8080/100.html
  10 threads and 10 connections
- thread #? PLANNED  start: 1,776,344,751.151 sec, stop:1,776,344,761.151 sec, duration:10.000 sec, cfg.dur:10
- thread #? PLANNED  start: 1,776,344,751.151 sec, stop:1,776,344,761.151 sec, duration:10.000 sec, cfg.dur:10
- thread #? PLANNED  start: 1,776,344,751.151 sec, stop:1,776,344,761.151 sec, duration:10.000 sec, cfg.dur:10
- thread #? ACTUAL    STOP: 1,776,344,761.151 sec, thread lifespan: 10.000 sec
- thread #? ACTUAL    STOP: 1,776,344,761.151 sec, thread lifespan: 10.000 sec
- thread #? ACTUAL    STOP: 1,776,344,761.151 sec, thread lifespan: 10.000 sec
- thread #0 benchmark time: 10.000 sec (10,000,005 usec)
- thread #1 benchmark time: 10.000 sec (10,000,006 usec)
- thread #2 benchmark time: 10.000 sec (10,000,004 usec)

See? By removing the "Crown Jewels" of wrk2 (its pointless and broken yet celebrated calibration)... and fixing its 3 other deadly bugs, you get back an usable tool. The advantage of using wrk2 is that it stops, more or less at the specified time, instead of taking forever to complete the benchmark, like wrk does it (when it is slower than the tested HTTP server).

wrk2 has been first published in 2012 by Gil Tene. In 2026, these 4 major by-design flaws are 14-year old – for something presented as "A constant throughput, correct latency recording variant of wrk".

Yet, wrk, created by Will Glozer, "only" miscalculates the RPS by 3 orders of magnitude (wrk2 does much worse), and has removed its calibration worsening its (still there) benchmark time flaw (its remaining gap, beyond its bogus execution time and its slow architecture, is about not stopping at the specified time when a server (like G-WAN) is faster than the benchmark tool – something that some may have interpreted as a G-WAN flaw: "Oh, you see, this server is so slow that the test lasts forever!" while actually the opposite was true: wrk is slow, not G-WAN).

So it would be very interesting to hear about why Gil Tene felt the need to introduce 3 major bugs and his (so badly designed) threads calibration time in wrk2 (increasing the bogus wrk RPS by 2 additional orders of magniture with the latest version of G-WAN), to the point where it completely defeats the purpose of benchmarking... while claiming that wrk2 is "more exact" than wrk!

After examining the wrk2 source code, there are very (very) strange things like variables and functions implemented and not used, redundant slow function calls, and... purposedly misleading messages like "Initialised %d threads in %.3f ms" while the timing was for event-loops creation (thread creation was not timed and was not reported) – here we don't talk about a skills gap: fairness is absent, for decades (and in thousands of people rather than just the wrk2 author).

The source code of wrk2 would deserve a complete rewrite (if it was not badly designed in the first place). Its only purpose seems to be as slow, faulty and inefficient as possible... while carefully hiding its sins with pointless layers, redundancy, chaos and complexity (like in "a haystack is required to hide a needle")... while eventually boosting the benchmark scores via unpardonable elementary thread-synchronization programming mistakes.

More generally, using event-queues for high-latency networks, low concurrencies and mostly-idle clients works (slowly), but this model will quickly show its limits on localhost (or on fast networks), and generates VERY HIGH latencies ("ready" queued connections are starved while only one is processed at a time), while higher concurrencies (more users) will hit the small 2-second wrk2 timeouts (yet another by-design benchmark tool issue):

  #define SOCKET_TIMEOUT_MS     60000 // GPG: 1 minute, was 2000 (2 seconds)
  #define TIMEOUT_INTERVAL_MS   60000 // GPG: 1 minute, was 2000 (2 seconds)

In the same spirit, calibrate_delay = 10_seconds + (thread->connections * 5);, is an absolute nonsense, especially with high concurrencies (and has disastrous consequences when subtracted from the actual benchmark time like wrk2 is, very wrongly, doing it).

Last but not least, the calibration disaster should have been made optional by its author – at least, without it, the wrk2 bugs would have been easier to find and fix, and wrk2 would have been useful (instead of a major nuisance for decades and myriads of end-users, all over the world).

Either these "widely praised scalability experts" are not familiar with the concept of multi-threading, arithmetic overflow and compiler warnings... or they knew very well what they were doing. In both cases, their tools are not trustworthy – and the fact that nobody has felt the need to correct them (in 14 years, despite a dedicated team!) reveals how serious is the whole self-congratulating cohort (that feels the need to censor, denigrate and sabotage anyone doing better).

I have quickly corrected the most deadly bugs, added some useful comments and printed messages, added pretty thousands for the readability of RPS and timing, a crash handler to show where and why the animal fails, etc. but I don't see the point of wasting more time on the outrageously amateurish wrk2 codebase. Stating "amateurish" is much nicer than "criminal" because there are many hints that all this mediocrity and bad design choices was planned rather than merely due to utter incompetence: one cannot at the same time do difficult things and fail miserably on the most basic things... that are critical for the whole to operate correctly.

So much for the "plausible deniability" too often presented as an excuse by the serial wrongdoers: "don't attribute to maliciousness what can be attributed to stupidity". Yes, right. Strangely, the "stupid guys" are reaping the bounty every single time, by censoring, denigrating and sabotaging anyone doing better... and they only make mistakes that actually benefit them. Stupidity is supposedly enjoying a more random distribution than unleashed greed enjoying infinite impunity.

Reminder: wrk was clearly written with several bias that favor NGINX. wrk2 enthusiastically went even further in the promotion of NGINX by reverting to 4 unpardonable multi-threading errors. G-WAN, while faster than all others since 2009, was constantly censored, denigrated and sabotaged. On one side, there's relentless funding and promotion, and on the other side, 18 years of constant sabotage. Call this "accidental" if you can.

SO, SINCE 2012 MOST WRK AND WRK2 BENCHMARKS ARE BOGUS – AND NOBODY HAS EVER NOTICED... IN 14 YEARS!

After fixing wrk2's latest available source code and recompiling it, I quickly tested it and... it crashed at 10k users. Wow, nobody seemed to have addressed the crash I have experienced a year ago.

I re-downloaded wrk2 from several sources to compare it to the version I downloaded in October 2024. In the 2024 source code, the RPS flaws were already there... but at least this 2024 version of wrk2 (published before the April 2025 G-WAN benchmarks) had no problem to test up to 40k users without crashing.

In the newest versions of wrk2 available on Github, Ubuntu repositories, etc., the Makefile has also been heavily rewritten (so they have time for this, but not to make better tools, or to correct the faulty ones they distribute). The resulting executable file is now 10 times smaller than before (by not embedding the libraries it relies on, so the executable will fail if copied to another machine, due to GNU GLibC incompatibilities and shared library versioning) – and all these new versions crash at... 10k users!

If someone wanted to sabotage the tool that has allowed G-WAN (or any reasonnably performing server) to shine (and that has revealed the deffects of NGINX and all other servers), the exact same thing would have been required to be done... and it was done, by everyone at the same time, as if sabotaging wrk2 even further was an urgency.

But an even larger damage was caused by wrk/2: sabotaging the instruments of measure prevents R&D from knowing if it makes progress, developers, hosting companies and Cloud operators from forecasting resource provisioning, and end users from choosing an HTTP server that is more efficient than another. It is reminiscent of the damage caused by the poor quality of academic papers, and the promotion of nonsensical new programming languages that are much slower, more unsafe, and increasingly more complex than the C language used to create all of them:

Rust           (14 years, since 2012)    33 CVE records (  2.36 per year, "memory-safe" language)
• Go              (17 years, since 2009)   339 CVE records ( 19.94 per year, "memory-safe" language)
• .NET           (24 years, since 2002) 5,433 CVE records (226.37 per year, "memory-safe" language)
• Javascript (31 years, since 1995) 8,587 CVE records (277.00 per year, "memory-safe" language)
• Java           (31 years, since 1995) 3,781 CVE records (121.96 per year, "memory-safe" language)
• GLibC        (39 years, since 1987)   224 CVE records (  5.74 per year, "memory-UNsafe" language)

The C language would be much safer than the Rust language had it adopted SLIMalloc censored by the U.S. DARPA. Destroying the capacity of the whole planet to think correctly, to experiment, to understand the foundational concepts of a technical matter, is a crime – not an improvement.

Some among us have spent quite a lot of time and money at designing new ways to "dumb down" all others – to dominate them. They use disloyal means aiming to make all of us dependent of them rather than striving, so they provide things like "assistants" to:

  • drive cars: then, you press buttons instead of interacting with the machine (learning to find the optimal way of doing things, which involves many of your skills). Pressing buttons deprives you from this understanding of how mechanical parts work together and you miss the opportunity to acquire skills (while the IoT operator can, by the press of a button, send your car and its passengers embrace a wall at full speed – in our troubled times, the ability to execute targeted or mass murders without a smoking gun has probably contributed to justify these gargantuan investments);
  • educate children: giving our kids to total strangers who are told by a central committee what they should know and what they should ignore certainly helps to explain the ever-falling quality of education seen in the countries where systematic bias and censorship are enforced by governments;
  • program computers: like for autonomous cars, programming languages that provide a function for every possible needs may look handy (the only skill you need is to find the function name in an humongous collection of libraries) – but that's also the most certain way to limit what you can actually do (in terms of ease of use, performance, security, and features). As G-WAN illustrates it, there's a gigantic gap between what is possible (and desirable) and what is taught in the best universities and sold by the GAFAM;
  • AI chatbots: cause cognitive offloading as AI users relying on the technology for problem-solving and decision-making rather than engaging in independent critical thinking. This leads to skill erosion as people no longer exercise their own capacity and can't diverge from the "ready to use" opinions delivered by the AI operators, leading to an ever-growing dependence on AI tools and an ever-disappearing expertise and judgment.

And here I did not even touch the "sweet pot" of these bandits: when "experts", government officials, teachers, the media, judges and attorneys, or engineers rely on these tools without verifying the accuracy of what they produce (why use AI if you have to check the facts independently, right?), they become the unsuspecting victims of manipulation:

  • Unverified data: AI tools often generate plausible but incorrect outputs, and this has been massively seen in legal proceedings and scientific papers where fabricated evidence or inaccurate calculations were introduced by the AI "assistants" (belonging to private companies that may sell this ability to influence or betray, on a per-case basis or on mass-scales, punctually, or constantly).
  • Erosion of expertise: outsourcing complex tasks to AI erodes the increasingly-unexercised skills needed to critically evaluate or challenge evidence, leading "experts" to become mere parrots of an AI itself programmed by a private company.
  • Unaccountability: blind trust in an AI falsely presented as "knowing better" shifts responsibility away from its users, creating a dangerous precedent where errors (caused by AI operators that will revert to blaming their creation to avoid being jailed) are overlooked or dismissed – even in the case of outright fraud!

I am sure that there will be people claiming that all of these coordinated acts are "benign, accidental mistakes" – but I hardly see why and how wrk2 crashing at 10k+ users was a necessary feature for a multicore benchmark tool so unanimously considered and celebrated as the "best of its class"... despite 4 unpardonable multi-threading programming bugs – defeating its official purpose.

If there's no outright fraud here, I can't understand why it is so difficult to find a reasonably designed, performing and reliable benchmark tool for servers: all other tools, including the most recent made in Go and Rust, are even slower and less capable than wrk2... so ever-degrading quality and spiraling budgets are presented as "the unescapable march of progress"! Cui bono?

I have called wrk3 this fixed version of wrk2 (without the 4 bogus RPS by-design flaws) that doesn't crash at 10k users (download it here). It's also easier to compile: (1) it comes with its dependencies and (2) has a Makefile using them.

wrk3 gives "exact numbers" (that is, artifically slow due to its by-design bottlenecks, but far less volatile and inflated than wrk2 and wrk): G-WAN now tops at 3.4m RPS at 10k users and 15m RPS at 40k users on the same machine where the same (relatively old) version of G-WAN topped at 281m RPS at 10k users and 63m RPS at 40k users... with the original (now known as faulty) wrk2 (and 500+m RPS with NGINX's wrk) – the very same benchmark tools published (without examination?) by the main Linux distributions, countless "scalability experts" and many hosting web sites.

But with 10s + 30s + 3m + 30m tests (the sprints and middle range races and marathons):

  • The old wrk2 (10-second) tests showed that at 10k users G-WAN was 453 times faster than NGINX.
  • The new wrk3 (10s to 30m) tests show that at 10k users G-WAN is   2,334 times faster than NGINX.
  • The new wrk3 (10s to 30m) tests show that at 20k users G-WAN is   3,948 times faster than NGINX.
  • The new wrk3 (10s to 30m) tests show that at 30k+ users G-WAN is infinitely faster than NGINX.

Even better, instead of the G-WAN performance drop after 10k users (wrongly) reported by wrk2, now wrk3 (rightly) shows that G-WAN performance NEVER drops (G-WAN RPS grow with the number of users, which is not the case for NGINX after 1k users).

The 2026 G-WAN version is now even faster than the 2025 version (very badly) tested here, but that will be for another blog post, where we will compare G-WAN server benchmarks made by wrk3 (wrk2 being too broken to produce useful benchmarks) and the integrated G-WAN benchmark tool.

I share wrk3 to let people test their own works and G-WAN... because we all deserve better tools than the ones provided by the best-funded and well-promoted "experts" of this "big-tech" industry. Wake-up, small is beautiful (and reliable, maybe because it can't afford to buy favorable media, fiscal and legal exposure).