Well yes the GPU is definitely not the bottleneck.
But the performance is a bit of a mystery to me too.
16bit float is a special case really, CPUs don’t support that data type natively (yet, at least…Alder Lake actually has AVX-512 FP16 support in its P-Cores, but since the E-Cores don’t have AVX-512 and current OS architecture doesn’t allow mixed instruction sets, it is disabled. What exact AVX-512 subsets Ryzen 7xxx will support is still to be revealed), so what happens is that FP16 just gets converted to 32bit float and then uses that code path.
But I did notice that even my crappy CPU with only 4 cores/threads can’t get 100% CPU usage consistently with brushes that should multithread well when using 16bit int.
There have been some obvious code reasons for low performance, like blending modes were not SIMD optimized for 16 bits, aswell as some brush engine code.
But I implemented the former, and (most of?) the latter was addressed by other people meanwhile, and while the synthetic brush benchmark shows great improvements, it didn’t get much faster in actual Krita usage for me.
My current suspicion is the display part where your 16bit image projection gets uploaded to OpenGL textures is the main issue, that part isn’t multithreaded and it may be so slow that the brush strokes run out of work at every canvas update…but I haven’t investigated that yet.