how much NAT needs to be done before that gain is realized
In a provider network, the latency should be kept at a very small value (ideally in the us range for each device), and even more importantly should be kept constant so that there will be no added jitter to packets.
A hardware processing (something like a FPGA or ASIC for example) can keep a constant and very low latency (for example all packets on all ports could be routed in a couple clock pulses).
Another advantage of hardware processing is that regardless what you do at the software layer, hardware processing should stay bugfree (as soon as the hardware logic is bugfree).
Multicore is a lot better than singlecore processing, but it is still serial processing, this mean that latency cannot be constant and bugs can be introduced more easily.