Author: Vincent Diepeveen
Date: 06:09:09 12/25/05
Go up one level in this thread
On December 23, 2005 at 16:25:16, William Kerr wrote: >Hi All, > >Just some results from using Microsofts free Visual C++ Toolkit 2003. I have a >test programm called TREE.cpp which is a tree searching program that uses >alpha/beta, killer heuristic and iterative deepening to search a tree. Very >simular to chess. In fact I also uses this program in my chess program. I used >to compile this using Microsoft VC++ 6.0 default settings. I then used >Microsofts free Visual C++ Toolkit 2003 to compile TREE.cpp and got a 3 to 1 >speed improvement. When using Visual C++ Toolkit 2003 I compile for speed for >Intel/AMD. > >One interesting observation is the number of clock ticks per node. A Intel P4 >3.4 GHz takes 91 clock ticks per node, a 1.73 GHz Centrino takes 40 clock ticks >per node. The Centrino executes 43,186,000 nodes per second whereas the P4 >3.4GHz executes only 37,325,000 nodes per second. By comparison, a AMD XP3000+ >running at 2.16Ghz executes 44,868,000 nodes per second with 48 clock ticks per >node. > >Something to ponder >Bill Very well known indeed that an outdated k7 is already faster than a P4 prescott. It really matters which core of the P4 you use. The old P4EE 3.2Ghz had a far better cache subsystem than the cheaper and newer P4 3.4Ghz. Those cpu's are very complex, so the few things i write down here is not the only reason why the k7 and k8 are faster. Easiest is describing the k8 as that one is newer. The internal bandwidth of the k8 is vastly superior over the p4. Important to conclude is that the caches are bigger and faster. Now for such a simple program probably the size of the L1 doesn't matter that much yet, but for modern chessprograms it does matter a lot. The P4 has in the P4EE version a 8KB L1 datacache. Versus the opteron (k8) has 64KB L1 datacache. The P4 prescott 3.4ghz has a 16KB L1 cache, however to get 1 element out of L1 cache, eats 4 cycles. You can only get 1 element out of L1 cache at a time. The K8 can get 2 elements out of L1 cache simultaneously and it eats 3 cycles. Then branch prediction. Now for a simple chessprogram this will matter, but if we look to world top 50 it sure matters a hell of a lot more. The branch prediction penalty at P4 is 20 cycles as a minimum. More like 30+ cycles on average. At k8 it is not clear what it is, but probably less than 30+ cycles on average. To support branches from getting this 'death penalty', sophisticated logics is in the processor, called branch prediction unit. In intel jargon BTB (branch target buffer). The size of it in P3 is 512 entries if i remember well. In the K7 it is 2048 entries. K8 is even 16384 entries. You know, i didn't even lookup its size for P4. Something real tiny by todays standards. Blind guess is 2048 entries. Now we move to the L2 cache. The L2 cache of k8 is 1024KB. Whether it's 512KB or 1024KB doesn't matter much for chess, but what matters really a lot is the SPEED at which it can serve. k7 has a 20 cycle L2 and you can call that WEAK. K8 has a 13 cycle L2 and that's real real good. Note there is always a difference so it's nearly never *exactly* 13 cycles. It depends upon what you do and how. P4 is more like 30+ cycles there for L2 cache. What the size of L3 cache is or whether it is there, you know, that's irrelevant, 95+% of all reads you do to L1 cache anyway. The rest goes to L2. Now let's touch the next subject. Instruction cache and decoding. You know it's real sad i mention this. This is a big weakness of all intel cpu's including itanium2. k8 has instruction cache also onto L2 cache. None of the intel cpu's AFAIK have this. You know, somewhere end of next year montecito should have this, if they still go produce that itanium2 cpu. AMD has this already in 2003. It's not a new invention or anything like that. It's bitter need, even for huge chessprograms. P4 has a tiny trace cache. It can decode only 1 instruction a cycle. You know, you really must see P4 as a processor which can on average execute 1 instruction a cycle at 3.4Ghz. If you have such huge penalties everywhere for L1, L2 and now i didn't mention memory even, then what are we talking about you know? What is real bad of intel, is their habits to put just nonsense on paper. Like if you search on paper the P4 has a 2 cycle L1 cache. Very clever mentionned "1 extra cycle". I already mentionned prescott has a 4 cycle L1. Intel didn't tell us that. Testers figured that out by simply benchmarking the prescott core. I remember someone who runs on a big supercomputer now with his chessprogram to in advance predict his program would get 1 million nodes a second hands down at that itanium2 1.6Ghz. "4 integer execution units, 6 instructions a cycle" You know, that's paper. Arturo Ochoa quoted me: "Paper supports everything" That's intel. On paper they are the greatest. In reality it's not so great at the moment. However i believe that in areas like hardware, in the sinus trajectory. At this moment AMD is faster. Next cpu release end of 2006, intel is faster again perhaps, as k8 by then is outdated. Of course please realize k8 is a highend cpu. From intel only itanium2 is a highend cpu. All P4 and pentium-m's, also pentium-m's in xeon form in future, they are not real highend. The pentium-m xeons as announced might be fast single cpu, but please realize how ugly the L2 sharing is. If you need a fast L2 cache, then that is an interesting cpu if you run at 2 processes at the same time at such a cpu. How are they gonna let it work hard in such a case? Additional the L2 communication from 1 cpu to another (so that's not from core to core as it just has 1 L2 cache, but for example in a quad machine the synchronisation mechanism) for memory is real ugly. So even before launch we already see performance issues in that architecture. Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.