Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: some quotes on switch and indirect branches

Author: Vincent Diepeveen

Date: 04:50:47 11/24/05

Go up one level in this thread


On November 23, 2005 at 22:34:13, Eugene Nalimov wrote:

>On November 23, 2005 at 21:32:18, Robert Hyatt wrote:
>
>>On November 23, 2005 at 15:51:01, Vincent Diepeveen wrote:
>>
>>>On November 21, 2005 at 20:00:24, Eugene Nalimov wrote:
>>>
>>>>On November 21, 2005 at 18:10:54, Dieter Buerssner wrote:
>>>>
>>>>>[...]
>>>>>I guess, you mean this as a substitution for
>>>>>  if (depth < 0)
>>>>>    fm = fm1;
>>>>>  else
>>>>>    fm = fm2;
>>>>>
>>>>>I am surprised, that compilers are not able to do this themselves. I
>>>>
>>>>I several times tried to modify Visual C to recognize additional cases where we
>>>>should emit conditional moves (last time was probably a year ago for
>>>>x64-targeting compiler). Every time I could demonstrate win on a small
>>>>artificial test case, but every large real world program either showed no gain
>>>>or slowed down.
>>>
>>>At *which* processor was it slower. AMD or EM64T?
>>>
>>>AMD has quite a big L1 cache and has instruction cache in L2 if i understand
>>>correctly. That should make larger code sizes no problem.
>>
>>Last I saw, AMD64 had a unified L2.  typical split L1.  Have not seen a split L2
>>machine that I am aware of (although one could exist...)
>
>New Itanium that will be released next year (should be this quarter, but
>slipped) has 1Mb of L2 I-cache in addition to 256Kb of L2 D-cache.

A sunken itanium ship with a price of like 8000 dollar a chip, for a chip not
yet there, but only if you buy a 1000 of them, otherwise it'll be more like
20000 dollar a chip, is not a good compare. Despite intel giving masses of those
chips for free to SGI, SGI despite all that still has been removed from the New
York Stock Exchange. It's no longer there.

x64 is more interesting.

Allows Gerd for example to do postings with assembly code. I'll have to see Gerd
ever write assembly for IA64. It's very complex to write assembly for IA64.

Eugene, you mentionned x64 and you'll never ever in your life will make a typing
mistake meaning IA64 when writing down x64.

So i conclude you tested only for intel hardware the conditional move
instructions and not so very careful for AMD, and that conditional moves were
thrown out because they were slow on intel and *not* because they were slow on
AMD.

On paper it's 1 cycle direct path (practical 2) at AMD, versus practical 7 for
intel.

Now in his bitboard thing, Gerd perhaps doesn't have the right data on his hands
to execute conditional moves. In DIEP i have. And i have a zillion branches in
diep and code size is already huge. So removing branches has priority.

Of course mainly at AMD, as it's easier to get a quad opteron for a tournament,
than it is to get a quad Xeon. I fear that'll be the case in 2007 too.

However, those opteron chips are there now and a compiler generating fast x64
code is not there. Simply because there is only a microsoft compiler and
microsofts nickname here is wintel.

As this conditional move is fast for the Israeli processor line and Xeon group
will release such a pentium-m dual core xeon doing 4 instructions a cycle end of
2006, not to confuse with the dual core p4 xeon that releases start of 2006 on
paper; does this mean by 2007 'suddenly' the microsoft compiler can do
conditional moves at x64 by 2007 somewhere?

Or is the same problem there at pentium-m with their medium sized L1 cache (only
32KB data) and probably inferior L2 cache compared to AMD.

Of course from multiprocessing viewpoint, sharing that L2 cache is a bad thing
for pentium-m. dual core opteron isn't doing that of course. So scaling at
opteron should be much better for crafty, diep, zappa and the baron. Basically
the dual core intel we should see as a single core with improved hyperthreading.
Perhaps even scales 70%+.

However the raw speed of a single core xeon should be way faster, so total speed
at the cpu should be significant faster than dual core opteron.

Vincent

>Thanks,
>Eugene
>
>>>
>>>So i assume intel EM64T became slower and as a result of that it was abandonned?
>>>
>>>Vincent
>>>
>>>>I suspect there are several reasons for this:
>>>>* branch predictors are good, and majority of branches can be correctly
>>>>predicted
>>>>* CMOV is long instruction; short branch is shorter, so program with less CMOVs
>>>>fits better into cache
>>>>* there is no 8-bit form of CMOV
>>>>* CMOV has no "CMOV reg, immediate" form; if you need it you first have to load
>>>>immediate into register, this executing more instructions and increasing
>>>>register pressure -- serious problem on x86
>>>>* for invalid address "CMOV reg, memory" will give you access violation even if
>>>>condition is false.
>>>>
>>>>Thanks,
>>>>Eugene



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.