Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Crafty multiplying matrice

Author: Robert Hyatt
Date: 08:01:40 02/17/04
On February 17, 2004 at 06:19:00, Vincent Diepeveen wrote:

>On February 16, 2004 at 22:09:28, Robert Hyatt wrote:
>
>>On February 16, 2004 at 18:25:19, Vincent Diepeveen wrote:
>>
>>>On February 16, 2004 at 13:28:21, Robert Hyatt wrote:
>>>
>>>>On February 16, 2004 at 12:08:28, Vincent Diepeveen wrote:
>>>>
>>>>>On February 16, 2004 at 12:02:16, Robert Hyatt wrote:
>>>>>
>>>>>>On February 16, 2004 at 11:30:57, Jorge Pichard wrote:
>>>>>>
>>>>>>>I still don't understand why Fritz nor Shredder have not been able to get an AMD
>>>>>>>sponsor, since 95% of the times it is sponsored by company that runs Intel
>>>>>>>inside. They need to get a different sponsor in order to beat Hydra in the World
>>>>>>>Championship.
>>>>>>>
>>>>>>>Hydra gets effectively around 4 million nodes a second
>>>>>>>
>>>>>>>I am very sure that a Quad opteron for a software program is
>>>>>>>faster than 4 fpga cards 30Mhz are.
>>>>>>
>>>>>>quad opteron box is NUMA.  There are some issues there that have to be addressed
>>>>>>by anyone using such a box.  Just taking a pure SMP program and dropping it in
>>>>>>may not produce such good results.  Dual opterons are a bit easier to use.
>>>>>
>>>>>It works SMP great too. The latency when using it SMP is still faster than quad
>>>>>xeon chipset can deliver to you.
>>>>
>>>>No it isn't.  A single cpu has a latency in the 60ns range.  Dual is 60 for
>>>>local, 120 for remote.  Go to 4-way and you get 60 for local, 120 for two of the
>>>>other banks, 180 for the last bank.
>>>
>>>
>>>>That is for a single memory reference.  Assuming a TLB hit.  IF you get a TLB
>>>>miss you _die_ just as you do anywhere, except that it is possible that the
>>>
>>>I didn't know Crafty nowadays was streaming sequential and that you only are
>>>multiplying nowadays matrice.
>>
>>I didn't know that _all_ you do is random hash probes...  In my program, I only
>>do one call to HashProbe() per node.  I do a _lot_ of other stuff per node as
>
>So you are denying that each probe you do to hashtable is random and you're
>saying that you are not using Zobrist in Crafty?

No, I am saying that you can't read simple text and comprehend it.  One more
time...

"I do one hash probe per node.  My probes are random.  They use the Zobrist
hashing algorithm.  But the _rest_ of my memory references are not to random
places in memory.  If I do one random probe, and then 1000 memory probes that go
through the TLB normally, then another random probe that misses the TLB, then
another 1000 that do not miss, then that TLB miss is _not_ going to be a major
contributor to performance."

Is that _that_ hard for you to understand???

Or is this just another of those "your move generator is too slow so you can't
win" even though move generation is less than 10% of the total search time for
most programs"????

>
>Are you or are you not?
>
>>well, from generating moves, to ordering moves, to using the Swap() (SEE) code,
>>to evaluating positions, and so forth..  Most of those are not going to blow the
>>TLB.  Which means that for Crafty, memory access time is going to be almost
>>exactly memory latency time.  Only one reference per node requires the 3-4
>>access virtual-to-real translation overhead.  Out of _thousands_...
>
>
>
>>
>>
>>>
>>>>memory map tables are in remote memory as well, which means that your memory
>>>>access time (not latency) turns into 3x or 4x what it should.  Opteron uses a
>>>>3-level map because of the 48 bit virtual address space.  That means you do
>>>>three extra memory reads when you suffer a TLB miss.
>>>
>>>>My dual xeon has 150ns latency.  TLB misses turn that into 450.  The Opteron has
>>>
>>>>much more variability.  60ns on a TLB hit, up to 720ns for a TLB miss where the
>>>>page tables are in remote memory.
>>>
>>>>>Even without PGO and using old GCC version SMP version from diep gets a lot of
>>>>>nps at that box slightly less than it gets at a 8 processor Xeon. the numa
>>>>>version a lot less (not sharing evaluation tables nor pawn tables and the numa
>>>>>version tested wasn't sharing qsearch hashtables either).
>>>>>
>>>>>See www.aceshardware.com for diep SMP tests at quad opteron boxes.
>>>>
>>>>Don't need to.  I have my own quad opteron numbers with things done right...
>>>>Whether you get good numbers with no work or not, you get _better_ numbers when
>>>>memory is done right.  And it can be _significantly_ better.  From experience.
>>>
>>>Well multiprocessing is way faster of course than multithreading at such
>>>machines, that includes 2-4 itaniums too.
>>
>>
>>No it isn't.  You just don't know how to do multi-threading, apparently.  I do
>
>You trivially have no idea how advanced that microprocessors are nowadays.
>You're still toying with the 68000 designs i bet.
>
>By the way the 68000 has been named like that because it uses 68000 gates.
>
>Nowadays processors have tens of millions of transistors and work very complex
>and you have not even a remote clue on how they work.

I apparently have a better idea than you do, from many of your ridiculous
statements made in CCC.



>
>>and it is working just fine.  And is actually an efficient way to do things as
>>Eugene has told you many times.  Shared egtb buffers is one reason.  There is no
>>reason for threads to be bad...
>
>So you do not know how to share memory when running multiprocessor?
>


I told _you_ how to do it, so I suppose I do.  I've been doing it since 1978 in
various flavors, so I suppose I do.

hint:  system V shared memory is not the _only_ answer.




>>
>>
>>>
>>>Your thing is continuesly busy with cache coherency, multiprocessor applications
>>>don't suffer from that of course. DIEP is multiprocessor.
>>
>>My thing is _not_ continually busy with cache coherency.  You just don't
>>understand how "my thing" works, apparently.
>>Perhaps you should look at your results, when trying to figure out whether what
>>you are doing is better than another approach.  My approach seems to be doing
>>just fine, based on recent results and performance measurements...
>
>I see that Crafty will never end above me in a world champs because you are
>fearing to join there as you would get crushed like an ant there.

Where were you at the last CCT?  Bet I would have finished above you there as
well...

With all your hot air, however, you might well rise above me if you put a big
sack over your head to catch some of it...

I noticed you didn't do that well against the "big 3" at the just-finished
event...



>
>>
>>>
>>>>>
>>>>>>Another issue is that AMD will likely want to see real 64 bit applications.
>>>>>>That is why they developed an interest in Crafty, in fact, because it really
>>>>>>needed the 64 bit internal stuff the opteron offers.  Fritz, et al don't need
>>>>>>nor will they use this particular part of the opteron...
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>Fritz and Shredder run in Paderborn on an identically constructed Transtec
>>>>>>>diagram workstation with two Intel each Xeon processors with 3,06 Ghz and 2
>>>>>>>gigabyte memory. Deep Fritz will also count over 1 gigabyte Hashtabellen and
>>>>>>>with 2 to 2.3 million position/second for instance a search depth on 14 to 16
>>>>>>>sections will reach, in the final game by means of 20 sections, strongly
>>>>>>>dependent on position and material.
>>>>>>>
>>>>>>>When ordered a Quad Opteron cost perhaps $45k and fpga cards cost only $3000 a
>>>>>>>card and a 4 node cluster Quad Xeon 3.06Ghz costs less than $45k.
>>>>>>>
>>>>>>>Here are some comparison of a Dual Opteron versus a Dual Xeon:
>>>>>>>http://www.gamepc.com/labs/view_content.asp?id=opt248vsxeon32a&page=5
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.