Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: New intel 64 bit ?

Author: Vincent Diepeveen

Date: 15:42:25 07/09/03

Go up one level in this thread


On July 09, 2003 at 16:02:11, Robert Hyatt wrote:

>On July 09, 2003 at 00:23:52, Vincent Diepeveen wrote:
>
>>On July 08, 2003 at 11:58:58, Robert Hyatt wrote:
>>
>>>On July 08, 2003 at 08:49:48, Vincent Diepeveen wrote:
>>>
>>>>On July 07, 2003 at 10:48:02, Robert Hyatt wrote:
>>>>
>>>>>On July 05, 2003 at 23:37:47, Jay Urbanski wrote:
>>>>>
>>>>>>On July 04, 2003 at 23:33:46, Robert Hyatt wrote:
>>>>>>
>>>>>><snip>
>>>>>>>"way better than MPI".  Both use TCP/IP, just like PVM.  Except that MPI/OpenMP
>>>>>>>is designed for homogeneous clusters while PVM works with heterogeneous mixes.
>>>>>>>But for any of the above, the latency is caused by TCP/IP, _not_ the particular
>>>>>>>library being used.
>>>>>>
>>>>>>With latency a concern I don't know why you'd use TCP/IP as the transport for
>>>>>>MPI when there are much faster ones available.
>>>>>>
>>>>>>Even VIA over Ethernet would be an improvement.
>>>>>
>>>>>I use VIA over ethernet, and VIA over a cLAN giganet switch as well.  The
>>>>>cLAN hardware produces .5usec latench which is about 1000X better than any
>>>>
>>>>Bob, the latencies that i quote are RASML : Random Average Shared Memory
>>>>Latencies.
>>>>
>>>>The latencies that you quote here are sequential latencies. Bandwidth divided by
>>>>the number of seconds = latency (according to the manufacturers).
>>>
>>>No it isn't.  It is computed by _me_.  By randomly sending packets to different
>>>nodes on this cluster and measuring the latency.  I'm not interested in any
>>
>>You need to ship a packet and then WAIT for it to get back. the simplest test is
>>using 1 way pingpong. I will email you that program now.
>>
>>You will see about a 20-30 usec latency then.
>
>Want to bet?  How about "the loser stops posting here?"
>
>
>>
>>>kind of bandwidth number.  I _know_ that is high.  It is high on a gigabit
>>>ethernet switch.  I'm interested in the latency, how long does it take me to
>>>get a packet from A to B, and there ethernet (including gigabit) is slow.
>>
>>>The cLAN with VIA is not.
>>
>>>IE on this particular cluster, it takes about 1/2 usec to get a short
>>>packet from A to B.  The longer the packet, the longer the latency since I
>>>assume that I need the last byte before I can use the first byte, which
>>>might not always be true.
>>
>>Bob this is not one way ping pong latency. Not to mention that it isn't a full
>>ship and receive.
>
>So what.  Ping-pong is the _only_ way I know to measure latency.  I told you
>that is what I did.  What is your problem with understanding that?

Bob, on this planet there are a thousand machines using the network cards you
have and there is guys in italy who are busy making their own protocols in order
to get faster latencies and they *manage*.

Now stop bragging about something you don't know. You do *not* know one way ping
pong latencies.

When you just got your machine i asked you: "what is the latency of this thing?"

Then you took a while to get the manufacturer specs of the and then came back
with: "0.5 usec".

However that's as we all know bandwidth divided by time.

A very poor understanding of latency.

Here is what pingpong as we can use it does. It ships 8 bytes and then waits for
those 8 bytes to get back.

After that it ships again 8 bytes and then waits for those 8 bytes to get back.

If you want to you may make that 4 bytes too. I don't care.

The number of times you can do those shipments a second is called n.

the latency in microseconds = 1 million  / n

So don't quote the same thing you quoted a bunch of years ago again.

That's not the latency we're looking after. Marketing managers have rewritten
and rewritten that definition until they had something very fast.

For your information your own machines have if i remember well 66Mhz PCI cards
or something. That's cool cards, they're much better than 33Mhz cards. That
means that the latency of the pci bus which is about 4 usec, is added to that of
the network cards when you do the measurement as above.

Is that new for you?

I am sure it should be as you're just quoting the same number i hear already
several years from you.

Of course you didn't do a pingpong. If you get under 1 microsecond from node to
node you made a mistake. Latency of PCI is already way above that.

Now the above latency time which we need in order to know what it takes to send
and receive a message, is divided by 2 by the pingpong program. That's called
'one way pingpong' then.

So better pack your bags.

>>
>>In computerchess you don't ship something without waiting for answer back.
>>You *want* answer back.
>>
>>Example if you want to split a node :)
>
>Wrong.  It is not hard to do this.  I say "do this" and that is all I need
>to do until I get the result back."  I don't need a "OK, I got that, I'll
>be back with the answer in a while."  It is easier to just keep going until
>the answer arrives back.
>
>>
>>The 0.5 usec latency is based upon shipping a terabyte data without answer back.
>
>No it isn't.



>
>
>
>>
>>Bandwidth / time needed = latency then.
>>
>>What i tried to explain to you is RASML but i know you won't understand it.
>>
>>In order to waste time onto this i'll just email the thing to you.
>>
>>Run it any time you like, But run it on 2 different nodes. Don't run it at the
>>same node :)
>
>
>You sent me some MPI crap that I'm not going to fool with.  As I said, I
>use VIA to use the cLAN stuff.  VIA.  Not MPI.
>
>But I'm not going to waste timr running your crap anyway as whenever I do it,
>and you don 't like the results, you just disappear for a while.
>
>
>
>>
>>>VIA has some cute stuff to "share memory" too.
>>>
>>>>
>>>>For computer chess that can't be used however.
>>>>
>>>>You can more accurate get an indication by using the well known ping pong
>>>>program. What it does is over MPI it ships messages and then WAITS for them to
>>>>come back. Then it divides that time by 2. Then it is called one way ping pong
>>>>latencies.
>>>
>>>That's how _I_ measure latency.  I know of no other way, since keepting two
>>>machine clocks synced that accurately is not easy.
>>>
>>>
>>>>
>>>>If you multiply that by 2, you already get closer to the latency that it takes
>>>>to get a single bitboard out of memory.
>>>
>>>It doesn't take me .5usec to get a bitboard out of memory.  Unless you are
>>>talking about a NUMA machine where machine A wants the bitboard and it is
>>>not in its local memory.
>>>
>>>
>>>>
>>>>Even better is using the RASML test i wrote. That's using OpenMP though but
>>>>conversion to MPI is trivial (yet slowing down things so much that it is less
>>>>accurate than openmp).
>>>>
>>>>So the best indication you can get is by doing a simple pingpong latency test.
>>>
>>>I do this all the time.
>>>
>>>>
>>>>The best ethernet network cards are myrilnet work cards (about $1300). I do not
>>>>know which chipset they have. They can achieve at 133Mhz PCI64X (jay might know
>>>>more about specifications here) like 5 usec one way ping pong latency, so that's
>>>>a minimum of way more than 10 usec to get a bitboard from the other side of th
>>>>emachine.
>>>
>>>Correct.  cLAN is faster.  It is also more expensive.  The 8-port switch we
>>>use cost us about $18,000 two years ago.  Myrinet was designed as a lower-cost
>>>network.  With somewhat lower performance.
>>>
>>>>
>>>>In your cluster you probably do not have such PCI stuff Bob. Most likely it is
>>>>around 10 usec for one way latency at your cluster so you can get at minimum of
>>>>20 usec to get a message.
>>>
>>>In my cluster I have PCI cards that are faster than Myrinet.  They were made by
>>>cLAN (again) and we paid about $1,500 each for them two years ago.  Again, you
>>>can find info about the cLAN stuff and compare it to myrinet if you want.  We
>>>have Myrinet stuff here on campus (not in any of my labs) and we have done the
>>>comparisons.  When we write proposals to NSF, they _always_ push us towards
>>>Myrinet because it is cheaper then the cLAN stuff, but it also is lower
>>>performance.
>>>
>>>
>>>
>>>>
>>>>Note that getting a cache line out of local memory of your quad xeons is already
>>>>taking about 0.5 usec. You can imagine hopefully that the quoted usecs by the
>>>>manufacturer for cLan is based upon bandwidth / time needed. And NOT the RASM
>>>>latencies.
>>>
>>>Your number there is dead wrong.  My cluster is PIII based, with a cache
>>>line of 32 bytes.  It uses 4-way interleaving.  lm_bench reports the latency
>>>as 132 nanoseconds, _total_.
>>>
>>>>
>>>>Best regards,
>>>>Vincent
>>>>
>>>>
>>>>>TCP/IP-ethernet implementation.  However, ethernet will never touch good
>>>>>hardware like the cLAN stuff.
>>>>>
>>>>>MPI/PVM use ethernet - tcp/ip for one obvious reason: "portability" and
>>>>>"availability".  :)



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.