Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: New intel 64 bit ?

Author: Vincent Diepeveen

Date: 03:26:12 07/11/03

Go up one level in this thread


On July 10, 2003 at 16:50:45, Robert Hyatt wrote:

Pingpong does *not* need a processor to get wakened up. Don't know which OS you
use that might be doing that, but the ping pong for MPI when properly
implemented does *not* wake up processes at all.

I hope you will understand that. If pingpong would wait for a process to wake up
it would run at 10 ms because a process can wake up at most 100 times a second
in linux (and all *nix flavours) as the scheduler runs 100Hz.

If you do not care for the pingpong test then you obviously do not care for
chessprograms as well. Because if you need a hashtable entry you definitely need
that latency test to measure.

Note that all HPC professors do include pingpong in the first tests they use to
measure supercomputers/clusters.

In fact out of the x cluster/supercomputer dudes i asked after pingpong test i
got within a second answer out of all of them. For them the pingpong test *is*
very relevant.

All presentations of new systems include this test.

See for example homepage from Aad v/d Steen: http://www.phys.uu.nl/~steen/

>On July 09, 2003 at 18:42:25, Vincent Diepeveen wrote:
>
>>On July 09, 2003 at 16:02:11, Robert Hyatt wrote:
>>
>>>On July 09, 2003 at 00:23:52, Vincent Diepeveen wrote:
>>>
>>>>On July 08, 2003 at 11:58:58, Robert Hyatt wrote:
>>>>
>>>>>On July 08, 2003 at 08:49:48, Vincent Diepeveen wrote:
>>>>>
>>>>>>On July 07, 2003 at 10:48:02, Robert Hyatt wrote:
>>>>>>
>>>>>>>On July 05, 2003 at 23:37:47, Jay Urbanski wrote:
>>>>>>>
>>>>>>>>On July 04, 2003 at 23:33:46, Robert Hyatt wrote:
>>>>>>>>
>>>>>>>><snip>
>>>>>>>>>"way better than MPI".  Both use TCP/IP, just like PVM.  Except that MPI/OpenMP
>>>>>>>>>is designed for homogeneous clusters while PVM works with heterogeneous mixes.
>>>>>>>>>But for any of the above, the latency is caused by TCP/IP, _not_ the particular
>>>>>>>>>library being used.
>>>>>>>>
>>>>>>>>With latency a concern I don't know why you'd use TCP/IP as the transport for
>>>>>>>>MPI when there are much faster ones available.
>>>>>>>>
>>>>>>>>Even VIA over Ethernet would be an improvement.
>>>>>>>
>>>>>>>I use VIA over ethernet, and VIA over a cLAN giganet switch as well.  The
>>>>>>>cLAN hardware produces .5usec latench which is about 1000X better than any
>>>>>>
>>>>>>Bob, the latencies that i quote are RASML : Random Average Shared Memory
>>>>>>Latencies.
>>>>>>
>>>>>>The latencies that you quote here are sequential latencies. Bandwidth divided by
>>>>>>the number of seconds = latency (according to the manufacturers).
>>>>>
>>>>>No it isn't.  It is computed by _me_.  By randomly sending packets to different
>>>>>nodes on this cluster and measuring the latency.  I'm not interested in any
>>>>
>>>>You need to ship a packet and then WAIT for it to get back. the simplest test is
>>>>using 1 way pingpong. I will email you that program now.
>>>>
>>>>You will see about a 20-30 usec latency then.
>>>
>>>Want to bet?  How about "the loser stops posting here?"
>>>
>>>
>>>>
>>>>>kind of bandwidth number.  I _know_ that is high.  It is high on a gigabit
>>>>>ethernet switch.  I'm interested in the latency, how long does it take me to
>>>>>get a packet from A to B, and there ethernet (including gigabit) is slow.
>>>>
>>>>>The cLAN with VIA is not.
>>>>
>>>>>IE on this particular cluster, it takes about 1/2 usec to get a short
>>>>>packet from A to B.  The longer the packet, the longer the latency since I
>>>>>assume that I need the last byte before I can use the first byte, which
>>>>>might not always be true.
>>>>
>>>>Bob this is not one way ping pong latency. Not to mention that it isn't a full
>>>>ship and receive.
>>>
>>>So what.  Ping-pong is the _only_ way I know to measure latency.  I told you
>>>that is what I did.  What is your problem with understanding that?
>>
>>Bob, on this planet there are a thousand machines using the network cards you
>>have and there is guys in italy who are busy making their own protocols in order
>>to get faster latencies and they *manage*.
>>
>>Now stop bragging about something you don't know. You do *not* know one way ping
>>pong latencies.
>
>Actually, I _DO_ know one-way ping pong latencies.  Of course, you seem to
>know everything about everything, including everything everybody _else_ knows,
>so that's an argument that can't be won.  But just because you say it, does
>_not_ make it so.
>
>I knew what "ping pong" latency was before you were _born_.
>
>
>>
>>When you just got your machine i asked you: "what is the latency of this thing?"
>>
>>Then you took a while to get the manufacturer specs of the and then came back
>>with: "0.5 usec".
>
>No, I took a while to go run the test.  If you were to ask me the max
>sustainable I/O rate I would do the same thing.
>
>
>>
>>However that's as we all know bandwidth divided by time.
>
>No it isn't.  Latency has _nothing_ to do with bandwidth in the context
>of networking.  And _nobody_ I know of computes it that way.
>
>
>>
>>A very poor understanding of latency.
>>
>>Here is what pingpong as we can use it does. It ships 8 bytes and then waits for
>>those 8 bytes to get back.
>>
>>After that it ships again 8 bytes and then waits for those 8 bytes to get back.
>>
>>If you want to you may make that 4 bytes too. I don't care.
>
>That is _exactly_ what my latency measurement does.  As I have now said for
>the _fourth_ time.
>
>
>>
>>The number of times you can do those shipments a second is called n.
>>
>>the latency in microseconds = 1 million  / n
>>
>>So don't quote the same thing you quoted a bunch of years ago again.
>>
>>That's not the latency we're looking after. Marketing managers have rewritten
>>and rewritten that definition until they had something very fast.
>
>You can define latency however you want.  I use _the_ definition that everybody
>else uses however, and will continue to do so.  Latency is the time taken to
>send a packet from A to B.  One way to measure it is to do the ping-pong test,
>although that is _not_ an accurate measurement.  If you want me to explain why
>I will be happy to do so.  But to make it simple, that kind of ping-pong test
>measures _more_ than latency.  Namely it includes the time needed to wake up
>and schedule a process on the other end, which is _not_ part of the latency.
>
>Of course, you won't understand that...
>
>But I thought I'd try.
>
>>
>>For your information your own machines have if i remember well 66Mhz PCI cards
>>or something. That's cool cards, they're much better than 33Mhz cards. That
>>means that the latency of the pci bus which is about 4 usec, is added to that of
>>the network cards when you do the measurement as above.
>>
>>Is that new for you?
>
>Yes, and it is wrong.  Here is a test I just ran:
>
>I did an "scp" copy of kqbkqb.nbw.emd from machine A to machine B.  That is
>almost 1 gigabyte of data, and it took 7.9 seconds to complete.  To do that
>copy, that "slow PCI bus" had to do the following:
>
>1.  Deliver a copy of 1 gigabyte from disk to memory.
>2.  deliver a copy of 1 gigabyte from memory to the CPU for
>    encryption.
>3.  Deliver a copy of 1 gigabyte from the cpu back to memory (this is the
>    encrypted data).
>4.  Deliver 1 gigabyte from memory to the CPU (this is the TCP/IP layer
>    copying and stuffing the data into packets.
>5.  Deliver 1 gigabyte of data from the CPU to memory, this is the other half
>    of the data copying to get the stuff to TCP/IP packet buffers.
>6.  Deliver 1 gigabyte of data from memory to the network Card.
>
>Your 4usec numbers are a bit distorted.  Your 250,000 "messages per second"
>is not just distorted, but _wrong_.  My machine moved about 6 gigabytes of data
>in 7 seconds, and much of that delay was in the SCSI disk reads and the network
>writes (this is on a gigabit network).
>
>So please don't quote me any of your nonsense numbers, it is far easier to
>run the tests.  If you want the lm-bench numbers for memory speeds, I can
>easily provide that.  Without any hand-waving.
>
>
>>
>>I am sure it should be as you're just quoting the same number i hear already
>>several years from you.
>
>I have had this cluster for 2.5 years.  You have been hearing the same number
>from me repeatedly since I got it.  Not before.
>
>
>>
>>Of course you didn't do a pingpong. If you get under 1 microsecond from node to
>>node you made a mistake. Latency of PCI is already way above that.
>
>
>As I said, you _must_ know what you are doing to measure this stuff.  You are
>factoring in way more than PCI latency.  Which is not a surprise, since you
>don't know beans about operating systems.
>
>
>
>>
>>Now the above latency time which we need in order to know what it takes to send
>>and receive a message, is divided by 2 by the pingpong program. That's called
>>'one way pingpong' then.
>>
>>So better pack your bags.
>
>Righto.  You seem to confuse "ping pong" with a game played with two paddles,
>a net, and a small white ball.  But your latency measurement is not the way
>to do it.  One day I'll tell you how _I_ do the ping-pong test, which _really_
>measures latency.  _not_ the way you do it, by the way....
>
>
>
>>
>>>>
>>>>In computerchess you don't ship something without waiting for answer back.
>>>>You *want* answer back.
>>>>
>>>>Example if you want to split a node :)
>>>
>>>Wrong.  It is not hard to do this.  I say "do this" and that is all I need
>>>to do until I get the result back."  I don't need a "OK, I got that, I'll
>>>be back with the answer in a while."  It is easier to just keep going until
>>>the answer arrives back.
>>>
>>>>
>>>>The 0.5 usec latency is based upon shipping a terabyte data without answer back.
>>>
>>>No it isn't.
>>
>>
>>
>>>
>>>
>>>
>>>>
>>>>Bandwidth / time needed = latency then.
>>>>
>>>>What i tried to explain to you is RASML but i know you won't understand it.
>>>>
>>>>In order to waste time onto this i'll just email the thing to you.
>>>>
>>>>Run it any time you like, But run it on 2 different nodes. Don't run it at the
>>>>same node :)
>>>
>>>
>>>You sent me some MPI crap that I'm not going to fool with.  As I said, I
>>>use VIA to use the cLAN stuff.  VIA.  Not MPI.
>>>
>>>But I'm not going to waste timr running your crap anyway as whenever I do it,
>>>and you don 't like the results, you just disappear for a while.
>>>
>>>
>>>
>>>>
>>>>>VIA has some cute stuff to "share memory" too.
>>>>>
>>>>>>
>>>>>>For computer chess that can't be used however.
>>>>>>
>>>>>>You can more accurate get an indication by using the well known ping pong
>>>>>>program. What it does is over MPI it ships messages and then WAITS for them to
>>>>>>come back. Then it divides that time by 2. Then it is called one way ping pong
>>>>>>latencies.
>>>>>
>>>>>That's how _I_ measure latency.  I know of no other way, since keepting two
>>>>>machine clocks synced that accurately is not easy.
>>>>>
>>>>>
>>>>>>
>>>>>>If you multiply that by 2, you already get closer to the latency that it takes
>>>>>>to get a single bitboard out of memory.
>>>>>
>>>>>It doesn't take me .5usec to get a bitboard out of memory.  Unless you are
>>>>>talking about a NUMA machine where machine A wants the bitboard and it is
>>>>>not in its local memory.
>>>>>
>>>>>
>>>>>>
>>>>>>Even better is using the RASML test i wrote. That's using OpenMP though but
>>>>>>conversion to MPI is trivial (yet slowing down things so much that it is less
>>>>>>accurate than openmp).
>>>>>>
>>>>>>So the best indication you can get is by doing a simple pingpong latency test.
>>>>>
>>>>>I do this all the time.
>>>>>
>>>>>>
>>>>>>The best ethernet network cards are myrilnet work cards (about $1300). I do not
>>>>>>know which chipset they have. They can achieve at 133Mhz PCI64X (jay might know
>>>>>>more about specifications here) like 5 usec one way ping pong latency, so that's
>>>>>>a minimum of way more than 10 usec to get a bitboard from the other side of th
>>>>>>emachine.
>>>>>
>>>>>Correct.  cLAN is faster.  It is also more expensive.  The 8-port switch we
>>>>>use cost us about $18,000 two years ago.  Myrinet was designed as a lower-cost
>>>>>network.  With somewhat lower performance.
>>>>>
>>>>>>
>>>>>>In your cluster you probably do not have such PCI stuff Bob. Most likely it is
>>>>>>around 10 usec for one way latency at your cluster so you can get at minimum of
>>>>>>20 usec to get a message.
>>>>>
>>>>>In my cluster I have PCI cards that are faster than Myrinet.  They were made by
>>>>>cLAN (again) and we paid about $1,500 each for them two years ago.  Again, you
>>>>>can find info about the cLAN stuff and compare it to myrinet if you want.  We
>>>>>have Myrinet stuff here on campus (not in any of my labs) and we have done the
>>>>>comparisons.  When we write proposals to NSF, they _always_ push us towards
>>>>>Myrinet because it is cheaper then the cLAN stuff, but it also is lower
>>>>>performance.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>Note that getting a cache line out of local memory of your quad xeons is already
>>>>>>taking about 0.5 usec. You can imagine hopefully that the quoted usecs by the
>>>>>>manufacturer for cLan is based upon bandwidth / time needed. And NOT the RASM
>>>>>>latencies.
>>>>>
>>>>>Your number there is dead wrong.  My cluster is PIII based, with a cache
>>>>>line of 32 bytes.  It uses 4-way interleaving.  lm_bench reports the latency
>>>>>as 132 nanoseconds, _total_.
>>>>>
>>>>>>
>>>>>>Best regards,
>>>>>>Vincent
>>>>>>
>>>>>>
>>>>>>>TCP/IP-ethernet implementation.  However, ethernet will never touch good
>>>>>>>hardware like the cLAN stuff.
>>>>>>>
>>>>>>>MPI/PVM use ethernet - tcp/ip for one obvious reason: "portability" and
>>>>>>>"availability".  :)



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.