Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: New intel 64 bit ?

Author: Robert Hyatt

Date: 13:50:45 07/10/03

Go up one level in this thread


On July 09, 2003 at 18:42:25, Vincent Diepeveen wrote:

>On July 09, 2003 at 16:02:11, Robert Hyatt wrote:
>
>>On July 09, 2003 at 00:23:52, Vincent Diepeveen wrote:
>>
>>>On July 08, 2003 at 11:58:58, Robert Hyatt wrote:
>>>
>>>>On July 08, 2003 at 08:49:48, Vincent Diepeveen wrote:
>>>>
>>>>>On July 07, 2003 at 10:48:02, Robert Hyatt wrote:
>>>>>
>>>>>>On July 05, 2003 at 23:37:47, Jay Urbanski wrote:
>>>>>>
>>>>>>>On July 04, 2003 at 23:33:46, Robert Hyatt wrote:
>>>>>>>
>>>>>>><snip>
>>>>>>>>"way better than MPI".  Both use TCP/IP, just like PVM.  Except that MPI/OpenMP
>>>>>>>>is designed for homogeneous clusters while PVM works with heterogeneous mixes.
>>>>>>>>But for any of the above, the latency is caused by TCP/IP, _not_ the particular
>>>>>>>>library being used.
>>>>>>>
>>>>>>>With latency a concern I don't know why you'd use TCP/IP as the transport for
>>>>>>>MPI when there are much faster ones available.
>>>>>>>
>>>>>>>Even VIA over Ethernet would be an improvement.
>>>>>>
>>>>>>I use VIA over ethernet, and VIA over a cLAN giganet switch as well.  The
>>>>>>cLAN hardware produces .5usec latench which is about 1000X better than any
>>>>>
>>>>>Bob, the latencies that i quote are RASML : Random Average Shared Memory
>>>>>Latencies.
>>>>>
>>>>>The latencies that you quote here are sequential latencies. Bandwidth divided by
>>>>>the number of seconds = latency (according to the manufacturers).
>>>>
>>>>No it isn't.  It is computed by _me_.  By randomly sending packets to different
>>>>nodes on this cluster and measuring the latency.  I'm not interested in any
>>>
>>>You need to ship a packet and then WAIT for it to get back. the simplest test is
>>>using 1 way pingpong. I will email you that program now.
>>>
>>>You will see about a 20-30 usec latency then.
>>
>>Want to bet?  How about "the loser stops posting here?"
>>
>>
>>>
>>>>kind of bandwidth number.  I _know_ that is high.  It is high on a gigabit
>>>>ethernet switch.  I'm interested in the latency, how long does it take me to
>>>>get a packet from A to B, and there ethernet (including gigabit) is slow.
>>>
>>>>The cLAN with VIA is not.
>>>
>>>>IE on this particular cluster, it takes about 1/2 usec to get a short
>>>>packet from A to B.  The longer the packet, the longer the latency since I
>>>>assume that I need the last byte before I can use the first byte, which
>>>>might not always be true.
>>>
>>>Bob this is not one way ping pong latency. Not to mention that it isn't a full
>>>ship and receive.
>>
>>So what.  Ping-pong is the _only_ way I know to measure latency.  I told you
>>that is what I did.  What is your problem with understanding that?
>
>Bob, on this planet there are a thousand machines using the network cards you
>have and there is guys in italy who are busy making their own protocols in order
>to get faster latencies and they *manage*.
>
>Now stop bragging about something you don't know. You do *not* know one way ping
>pong latencies.

Actually, I _DO_ know one-way ping pong latencies.  Of course, you seem to
know everything about everything, including everything everybody _else_ knows,
so that's an argument that can't be won.  But just because you say it, does
_not_ make it so.

I knew what "ping pong" latency was before you were _born_.


>
>When you just got your machine i asked you: "what is the latency of this thing?"
>
>Then you took a while to get the manufacturer specs of the and then came back
>with: "0.5 usec".

No, I took a while to go run the test.  If you were to ask me the max
sustainable I/O rate I would do the same thing.


>
>However that's as we all know bandwidth divided by time.

No it isn't.  Latency has _nothing_ to do with bandwidth in the context
of networking.  And _nobody_ I know of computes it that way.


>
>A very poor understanding of latency.
>
>Here is what pingpong as we can use it does. It ships 8 bytes and then waits for
>those 8 bytes to get back.
>
>After that it ships again 8 bytes and then waits for those 8 bytes to get back.
>
>If you want to you may make that 4 bytes too. I don't care.

That is _exactly_ what my latency measurement does.  As I have now said for
the _fourth_ time.


>
>The number of times you can do those shipments a second is called n.
>
>the latency in microseconds = 1 million  / n
>
>So don't quote the same thing you quoted a bunch of years ago again.
>
>That's not the latency we're looking after. Marketing managers have rewritten
>and rewritten that definition until they had something very fast.

You can define latency however you want.  I use _the_ definition that everybody
else uses however, and will continue to do so.  Latency is the time taken to
send a packet from A to B.  One way to measure it is to do the ping-pong test,
although that is _not_ an accurate measurement.  If you want me to explain why
I will be happy to do so.  But to make it simple, that kind of ping-pong test
measures _more_ than latency.  Namely it includes the time needed to wake up
and schedule a process on the other end, which is _not_ part of the latency.

Of course, you won't understand that...

But I thought I'd try.

>
>For your information your own machines have if i remember well 66Mhz PCI cards
>or something. That's cool cards, they're much better than 33Mhz cards. That
>means that the latency of the pci bus which is about 4 usec, is added to that of
>the network cards when you do the measurement as above.
>
>Is that new for you?

Yes, and it is wrong.  Here is a test I just ran:

I did an "scp" copy of kqbkqb.nbw.emd from machine A to machine B.  That is
almost 1 gigabyte of data, and it took 7.9 seconds to complete.  To do that
copy, that "slow PCI bus" had to do the following:

1.  Deliver a copy of 1 gigabyte from disk to memory.
2.  deliver a copy of 1 gigabyte from memory to the CPU for
    encryption.
3.  Deliver a copy of 1 gigabyte from the cpu back to memory (this is the
    encrypted data).
4.  Deliver 1 gigabyte from memory to the CPU (this is the TCP/IP layer
    copying and stuffing the data into packets.
5.  Deliver 1 gigabyte of data from the CPU to memory, this is the other half
    of the data copying to get the stuff to TCP/IP packet buffers.
6.  Deliver 1 gigabyte of data from memory to the network Card.

Your 4usec numbers are a bit distorted.  Your 250,000 "messages per second"
is not just distorted, but _wrong_.  My machine moved about 6 gigabytes of data
in 7 seconds, and much of that delay was in the SCSI disk reads and the network
writes (this is on a gigabit network).

So please don't quote me any of your nonsense numbers, it is far easier to
run the tests.  If you want the lm-bench numbers for memory speeds, I can
easily provide that.  Without any hand-waving.


>
>I am sure it should be as you're just quoting the same number i hear already
>several years from you.

I have had this cluster for 2.5 years.  You have been hearing the same number
from me repeatedly since I got it.  Not before.


>
>Of course you didn't do a pingpong. If you get under 1 microsecond from node to
>node you made a mistake. Latency of PCI is already way above that.


As I said, you _must_ know what you are doing to measure this stuff.  You are
factoring in way more than PCI latency.  Which is not a surprise, since you
don't know beans about operating systems.



>
>Now the above latency time which we need in order to know what it takes to send
>and receive a message, is divided by 2 by the pingpong program. That's called
>'one way pingpong' then.
>
>So better pack your bags.

Righto.  You seem to confuse "ping pong" with a game played with two paddles,
a net, and a small white ball.  But your latency measurement is not the way
to do it.  One day I'll tell you how _I_ do the ping-pong test, which _really_
measures latency.  _not_ the way you do it, by the way....



>
>>>
>>>In computerchess you don't ship something without waiting for answer back.
>>>You *want* answer back.
>>>
>>>Example if you want to split a node :)
>>
>>Wrong.  It is not hard to do this.  I say "do this" and that is all I need
>>to do until I get the result back."  I don't need a "OK, I got that, I'll
>>be back with the answer in a while."  It is easier to just keep going until
>>the answer arrives back.
>>
>>>
>>>The 0.5 usec latency is based upon shipping a terabyte data without answer back.
>>
>>No it isn't.
>
>
>
>>
>>
>>
>>>
>>>Bandwidth / time needed = latency then.
>>>
>>>What i tried to explain to you is RASML but i know you won't understand it.
>>>
>>>In order to waste time onto this i'll just email the thing to you.
>>>
>>>Run it any time you like, But run it on 2 different nodes. Don't run it at the
>>>same node :)
>>
>>
>>You sent me some MPI crap that I'm not going to fool with.  As I said, I
>>use VIA to use the cLAN stuff.  VIA.  Not MPI.
>>
>>But I'm not going to waste timr running your crap anyway as whenever I do it,
>>and you don 't like the results, you just disappear for a while.
>>
>>
>>
>>>
>>>>VIA has some cute stuff to "share memory" too.
>>>>
>>>>>
>>>>>For computer chess that can't be used however.
>>>>>
>>>>>You can more accurate get an indication by using the well known ping pong
>>>>>program. What it does is over MPI it ships messages and then WAITS for them to
>>>>>come back. Then it divides that time by 2. Then it is called one way ping pong
>>>>>latencies.
>>>>
>>>>That's how _I_ measure latency.  I know of no other way, since keepting two
>>>>machine clocks synced that accurately is not easy.
>>>>
>>>>
>>>>>
>>>>>If you multiply that by 2, you already get closer to the latency that it takes
>>>>>to get a single bitboard out of memory.
>>>>
>>>>It doesn't take me .5usec to get a bitboard out of memory.  Unless you are
>>>>talking about a NUMA machine where machine A wants the bitboard and it is
>>>>not in its local memory.
>>>>
>>>>
>>>>>
>>>>>Even better is using the RASML test i wrote. That's using OpenMP though but
>>>>>conversion to MPI is trivial (yet slowing down things so much that it is less
>>>>>accurate than openmp).
>>>>>
>>>>>So the best indication you can get is by doing a simple pingpong latency test.
>>>>
>>>>I do this all the time.
>>>>
>>>>>
>>>>>The best ethernet network cards are myrilnet work cards (about $1300). I do not
>>>>>know which chipset they have. They can achieve at 133Mhz PCI64X (jay might know
>>>>>more about specifications here) like 5 usec one way ping pong latency, so that's
>>>>>a minimum of way more than 10 usec to get a bitboard from the other side of th
>>>>>emachine.
>>>>
>>>>Correct.  cLAN is faster.  It is also more expensive.  The 8-port switch we
>>>>use cost us about $18,000 two years ago.  Myrinet was designed as a lower-cost
>>>>network.  With somewhat lower performance.
>>>>
>>>>>
>>>>>In your cluster you probably do not have such PCI stuff Bob. Most likely it is
>>>>>around 10 usec for one way latency at your cluster so you can get at minimum of
>>>>>20 usec to get a message.
>>>>
>>>>In my cluster I have PCI cards that are faster than Myrinet.  They were made by
>>>>cLAN (again) and we paid about $1,500 each for them two years ago.  Again, you
>>>>can find info about the cLAN stuff and compare it to myrinet if you want.  We
>>>>have Myrinet stuff here on campus (not in any of my labs) and we have done the
>>>>comparisons.  When we write proposals to NSF, they _always_ push us towards
>>>>Myrinet because it is cheaper then the cLAN stuff, but it also is lower
>>>>performance.
>>>>
>>>>
>>>>
>>>>>
>>>>>Note that getting a cache line out of local memory of your quad xeons is already
>>>>>taking about 0.5 usec. You can imagine hopefully that the quoted usecs by the
>>>>>manufacturer for cLan is based upon bandwidth / time needed. And NOT the RASM
>>>>>latencies.
>>>>
>>>>Your number there is dead wrong.  My cluster is PIII based, with a cache
>>>>line of 32 bytes.  It uses 4-way interleaving.  lm_bench reports the latency
>>>>as 132 nanoseconds, _total_.
>>>>
>>>>>
>>>>>Best regards,
>>>>>Vincent
>>>>>
>>>>>
>>>>>>TCP/IP-ethernet implementation.  However, ethernet will never touch good
>>>>>>hardware like the cLAN stuff.
>>>>>>
>>>>>>MPI/PVM use ethernet - tcp/ip for one obvious reason: "portability" and
>>>>>>"availability".  :)



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.