Author: Vincent Diepeveen
Date: 03:26:12 07/11/03
Go up one level in this thread
On July 10, 2003 at 16:50:45, Robert Hyatt wrote: Pingpong does *not* need a processor to get wakened up. Don't know which OS you use that might be doing that, but the ping pong for MPI when properly implemented does *not* wake up processes at all. I hope you will understand that. If pingpong would wait for a process to wake up it would run at 10 ms because a process can wake up at most 100 times a second in linux (and all *nix flavours) as the scheduler runs 100Hz. If you do not care for the pingpong test then you obviously do not care for chessprograms as well. Because if you need a hashtable entry you definitely need that latency test to measure. Note that all HPC professors do include pingpong in the first tests they use to measure supercomputers/clusters. In fact out of the x cluster/supercomputer dudes i asked after pingpong test i got within a second answer out of all of them. For them the pingpong test *is* very relevant. All presentations of new systems include this test. See for example homepage from Aad v/d Steen: http://www.phys.uu.nl/~steen/ >On July 09, 2003 at 18:42:25, Vincent Diepeveen wrote: > >>On July 09, 2003 at 16:02:11, Robert Hyatt wrote: >> >>>On July 09, 2003 at 00:23:52, Vincent Diepeveen wrote: >>> >>>>On July 08, 2003 at 11:58:58, Robert Hyatt wrote: >>>> >>>>>On July 08, 2003 at 08:49:48, Vincent Diepeveen wrote: >>>>> >>>>>>On July 07, 2003 at 10:48:02, Robert Hyatt wrote: >>>>>> >>>>>>>On July 05, 2003 at 23:37:47, Jay Urbanski wrote: >>>>>>> >>>>>>>>On July 04, 2003 at 23:33:46, Robert Hyatt wrote: >>>>>>>> >>>>>>>><snip> >>>>>>>>>"way better than MPI". Both use TCP/IP, just like PVM. Except that MPI/OpenMP >>>>>>>>>is designed for homogeneous clusters while PVM works with heterogeneous mixes. >>>>>>>>>But for any of the above, the latency is caused by TCP/IP, _not_ the particular >>>>>>>>>library being used. >>>>>>>> >>>>>>>>With latency a concern I don't know why you'd use TCP/IP as the transport for >>>>>>>>MPI when there are much faster ones available. >>>>>>>> >>>>>>>>Even VIA over Ethernet would be an improvement. >>>>>>> >>>>>>>I use VIA over ethernet, and VIA over a cLAN giganet switch as well. The >>>>>>>cLAN hardware produces .5usec latench which is about 1000X better than any >>>>>> >>>>>>Bob, the latencies that i quote are RASML : Random Average Shared Memory >>>>>>Latencies. >>>>>> >>>>>>The latencies that you quote here are sequential latencies. Bandwidth divided by >>>>>>the number of seconds = latency (according to the manufacturers). >>>>> >>>>>No it isn't. It is computed by _me_. By randomly sending packets to different >>>>>nodes on this cluster and measuring the latency. I'm not interested in any >>>> >>>>You need to ship a packet and then WAIT for it to get back. the simplest test is >>>>using 1 way pingpong. I will email you that program now. >>>> >>>>You will see about a 20-30 usec latency then. >>> >>>Want to bet? How about "the loser stops posting here?" >>> >>> >>>> >>>>>kind of bandwidth number. I _know_ that is high. It is high on a gigabit >>>>>ethernet switch. I'm interested in the latency, how long does it take me to >>>>>get a packet from A to B, and there ethernet (including gigabit) is slow. >>>> >>>>>The cLAN with VIA is not. >>>> >>>>>IE on this particular cluster, it takes about 1/2 usec to get a short >>>>>packet from A to B. The longer the packet, the longer the latency since I >>>>>assume that I need the last byte before I can use the first byte, which >>>>>might not always be true. >>>> >>>>Bob this is not one way ping pong latency. Not to mention that it isn't a full >>>>ship and receive. >>> >>>So what. Ping-pong is the _only_ way I know to measure latency. I told you >>>that is what I did. What is your problem with understanding that? >> >>Bob, on this planet there are a thousand machines using the network cards you >>have and there is guys in italy who are busy making their own protocols in order >>to get faster latencies and they *manage*. >> >>Now stop bragging about something you don't know. You do *not* know one way ping >>pong latencies. > >Actually, I _DO_ know one-way ping pong latencies. Of course, you seem to >know everything about everything, including everything everybody _else_ knows, >so that's an argument that can't be won. But just because you say it, does >_not_ make it so. > >I knew what "ping pong" latency was before you were _born_. > > >> >>When you just got your machine i asked you: "what is the latency of this thing?" >> >>Then you took a while to get the manufacturer specs of the and then came back >>with: "0.5 usec". > >No, I took a while to go run the test. If you were to ask me the max >sustainable I/O rate I would do the same thing. > > >> >>However that's as we all know bandwidth divided by time. > >No it isn't. Latency has _nothing_ to do with bandwidth in the context >of networking. And _nobody_ I know of computes it that way. > > >> >>A very poor understanding of latency. >> >>Here is what pingpong as we can use it does. It ships 8 bytes and then waits for >>those 8 bytes to get back. >> >>After that it ships again 8 bytes and then waits for those 8 bytes to get back. >> >>If you want to you may make that 4 bytes too. I don't care. > >That is _exactly_ what my latency measurement does. As I have now said for >the _fourth_ time. > > >> >>The number of times you can do those shipments a second is called n. >> >>the latency in microseconds = 1 million / n >> >>So don't quote the same thing you quoted a bunch of years ago again. >> >>That's not the latency we're looking after. Marketing managers have rewritten >>and rewritten that definition until they had something very fast. > >You can define latency however you want. I use _the_ definition that everybody >else uses however, and will continue to do so. Latency is the time taken to >send a packet from A to B. One way to measure it is to do the ping-pong test, >although that is _not_ an accurate measurement. If you want me to explain why >I will be happy to do so. But to make it simple, that kind of ping-pong test >measures _more_ than latency. Namely it includes the time needed to wake up >and schedule a process on the other end, which is _not_ part of the latency. > >Of course, you won't understand that... > >But I thought I'd try. > >> >>For your information your own machines have if i remember well 66Mhz PCI cards >>or something. That's cool cards, they're much better than 33Mhz cards. That >>means that the latency of the pci bus which is about 4 usec, is added to that of >>the network cards when you do the measurement as above. >> >>Is that new for you? > >Yes, and it is wrong. Here is a test I just ran: > >I did an "scp" copy of kqbkqb.nbw.emd from machine A to machine B. That is >almost 1 gigabyte of data, and it took 7.9 seconds to complete. To do that >copy, that "slow PCI bus" had to do the following: > >1. Deliver a copy of 1 gigabyte from disk to memory. >2. deliver a copy of 1 gigabyte from memory to the CPU for > encryption. >3. Deliver a copy of 1 gigabyte from the cpu back to memory (this is the > encrypted data). >4. Deliver 1 gigabyte from memory to the CPU (this is the TCP/IP layer > copying and stuffing the data into packets. >5. Deliver 1 gigabyte of data from the CPU to memory, this is the other half > of the data copying to get the stuff to TCP/IP packet buffers. >6. Deliver 1 gigabyte of data from memory to the network Card. > >Your 4usec numbers are a bit distorted. Your 250,000 "messages per second" >is not just distorted, but _wrong_. My machine moved about 6 gigabytes of data >in 7 seconds, and much of that delay was in the SCSI disk reads and the network >writes (this is on a gigabit network). > >So please don't quote me any of your nonsense numbers, it is far easier to >run the tests. If you want the lm-bench numbers for memory speeds, I can >easily provide that. Without any hand-waving. > > >> >>I am sure it should be as you're just quoting the same number i hear already >>several years from you. > >I have had this cluster for 2.5 years. You have been hearing the same number >from me repeatedly since I got it. Not before. > > >> >>Of course you didn't do a pingpong. If you get under 1 microsecond from node to >>node you made a mistake. Latency of PCI is already way above that. > > >As I said, you _must_ know what you are doing to measure this stuff. You are >factoring in way more than PCI latency. Which is not a surprise, since you >don't know beans about operating systems. > > > >> >>Now the above latency time which we need in order to know what it takes to send >>and receive a message, is divided by 2 by the pingpong program. That's called >>'one way pingpong' then. >> >>So better pack your bags. > >Righto. You seem to confuse "ping pong" with a game played with two paddles, >a net, and a small white ball. But your latency measurement is not the way >to do it. One day I'll tell you how _I_ do the ping-pong test, which _really_ >measures latency. _not_ the way you do it, by the way.... > > > >> >>>> >>>>In computerchess you don't ship something without waiting for answer back. >>>>You *want* answer back. >>>> >>>>Example if you want to split a node :) >>> >>>Wrong. It is not hard to do this. I say "do this" and that is all I need >>>to do until I get the result back." I don't need a "OK, I got that, I'll >>>be back with the answer in a while." It is easier to just keep going until >>>the answer arrives back. >>> >>>> >>>>The 0.5 usec latency is based upon shipping a terabyte data without answer back. >>> >>>No it isn't. >> >> >> >>> >>> >>> >>>> >>>>Bandwidth / time needed = latency then. >>>> >>>>What i tried to explain to you is RASML but i know you won't understand it. >>>> >>>>In order to waste time onto this i'll just email the thing to you. >>>> >>>>Run it any time you like, But run it on 2 different nodes. Don't run it at the >>>>same node :) >>> >>> >>>You sent me some MPI crap that I'm not going to fool with. As I said, I >>>use VIA to use the cLAN stuff. VIA. Not MPI. >>> >>>But I'm not going to waste timr running your crap anyway as whenever I do it, >>>and you don 't like the results, you just disappear for a while. >>> >>> >>> >>>> >>>>>VIA has some cute stuff to "share memory" too. >>>>> >>>>>> >>>>>>For computer chess that can't be used however. >>>>>> >>>>>>You can more accurate get an indication by using the well known ping pong >>>>>>program. What it does is over MPI it ships messages and then WAITS for them to >>>>>>come back. Then it divides that time by 2. Then it is called one way ping pong >>>>>>latencies. >>>>> >>>>>That's how _I_ measure latency. I know of no other way, since keepting two >>>>>machine clocks synced that accurately is not easy. >>>>> >>>>> >>>>>> >>>>>>If you multiply that by 2, you already get closer to the latency that it takes >>>>>>to get a single bitboard out of memory. >>>>> >>>>>It doesn't take me .5usec to get a bitboard out of memory. Unless you are >>>>>talking about a NUMA machine where machine A wants the bitboard and it is >>>>>not in its local memory. >>>>> >>>>> >>>>>> >>>>>>Even better is using the RASML test i wrote. That's using OpenMP though but >>>>>>conversion to MPI is trivial (yet slowing down things so much that it is less >>>>>>accurate than openmp). >>>>>> >>>>>>So the best indication you can get is by doing a simple pingpong latency test. >>>>> >>>>>I do this all the time. >>>>> >>>>>> >>>>>>The best ethernet network cards are myrilnet work cards (about $1300). I do not >>>>>>know which chipset they have. They can achieve at 133Mhz PCI64X (jay might know >>>>>>more about specifications here) like 5 usec one way ping pong latency, so that's >>>>>>a minimum of way more than 10 usec to get a bitboard from the other side of th >>>>>>emachine. >>>>> >>>>>Correct. cLAN is faster. It is also more expensive. The 8-port switch we >>>>>use cost us about $18,000 two years ago. Myrinet was designed as a lower-cost >>>>>network. With somewhat lower performance. >>>>> >>>>>> >>>>>>In your cluster you probably do not have such PCI stuff Bob. Most likely it is >>>>>>around 10 usec for one way latency at your cluster so you can get at minimum of >>>>>>20 usec to get a message. >>>>> >>>>>In my cluster I have PCI cards that are faster than Myrinet. They were made by >>>>>cLAN (again) and we paid about $1,500 each for them two years ago. Again, you >>>>>can find info about the cLAN stuff and compare it to myrinet if you want. We >>>>>have Myrinet stuff here on campus (not in any of my labs) and we have done the >>>>>comparisons. When we write proposals to NSF, they _always_ push us towards >>>>>Myrinet because it is cheaper then the cLAN stuff, but it also is lower >>>>>performance. >>>>> >>>>> >>>>> >>>>>> >>>>>>Note that getting a cache line out of local memory of your quad xeons is already >>>>>>taking about 0.5 usec. You can imagine hopefully that the quoted usecs by the >>>>>>manufacturer for cLan is based upon bandwidth / time needed. And NOT the RASM >>>>>>latencies. >>>>> >>>>>Your number there is dead wrong. My cluster is PIII based, with a cache >>>>>line of 32 bytes. It uses 4-way interleaving. lm_bench reports the latency >>>>>as 132 nanoseconds, _total_. >>>>> >>>>>> >>>>>>Best regards, >>>>>>Vincent >>>>>> >>>>>> >>>>>>>TCP/IP-ethernet implementation. However, ethernet will never touch good >>>>>>>hardware like the cLAN stuff. >>>>>>> >>>>>>>MPI/PVM use ethernet - tcp/ip for one obvious reason: "portability" and >>>>>>>"availability". :)
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.