Author: Robert Hyatt
Date: 18:53:33 02/01/05
Go up one level in this thread
On February 01, 2005 at 17:04:42, Matthew Hull wrote: >On February 01, 2005 at 16:28:22, Robert Hyatt wrote: > >>On February 01, 2005 at 13:26:04, Vincent Diepeveen wrote: >> >>>On February 01, 2005 at 12:59:31, Robert Hyatt wrote: >>> >>>>On February 01, 2005 at 10:59:18, Vincent Diepeveen wrote: >>>> >>>>>On February 01, 2005 at 00:56:17, Robert Hyatt wrote: >>>>> >>>>>>On January 31, 2005 at 13:35:02, Vincent Diepeveen wrote: >>>>>> >>>>>>>On January 31, 2005 at 13:03:43, Robert Hyatt wrote: >>>>>>> >>>>>>>>On January 31, 2005 at 10:14:28, Vincent Diepeveen wrote: >>>>>>>> >>>>>>>>>On January 31, 2005 at 10:01:16, Vincent Lejeune wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>>news from 28/01/05 (more to come) >>>>>>>>>> >>>>>>>>>>http://hydrachess.com/hydra-scylla.html >>>>>>>>>> >>>>>>>>>>32 nodes (previous version has 16) , no information about CPU power and FPGA >>>>>>>>>>cards yet ... >>>>>>>>> >>>>>>>>>It's 4 nodes. >>>>>>>>> >>>>>>>>>1 node = 8 processor Xeon. >>>>>>>>> >>>>>>>>>FPGA cards would get double speed. So must be between 30Mhz and 60Mhz. They only >>>>>>>>>use development fpga cards. So never use the real power of fpga (which is >>>>>>>>>printing your own processor which can run hands down at 600Mhz or more). They >>>>>>>>>stick to development cards for some unknown reason to me. >>>>>>>>> >>>>>>>>>CPU power is not interesting at all of course, cards do the work. >>>>>>>>> >>>>>>>>>Vincent >>>>>>>> >>>>>>>>I hope not. Old machine used 8 boxes with 2 cpus per box. Going to 8-way xeons >>>>>>>>is a performance killer. The PCI bus just can't keep up. >>>>>>> >>>>>>>Ok the sheikh nor his righthand didn't know himself very well the architectural >>>>>>>details which we will forgive them. >>>>>>> >>>>>>>I have accurate information now. >>>>>>> >>>>>>>It is a 32 node system. With each node a dual. They have however 32 FPGA cards >>>>>>>because when ordering the person filling in the form (and i am NOT going to post >>>>>>>who it was, but it was NOT the sheikh) confused nodes for cpu's. >>>>>>> >>>>>>>So they have a mighty 32 node myrinet now with 64 processors. However 32 cards >>>>>>>so they run 64 processor effectively while serving with 32 cards which do the >>>>>>>job. Cards at 55Mhz. >>>>>>> >>>>>>>Please note that PCI bus isn't the problem. They are using pci-x. >>>>>> >>>>>>PCI-X falls flat if you have 8 cpus in a single box. I have run on such >>>>> >>>>>It's not the pci-x which is the problem at all. >>>>> >>>>>It's that it is simply tough programming to get it to work. >>>> >>>>No it isn't. One bus to memory, 8 processors hanging on the bus trying to get >>>>to memory. They get in the way of each other, and that is why bus architectures >>>>don't scale very well beyond 4. And even going to 4 requires 4-way interleaving >>>>to keep up. But they don't go to 8-way interleaving on the Dell-type boxes. >>>>Others do but they have a price that shows it... >>>> >>>>For 8 and up, a crossbar is really the right way to go, if the price can be >>>>handled. Otherwise a NUMA-type approach like the AMD solution is most >>>>affordable. >>>> >>>>> >>>>>You simply must limit any read or write remote to the ultimate maximum. >>>>> >>>>>Multithreading, forget it. >>>> >>>>multi-threading or multiple processes is not the issue here. What you can do >>>>with one, you can do with the other. One just offers easier-to-use features for >>>>certain applications. >>>> >>>> >>>>> >>>>>>machines. PCI does pretty well on 4way systems, but on 8-way, overall gain >>>>> >>>>>PCI is at least 4 times slower than pci-x in latency. >>>>> >>>>>pci-x can give easily 1 us if the network card is fast enough. >>>>> >>>> >>>>We are not talking about network. We are talking (at least I am talking) about >>>>SMP-type boxes with 8 cpus in a single chassis, like the 8-cpu xeon Dell sells, >>>>or like the 4-cpu xeon boxes I have here... >>> >>>I'm not talking about 8 way smp boxes. Those are more expensive than a 32 >>>processor cluster when you get latest processors inside. >>> >> >>Quote on: >>------------------------------------------ >> >>It's 4 nodes. >> >>1 node = 8 processor Xeon. >>------------------------------------------ >>quote off >> >>"one node = 8 processor xeon". That is _clearly_ SMP. And _that_ is the box I >>was talking about. I've run on more than one of them. For chess they are bad, >>running 2 copies of crafty produces 2x NPS of one copy. Running 4 copies >>produces nearly 4x the NPS of one copy. Running 8 copes produces barely 6x the >>1 copy speed. > > >Would this be considered a NUMA relationship, between the two 4-way nodes. Does >your NUMA crafty handle this better than your older SMP-only crafty? > wasn't numa. Used the Intel "Fusion" chipset to tie two 4-way processor groups together to provide a total of 8, but there was a bottleneck. It was one of those "let's see what it will do" sort of things. Big L2 cache processors could provide good performance if memory requirements were limited. I'll take a NUMA opteron system any day. > > >> >>_that_ was the PCI bottleneck I was talking about. >> >> >>>> >>>> >>>>>practical even cheapo myrinet gives 2.7 us. >>>>> >>>>>>seems to be 1.5x a 4-way which is not that great. If you run a program that >>>>>>runs out of cache quickly, this drops even further. >>>>> >>>>>you can't run smp programs over networks obviously. >>>> >>>>That's why I wasn't talking about networks. You originally said this machine >>>>(New Hydra) has a node with 8 processors. That is what I am talking about. >>> >>>Perhaps in future read the subject >>> >> >> >>Perhaps in the future you should remember what you wrote and what I responded >>to? I haven't changed the subject a single time in my posts in this thread. >> >> >> >>>> >>>> >>>>> >>>>>> >>>>>>> >>>>>>>Latency of one way pingpong is around 2.7 us with myrinet. That excludes the >>>>>>>router costs which will be also at about 1 us i guess for random data traffic >>>>>>>(up to 35 ns for bandwidth traffic). >>>>>>> >>>>>>>Vincent >>>>>> >>>>>> >>>>>>All depends. We have myrinet here and are probably going to use that in our new >>>>>>opteron cluster when we buy it after the dual-core opterons start shipping in >>>>>>quantity... >>>>> >>>>>For chess myrinet sucks ass to say it very polite, because it doesn't allow DSM >>>>>(distributed shared memory). >>>>> >>>>>For just a few dollar more you can get quadrics or dolphin which do have better >>>>>latencies (dolphin 1 us) and allow distributed shared memory. >>>>> >>>>>The real major problem with myrinet is that the receiving process must non stop >>>>>receive the messages and process them. So you must make some kind of hand timing >>>>>within the search process to do just that. >>>>> >>>>>With DSM your processes don't feel any of that. >>>>> >>>>>A 8 node quadrics network is 13095 dollar. That includes everything. >>>>> >>>>>Quadrics is used in the fastest supercomputers, like the nuclear supercomputer >>>>>France just ordered a while ago. It scales far superior to myrinet when you >>>>>start scaling above those 8 nodes. >>>>> >>>>>For chess using the DSM features in a program is not so trivial, but pretty easy >>>>>compared to the task of parallellizing a product. >>>>> >>>>>Vincent >>>> >>>> >>>>Just remember that most supercomputer applications don't care about latency, >>> >>>Just remember that i don't build a cluster for a matrix calculation, but for >>>DIEP :) >>> >>>In which case myrinet sucks :) >> >> >>That's fine. But "supercomputers" are not built to your specifications. They >>are built to address the programming requirements of large numerical systems >>being used. They do that well. Your car wasn't designed to fly either. You'd >>have hell getting it to do so, although a lowly piper cub can do it since it was >>designed to do so... >> >> >>> >>>>they care about bandwidth. Large applications are all about streaming data, >>>the >>>>latency for the first word is not important when several million are going to >>>>follow back-to-back. All that matters to the big applications is how frequently >>>>do I get the next word, the latency for the first word gets buried in the cost >>>>to transfer the remaining millions of words. That's what makes vector computers >>>>so powerful for the right kinds of applications, as opposed to these "toy >>>>supercomputers" that just use lots of general purpose processors and sloppy >>>>interconnections. >>> >>>Majority of supercomputers are not vector computers. Latency is important also >>>at supercomputers. Majority of jobs run at supercomputers are 4-8 processors and >>>eat half the system time a year. The other half is matrix calculations and >>>trivially could run at cheapo clusters. >>> >>>Only a few applications are really optimized and i wonder why they don't run >>>those on clusters but use very expensive SGI type hardware for it which delivers >>>very little flops per dollar. >>> >>>Vincent >> >>Because a cluster can't offer 1/100th the total memory bandwidth of a big Cray >>vector box.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.