Author: Axel Schumacher
Date: 22:27:13 03/02/05
Go up one level in this thread
On March 03, 2005 at 00:21:44, Russell Reagan wrote: >On March 02, 2005 at 18:59:18, Axel Schumacher wrote: > >>Hi all, >>I have two question regarding the storage requirements for information; I hope >>somebody can help me with answering them. Please excuse if my questions are >>stupid. >> >>1. For each data-point (e.g. let's say the position of a pawn on the chessboard) >>one requires 1 bit (either 0 or 1). Right? However, the information does not >>include where the pawn is located. So, how much data has to be stored to >>describe e.g. the position of a pawn? > > >Practically speaking, if you only want to store the location of a single pawn, >you can do that in 6 bits. 6 bits will hold 64 different values. > >Theoretically speaking, you must also store information that maps each of the 64 >6-bit values to each of the 64 squares on a chess board. For example, "a pawn is >on square 28," doesn't mean anything by itself. We need to know that square 28 >is e4 (or whatever). > >However, the right answer depends on more details. Do you want a practical or >theoretical answer? Do you only want to store the location of one pawn, or many >pawns, or many types of pieces? Do you want to store data about one position, or >about many positions (like a database)? Do you want to consider illegal >positions with more than 16 pawns? Do you want a solution that works well with >sparsely populated boards, or densely populated boards, or a solution that >handles both well? Are you more interested in the average case or the worst >case? A good answer will depend upon details like these. > Hi Russel, thanks for your in-depth answer. Let's get practical. I have to calculate (with the analogy of computerchess) how much capacity is needed to store the information of the human epigenome (that is changes in the information content of the DNA which is not based on the basepair code). If I only measure a subset of the information (e.g. if I treat a nucleosome, a ~180 basepair unit of the genome as one target), I get about 3.6x10^14 raw data points. As you said, at this point, this information is useless. Additionally there has to be stored, the location on the chromosome (basepair location), the chromosome number, the sex of the individual, the kind of epigenetic information (e.g. methylation or acetylation) etc..etc.. for each datapoint (My guess is about 12 different attributes). In terms of chess you would have to store the information of 3.6x10^14 different chess pieces (actually the other way around would be more logical, to store the information of the 3.6 x 10^14 'squares' and calculate in which state each square at a timepoint is (is there a pawn or is there a knight, is it white or black, attacked or not, moving or not etc...). How much storage capacity would be needed. 12 x 3.6x10^14 bits? Or is it less, since several bit can store many values? I guess the amount of space is huge. With this knowledge, one can again using principles of computerchess, make sense out of the data. For example,a certain chemical state is not possible, since this state (position) would be illegal, as e.g. moving the King into check. (In an organism that would relate to a lethal chemical combination). Due to this ‘data-overflow’ and the fact that not much is known about epigenetic mechanisms we face the problem that our ability to generate data in vast quantities is running ahead of our ability to make sense of them. One has to keep in mind that an organism is a dynamic system, where cause and effect are subtle. Complexity comes along with dramatically different short-term and long-term effects, locally and in other parts of the system and obvious inputs (moves on the board) may produce non-obvious consequences (e.g. a sacrifice of a knight can still lead to a win). The overall emergent behaviour of a complex system is difficult to predict, even when subsystem behaviour ( e.g. chemical modification of DNA, or the movement of chess-pieces) is readily predictable. The final goal would be to distill the data-overflow to a understandable size; to make sense out of it and hence to understand the behaviour of a complex system as the human body (in a way a chess-program can make correct and logical moves in a game of chess). On the other hand, pattern recognition has to be dynamical. This is due to the fact that the human body is not a perfect system, since a wholly ordered organism would constitute a perfect equilibrium and would be dead (which would be either a loss or a draw in chess). Yet, order is a fundamental requirement to orchestrate all genetic elements (or chess-pieces) at all levels of organization (from the opening to the endgame). But no natural environment can be ordered in its sum total and still function. Hence, it may be that in epigenetic studies and in analogy also in computerchess, the mechanistic and deterministic Newtonian world-view, accentuating stability, order, uniformity, equilibrium and linear relationships of a systems – may has to be complemented by a new paradigm which deals with temporality disorder, instability, diversity and disequilibrium. I wonder if I make sense :-) A. P.S. Maybe somebody is interested to write a scientific paper with me about that?! >>points > >>2. How much calculation power is need to calculate a certain amount of data? I >>know, this this may sound a little bit abstract and, of course, it depends on >>the time-factor. But let's you have 1 terabyte of data in a data-spreadsheet. >>What calculation-power (e.g. amount of average desktop computers) is needed to >>make simple algebraic calculations with such a data-table? > > >Here also we need more details. How many things does the one terabyte describe >(i.e. how many things have to be processed)? At what rate can a desktop computer >process the data? 1 entry per hour? 10 million entries per second? What exactly >needs to be done? Do you need one final result (average, sum, etc.), or do you >need to keep the results for all entries? Does all of the data have to be >processed? Or are you looking for something like a maximum or minimum? If so, >maybe we could skip some work. Depending upon exactly what you want to know, >different algorithms will perform better, and that also will depend on more >details. If the data is sorted, that could help (depending upon what we're >searching for). Do you need an exact answer, or only an estimate? Do you need to >prove that the exact answer or estimate is correct, or is a "pretty good guess" >okay? A good answer will depend upon details like these. > > >>I hope sombody can help me with this. >>I'm writing a paper in which I make an analogy from biostatistic calculations >>with chess and calculations in chess (e.g. from a typical chess program). The >>reason for this is to examplify how biological data can be stored and how it can >>be interpreted. In this special case we are dealing with 3.6 x 10^14 raw data >>points deriving from chemical modifications in the human genome (so called >>epigenetics). For example, is a specific DNA base in the genome methylayted or >>not we have the state 0 or 1 again (plus this data has to be referenced). These >>information-units could interact in an infinite number of ways, so that it seems >>that it impossible to make sense out of them. However, IMHO, the analogy with >>the game of chess exemplifies that it still should be feasible to approach the >>problem of complex genetic information. In chess, a small number of rules can >>generate a huge number of board configurations (states), which are analogous to >>the configurations of molecules obeying physiological laws. Chess is known to >>have also an infinite number of possible combinations in its play but in theory >>the number is finite, since specific positions are impossible, as not all >>(epi)genetic factors can be found in all functional working combinations. E.g. >>it is said that in chess ‘merely’ ~10^43 to 10^50 states (positions) are needed >>to to describe the state (or the game) of the system. Out of these subsets of >>possible states, patterns can be established and calculated. So it is not >>necessary, to know every possible state. It is obvious that pure reductionism, >>the theory that all complex systems can be completely understood in terms of >>their components, may not be a fully fruitful approach. >>Yet, recent development in the field of complexity (e.g. statistical mechanics) >>has come up with alternative statistical approaches. It considers the average >>behaviour of a large number of components rather than the behaviour of any >>individual component, drawing heavily on the laws of probability, and aims to >>predict and explain the measurable properties of macroscopic systems on the >>basis of the properties and behaviour of their microscopic constituents. Chess >>programs don't rely on brute force alone anymore. Maybe such 'pattern >>recognition' or reduction of legal states can help in making sense out of >>complex data. >>Your opinion? Answers to the qustions? :-) >> >>Axel
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.