Complexity - A Guided Tour - Part 6
Library

Part 6

These questions were all answered, at least in part, within the next ten years. The biggest break came when, in 1953, James Watson and Francis Crick figured out that the structure of DNA is a double helix. In the early 1960s, the combined work of several scientists succeeded in breaking the genetic code-how the parts of DNA encode the amino acids that make up proteins. A gene-a concept that had been around since Mendel without any understanding of its molecular substrate-could now be defined as a substring of DNA that codes for a particular protein. Soon after this, it was worked out how the code was translated by the cell into proteins, how DNA makes copies of itself, and how variation arises via copying errors, externally caused mutations, and s.e.xual recombination. This was clearly a "tipping point" in genetics research. The science of genetics was on a roll, and hasn't stopped rolling yet.

The Mechanics of DNA.

The collection of all of an organism's physical traits-its phenotype-comes about largely due to the character of and interactions between proteins in cells. Proteins are long chains of molecules called amino acids.

Every cell in your body contains almost exactly the same complete DNA sequence, which is made up of a string of chemicals called nucleotides. Nucleotides contain chemicals called bases, which come in four varieties, called (for short) A, C, G, and T. In humans, strings of DNA are actually double strands of paired A, C, G, and T molecules. Due to chemical affinities, A always pairs with T, and C always pairs with G.

Sequences are usually written with one line of letters on the top, and the paired letters (base pairs) on the bottom, for example, TCCGATT ...

AGGCTAA ...

In a DNA molecule, these double strands weave around one another in a double helix (figure 6.1).

Subsequences of DNA form genes. Roughly, each gene codes for a particular protein. It does that by coding for each of the amino acids that make up the protein. The way amino acids are coded is called the genetic code. The code is the same for almost every organism on Earth. Each amino acid corresponds to a triple of nucleotide bases. For example, the DNA triplet AAG corresponds to the amino acid phenylalanine, and the DNA triplet C A C corresponds to the amino acid valine. These triplets are called codons.

So how do proteins actually get formed by genes? Each cell has a complex set of molecular machinery that performs this task. The first step is transcription (figure 6.2), which happens in the cell nucleus. From a single strand of the DNA, an enzyme (an active protein) called RNA polymerase unwinds a small part of the DNA from its double helix. This enzyme then uses one of the DNA strands to create a messenger RNA (or mRNA) molecule that is a letter-for-letter copy of the section of DNA. Actually, it is an anticopy: in every place where the gene has C, the mRNA has G, and in every place where the gene has A, the mRNA has U (its version of T). The original can be reconstructed from the anticopy.

FIGURE 6.1. Ill.u.s.tration of the double helix structure of DNA. (From the National Human Genome Research Inst.i.tute, Talking Glossary of Genetic Terms [http://www.genome.gov/glossary.cfm.]) FIGURE 6.2. Ill.u.s.tration of transcription of DNA into messenger RNA. Note that the letter U is RNA's version of DNA's letter T.

The process of transcription continues until the gene is completely transcribed as mRNA.

The second step is translation (figure 6.3), which happens in the cell cytoplasm. The newly created mRNA strand moves from the nucleus to the cytoplasm, where it is read, one codon at a time, by a cytoplasmic structure called a ribosome. In the ribosome, each codon is brought together with a corresponding anticodon residing on a molecule of transfer RNA (tRNA). The anticodon consists of the complementary bases. For example, in figure 6.3, the mRNA codon being translated is UAG, and the anticodon is the complementary bases AUC.A tRNA molecule that has that anticodon will attach to the mRNA codon, as shown in the figure. It just so happens that every tRNA molecule has attached to it both an anticodon and the corresponding amino acid (the codon A U C happens to code for the amino acid isoleucine in case you were interested). Douglas Hofstadter has called tRNA "the cell's flash cards."

FIGURE 6.3. Ill.u.s.tration of translation of messenger RNA into amino acids.

The ribosome cuts off the amino acids from the tRNA molecules and hooks them up into a protein. When a stop-codon is read, the ribosome gets the signal to stop, and releases the protein into the cytoplasm, where it will go off and perform whatever function it is supposed to do.

The transcription and translation of a gene is called the gene's expression and a gene is being expressed at a given time if it is being transcribed and translated.

All this happens continually and simultaneously in thousands of sites in each cell, and in all of the trillions of cells in your body. It's amazing how little energy this takes-if you sit around watching TV, say, all this subcellular activity will burn up fewer than 100 calories per hour. That's because these processes are in part fueled by the random motion and collisions of huge numbers of molecules, which get their energy from the "ambient heat bath" (e.g., your warm living room).

The paired nature of nucleotide bases, A with T and C with G, is also the key to the replication of DNA. Before mitosis, enzymes unwind and separate strands of DNA. For each strand, other enzymes read the nucleotides in the DNA strand, and to each one attach a new nucleotide (new nucleotides are continually manufactured in chemical processes going on in the cell), with A attached to T, and C attached to G, as usual. In this way, each strand of the original two-stranded piece of DNA becomes a new two-stranded piece of DNA, and each cell that is the product of mitosis gets one of these complete two-stranded DNA molecules. There are many complicated processes in the cell that keep this replication process on track. Occasionally (about once every 100 billion nucleotides), errors will occur (e.g., a wrong base will be attached), resulting in mutations.

It is important to note that there is a wonderful self-reference here: All this complex cellular machinery-the mRNA, tRNA, ribosomes, polymerases, and so forth-that effect the transcription, translation, and replication of DNA are themselves encoded in that very DNA. As Hofstadter remarks: "The DNA contains coded versions of its own decoders!" It also contains coded versions of all the proteins that go into synthesizing the nucleotides the DNA is made up of. It's a self-referential circularity that would no doubt have pleased Turing, had he lived to see it explained.

The processes sketched above were understood in their basic form by the mid-1960s in a heroic effort by geneticists to make sense of this incredibly complex system. The effort also brought about a new understanding of evolution at the molecular level.

In 1962, Crick, Watson, and biologist Maurice Wilkins jointly received the n.o.bel prize in medicine for their discoveries about the structure of DNA. In 1968, Har Gobind Korana, Robert Holley, and Marshall Nirenberg received the same prize for their work on cracking the genetic code. By this time, it finally seemed that the major mysteries of evolution and inheritance had been mostly worked out. However, as we see in chapter 18, it is turning out to be a lot more complicated than anyone ever thought.

CHAPTER 7.

Defining and Measuring Complexity.

THIS BOOK IS ABOUT COMPLEXITY, but so far I haven't defined this term rigorously or given any clear way to answer questions such as these: Is a human brain more complex than an ant brain? Is the human genome more complex than the genome of yeast? Did complexity in biological organisms increase over the last four billion years of evolution? Intuitively, the answer to these questions would seem to be "of course." However, it has been surprisingly difficult to come up with a universally accepted definition of complexity that can help answer these kinds of questions.

In 2004 I organized a panel discussion on complexity at the Santa Fe Inst.i.tute's annual Complex Systems Summer School. It was a special year: 2004 marked the twentieth anniversary of the founding of the inst.i.tute. The panel consisted of some of the most prominent members of the SFI faculty, including Doyne Farmer, Jim Crutchfield, Stephanie Forrest, Eric Smith, John Miller, Alfred Hubler, and Bob Eisenstein-all well-known scientists in fields such as physics, computer science, biology, economics, and decision theory. The students at the school-young scientists at the graduate or postdoctoral level-were given the opportunity to ask any question of the panel. The first question was, "How do you define complexity?" Everyone on the panel laughed, because the question was at once so straightforward, so expected, and yet so difficult to answer. Each panel member then proceeded to give a different definition of the term. A few arguments even broke out between members of the faculty over their respective definitions. The students were a bit shocked and frustrated. If the faculty of the Santa Fe Inst.i.tute-the most famous inst.i.tution in the world devoted to research on complex systems-could not agree on what was meant by complexity, then how can there even begin to be a science of complexity?

The answer is that there is not yet a single science of complexity but rather several different sciences of complexity with different notions of what complexity means. Some of these notions are quite formal, and some are still very informal. If the sciences of complexity are to become a unified science of complexity, then people are going to have to figure out how these diverse notions-formal and informal-are related to one another, and how to most usefully refine the overly complex notion of complexity. This is work that largely remains to be done, perhaps by those shocked and frustrated students as they take over from the older generation of scientists.

I don't think the students should have been shocked and frustrated. Any perusal of the history of science will show that the lack of a universally accepted definition of a central term is more common than not. Isaac Newton did not have a good definition of force, and in fact, was not happy about the concept since it seemed to require a kind of magical "action at a distance," which was not allowed in mechanistic explanations of nature. While genetics is one of the largest and fastest growing fields of biology, geneticists still do not agree on precisely what the term gene refers to at the molecular level. Astronomers have discovered that about 95% of the universe is made up of "dark matter" and "dark energy" but have no clear idea what these two things actually consist of. Psychologists don't have precise definitions for idea or concept, or know what these correspond to in the brain. These are just a few examples. Science often makes progress by inventing new terms to describe incompletely understood phenomena; these terms are gradually refined as the science matures and the phenomena become more completely understood. For example, physicists now understand all forces in nature to be combinations of four different kinds of fundamental forces: electromagnetic, strong, weak, and gravitational. Physicists have also theorized that the seeming "action at a distance" arises from the interaction of elementary particles. Developing a single theory that describes these four fundamental forces in terms of quantum mechanics remains one of the biggest open problems in all of physics. Perhaps in the future we will be able to isolate the different fundamental aspects of "complexity" and eventually unify all these aspects in some overall understanding of what we now call complex phenomena.

The physicist Seth Lloyd published a paper in 2001 proposing three different dimensions along which to measure the complexity of an object or process: How hard is it to describe?

How hard is it to create?

What is its degree of organization?

Lloyd then listed about forty measures of complexity that had been proposed by different people, each of which addressed one or more of these three questions using concepts from dynamical systems, thermodynamics, information theory, and computation. Now that we have covered the background for these concepts, I can sketch some of these proposed definitions.

To ill.u.s.trate these definitions, let's use the example of comparing the complexity of the human genome with the yeast genome. The human genome contains approximately three billion base pairs (i.e., pairs of nucleotides). It has been estimated that humans have about 25,000 genes-that is, regions that code for proteins. Surprisingly, only about 2% of base pairs are actually parts of genes; the nongene parts of the genome are called noncoding regions. The noncoding regions have several functions: some of them help keep their chromosomes from falling apart; some help control the workings of actual genes; some may just be "junk" that doesn't really serve any purpose, or has some function yet to be discovered.

I'm sure you've heard of the Human Genome project, but you may not know that there was also a Yeast Genome Project, in which the complete DNA sequences of several varieties of yeast were determined. The first variety that was sequenced turned out to have approximately twelve million base pairs and six thousand genes.

Complexity as Size.

One simple measure of complexity is size. By this measure, humans are about 250 times as complex as yeast if we compare the number of base pairs, but only about four times as complex if we count genes.

Since 250 is a pretty big number, you may now be feeling rather complex, at least as compared with yeast. However, disappointingly, it turns out that the amoeba, another type of single-celled microorganism, has about 225 times as many base pairs as humans do, and a mustard plant called Arabidopsis has about the same number of genes that we do.

Humans are obviously more complex than amoebae or mustard plants, or at least I would like to think so. This means that genome size is not a very good measure of complexity; our complexity must come from something deeper than our absolute number of base pairs or genes (See figure 7.1).

Complexity as Entropy.

Another proposed measure of the complexity of an object is simply its Shannon entropy, defined in chapter 3 to be the average information content or "amount of surprise" a message source has for a receiver. In our example, we could define a message to be one of the symbols A, C, G, or T. A highly ordered and very easy-to-describe sequence such as "A A A A A A A... A" has entropy equal to zero. A completely random sequence has the maximum possible entropy.

FIGURE 7.1. Clockwise from top left: Yeast, an amoeba, a human, and Arabidopsis. Which is the most complex? If you used genome length as the measure of complexity, then the amoeba would win hands down (if only it had hands). (Yeast photograph from NASA, [http://www.nasa.gov/mission_pages/station/science/experiments/Yeast-GAP.html]; amoeba photograph from NASA [http://ares.jsc.nasa.gov/astrobiology/biomarkers/_images/amoeba.jpg]; Arabidopsis photograph courtesy of Kirsten Bomblies; Darwin photograph reproduced with permission from John van Wyhe, ed., The Complete Work of Charles Darwin Online [http://darwin-online.org.uk/].) There are a few problems with using Shannon entropy as a measure of complexity. First, the object or process in question has to be put in the form of "messages" of some kind, as we did above. This isn't always easy or straightforward-how, for example, would we measure the entropy of the human brain? Second, the highest entropy is achieved by a random set of messages. We could make up an artificial genome by choosing a bunch of random As, Cs, Gs, and Ts. Using entropy as the measure of complexity, this random, almost certainly nonfunctional genome would be considered more complex than the human genome. Of course one of the things that makes humans complex, in the intuitive sense, is precisely that our genomes aren't random but have been evolved over long periods to encode genes useful to our survival, such as the ones that control the development of eyes and muscles. The most complex ent.i.ties are not the most ordered or random ones but somewhere in between. Simple Shannon entropy doesn't capture our intuitive concept of complexity.

Complexity as Algorithmic Information Content.

Many people have proposed alternatives to simple entropy as a measure of complexity. Most notably Andrey Kolmogorov, and independently both Gregory Chaitin and Ray Solomonoff, proposed that the complexity of an object is the size of the shortest computer program that could generate a complete description of the object. This is called the algorithmic information content of the object. For example, think of a very short (artificial) string of DNA: ACACACACACACACACACAC (string 1).

A very short computer program, "Print A C ten times," would spit out this pattern. Thus the string has low algorithmic information content. In contrast, here is a string I generated using a pseudo-random number generator: ATCTGTCAAGACGGAACAT (string 2).

a.s.suming my random number generator is a good one, this string has no discernible overall pattern to it, and would require a longer program, namely "Print the exact string A T C T G T C A A A A C G G A A C A T." The idea is that string 1 is compressible, but string 2 is not, so contains more algorithmic information. Like entropy, algorithmic information content a.s.signs higher information content to random objects than ones we would intuitively consider to be complex.

The physicist Murray Gell-Mann proposed a related measure he called "effective complexity" that accords better with our intuitions about complexity. Gell-Mann proposed that any given ent.i.ty is composed of a combination of regularity and randomness. For example, string 1 above has a very simple regularity: the repeating A C motif. String 2 has no regularities, since it was generated at random. In contrast, the DNA of a living organism has some regularities (e.g., important correlations among different parts of the genome) probably combined with some randomness (e.g., true junk DNA).

To calculate the effective complexity, first one figures out the best description of the regularities of the ent.i.ty; the effective complexity is defined as the amount of information contained in that description, or equivalently, the algorithmic information content of the set of regularities.

String 1 above has the regularity that it is A C repeated over and over. The amount of information needed to describe this regularity is the algorithmic information content of this regularity: the length of the program "Print A C some number of times." Thus, ent.i.ties with very predictable structure have low effective complexity.

In the other extreme, string 2, being random, has no regularities. Thus there is no information needed to describe its regularities, and while the algorithmic information content of the string itself is maximal, the algorithmic information content of the string's regularities-its effective complexity-is zero. In short, as we would wish, both very ordered and very random ent.i.ties have low effective complexity.

The DNA of a viable organism, having many independent and interdependent regularities, would have high effective complexity because its regularities presumably require considerable information to describe.

The problem here, of course, is how do we figure out what the regularities are? And what happens if, for a given system, various observers do not agree on what the regularities are?

Gell-Mann makes an a.n.a.logy with scientific theory formation, which is, in fact, a process of finding regularities about natural phenomena. For any given phenomenon, there are many possible theories that express its regularities, but clearly some theories-the simpler and more elegant ones-are better than others. Gell-Mann knows a lot about this kind of thing-he shared the 1969 n.o.bel prize in Physics for his wonderfully elegant theory that finally made sense of the (then) confusing mess of elementary particle types and their interactions.

In a similar way, given different proposed sets of regularities that fit an ent.i.ty, we can determine which is best by using the test called Occam's Razor. The best set of regularities is the smallest one that describes the ent.i.ty in question and at the same time minimizes the remaining random component of that ent.i.ty. For example, biologists today have found many regularities in the human genome, such as genes, regulatory interactions among genes, and so on, but these regularities still leave a lot of seemingly random aspects that don't obey any regularities-namely, all that so-called junk DNA. If the Murray Gell-Mann of biology were to come along, he or she might find a better set of regularities that is simpler than that which biologists have so far identified and that is obeyed by more of the genome.

Effective complexity is a compelling idea, though like most of the proposed measures of complexity, it is hard to actually measure. Critics also have pointed out that the subjectivity of its definition remains a problem.

Complexity as Logical Depth.

In order to get closer to our intuitions about complexity, in the early 1980s the mathematician Charles Bennett proposed the notion of logical depth. The logical depth of an object is a measure of how difficult that object is to construct. A highly ordered sequence of A, C, G, T (e.g., string 1, mentioned previously) is obviously easy to construct. Likewise, if I asked you to give me a random sequence of A, C, G, and T, that would be pretty easy for you to do, especially with the help of a coin you could flip or dice you could roll. But if I asked you to give me a DNA sequence that would produce a viable organism, you (or any biologist) would be very hard-pressed to do so without cheating by looking up already-sequenced genomes.

In Bennett's words, "Logically deep objects... contain internal evidence of having been the result of a long computation or slow-to-simulate dynamical process, and could not plausibly have originated otherwise." Or as Seth Lloyd says, "It is an appealing idea to identify the complexity of a thing with the amount of information processed in the most plausible method of its creation."

To define logical depth more precisely, Bennett equated the construction of an object with the computation of a string of 0s and 1s encoding that object. For our example, we could a.s.sign to each nucleotide letter a two-digit code: A = 00, C = 01, G = 10, and T = 11. Using this code, we could turn any sequence of A, C, G, and T into a string of 0s and 1s. The logical depth is then defined as the number of steps that it would take for a properly programmed Turing machine, starting from a blank tape, to construct the desired sequence as its output.

Since, in general, there are different "properly programmed" Turing machines that could all produce the desired sequence in different amounts of time, Bennett had to specify which Turing machine should be used. He proposed that the shortest of these (i.e., the one with the least number of states and rules) should be chosen, in accordance with the above-mentioned Occam's Razor.

Logical depth has very nice theoretical properties that match our intuitions, but it does not give a practical way of measuring the complexity of any natural object of interest, since there is typically no practical way of finding the smallest Turing machine that could have generated a given object, not to mention determining how long that machine would take to generate it. And this doesn't even take into account the difficulty, in general, of describing a given object as a string of 0s and 1s.

Complexity as Thermodynamic Depth.

In the late 1980s, Seth Lloyd and Heinz Pagels proposed a new measure of complexity, thermodynamic depth. Lloyd and Pagels' intuition was similar to Bennett's: more complex objects are harder to construct. However, instead of measuring the number of steps of the Turing machine needed to construct the description of an object, thermodynamic depth starts by determining "the most plausible scientifically determined sequence of events that lead to the thing itself," and measures "the total amount of thermodynamic and informational resources required by the physical construction process."

For example, to determine the thermodynamic depth of the human genome, we might start with the genome of the very first creature that ever lived and list all the evolutionary genetic events (random mutations, recombinations, gene duplications, etc.) that led to modern humans. Presumably, since humans evolved billions of years later than amoebas, their thermodynamic depth is much greater.

Like logical depth, thermodynamic depth is appealing in theory, but in practice has some problems as a method for measuring complexity. First, there is the a.s.sumption that we can, in practice, list all the events that lead to the creation of a particular object. Second, as pointed out by some critics, it's not clear from Seth Lloyd and Heinz Pagels' definition just how to define "an event." Should a genetic mutation be considered a single event or a group of millions of events involving all the interactions between atoms and subatomic particles that cause the molecular-level event to occur? Should a genetic recombination between two ancestor organisms be considered a single event, or should we include all the microscale events that cause the two organisms to end up meeting, mating, and forming offspring? In more technical language, it's not clear how to "coa.r.s.e-grain" the states of the system-that is, how to determine what are the relevant macrostates when listing events.

Complexity as Computational Capacity.

If complex systems-both natural and human-constructed-can perform computation, then we might want to measure their complexity in terms of the sophistication of what they can compute. The physicist Stephen Wolfram, for example, has proposed that systems are complex if their computational abilities are equivalent to those of a universal Turing machine. However, as Charles Bennett and others have argued, the ability to perform universal computation doesn't mean that a system by itself is complex; rather, we should measure the complexity of the behavior of the system coupled with its inputs. For example, a universal Turing machine alone isn't complex, but together with a machine code and input that produces a sophisticated computation, it creates complex behavior.

Statistical Complexity.

Physicists Jim Crutchfield and Karl Young defined a different quant.i.ty, called statistical complexity, which measures the minimum amount of information about the past behavior of a system that is needed to optimally predict the statistical behavior of the system in the future. (The physicist Peter Gra.s.sberger independently defined a closely related concept called effective measure complexity.) Statistical complexity is related to Shannon's entropy in that a system is thought of as a "message source" and its behavior is somehow quantified as discrete "messages." Here, predicting the statistical behavior consists of constructing a model of the system, based on observations of the messages the system produces, such that the model's behavior is statistically indistinguishable from the behavior of the system itself.

For example, a model of the message source of string 1 above could be very simple: "repeat A C"; thus its statistical complexity is low. However, in contrast to what could be done with entropy or algorithmic information content, a simple model could also be built of the message source that generates string 2: "choose at random from A, C, G, or T." The latter is possible because models of statistical complexity are permitted to include random choices. The quant.i.tative value of statistical complexity is the information content of the simplest such model that predicts the system's behavior. Thus, like effective complexity, statistical complexity is low for both highly ordered and random systems, and is high for systems in between-those that we would intuitively consider to be complex.

Like the other measures described above, it is typically not easy to measure statistical complexity if the system in question does not have a ready interpretation as a message source. However, Crutchfield, Young, and their colleagues have actually measured the statistical complexity of a number of real-world phenomena, such as the atomic structure of complicated crystals and the firing patterns of neurons.

Complexity as Fractal Dimension.