Kolmogorov Complexity

Kolmogorov complexity measures the complexity of describing an object, such as a binary string. The description is considered as a program which can be run on a computer. By Church’s thesis, all non-trivial computational models are equivalent; thus, one can consider any universal computer (which can simulate other computers).

Let \(x\) be a binary string and \(U\) be a universal computer. Denote by \(l(x)\) the length of \(x\). The Kolmogorov complexity of \(x\) with respect to \(U\) is defined as
K(x; U) = \min_{p\colon U(p) = x} l(p).
This is the length of the shortest program which prints \(x\) and halts.

The Kolmogorov complexity of a string with respect to two universal computers differs only by a constant which is independent of the string. This is by the universality of computation; that is, a universal computer can interpret instructions of another computer by a constant-size simulation program. Thus, we will simply write \(K(x)\) for the Kolmogorov complexity of \(x\).

We define the conditional Kolmogorov complexity of \(x\) given the length of \(x\) as \(K(x\mid l(x))\). The following are simple facts.

  • \(K(x\mid l(x)) \leq l(x) + O(1)\).
  • \(K(x) \leq K(x\mid l(x)) + 2 \log l(x) + O(1)\).

The second statement follows from the first by encoding the length into the description.

Consider a binary string of length \(n\) with \(k\) ones. One can compress the string into almost \(n H(k/n)\) bits, where \(H(p) = -p \log p -(1-p)\log(1-p)\). The final description could be the compressed string plus a constant-size decompression program. The total length is upper bounded by \(n H(k/n) + O(\log n)\). Another way of description is using the following program: generate strings with \(k\) ones lexicographically, and output the i-th string.

The total number of strings with complexity \(K(x) < k\) is less than \(2^k\). Following this, the fraction of binary strings of length \(n\) with Kolmogorov complexity less than \(n-k\) is less than \(2^{-k}\).

For any computer \(U\), \(\sum_{p\colon U(p)~halts} 2^{-l(p)} \leq 1\), by Kraft inequality.

Kolmogorov complexity and entropy

Consider a sequence of i.i.d. binary variables \(X^n =\{X_i\}\). Let \(p(x^n) = \prod_{i=1}^n p(x_i)\) be the probability distribution. Consider \(E[ K(X^n \mid n) ] = \sum_{x^n} K(x^n \mid n)\). The following holds.
n H(X) \leq E[ K(X^n \mid n) ] \leq n H(X) + O(\log n).

The lower bound is from the theory of source coding, where the expected codeword length is greater than the entropy. The upper bound is by the compressed encoding of each string, and also by Jensen’s inequality that
E[ H(\sum_{i=1}^n X_i/n) ] \leq H(E [\sum_{i=1}^n X_i/n] ) = H (E[X]).

The above result holds also with high probability.
K(X^n \mid n) / n \to H(X) \hbox{ in probability }.

A infinite string \(x\) is incompressible if
\lim_{n\to \infty} \frac{K (x^n\mid n)}{ n } = 1.
If a string is incompressible, then \(\sum x_i /n \to 1/2\). The upper bound of Kolmogrov complexity gives a lower bound for then entropy \(H(k/n)\).

Kolmogorov complexity is uncomputable; this follows from that the halting problem is uncomputable.


Kolmogorov complexity is different from computational complexity in many senses. In general, a program computes a function from inputs to outputs, where the input ranges from strings of arbitrary length, and the function is an “infinite” object. In computational complexity, we wish to find the most efficient (which must be finite to be interesting) program computing the function/problem. In Kolmogorov complexity, one is given a string, and wishes to find the shortest program generating the string. The program does not take any input; it will only produce an output.

Tagged with: ,
Posted in Probability, Theory

Unix Commands for Large Files

This tutorial is about how to handle large files using Linux/Unix commands, such as viewing the first few lines, counting the number of lines or bytes, splitting the file, etc.

wc — word, line, character, and byte count

The wc utility displays the number of lines, words, and bytes in the file. It has the following options.

  • -c: The number of bytes.
  • -l: The number of lines.
  • -w: The number of words.

The following are some examples:

  • wc -c myfile.txt: counts the number of bytes.
  • wc -l myfile.csv: counts the number of lines.

head — display first lines of a file

The head utility displays the first few lines or bytes of a file. A similar command tail displays the last lines. It has the following options.

  • -n count: display the first count lines.
  • -c count: display the first count bytes.

The following are some examples:

  • head -n 10 myfile.csv: displays the first 10 lines.

split — split a file into pieces

The split utility breaks the given file up into files of 1000 lines each. It has the following options.

  • -l count: split into files of count lines each.
  • -b count[k|m]: split into files of count [kilo|mega] bytes.

By default, the file is split into lexically ordered files named with the prefix “x”. The following are some examples:

  • split -l 10000 myfile.csv: splits the file into 10000 lines each, with names ‘xaa’, ‘xab’, etc
  • split -l 100 myfile.csv newfile: splits the file into 100 lines each, with names ‘newfileaa’, ‘newfileab’, etc
Tagged with: ,
Posted in BigData

Get or Remove the Last Char in Strings in PHP

PHP function substr() allows manipulating a string for the rear. The function is defined as below.

string substr ( string $string , int $start [, int $length ] )

It returns the portion of string specified by the start and length parameters.

  • If start is negative, the starting position is counted from the end of the string.
  • If length is given and positive, the returned string starts from the starting position but with at most the given length of characters.
  • If length is given and negative, the returned string starts from the starting position, but with the given length of characters omitted from the end.

This allows us to return the last character or remove the last character.

substr("theory", -1);       // returns "y"
substr("theory", 0, -1);    // returns "theor"

substr("theory", -2);       // returns "ry"
substr("theory", 0, -2);       // returns "theo"

substr("theory", 0, 2);       // returns "th"
substr("theory", 2);       // returns "eory"
Tagged with:
Posted in PHP

Decision Tree Complexity

Decision trees are a simple computation model. One can compute a boolean function using a decision tree by examining the input bits in specific orders. The complexity is characterized by the number of input bits examined, or simply, the depth of the decision tree.

Decision Trees

A decision tree for a boolean function \(f\colon\{0,1\}^n \to \{0,1\}\) is a labeled tree where

  • each internal node is labeled with a variable \(x_i\), and the two outgoing edges are labeled with the two assignments \(x_i=0\) and \(x_i=1\),
  • each leaf is labeled with an output 0 or 1.

Given an input, we examine its bits by following a path from the root of the tree to a leaf, which gives the output.

The depth of tree is the longest path from the root to a leaf. It is the maximum number of bits examined on an input. The decision tree complexity of a boolean function is the smallest depth of any decision tree computing the function.

Let \(dep(T)\) be the depth of the tree \(T\), and \(dep(T, x)\) be the length of the path examining the bits of an input \(x\). Then \(dep(T) = \max_{x} dep(T,x)\). The decision tree complexity of \(f\) is
D(f) = \min_{T} dep(T) = \min_{T} \max_{x} dep(T,x),
where \(T\) ranges over all decisions tree computing \(f\).

For example, the OR function on \(n\) inputs requires a decision tree of depth \(n\); otherwise, we can construct two inputs following the same path but may have different outputs, which is a contradiction.

As another example, the address function on \(n = k + 2^k\) inputs is defined as follows: use the value of the first \(k\) bits as an index, and output the bit within the rest \(2^k\) bits at the indexed position. This function has a decision tree of depth \(k+1 \leq \log n +1\).

Randomized Decision Trees

A randomized decision tree is a distribution over a set of (deterministic) decision trees. We assume all decision trees in the set compute the function correctly on every input. The hope is that a randomized decision tree may examine fewer bits in expectation.

The randomized decision tree complexity for \(f\) is defined as
R(f) = \min_{\cal T} max_x {\bf E}_{T\in {\cal T}} dep(T, x),
where \(\cal T\) ranges over all possible randomized decision trees (all possible distributions over trees), and \(T\) is a randomly chosen decision tree according to the distribution \(\cal T\).

Another way of defining a randomized decision tree is to examine a randomly chosen input bit at each internal node of the tree.

For example, the majority function on 3 inputs can be computed by a decision tree of depth 3. By randomly permuting the inputs, we may construct 6 different decision trees. If we uniformly choose one of such decision trees, then for an input, say 001, we will examine \(2\cdot 2/6 + 3 \cdot 4/6 = 8/3\) bits on average. Since this is the worst input, we have \(R(f) \leq 8/3\).

Let \(\cal A\) be a set of decision trees. Let \(\cal T\) be a distribution over \(\cal A\), that is a randomized decision tree. Let \(\cal D\) be a distribution over the inputs \(\{0,1\}^n\). Then Yao’s Min-Max Lemma says that
\min_{\cal T} max_{x} {\bf E}_{T\sim {\cal T}} dep(T, x) =
\max_{\cal D} min_{T\in {\cal A}} {\bf E}_{x\sim {\cal D}} dep(T, x) .

On the LHS, \(\cal T\) ranges over all distributions over decision trees. On the RHS, \(\cal D\) ranges over all distributions on the inputs.

For the example of the majority function on 3 inputs, we already have the LHS is at most 8/3, and now we show the RHS is at least 8/3, which means \(R(f) = 8/3\). We choose \({\cal D}\) to be a uniform distribution over all inputs except 000 and 111. Then, for any decision tree, the average number of bits examined is 8/3, and this is true for any distribution over decision trees.

Tagged with: ,
Posted in Theory