Chernoff Bound

Consider the sum of a sequence of independent random variables. One may expect its distribution is concentrated around its expected value. This is characterized by the Chernoff bound, which gives exponentially decreasing bounds on tail distributions of the sum.

Let \(X_i\), \(i=1,\ldots, n\), be independent Bernoulli variables, where
{\bf Pr}[X_i = 1] = 1 – {\bf Pr}[X_i = 0] = p_i.
Let \(X = \sum X_i\) and \(\mu = \sum p_i\). We wish to bound the probability that \(X\) is far from \(\mu\).

The moment generating function of \(X\) is defined as \(M_t(X) = {\bf E} [e^tX]\).
{\bf E}[e^{tX_i}] = p_i e^t + 1-p_i \leq e^{p_i(e^t-1)},
by the fact that \(1 + x\leq e^x\).
We have the following
{\bf E}[e^{tX}] = {\bf E}[\prod e^{tX_i}] = \prod {\bf E}[e^{tX_i}]
\leq \prod e^{p_i(e^t-1)} = e^{\mu(e^t-1)},
by the independency of \(X_i\)‘s.

{\bf Pr}[ X\geq (1+\delta) \mu ] = {\bf Pr}[ e^{tX}\geq e^{t(1+\delta) \mu} ]
\leq \frac{ {\bf E}[e^{tX}] }{ e^{t(1+\delta) \mu} }
\leq \frac{ e^{\mu(e^t-1)} }{ e^{t(1+\delta) \mu} }
\leq \left(\frac{ e^\delta }{ (1+\delta)^{(1+\delta)} }\right)^{\mu},
choosing \(e^t = 1+\delta\).

We can also get a similar bound on the deviation below the mean.
{\bf Pr}[ X\leq (1-\delta) \mu ]
\leq \left(\frac{ e^{-\delta} }{ (1-\delta)^{(1-\delta)} }\right)^{\mu}.

Tagged with:
Posted in Probability

Store compressed data in database using PHP

When storing big text contents in a database, it is better to store after compressing the data. The following is an example storing HTML contents in a database cache.

We first create a table mapping URLs to HTML contents. The field for HTML contents is BLOB since we will store compressed data.

CREATE TABLE mycache (
    url varchar(128) PRIMARY KEY,
    html blob default NULL,
    updated_time timestamp NOT NULL default CURRENT_TIMESTAMP

In PHP, use gzcompress and gzuncompress to compress and extract data. The following code compresses the text before storing into the database.

$url = "";
$html = @file_get_contents($url);

$query = 
    sprintf("replace into mycache(url, html) values('%s', '%s') ",
        mysql_escape_string(gzcompress($html)) );

The following code retrieves the compressed data and extract the text.

$url = "";
$query = sprintf("select * from mycache where url = '%s' ",
        mysql_escape_string($url) );
$result = mysql_query($query);
if ($result && $row = mysql_fetch_assoc($result))
    $html = gzuncompress($row['html']);
Tagged with: , ,
Posted in PHP, SQL

Format numbers with leading and trailing zeros in PHP

PHP provides str_pad() to add a string to a certain length with another string. Another way is to use sprintf() with formatting patterns.

Pad leading zeros to get exactly 3 digits:

$value = 12;
sprintf('%03d', $value);                    // 012
str_pad($value, 3, '0', STR_PAD_LEFT);      // 012
Posted in PHP

Sperner Theorem on Maximal Antichains

Let \(S\) be a family of subsets of \(\{1,\ldots,n\}\) such that no set in \(S\) contains another. This is often called an antichain.

Theorem. [Sperner’s Theorem]
|S| \leq {n \choose \lfloor n/2 \rfloor}.

The proof of Sperner’s Theorem follows from the following lemma, which is known as LYM (Lubell–Yamamoto–Meshalkin) inequality.

Lemma. [LYM Inequality]
\sum_{A\in S} \frac{1}{n \choose |A|} \leq 1.

Proof. The proof follows from a double counting argument, that is, counting the number of permutations of \(\{1,\ldots,n\}\) in two ways. A direct counting gives \(n!\).

Another way is to generate permutations according to subsets in \(S\). For each \(A\in S\), we first give a permutation of the elements in \(A\), and then concatenate a permutation of elements which are not in \(A\). Each \(A\) gives \(|A|!(n-|A|)!\) permutations. By the antichain property, permutations generated from different subsets are different. Therefore, the lemma follows directly from the following:
\sum_{A\in S} |A|!(n-|A|)! \leq n!.

Sperner’s Theorem follows from LYM Inequality by the fact that \({n \choose |A|} \leq {n \choose \lfloor n/2 \rfloor}\) for all \(A\).

One can also thinks of a probabilistic argument for the LYM Inequality. Suppose we get a uniformly random permutation. For each \(A\in S\), consider the event that \(A\) is exactly the first \(|A|\) elements in the random permutation. This event happens with probability \(1 / {n\choose |A|}\). For different subsets in \(S\), such events are disjoint. This gives the LYM Inequality.

Tagged with: , ,
Posted in Theory