%% ToC#247  v002/a008.tex   Jackson - Servedio

\documentclass{toc}

\tocdetails{volume=2, number=8, year=2006, firstpage=147,
      received={March 21, 2006}, published={September 19, 2006} }

\bibliographystyle{tocplain}

\newcommand{\la}{\langle}
\newcommand{\ra}{\rangle}
\newcommand{\eps}{\epsilon}
\newcommand{\bxxx}{{\bf XXXXXXX}}
\newcommand{\D}{{\cal D}}
\newcommand{\E}{{\bf E}}
\newcommand{\Mn}{{\cal M}_{n}}
\newcommand{\M}{{\cal M}}
\newcommand{\Mtk}{{\cal M}^{t,k}_{n}}
\newcommand{\Dn}{{\cal D}_{n}}
\newcommand{\Dtk}{{\cal D}^{t,k}_{n}}
\newcommand{\LRD}{{\bf LearnRandomDNF~}}
\newcommand{\ignore}[1]{}
\newcommand{\ol}[1]{\overline{#1}}
\newcommand{\qqed}[1]{\hfill{#1}\rule{7pt}{7pt}}
\newcommand{\dmany}{\delta_{\mbox{\small \rm many}}}
\newcommand{\dsmall}{\delta_{\mbox{\small \rm small}}}
\newcommand{\dshared}{\delta_{\mbox{\small \rm shared}}}
\newcommand{\dusat}{\delta_{\mbox{\small \rm usat}}}
\newcommand{\dpusat}{\delta'_{\mbox{\small \rm usat}}}
\newcommand{\dcooccur}{\delta_{\mbox{\small \rm co-occur}}}
\newcommand{\dclique}{\delta_{\mbox{\small \rm clique}}}
\newcommand{\dC}{\delta_C}
\newcommand{\poly}{\text{poly}}

\newcounter{algoctr}

\newenvironment{algo}{\begin{list}{\arabic{algoctr}.}{\usecounter{algoctr}\setlength{\parsep}{0pt} \setlength{\itemsep}{0pt}}}{\end{list}}

\newcommand{\algitem}[1]{\item \hspace*{#1 em}}

\runningauthor{J.~Jackson and R.~Servedio}
%%\runningtitle{On Learning Random DNF Formulas Under the Uniform
\runningtitle{Learning Random DNF Formulas}

\copyrightauthor{Jeffrey C. Jackson and Rocco A. Servedio}

\begin{document}
\begin{frontmatter}[classification=text]

\title{On Learning Random DNF Formulas Under the Uniform Distribution}

\tocpdftitle{On Learning Random DNF Formulas Under the Uniform Distribution}
\tocpdfauthor{Jeffrey C. Jackson and Rocco A. Servedio}

\author[jackson]{Jeffrey C. Jackson}
\author[servedio]{Rocco A. Servedio}

\begin{abstract}
  We study the average-case learnability of DNF formulas in the model
  of learning from uniformly distributed random examples.  We define a
  natural model of random monotone DNF formulas and give an efficient
  algorithm which with high probability can learn, for any fixed
  constant $\gamma>0$, a random $t$-term monotone DNF for any $t =
  O(n^{2-\gamma})$. We also define a model of random non-monotone DNF
  and give an efficient algorithm which with high probability can
  learn a random $t$-term DNF for any $t=O(n^{3/2 - \gamma})$. These
  are the first known algorithms that can learn a broad
  class of polynomial-size DNF in a reasonable average-case model of
  learning from random examples.
\end{abstract}

\tockeywords{computational learning theory, uniform-distribution learning, PAC learning, DNF formulas, monotone DNF}

\tocacm{I.2.6, F.2.2, G.1.2, G.3}

\tocams{68Q32, 68W20, 68W25, 60C05}

\end{frontmatter}


\section{Introduction}

\subsection{Motivation and background}

A \emph{disjunctive normal form} formula, or DNF, is an AND of ORs of
Boolean literals.  A question that has been open since Valiant's
initial paper on computational learning theory \cite{Valiant:84} is
whether or not efficient algorithms exist for learning polynomial
size DNF formulas in variants of the PAC (Probably Approximately
Correct) learning model introduced by Valiant.  Roughly speaking, in
these models a learning algorithm is required to generate a
high-accuracy (error rate at most $\epsilon$) hypothesis with high
probability (the algorithm must fail to generate such a hypothesis
with probability at most $\delta$); we give a more detailed
explanation of our learning scenario in~\secref{sec:prelim}.
The only positive result for learning general DNF formulas in such
frameworks to date is the Harmonic Sieve~\cite{Jackson:97}. The
Sieve is a membership-query algorithm (i.e.\ it requires black-box
query access to the unknown function $f$ that is being learned) that
efficiently PAC learns DNF when the error rate is defined with
respect to the uniform distribution over the space of all possible
$n$-bit example strings (and certain related distributions). The
approximating hypothesis produced by the Sieve is not itself
represented as a DNF; thus, the Sieve is an \emph{improper} learning
algorithm.


There has been little progress on polynomial-time algorithms for
learning arbitrary DNF since the discovery of the Sieve.  There are
two obvious relaxations of the uniform distribution membership query
model that can be pursued.  The first is to learn with respect to
arbitrary distributions using membership queries; in this setting,
the learning algorithm is given black-box (membership query) access
to the unknown function $f$, and also access to a source of random
labeled examples $(x,f(x))$ where each example $x$ is independently
drawn from a fixed probability distribution which is arbitrary and
not known to the learning algorithm.  The learner must generate a
high-accuracy hypothesis with respect to this unknown distribution.
Given standard cryptographic assumptions, it is known that learning
DNF in this framework is essentially as difficult as learning DNF
with respect to arbitrary distributions without membership queries~\cite{angkha95}.

The second obvious relaxation is to learn with respect to the
uniform distribution without membership queries. However, there are
substantial known obstacles to learning DNF in the model of uniform
distribution without membership queries.  In particular, no
algorithm which can be recast as a Statistical Query algorithm can
learn arbitrary polynomial-size DNF under the uniform distribution
in $n^{o(\log n)}$ time~\cite{BFJ+:94}.  (Roughly speaking, a
Statistical Query algorithm is an algorithm which is only allowed to
obtain statistical estimates of properties of the distribution over
labeled example pairs $(x,f(x))$; such an algorithm does not have
access to actual labeled examples $(x,f(x))$. See~\cite{Kearns:98}
for a detailed description of the Statistical Query model.)  Since
nearly all non-membership learning algorithms can be recast as
Statistical Query algorithms~\cite{Kearns:98}, a major conceptual
shift seems necessary to obtain an algorithm for efficiently
learning arbitrary DNF formulas from uniform examples alone.

An apparently simpler question is whether \emph{monotone} DNF
formulas, which contain only un-negated variables, can be learned
efficiently. Angluin showed that monotone DNF can be properly
learned with respect to arbitrary distributions using membership
queries~\cite{Angluin:88}. It has also long been known that with
respect to arbitrary distributions without membership queries,
monotone DNF are no easier to learn than arbitrary DNF~\cite{KLP+:87a}.
  This leaves the following enticing question (posed
in \cite{JacksonTamon:97,BBL:98,Blum:03tutorial}): are monotone DNF
efficiently learnable from uniform examples alone?

In 1990, Verbeurgt~\cite{Verbeurgt:90} gave an algorithm that can
properly learn any $\poly(n)$-size (arbitrary) DNF from uniform
examples in time $n^{O(\log n)}$.  More recently, the algorithm of
\cite{Servedio:01mdnf} learns any $2^{\sqrt{\log n}}$-term monotone
DNF in poly$(n)$ time. However, despite significant interest in the
problem, no algorithm faster than that of \cite{Verbeurgt:90} is
known for learning arbitrary poly$(n)$-size monotone DNF from
uniform examples, and no known hardness result precludes such an
algorithm (the Statistical Query result of~\cite{BFJ+:94} is at its
heart a hardness result for low-degree parity functions, and thus
does not apply to monotone DNF).

Since worst-case
versions of several DNF learning problems have remained stubbornly
open for a decade or more, it is natural to ask about DNF learning
from an average-case perspective, i.e., about learning \emph{random}
DNF formulas.  In fact, this question has been considered before:
Aizenstein and Pitt~\cite{AizensteinPitt:95} were the first to ask
whether random DNF formulas are efficiently learnable.  They
proposed a model of random DNF in which each of the $t$ terms is
selected independently at random from all possible terms, and gave a
membership and equivalence query algorithm which with high
probability learns a random DNF generated in this way. (See~\cite{AizensteinPitt:95} or~\cite{Angluin:88} for a description of
the membership and equivalence query learning framework.)
 However, as noted in~\cite{AizensteinPitt:95}, a limitation of this model is that with
very high probability all terms will have length $\Omega(n)$.  The
learning algorithm itself becomes quite simple in this situation.
Thus, while this is a ``natural'' average-case DNF model, from a
learning perspective it is not a particularly interesting model. To
address this deficiency, they also proposed another natural
average-case model which is parameterized by the expected length $k$
of each term as well as the number of independent terms $t$,
but left open the question of
whether or not random DNF can be efficiently
learned in such a model.

\subsection{Our results}
We consider an average-case DNF model very similar to the latter
Aizenstein and Pitt model, although we simplify slightly by assuming
that $k$ represents a fixed term length rather than an expected
length.  We show that, in the model of learning from uniform random
examples only, random monotone DNF are properly and
efficiently learnable for many interesting values of $k$ and $t$.  In
particular, for $t = O(n^{2 - \gamma})$ where $\gamma > 0$, and for
$k=\log t$, our algorithm can achieve any error rate $\eps>0$ in
$\poly(n, 1/\eps)$ time with high probability (over both the selection
of the target DNF and the selection of examples).  In addition, we
obtain slightly weaker results for arbitrary DNF: our algorithm can
properly and efficiently
learn random $t$-term DNF for $t$ such that
$t = O(n^{{\frac 3 2} - \gamma})$. This
algorithm cannot achieve arbitrarily small error
but can achieve error $\eps = o(1)$ for any $t=\omega(1)$.
For detailed result statements see Theorems~\ref{thm:mainmon} and~\ref{thm:mainnonmon}.

While our results would clearly be stronger if they held for any
$t= \poly(n)$ rather than the specific
polynomials given, they are a marked advance over the previous state
of affairs for DNF learning.  (Recall that in the standard
worst-case model, $\poly(n)$-time uniform-distribution learning of
$t(n)$-term DNF for any $t(n) = \omega(1)$ is an open problem
with an associated cash prize~\cite{Blum:03problem}.)

At this point a word or two is in order to clarify the relationship
between the random DNF model we consider and the models of random
CNF formulas that are often studied in the context of the Boolean
satisfiability problem.  In the study of random $k$-CNF formulas,
$k$ is often taken to be a fixed constant such as 3.  In contrast
with the satisfiability problem, in the learning arena taking $k$ to
be a fixed constant such as 3 is not an interesting choice, since it
is well known that $k$-CNF (or equivalently, DNF formulas in which
every term is of length at most $k$) can be easily learned with
respect to any distribution in time $n^{O(k)}$~\cite{Valiant:84}.
Intuitively, the
``interesting'' values of $k$ are different for the satisfiability
problem and the learning problem because in the satisfiability
problem the interesting cases occur when there are only a small
number of satisfying assignments, whereas in the learning framework
the interesting cases occur when the target DNFs are roughly
balanced between satisfying and unsatisfying assignments.  (From a
learning perspective balanced functions are generally more
interesting than unbalanced functions, since a constant function is
trivially a good approximator to a highly unbalanced function.)
Thus, for the learning problem, taking $k=\log t$ is a natural
choice when learning with respect to the uniform distribution. (We
actually allow a somewhat more general choice of $k$, as is
described in detail in the paper.)

Our results shed some light on which cases are \emph{not} hard to
learn in the worst-case uniform distribution model. While ``hard''
cases were previously known for arbitrary DNF~\cite{Blum:03problem},
our findings may be particularly helpful in guiding future research
on monotone DNF.  In particular, our algorithm learns any monotone
$t=O(n^{2 - \gamma})$-term DNF which (i) is near-balanced, (ii) has
every term uniquely satisfied with reasonably high probability,
(iii) has every pair of terms jointly satisfied with much smaller
probability, and (iv) has no variable appearing in significantly
more than a $1/\sqrt{t}$ fraction of the $t$ terms (this is made
precise in \lemref{lemma:mn}).  So in order to be ``hard,'' a
monotone DNF must violate one or more of these criteria.

Our algorithms work in two stages: they first identify pairs of
variables which co-occur in some term of the target DNF, and then use
these pairs to reconstruct terms via a specialized clique-finding
algorithm.  (This is why our results do not extend to random DNF
with more than $n^{2-\gamma}$ terms; for such formulas the variable
co-occurrence graph is with high probability dense or even complete,
so we cannot reconstruct terms from co-occurrence information.)
For monotone DNF we can with high probability determine
for every pair of variables whether or not the pair co-occurs in some
term.  For non-monotone DNF, with high probability we can identify most
pairs of variables which co-occur in some term; as we show, this
enables us to learn to fairly (but not arbitrarily) high accuracy.

We give preliminaries in \secref{sec:prelim}.  Sections~\ref{sec:monotone} and~\ref{sec:nonmonotone} contain our results for monotone and non-monotone DNF respectively.  \secref{sec:discuss}
concludes.

A preliminary version of this work appeared in the proceedings of
RANDOM 2005~\cite{JacksonServedio:05random}.  The current version of the
paper gives a more thorough exposition and includes many proofs that were 
omitted from the conference version due to space limitations.

\section{Preliminaries}
\label{sec:prelim}

We first describe our models of random monotone and non-monotone DNF.
Let $\Mtk$ be the probability distribution over monotone $t$-term DNF
induced by the following random process:
each term is independently and uniformly chosen at random
from all $\binom{n}{k}$ monotone ANDs of size exactly $k$ over
variables $v_1,\dots,v_n$.
For non-monotone DNF, we write $\Dtk$ to
denote the following distribution over $t$-term DNF:
first a monotone DNF is selected from $\Mtk$, and then each occurrence of each
variable in each term is independently negated with probability $1/2$.
(Equivalently, a draw from $\Dtk$ is done by independently selecting
$t$ terms from the set of all terms of length exactly $k$.)

Given a Boolean function $\phi:\{0,1\}^n \rightarrow \{0,1\}$, we
write $\Pr[\phi]$ to denote $\Pr_{x \sim U_n}[\phi(x) = 1]$, where
$U_n$ denotes the uniform distribution over $\{0,1\}^n$. We write
$\log$ to denote $\log_2$ and $\ln$ to denote natural log.


\subsection{Tail bounds}
\label{ap:bounds}
We use the following:
\medskip

\noindent \textbf{Chernoff bound} (see~\cite[Theorem A.12]{AlonSpencer}): Let $B(t,p)$ denote
the binomial distribution with parameter $p,$ i.e.\ a draw from
$B(t,p)$ is a sum of $t$ independent $p$-biased 0/1 Bernoulli
trials.  Then for $\beta > 1,$
$$ \Pr_{S \sim B(t,p)}[ S \geq \beta p t ] \leq
  \left( e^{\beta-1} \beta^{-\beta} \right)^{pt}
  <  \left( e/\beta \right)^{\beta pt}\enspace.
$$
The following bound will also be useful:

\medskip

\noindent \textbf{McDiarmid bound}~\cite{McDiarmid:89}:  {\em Let $X_1,
\dots X_m$ be independent random variables taking values in a set
$\Omega$. Let $F \colon \Omega^m \to \R$ be such that for all $i \in
[m]$ we have
\[
|F(x_1, \dots, x_m) - F(x_1, \dots, x_{i-1}, x_i', x_{i+1}, \dots,
x_m)| \leq c_i
\]
for all $x_1, \dots, x_m$ and $x_i'$ in $\Omega$.  Let $\mu =
\E[F(X_1,\dots, X_m)]$.  Then for all $\tau> 0$,
\[
\Pr\left[\left|F(X_1, \dots, X_m) - \mu\right| > \tau\right] <
\exp(-{\tau^2}/(c_1^2 + \cdots + c_m^2))\enspace.
\]}




\subsection{The learning model}

In the uniform distribution learning model which we consider, the
learner is given a source of labeled examples $(x,f(x))$ where each
$x$ is uniformly drawn from $\{0,1\}^n$ and $f$ is the unknown
function to be learned. The goal of the learner is to efficiently
construct a hypothesis $h$ which with high probability (over the
choice of labeled examples used for learning) has low error relative
to $f$ under the uniform distribution, i.e. $\Pr_{x \sim U_n}[h(x)
\neq f(x)] \leq \epsilon$ with probability $1 - \delta$. This model
has been intensively studied in learning theory, see e.g.
\cite{Hancock:93,GHM:94,JKS:02,KOS:02,KMP:94,Servedio:01mdnf,Verbeurgt:90}.
In our average case framework, the target function $f$ will be drawn
randomly from either $\Mtk$ or $\Dtk$, and (as in~\cite{JacksonServedio:03}) our goal is to construct a low-error
hypothesis $h$ for $f$ with high probability over both the random
examples used for learning and the random draw of $f$.


\section{Learning random monotone DNF} 
\label{sec:monotone}

\subsection{Interesting parameter settings}

Consider a random draw of $f$ from $\Mtk$.  It is intuitively
clear that if $t$ is too large relative to $k$ then a random $f \in \Mtk$
will likely have $\Pr[f] \approx 1$; similarly if $t$ is too small
relative to $k$ then a random $f \in \Mtk$ will likely have $\Pr[f]
\approx 0$.
Such cases are not very interesting from a learning perspective
since a trivial algorithm can learn to high accuracy.
We are thus led to the following definition:

\begin{definition}
A pair of values $(k,t)$ is said to be \emph{monotone $\alpha$-interesting} if
\[
\alpha \leq \E_{f \in \Mtk}[\Pr[f]] \leq 1 - \alpha\enspace.
\]
\end{definition}

Throughout the paper we will assume that $0 < \alpha \leq .09$
is a fixed constant independent of $n$ and that $t \leq p(n)$, where
$p(\cdot)$ is a fixed polynomial (and we will also make assumptions
about the degree of $p$). The following lemma gives necessary
conditions for $(k,t)$ to be monotone $\alpha$-interesting. (As
\lemref{lem:monint} indicates, we may always think of $k$ as
being roughly $\log t$.)

\begin{lemma} \label{lem:monint}
For $n$ sufficiently large, if $(k,t)$ is monotone
$\alpha$-interesting then $\alpha 2^k \leq t \leq 2^{k+1} \ln {\frac
2 \alpha}$.
\end{lemma}

\begin{proof}
One side is easy: if $t < \alpha2^k$ then each of the $t$ terms
of $f$ is satisfied by a uniform random example with probability at
most $\alpha/t$, and consequently $\Pr[f(x) = 1] \leq \alpha$.  Note
that by our assumptions on $t$ and $\alpha$ we thus have that $k =
O(\log n)$ for any monotone $\alpha$-interesting pair $(k,t)$.

We now show that if $t > 2^{k+1} \log {\frac 2 \alpha}$, then
\[
\E_{f \in \Mtk}[\Pr[f]] > 1 - \alpha\enspace.
\]
Let us write $|x|$ to denote $x_1 +
\cdots + x_n$ for $x \in \{0,1\}^n$. It is easy to see that
$\Pr[f(x)=1]$, viewed as a random variable over the choice of $f \in
\Mtk$, depends only on the value of $|x|$. We have
$$
\E_{f \in \Mtk}[\Pr[f]] = \sum_{r=0}^n
\E_{f \in \Mtk}[\Pr[f(x) = 1 \mid |x| = r] \cdot \Pr[|x| = r]\enspace.
$$
A standard tail bound on the binomial distribution (which can be
derived, e.g., from the results in~\cite{McDiarmid:89})
implies that
$$
\Pr_{x \in U_n}\left[|x| \leq n/2 - \sqrt{n \log(2/\alpha)}\right] <
\alpha/2\enspace.
$$
Thus it suffices to show that for any $x$ with $|x| \geq n/2 -
\sqrt{n \log (2 / \alpha)}$, we have 
\[
\Pr_{f \in \Mtk}[f(x)=1] \geq
1 - \alpha/2\enspace.
\]

Fix an $x \in \{0,1\}^n$ with $|x| = w \geq n/2 - \sqrt{n \log
(2/\alpha)}$. Let $T_1$ be a random monotone term of length $k$. We
have
$$
\Pr_{T_1}[T_1(x) =1] = {\frac {w(w-1)\cdots(w-k+1)}{n(n-1)\cdots(n-k+1)}}
\geq {\frac 1 {2^{k+1}}}
$$
where the inequality holds for sufficiently large $n$ using the fact
that $k=O(\log n)$ and $\alpha = \Theta(1)$.
 Since the terms of $f$
are chosen independently, this implies that
$$
\Pr_f[f(x)=0] \leq \left(1 - {\frac 1 {2^{k+1}}}\right)^t \leq
\exp\left({\frac {-t}{2^{k+1}}}\right)\enspace.
$$
If $t/2^{k+1}> \ln {\frac 2 \alpha}$ then this bound is at most
$\alpha/2$.
\end{proof}

DNF expressions with either a constant number of terms or a constant
number of variables per term have long been known to be efficiently
learnable~\cite{Valiant:84} (this holds for non-monotone as well as
monotone DNF, and in fact holds for learning with respect to
arbitrary distributions, not only uniform).
  So we will assume
throughout that both $t$ and $k$ are $\omega(1)$; many of our
probability bounds are only meaningful given such an assumption, and
some of our lemmas explicitly depend on $t$ and/or $k$ being larger
than a certain (small) constant.  While this assumption is
sufficient for our purposes, we note briefly that in fact a stronger
assumption can be made concerning $t$.  If $t$ grows very slowly
relative to $n$, say, $t = O(n^{1/4})$, then with high probability a
random $f$ drawn from $\Mtk$ will have the property that every
variable in $f$ appears in exactly one term.  Such a read-once DNF,
even if it is non-monotone, is learnable with respect to the uniform
distribution~\cite{KLP+:87b}.  Thus, we can actually think of $t$ as
growing reasonably quickly with $n$.

\subsection{Properties of random monotone DNF}
\label{sec:propertiesmonotone}

Throughout the rest of \secref{sec:monotone} we assume that
$\alpha>0$ is fixed and $(k,t)$ is a monotone $\alpha$-interesting
pair where $t = O(n^{2-\gamma})$ for some $\gamma > 0$. In this
section we develop some useful lemmas regarding $\Mtk$.  

We first prove the following lemma, which will be useful
in subsequent proofs.  This lemma does not require that $f$ be
drawn from $\Mtk$.

\begin{lemma} \label{lem:pfalse}
  Any monotone DNF $f$ with $t \geq 2$ terms each of size $k$
 has $\Pr[\overline{f}] \geq \alpha^3/4$.
\end{lemma}
\begin{proof}
We write
$T_1,T_2,\dots,T_t$ to denote the terms of $f$.
We have
\begin{align}
  \Pr[\overline{f}] &= \Pr[\overline{T_1} \wedge \overline{T_2} \wedge \cdots
                      \wedge \overline{T_t}]
   =  \Pr[\overline{T_1} \mid \overline{T_2} \wedge \cdots \overline {T_t}]
        \Pr[\overline{T_2} \mid \overline{T_3} \wedge \cdots \overline {T_t}]
        \cdots \Pr[\overline{T_{t-1}} \mid \overline{T_t}] \Pr[\overline{T_t}] \nonumber  \\
  & \geq \prod_{i=1}^t \Pr[\overline{T_i}] \label{eq:asterix}\\
  &= \left(1-\frac{1}{2^k}\right)^t
  \geq \left(1-{\frac{1}{2^k}}\right)^{2^{k+1} \ln (2/\alpha)} \label{eq:interesting}
  \geq \left({\frac 1 4}\right)^{2 \ln {\frac 2 \alpha}} =
\left({\frac \alpha 2}\right)^{2 \ln 4} \geq \frac{\alpha^3}{4}\enspace.
\nonumber
\end{align}
The first inequality (\ref{eq:asterix}) holds since
$\Pr[f(x)=1 \mid g(x)=1] \geq \Pr[f(x)=1]$ for any monotone Boolean functions
$f,g$ on $\{0,1\}^n$ (see e.g. Corollary 7, p. 149 of
\cite{Bollobas:86}).  The second inequality holds by \lemref{lem:monint}.
The third inequality holds since $(1 - 1/x)^x \geq 1/4$
for all $x \geq 2$, and the fourth follows from the restriction
$\alpha \leq .09$.
\end{proof}



Let $f^i$ denote the projected function obtained from $f$
by first removing term $T_i$ from the monotone
DNF for $f$ and then restricting all of the variables which were present
in term $T_i$ to $1$.
For $\ell \neq i$ we write $T^i_\ell$ to denote the term obtained by setting
all variables in $T_i$ to 1 in $T_\ell$, i.e. $T^i_\ell$ is the term
in $f^i$ corresponding to $T_\ell$.  Note that if $T^i_\ell \not \equiv T_\ell$
then $T^i_\ell$ is smaller than $T_\ell$.

The following lemma
shows that each variable appears in a limited
number of terms and that therefore
not too many terms $T^i_\ell$ in $f^i$
are smaller than their corresponding terms $T_\ell$ in $f$.  In
this and later lemmas, ``$n$ sufficiently large''
means that $n$ is larger than a constant which depends
on $\alpha$ but not on $k$ or $t$.

\begin{lemma} \label{lem:vjNotInManyTerms}
Let 
\[
\dmany := n\left({\frac {ekt^{3/2}\log t} {n2^{k-1}\alpha^2}} \right)^{2^{k-1}\alpha^2/(\sqrt{t}\log t)}\enspace.
\]
For $n$ sufficiently large, with probability at least
$1 - \dmany$
over the draw of $f$
from $\Mtk$, both of the following conditions hold:
\begin{itemize}
\item Every variable $v_j$, $1 \leq j \leq n$,
appears in
at most $2^{k-1}\alpha^2/(\sqrt{t}\log t)$ terms of $f$; and
\item For all $1 \leq i \leq t$
at most $k 2^{k-1} \alpha^2/(\sqrt{t} \log t)$
terms $T^i_\ell$ with $\ell \neq i$
in the projection $f^i$ are smaller than the corresponding terms
$T_\ell$ in $f$.
\end{itemize}
\end{lemma}
Note that since $(k,t)$ is a monotone
$\alpha$-interesting pair and $t = O(n^{2-\gamma})$ for some
fixed $\gamma > 0$, for sufficiently large $n$ this probability bound
is non-trivial.


\begin{proof}[Proof of \lemref{lem:vjNotInManyTerms}]
We first prove that with high probability every variable appears
in a limited number of terms.
Fix any variable $v_j$.  For each term $T_\ell$
we have that
$v_j$ occurs in $T_\ell$ with probability $k/n$.
Since the terms are chosen independently, the number
of occurrences of $v_j$
is binomially distributed according to $B(t, p)$ with $p = k/n$.
Taking $\beta = n2^{k-1}\alpha^2/(kt^{3/2}\log t)$ in the Chernoff bound (which
is greater than $1$ for sufficiently large $n$),
the probability that $v_j$ appears in
$\beta p t = 2^{k-1}\alpha^2/(\sqrt{t} \log t)$ or more terms
is at most
\[
\left({\frac {ekt^{3/2}\log t} {n2^{k-1}\alpha^2}} \right)^{2^{k-1}\alpha^2/(\sqrt{t}\log t)}\enspace.
\]
The lemma follows by the union bound over the $n$ variables
$v_j$.

For the bound on the number of terms in $f^i$ smaller than those in
$f$, simply note that if every variable appears in at most
$2^{k-1}\alpha^2/(\sqrt{t} \log t)$ terms then, since there are $k$ variables
in term $T_i$, at most $k 2^{k-1}\alpha^2/(\sqrt{t} \log t)$ terms
$T_{\ell}^i$ with $\ell \neq i$
in
$f^i$ are smaller than the corresponding terms $T_{\ell}$ in $f$.
\end{proof}

The next lemma shows that 
there is probably little overlap between any pair
of terms in $f$:

\begin{lemma} \label{lem:fiNotVerySmallTerms}
Let $\dshared := t^2 ({\frac {k^2} {n}})^{\log \log t}$.
With probability at least
$1 - \dshared$
over the random draw of $f$
from $\Mtk$, for all $1 \leq i,j \leq t$
no set of $\log \log t$ or more variables belongs to two distinct
terms $T_i$ and $T_j$ in $f$.
\end{lemma}

\begin{proof}
We are interested in upper bounding
the probability $p_i$ that $\log \log t$ or more of the
variables in a fixed term $T_i$ belonging to $f$
also appear in some other term $T_\ell$ of $f$, for any $\ell \neq
i$.  First, a simple counting argument shows that the probability that
a fixed set of $\log \log t$ variables appears in a set
of $k$ variables randomly chosen from among $n$ variables is at most
$(k/n)^{\log \log t}$.  Since there are $\binom{k}{\log \log t}$ ways to choose a fixed
set of $\log \log t$ variables from term $T_i$, we have
\[
p_i \leq \binom{k}{\log \log t} \left( \frac{k}{n} \right)^{\log \log t} (t-1)\enspace.
\]
The lemma follows by the union bound over the $t$ probabilities
$p_i$.
\end{proof}

Using the preceding lemmas, we can show that
for $f$ drawn from $\Mtk$, with high probability
each term is ``uniquely satisfied'' by a noticeable
fraction of assignments.  More precisely, we have:
\begin{lemma}
\label{lem:uniquesat}
Let $\dusat := \dmany + \dshared$.
For $n$ sufficiently large and $k \geq 5$,
with probability at least $1 - \dusat$
over the random draw
of $f$ from $\Mtk$, $f$ is such that for all
$i = 1,\dots,t$
we have
\[
\Pr_x[T_i \;\text{is satisfied by $x$ but
no other $T_j$ is satisfied by $x$}]
\geq {\frac {\alpha^3} {2^{k+3}}}\enspace.
\]
\end{lemma}
\begin{proof}
  Given an $f$ drawn according to $\Mtk$ and given any term $T_i$ in
  $f$, we are interested in the probability over uniformly drawn
  instances that $T_i$ is satisfied and $T_\ell$ is not satisfied for
  all $\ell \neq i$.  Let $\overline{T_{\ell \neq i}}$ represent the
  formula that is satisfied by an assignment $x$ if and only if all of
  the $T_\ell$ with $\ell \neq i$ are not satisfied by $x$.  We want a
  lower bound on
\[
  \Pr[ T_i \wedge \overline{T_{\ell \neq i}} ] =
  \Pr[ \overline{T_{\ell \neq i}} \mid T_i] \cdot \Pr[ T_i ]\enspace.
\]
Since $\Pr[T_i] = 1/2^k$, what remains is to show that with very high
probability over random draw of $f$, $\Pr[\overline{T_{\ell \neq i}}\;
|\; T_i]$ is bounded below by $\alpha^3/8$ for all $T_i$.  That is, we
need to show that $\Pr[\overline{f^i}] \geq \alpha^3/8$ with very high
probability.

%jcj Let $T^i_\ell$ be the term in $f^i$ as described earlier.
\ignore{corresponding to $T_j$ in $f$
(assuming that $f^i$ is not identically true, each term
in $f$
except $T_i$ has a corresponding term in $f^i$ and the mapping is
onto, although the mapping may not be one-to-one).}
We have that
all of the following statements
hold with probability at least $1 - \dusat$
for every $1 \leq i \leq n$
for a random $f$ from $\Mtk$:
\begin{enumerate}
\item $\Pr[\overline{f^i}] \geq \prod_{\ell: \ell \neq i} \Pr[\overline{T^i_\ell}]$:  this follows
from Equation~\eqref{eq:asterix} in the proof of \lemref{lem:pfalse}.
\item $\prod_{\ell : T^i_\ell \equiv T_\ell} \Pr[\overline{T^i_\ell}] >
\alpha^3/4$.  This holds because the terms in this product are a subset
of the terms in Equation~\eqref{eq:asterix}
(in the proof of \lemref{lem:pfalse}).
\item At most $k 2^{k-1}\alpha^2/(\sqrt{t} \log t)$
terms $T_\ell$ with $\ell \neq i$ are smaller in
$f^i$ than they are in $f$ (by \lemref{lem:vjNotInManyTerms}).
\item No term in $f^i$ has fewer than $k - \log \log t$ variables (by \lemref{lem:fiNotVerySmallTerms}).
\end{enumerate}

These conditions together imply that
\[
\Pr[\overline{f^i}] \geq \left(\frac{\alpha^3}{4}\right)
 \left(\left(1 - \frac{\log t}{2^k} \right)^{2^k/(\log t)}
 \right)^{k \alpha^2/2\sqrt{t}}\enspace.
 \]
Note that $k \alpha^2/2\sqrt{t} \leq 1/2$ for all $k \geq 5$, since
for such $k$ we have
$k^2 \alpha^4 \leq \alpha k^2 < \alpha 2^k \leq t$.  Thus, since
$\left( 1 - {\frac 1 x} \right)^x \geq 1/4$ for
all $x \geq 2$, we have that
$\Pr[\overline{f^i}] \geq \alpha^3 / 8$.
\end{proof}

On the other hand, we can upper bound the probability that two terms
of a random DNF $f$ will be satisfied simultaneously:

\begin{lemma} \label{lem:sattwo}
With probability at least $1-\dshared$ over the random draw of
$f$ from $\Mtk$,
for all $1 \leq i < j \leq t$ we have
$\Pr[T_i \wedge T_j] \leq {\frac {\log t}{2^{2k}}}$.
\end{lemma}

\begin{proof}

By \lemref{lem:fiNotVerySmallTerms}, with probability at least
$1-\dshared$ $f$ is such that, for all $1 \leq i < j \leq n$, terms
$T_i$ and $T_j$ share at most $\log\log t$ variables.  Thus for
each pair of terms a specific
set of at least $2k - \log\log t$ variables must be simultaneously set
to $1$ in an instance in order for both terms to be satisfied.
\end{proof}



\subsection{Identifying co-occurring variables} \label{sec:idmon}

We now show how to identify pairs of variables that
co-occur in some term of $f$.
First, some notation.  Given a monotone DNF $f$ over variables
$v_1,\dots,v_n$, define DNF formulas $g_{**}$, $g_{1*}$, $g_{*1}$, and $g_{11}$ over variables
$v_{3},\dots,v_n$ as follows:
\begin{itemize}
\item  $g_{**}$ is the disjunction of the terms
in $f$ that contain neither $v_1$ nor $v_2$;
\item $g_{1*}$ is the disjunction of the terms in $f$
that contain $v_1$ but not $v_2$ (but with $v_1$ removed from
each of these terms);
\item  $g_{*1}$ is defined similarly as
the disjunction of the terms in $f$
that contain $v_2$ but not $v_1$ (but with $v_2$ removed from
each of these terms);
\item $g_{11}$ is the disjunction of the terms in $f$
that contain both $v_1$ and $v_2$ (with both variables removed from
each term).
\end{itemize}
We thus have
$f = g_{**} \vee (v_1 g_{1*}) \vee (v_2  g_{*1}) \vee
               (v_1v_2 g_{11})$.
Note that any of $g_{**}$, $g_{1*}$, $g_{*1}$, $g_{11}$ may be
an empty disjunction which is identically false.

We can empirically estimate each of the following using
uniform random examples $(x, f(x))$:
\begin{eqnarray*}
p_{00} &:=& \Pr_x[g_{**}] =
\Pr_{x \in U_n}[f(x) = 1 \mid x_1 = x_2 = 0]\\
p_{01} &:=& \Pr_x[g_{**} \vee g_{*1}]=
\Pr_{x \in U_n}[f(x) = 1 \mid x_1 = 0, x_2 = 1] \\
p_{10} &:=& \Pr_x[g_{**} \vee g_{1*}]=
\Pr_{x \in U_n}[f(x) = 1 \mid x_1 = 1, x_2 = 0] \\
p_{11} &:=& \Pr_x[g_{**} \vee g_{*1} \vee g_{1*} \vee g_{11}] =
\Pr_{x \in U_n}[f(x) = 1 \mid x_1 = 1, x_2 = 1]\enspace.
\end{eqnarray*}
It is clear that $g_{11}$ is nonempty if and only if $v_1$ and $v_2$
co-occur in some term of $f$; thus we would ideally like to obtain
$\Pr_{x \in U_n}[g_{11}]$.  While we cannot obtain this probability
from $p_{00}$, $p_{01}$, $p_{10}$, and $p_{11}$, the following lemma
shows that we can estimate a related quantity:

\begin{lemma} \label{lemma:psum}
Let $P$ denote $p_{11} - p_{10} - p_{01} + p_{00}$.  Then
$P =
\Pr[g_{11} \wedge
\overline{g}_{1*} \wedge \overline{g}_{*1} \wedge \overline{g}_{**}]
-\Pr[g_{1*} \wedge g_{*1} \wedge \overline{g}_{**}]$.
\end{lemma}
\begin{proof}
$P$ gets a net contribution of $0$ from those $x$ which belong to
$g_{*,*}$ (since each such $x$ is added twice and subtracted twice
in $P$). We proceed to analyze the contributions to $P$ from the
remaining 8 subsets of the events $g_{11}$, $g_{1*}$, and $g_{*1}$:
\begin{itemize}

\item $P$ gets a net contribution of 0 from those $x$ which are
in $g_{1*} \wedge \overline{g}_{*1} \wedge \overline{g}_{**}$ since
each such $x$ is counted in $p_{11}$ and $p_{10}$ but not in
$p_{01}$ or $p_{00}$.  Similarly $P$ gets a net contribution of 0
from those $x$ which are in $g_{*1} \wedge \overline{g}_{1*} \wedge
\overline{g}_{**}$.


\item $P$ gets a net contribution of $\Pr[g_{11} \wedge
\overline{g}_{1*} \wedge \overline{g}_{*1} \wedge
\overline{g}_{**}]$ since each such $x$ is counted in $p_{11}$.

\item $P$ gets a net contribution of $-\Pr[g_{1*} \wedge g_{*1} \wedge
\overline{g}_{**}]$ since each such $x$ is counted in $p_{01},
p_{10}$, and $p_{11}$.
\end{itemize}
\end{proof}



More generally, let $P_{ij}$ be defined as $P$ but with $v_i$, $x_i$,
$v_j$, and $x_j$ substituted for $v_1$, $x_1$, $v_2$, and $x_2$,
respectively, throughout the definitions of the $g$'s and $p$'s above.
The reader familiar with Boolean Fourier analysis will readily
recognize that $P_{ij}$ is a scaled (by a factor of $-2$)
version of the second-order
Fourier coefficient of $f$ corresponding to the pair of variables
$(v_i,v_j)$.
(This coefficient is equal to $2 \Pr_{x \in U_n}[f(x) = x_i \oplus x_j]-1$;
see~\cite{Mansour:94} for a nice overview of Boolean Fourier analysis
in the context of uniform-distribution learning.)
The following lemma shows that, for most random choices of $f$, for
all $1 \leq i,j \leq n$, the
value of $P_{ij}$ is a good indicator of whether or not $v_i$ and
$v_j$ co-occur in some term of $f$:

\begin{lemma} \label{lemma:mn}
For $n$ sufficiently large and $t \geq 16$,
with probability at least $1 - \dusat - \dshared - \dmany$
over the random draw
of $f$ from $\Mtk$, we have that for all $1 \leq i,j \leq n$
(i) if $v_i$ and $v_j$ do not co-occur in some term of $f$ then $P_{ij} \leq 0$;
(ii) if $v_i$ and $v_j$ do co-occur in some term of $f$ then
  $P_{ij} \geq {\alpha^4}/ {16t}$.
\end{lemma}

\begin{proof}
Part (i) holds for any monotone DNF by \lemref{lemma:psum}.
For (ii), we first note that with probability at least
$1-\dusat-\dshared-\dmany$, a randomly chosen $f$ has all the
following properties:
\begin{enumerate}
 \item
 Each term in $f$ is uniquely satisfied with probability at
   least $\alpha^3/2^{k+3}$ (by \lemref{lem:uniquesat});

 \item\begin{sloppypar} Each pair of terms $T_i$ and $T_j$ in $f$ are both satisfied
   with probability at most $\log t/2^{2k}$ (by \lemref{lem:sattwo}); and   \end{sloppypar}
 \item Each variable in $f$ appears in at most $2^{k-1}\alpha^2/
(\sqrt{t}\log t)$ terms
   (by \lemref{lem:vjNotInManyTerms}).
\end{enumerate}
We call such an $f$ \emph{well-behaved}.
For the sequel, assume that $f$ is well-behaved and also assume without
loss of generality that
$i=1$ and $j=2$.  We consider
separately the two probabilities 
\[
\rho_1 = \Pr[g_{11} \wedge
\overline{g}_{1*} \wedge \overline{g}_{*1} \wedge \overline{g}_{**}]
\qquad \text{and}\qquad \rho_2 = \Pr[g_{1*} \wedge g_{*1} \wedge \overline{g}_{**}]
\]
whose difference defines $P_{12}=P$.  By property (1) above,
$\rho_1 \geq \alpha^3/2^{k+3}$, since each instance $x$ that uniquely
satisfies a term $T_j$ in $f$ containing both $v_1$ and $v_2$ also
satisfies $g_{11}$ while falsifying all of $g_{1*}$, $g_{*1}$, and
$g_{**}$.  Since $(k,t)$ is monotone
$\alpha$-interesting, this implies that $\rho_1 \geq \alpha^4/8t$.  On
the other hand, clearly $\rho_2 \leq \Pr[g_{1*} \wedge g_{*1}]$.  By
property (2) above, for any pair of terms consisting of one
term from $g_{1*}$ and the other from $g_{*1}$, the probability that
both terms are satisfied is at most $\log t/2^{2k}$.  Since each of
$g_{1*}$ and $g_{*1}$ contains at most $2^{k-1}\alpha^2/(\sqrt{t}\log t)$
terms by property (3), by a union bound we have
$\rho_2 \leq \alpha^4/(4t\log t)$, and the lemma
follows:
\[
\rho_1 - \rho_2 \geq
{\frac {\alpha^4} {8t}}
-
{\frac {\alpha^4}{4t \log t}} \geq
{\frac {\alpha^4}{16t}}
\]
given the assumption that $t \geq 16$.
\end{proof}

Thus, our algorithm for finding all of the co-occurring pairs of a
randomly chosen monotone DNF consists of estimating $P_{ij}$ for each
of the $n(n-1)/2$ pairs $(i,j)$ so that all of our estimates
are---with probability at least $(1-\delta)$---within an additive factor
of $\alpha^4/32t$ of their true values.
Recalling that each $P_{ij}$ is a scaled version of the second-order
Fourier coefficient,
by the standard Hoeffding bound a uniform
random sample of size $O(t^2\ln(n^2/\delta)/\alpha^8)$ is sufficient
to estimate all of the $P_{ij}$'s to the specified tolerance with
overall probability at least $1-\delta$. We thus have the following theorem for monotone $\alpha$-interesting $(k,t)$
with $t = O(n^{2 - \gamma})$:

\begin{theorem} \label{thm:pairs}
  For $n$ sufficiently large and any $\delta > 0$, with
  probability at least 
  $1-\dusat -\dshared -\dmany -\delta$
  over the
  choice of $f$ from $\Mtk$ and the choice of random examples,
  our algorithm runs in $O(n^2 t^2 \log(n/\delta))$ time and
  identifies exactly those pairs $(v_i,v_j)$
  which co-occur in some term of $f$.
\end{theorem}

\subsection{Forming a hypothesis from pairs of co-occurring variables} \label{sec:reconmon}

Here we show how to construct an accurate DNF
hypothesis for a random $f$ drawn from $\Mtk$.


\paragraph{Identifying $k$-cliques.}
By \thmref{thm:pairs}, with high probability we have complete
information about which pairs of variables $(v_i,v_j)$ co-occur in some
term of $f$. We thus may consider the graph $G$ with vertices
$v_1,\dots,v_n$ and edges for precisely those pairs of variables
$(v_i,v_j)$ which co-occur in some term of $f$. This graph is a union
of $t$ randomly chosen $k$-cliques from $\{v_1,\dots,v_n\}$ which
correspond to the $t$ terms in $f$; we will call these the
\emph{$f$-cliques} of $G$.  Ideally, if we could identify exactly the
$t$ $f$-cliques of $G$, then we could exactly reconstruct $f$.  While
we do not know how to accomplish this, we do show how to find
a set of $k$-cliques corresponding to a set of terms whose union
closely approximates $f$.

Specifically, we will show how to efficiently identify (with
high probability over the choice of $f$ and random examples of $f$)
a set of $k$-cliques in $G$ that contains as a subset the set of
all of the $f$-cliques in $G$.
Once these
$k$-cliques have been identified, as we show later
it is easy to construct an accurate DNF hypothesis
for $f$.

The following lemma shows that with high probability over the choice
of $f$, each pair $(v_i,v_j)$ co-occurs in at most a constant number
of terms:

\begin{lemma} \label{lem:fewpairs}
Let $\dC := ({\frac {t k^2}{n^2}})^C$ ($\dC$ is a function of $C$ as
well as of $t$, $k$, and $n$) and
fix $1 \leq i < j \leq n$.
  For any $C \geq 0$ and all sufficiently large $n$, we have
\[
\Pr_{f \in \Mtk}[\text{some pair of variables $(v_i,v_j)$ co-occur in more than
  $C$ terms of $f$}] \leq \dC\enspace.
\]
\end{lemma}

\begin{proof}
For any fixed $r \in \{1,\dots,t\}$ we have that 
\[
\Pr[\text{$v_i$ and $v_j$ co-occur in term $T_r$}] = {\frac {k(k-1)}{n(n-1)}}\leq
{\frac {k^2} {n^2}}\enspace.
\]
Since these events are independent for all
$r$, the probability that there is any collection of $C$ terms  such
that $v_i$ and $v_j$ co-occur in all $C$ of these terms is at most 
\[
\binom{t}{C} \cdot \Bigl({\frac {k^2}{n^2}}\Bigr)^C \leq \Bigl({\frac {t
k^2}{n^2}}\Bigr)^C\enspace.
\]
\end{proof}



By \lemref{lem:fewpairs} we know that, for any given
pair $(v_i,v_j)$ of variables, with probability
at least $1 - \dC$
there are at most $Ck$ other variables $v_\ell$ such that
$(v_i,v_j,v_\ell)$ all co-occur in some term of $f$.
Suppose that we can efficiently (with high probability)
identify the set $S_{ij}$ of all such variables $v_\ell$.  Then we can perform
an exhaustive search over all $(k-2)$-element subsets $S'$ of $S_{ij}$
in at most $\binom{Ck}{k} \leq (eC)^k = n^{O(\log C)}$ time, and can
identify all of the sets $S'$ such that $S' \cup \{v_i,v_j\}$ is a
clique of size $k$ in $G$ that includes both $v_i$ and $v_j$.
Repeating this over all pairs
of variables $(v_i,v_j)$, we can with high probability identify a set
$G_k$ of
$k$-cliques in $G$ such that $G_k$ contains all of the $f$-cliques.

Thus, to identify $G_k$, it remains only to show that
for every pair of variables $v_i$ and $v_j$,
we can determine the set $S_{ij}$ of those variables $v_\ell$ that
co-occur in at least one term with both $v_i$ and $v_j$.  Assume that
$f$ is such that all pairs of variables co-occur in at most $C$ terms,
and let $T$ be a set of variables of cardinality at most $C$ having
the following properties:
\begin{itemize}
\item In the projection
$f_{T \leftarrow 0}$ of $f$
in which all of the variables of $T$ are fixed to $0$,
$v_i$ and $v_j$ do not co-occur in any term; and
\item For every set $T' \subset T$ such that $|T'| = |T|-1$,
$v_i$ and $v_j$ do co-occur in $f_{T' \leftarrow 0}$.
\end{itemize}
Then $T$ is clearly a subset of $S_{ij}$.  Furthermore, if we can
identify all such sets $T$, then their union will be
$S_{ij}$.  There are only $O(n^C)$ possible sets to consider,
so our problem now reduces to the following: given a
set $T$ of at most $C$ variables, determine whether
$v_i$ and $v_j$ co-occur in $f_{T \leftarrow 0}$.

The proof of \lemref{lemma:mn} shows that $f$ is well-behaved with
probability at least $1 - \dusat - \dshared - \dmany$ over the choice of
$f$.  Furthermore, if $f$ is well-behaved then it is easy to see that
for every $|T|\leq C$, $f_{T \leftarrow 0}$ is also well-behaved,
since $f_{T \leftarrow 0}$ is just $f$ with $O(\sqrt{t})$ terms
removed (by \lemref{lem:vjNotInManyTerms}).  That is, removing
terms from $f$ can only make it more likely that the remaining terms
are uniquely satisfied, does not change the bound on the probability
of a pair of remaining terms being satisfied, and can only decrease
the bound on the number of remaining terms in which a remaining
variable can appear.  Furthermore, \lemref{lemma:psum} holds for
any monotone DNF $f$.  Therefore, if $f$ is well-behaved then the
proof of \lemref{lemma:mn} also shows that for every $|T| \leq C$,
the $P_{ij}$'s of $f_{T \leftarrow 0}$ can be used to identify the
co-occurring pairs of variables within $f_{T \leftarrow 0}$.  It
remains to show that we can efficiently simulate a uniform example
oracle for $f_{T \leftarrow 0}$ so that these $P_{ij}$'s can be
accurately estimated.

In fact, for a given set $T$, we can simulate a uniform example oracle
for $f_{T \leftarrow 0}$ by filtering the examples from the uniform
oracle for $f$ so that only examples setting the variables in $T$ to
$0$ are accepted.  Since $|T|\leq C$, the filter accepts with constant
probability at least $1/2^C$.  A Chernoff argument shows that if all
$P_{ij}$'s are estimated using a single sample of size $2^{C+12} t^2
\ln(2(C+2)n^C/\delta)/\alpha^8$ (filtered appropriately when needed)
then all of the estimates will have the desired accuracy with
probability at least $1-\delta$.

In somewhat more detail, the $G_k$-finding algorithm can be written as:

\begin{itemize}
\item Given: $\alpha$, $\gamma$, $C$, $\delta$
\item (Note that $f$ is well-behaved and has the ``each pair occurs
in at most $C$ terms''
  property with probability at least $1-\dusat -\dshared -\dmany - \dC$.
  So assume this of $f$ for the remainder of the algorithm.)
\item Draw set $S$ of $O(t^2 \log^2(n/\delta))$ examples of $f$
\item For $1 \leq i < j \leq n$ (fewer than $n^2$ times)
  \begin{itemize}
  \item Estimate $P_{ij}$ over $S$ ($O(|S|)$ time)
  \item Add $(v_i,v_j)$ to the set of co-occurring pairs
    if estimated $P_{ij}$ exceeds the threshold
    $\alpha^4/(32t)$
  \end{itemize}
\item For each $|T|\leq C$  (at most $n^C$ times)
  \begin{itemize}
  \item For each co-occurring pair $(v_i,v_j)$ disjoint from $T$
    (less than $tk^2$ times)
    \begin{itemize}
    \item [(1)] Estimate $P_{ij}$ over ($T \leftarrow 0$)-filtered $S$  ($O(|S|)$ time)
    \item [(2)] For each subset $T' \subset T$, $|T'| = |T|-1$  (at most $C$ times)
      \begin{itemize}
      \item [(i)] Estimate $P_{ij}$ over ($T'\leftarrow 0$)-filtered $S$ ($O(|S|)$ time)
      \end{itemize}
    \item Add $T$ to $S_{ij}$ if it passes all threshold tests, i.e.\
    the estimate from (1) is at least $\alpha^4/(32t)$ and each estimate
    from (2)(i) is at most $\alpha^4/(32t)$  ($O(C)$ time)
    \end{itemize}
  \end{itemize}
\item For each co-occurring pair $(v_i,v_j)$  (less than $tk^2$ times)
  \begin{itemize}
  \item For each $(k-2)$-size subset $U$ of $S_{ij}$  ($n^{O(\log C)}$ times)
    \begin{itemize}
    \item Test if the union of $(v_i,v_j)$ and $U$ is a clique ($O(k^2)$ time)
    \end{itemize}
  \end{itemize}
\end{itemize}

The time bound for this algorithm is then
\[
O(|S| + n^2 |S| + n^C tk^2 C|S| + tk^2 n^{O(\log C)} k^2)
\]
which is dominated by the third term if $C \geq 2$.

The sample complexity $|S|$ is derived as follows.  We need a sample
large enough to
\begin{itemize}
\item succeed for all $n^2$ tests for co-occurring pairs (over the
  full sample), and
\item succeed for all $n^C (C+1)$ tests over filtered examples.
\end{itemize}
The total number of tests if $C \geq 2$ is bounded by $O((C+2) n^C)$.
Recalling that our estimates need to be accurate to within an additive
factor of $\alpha^4/32t$, we see that if all tests are run over
samples of size $m = 2^{11} t^2 \ln(2(C+2)n^C/\delta)/\alpha^8$ then, by
Hoeffding and the union bound, all tests succeed with probability at
least $1-\delta/2$.

We want $|S|$ large enough so that all $(C+1)n^C$ filtered samples
will be of size $m$ with probability $1-\delta/2$.  If a filter
accepts with probability $p$ over a sample of size $2m/p$, then the
probability that fewer than $m$ examples are accepted is at most
$e^{-m/4}$ by Chernoff.  Using the $m$ given in the previous paragraph
and the union bound, it can be seen that choosing $|S|=2m/p$ gives
us the desired probability of success over all tests.  

Thus, since we are using $p=1/2^C$ in the filtering, the final time
bound of the algorithm becomes (for arbitrary $C \geq 2$) $O((2n)^C
t^3 k^2 \log(Cn^C/\delta))$.  This gives us the following:

\begin{theorem}
  For $n$ sufficiently large, any $\delta > 0$, and any fixed $C \geq 2$,
with  probability at least $1-\dusat -\dshared -\dmany - \dC - \delta$ over the
random draw of $f$ from $\Mtk$ and the choice of
  random examples, a set $G_k$ containing all of the $f$-cliques of $G$
 can be identified in time
  $n^{O(C)} t^3 k^2 \log(n/\delta)$.
\end{theorem}

\paragraph{The main learning result for monotone DNF}
From $G_k$ we construct in the obvious way 
a list $T'_1,\dots,T'_N$ (with $N=|G_k|=O(n^C)$) of length-$k$ monotone terms
that contains all $t$ true terms $T_1,\dots,T_t$ of $f$.
Now note that the target function $f$ is simply an OR of
some subset of these $N$ ``variables'' $T_1,\dots,T_N$, so the
standard elimination algorithm for PAC learning disjunctions (under
any distribution) can be used to PAC learn the target function.
The algorithm requires $O((1/\eps)\log(1/\delta) + N/\eps)$ examples
and runs in time which is linear in its sample size; see e.g.~Chapters
1 and 2 of~\cite{KearnsVazirani:94}.

Call the above described entire learning algorithm $A$.  In summary, we have proved the following:

\begin{theorem} \label{thm:mainmon}
  Fix $\gamma$, $\alpha > 0$ and $C \geq 2$.
  Let $(k,t)$ be a monotone
  $\alpha$-interesting pair.  For any $\eps > 0$, $\delta > 0$,
  and $t = O(n^{2 - \gamma})$,
  algorithm $A$ will
with  probability at least $1-\dusat -\dshared -\dmany - \dC - \delta$
(over the random choice of DNF from $\Mtk$ and the randomness of the example
  oracle) produce a hypothesis $h$ that $\eps$-approximates the target
  with respect to the uniform distribution.  Algorithm $A$ runs in time
  polynomial in $n$, $\log (1/\delta)$, and $1/\eps$.
\end{theorem}

\section{Non-monotone DNF} \label{sec:nonmonotone}

\subsection{Interesting parameter settings} \label{sec:interestingnon}

As with $\Mtk$ we are interested in pairs $(k,t)$ for which
$\E_{f \in \Dtk}[\Pr[f]]$ is between $\alpha$ and $1 - \alpha$:

\begin{definition} For $\alpha > 0$,
the pair $(k,t)$ is said to be \emph{$\alpha$-interesting} if
$\alpha \leq \E_{f \in \Dtk}[\Pr[f]] \leq 1 - \alpha$.
\end{definition}
For any fixed $x \in \{0,1\}^n$ we have
\[
\Pr_{f \in \Dtk}[f(x) = 0] = \bigl(1 - {\frac 1 {2^k}}\bigr)^t\enspace,
\qquad \text{and thus} \qquad
\E_{f \in \Dtk}[\Pr[f]] = 1 - \bigl(1 - {\frac 1 {2^k}}\bigr)^t
\]
by linearity of expectation;
this formula will be useful later.

Throughout the rest of \secref{sec:nonmonotone}
we assume that $\alpha>0$ is fixed and
$(k,t)$ is an $\alpha$-interesting pair where $t = O(n^{3/2 - \gamma})$
for some $\gamma > 0$.

\subsection{Properties of random DNF}

In this section we develop analogues of Lemmas~\ref{lem:uniquesat}
and~\ref{lem:sattwo} for $\Dtk$.
The $\Dtk$ analogue of \lemref{lem:sattwo} follows directly from the proof of
\lemref{lem:sattwo}, and we have:
\begin{lemma} \label{lem:sattwonon}
With probability at least $1-\dshared$ over the random draw of
$f$ from $\Dtk$,
for all $1 \leq i < j \leq n$,
$\Pr[T_i \wedge T_j] \leq {\frac {\log t}{2^{2k}}}$.
\end{lemma}
In the following lemma  we use McDiarmid's bound to prove a $\Dtk$
version of \lemref{lem:uniquesat}:

\begin{lemma} \label{lem:uniquesatnon}
Let 
\[
\dpusat := t\left( (t-1)\Bigl({\frac {k^2}{n}}\Bigr)^{\log \log t} +
\exp\left({\frac {-\alpha^2 t}{16 \ln^2 (2/\alpha) \log^2 t}}\right)
\right)\enspace.
\]
With probability at least $1 - \dpusat$, a random $f$
drawn from $\Dtk$ is such that for each $i = 1,\dots,t$, we have
\[
P_i \equiv \Pr_x[\text{$T_i$ is satisfied by $x$ but no other $T_j$ is
satisfied by $x$}] \geq {\frac {\alpha}{2^{k+1}}}\enspace.
\]
\end{lemma}
\begin{proof}
We show that $P_1 \geq \alpha/ {2^{k+1}}$ with probability at
least $1 - \dpusat/t$; the lemma follows by a union bound. We first
show that $\E_{f \in \Dtk}[P_1] \geq \alpha/{2^k}$. For any
fixed $x \in T_1$, we have
\[
\Pr[\overline{T}_2(x) \wedge \cdots
\cdots \wedge \overline{T}_t(x)] = (1 - 2^{-k})^{t-1} > (1 -
2^{-k})^t \geq \alpha
\]
where the last inequality holds since $(k,t)$
is $\alpha$-interesting. Since a $2^{-k}$ fraction of all $x \in
\{0,1\}^n$ belong to $T_1$, by linearity of expectation we have
$\E_{f \in \Dtk}[P_1] \geq \alpha/{2^k}$.

Now we show that with high probability the deviation of $P_1$ from
its expected value is low.  Given any fixed length-$k$ term $T_1$,
let $\Omega$ denote the set of all length-$k$ terms $T$ which
satisfy $\Pr[T_1 \wedge T] \leq ({\log t})/{2^{2k}}$. By
reasoning as in the proof of \lemref{lem:sattwonon}, with
probability at least $1 - (t-1)({\frac {k^2}{n}})^{\log \log t}$
each of $T_2,\dots,T_t$ belongs to $\Omega$, so we henceforth assume
that this is in fact the case, i.e.\ we condition on the event
$\{T_2,\dots,T_t\} \subset \Omega$. Note that under this
conditioning we have that each of $T_2,\dots,T_t$ is selected
uniformly and independently from $\Omega$. Note also that this
conditioning can change the value of $P_1$ (a probability) by at
most $(t-1)({\frac {k^2}{n}})^{\log \log t} < {\frac \alpha
{2^{k+2}}}$, so under this conditioning we have $\E[P_1] \geq {\frac
3 4} \cdot {\frac \alpha {2^k}}$.

We now use McDiarmid's inequality where the random variables are the
randomly selected terms $T_2,\dots,T_t$ from $\Omega$ and
$F(T_2,\dots,T_t)$ denotes $P_1$, i.e.
$$
F(T_2,\dots,T_t) = \Pr_x[\text{$T_1$ is satisfied by $x$ but no $T_j$
with $j \geq 2$ is satisfied by $x$}]\enspace.
$$
Since each $T_j$ belongs to $\Omega$, we have 
\[
|F(T_2,\dots,T_t) -
F(T_2,\dots,T_{j-1},T'_{j},T_{j+1},\dots,T_t)| \leq c_i = {\frac
{\log t}{2^{2k}}}
\]
for all $j = 2,\dots,t$. Taking $\tau = {\frac 1
4} \cdot {\frac  {\alpha}{2^{k }}}$, McDiarmid's inequality implies
that $\Pr\left[P_1 < {\frac \alpha {2^{k+1}}}\right]$ is at most
$$
\exp \left( {\frac {- \alpha^2/(16 \cdot 2^{2k})} {(t-1) ({\frac
{\log t}{2^{2k}}})^2}} \right) = \exp\left({\frac {-\alpha^2
2^{2k}}{16 (t-1)\log^2 t}}\right) \leq \exp\left({\frac {-\alpha^2
2^{2k}}{16 t\log^2 t}}\right) \leq \exp\left({\frac {-\alpha^2 t}{16
\ln^2 (2/\alpha) \log^2 t}}\right)
$$
where the last inequality holds since $(k,t)$ is
$\alpha$-interesting. Combining all the failure probabilities, the
lemma is proved.
\end{proof}


\subsection{Identifying (most pairs of) co-occurring variables}

Recall that in \secref{sec:idmon}
we partitioned the terms of our monotone DNF into
four disjoint groups
depending on what subset of $\{v_1,v_2\}$ was present in each term.
In the non-monotone case, we will partition the terms of $f$
into nine disjoint groups depending on whether each of $v_1,v_2$ is unnegated,
negated, or absent:
\[
  f = g_{**} \vee (v_1 g_{1*}) \vee (\overline{v_1} g_{0*})
\vee (v_2g_{*1}) \vee (v_1 v_2 g_{11}) \vee (\overline{v_1} v_2 g_{01})
\vee (\overline{v_2}g_{*0}) \vee (v_1 \overline{v_2} g_{10}) \vee (\overline{v_1}\overline{ v_2} g_{00})
\]
Thus $g_{**}$ contains those terms of $f$ which contain neither $v_1$ nor $v_2$ in any form;
$g_{0*}$ contains the terms of $f$ which contain $\overline{v_1}$ but not $v_2$ in any
form (with $\overline{v_1}$
removed from each term); $g_{*1}$ contains the terms of $f$ which contain $v_2$ but not $v_1$ in any
form (with $v_2$ removed from each term); and so on.
Each $g_{\cdot, \cdot}$ is thus a DNF (possibly empty) over literals formed from $v_3,\dots,v_n$.

For all four possible values of $(a,b) \in (0,1)^2$,
we can empirically estimate
\begin{eqnarray*}
p_{ab} &:=& \textstyle{\Pr_x[g_{**} \vee g_{a*} \vee g_{*b} \vee g_{ab}] =
\Pr_{x}[f(x) = 1 \mid x_1 = a, x_2 = b]}\enspace.
\end{eqnarray*}

It is easy to see that $\Pr[g_{11}]$ is either 0 or else at least
$4 /{2^k}$ depending on whether $g_{11}$ is empty or not.
Ideally we would like to be able to accurately estimate each of
$\Pr[g_{00}]$, $\Pr[g_{01}]$, $\Pr[g_{10}]$, and $\Pr[g_{11}]$; if we could
do this then we would have complete information about which pairs of literals
involving variables $v_1$ and $v_2$ co-occur in terms of $f$.
Unfortunately, the probabilities
$\Pr[g_{00}]$, $\Pr[g_{01}]$, $\Pr[g_{10}]$, and $\Pr[g_{11}]$
cannot in general be obtained from $p_{00}$, $p_{01}$, $p_{10}$,
and $p_{11}$.  However, we will show that we can efficiently
obtain some partial information which enables us to learn
to fairly high accuracy.

As before, our approach is to accurately estimate the quantity
$P=p_{11} - p_{10} - p_{01} + p_{00}$. We have the following two
lemmas:

\begin{lemma} \label{lemma:empty}
If all four of $g_{00}$, $g_{01}$, $g_{10}$, and $g_{11}$
are empty, then $P$ equals
\begin{eqnarray}
& &
\Pr[g_{1*} \wedge g_{*0} \wedge
(\text{no other $g_{\cdot,\cdot}$})]
\ + \Pr[g_{0*} \wedge g_{*1} \wedge (\text{no other $g_{\cdot, \cdot}$})]
\nonumber \\
& & \ \ \ \ \
-\Pr[g_{1*} \wedge g_{*1} \wedge
(\text{no other $g_{\cdot,\cdot}$})]
\ -\Pr[g_{0*} \wedge g_{*0} \wedge (\text{no other $g_{\cdot, \cdot}$})]\enspace.
\label{eq:quant2}
\end{eqnarray}
\end{lemma}
\begin{proof}
Since all four of $g_{00}$, $g_{01}$, $g_{10}$, and $g_{11}$ are empty we
need only consider the five events $g_{**}$, $g_{*0}$, $g_{0*}$,
$g_{*1}$, and $g_{1*}$. We now analyze the contribution to $P$ from
each possible subset of these 5 events:

\begin{itemize}

\item $P$ gets a net contribution of
$0$ from those $x$ which belong to $g_{*,*}$ (and to any other
subset of the remaining four events) since each such $x$ is counted
in each of $p_{00}$, $p_{01}$, $p_{10}$, and $p_{11}$. It remains to
consider all 16 subsets of the four events $g_{*0}$, $g_{0*}$, $g_{*1}$, and $g_{1*}$.

\item $P$ gets a net contribution of $0$ from those $x$ which are
in at least 3 of the four events $g_{*0}, g_{0*},$ $g_{*1}$, and
$g_{1*}$ since each such $x$ is counted in each of $p_{00}$,
$p_{01}$, $p_{10}$, and $p_{11}$. $P$ also gets a net contribution of
$0$ from those $x$ which are in exactly one of the four events
$g_{*0}$, $g_{0*}$, $g_{*1}$, and $g_{1*}$. It remains to consider those $x$ which are in exactly two of the
four events $g_{1*}$, $g_{0*}$, $g_{*1}$, and $g_{*0}$.

\item $P$ gets a net contribution of $0$ from those $x$ which
are in $g_{1*}$ and $g_{0*}$ and no other events, since each such
$x$ is counted in each of $p_{00}$, $p_{01}$, $p_{10}$, and $p_{11}$.
The same is true for those $x$ which are in $g_{*1}$ and $g_{*0}$
and no other events.

\item $P$ gets a net contribution of 
$$
-\Pr[g_{1*} \wedge g_{*1} \wedge (\text{no other $g_{\cdot,\cdot}$ occurs})]
$$
from those $x$ which
are in $g_{1*}$ and $g_{*1}$ and no other event. Similarly, $P$ gets
a net contribution of
$$
-\Pr[g_{0*} \wedge g_{*0} \wedge (\text{no other $g_{\cdot, \cdot}$ occurs})]
$$
from those $x$ which are in
$g_{0*}$ and $g_{*0}$ and no other event. $P$ gets a net
contribution of 
$$
\Pr[g_{1*} \wedge g_{*0} \wedge (\text{no other $g_{\cdot,\cdot}$ occurs})]
$$
from those $x$ which are in $g_{1*}$
and $g_{*0}$ and no other event, and gets a net contribution of
$$
\Pr[g_{0*} \wedge g_{*1} \wedge (\text{no other $g_{\cdot, \cdot}$ occurs})]
$$
from those $x$ which are in $g_{0*}$ and $g_{*1}$ and no
other event.
\end{itemize}
\end{proof}

\begin{lemma} \label{lemma:mess}
If
exactly one of $g_{00}, g_{01}, g_{10}$ and $g_{11}$
is nonempty (say $g_{11}$), then $P$ equals (\ref{eq:quant2}) plus
\begin{align*}
\Pr[&g_{11} \wedge g_{1*} \wedge g_{*0} \wedge
(\text{no other $g_{\cdot,\cdot}$})] + 
\Pr[g_{11} \wedge g_{0*} \wedge g_{*1} \wedge (\text{no other $g_{\cdot, \cdot}$})]\\
 &- \Pr[g_{11} \wedge g_{1*} \wedge g_{*1} \wedge
(\text{no other $g_{\cdot,\cdot}$})]
 - \Pr[g_{11} \wedge g_{0*} \wedge g_{*0} \wedge (\text{no other $g_{\cdot, \cdot}$})]\\
 &+ \Pr[g_{11} \wedge g_{0*} \wedge (\mbox{no other $g_{\cdot,\cdot}$})]
+ \Pr[g_{11} \wedge g_{*0} \wedge (\text{no other $g_{\cdot,\cdot}$})]
+ \Pr[g_{11} \wedge (\text{no other $g_{\cdot,\cdot}$})]\enspace.
\end{align*}
\end{lemma}

\begin{proof}
We suppose that $g_{11}$ is nonempty. We wish to analyze the
contribution to $P$ from all 64 subsets of the six events $g_{**}$,
$g_{1*}$, $g_{0*}$, $g_{*1}$, $g_{*0}$, and $g_{11}$.  From \lemref{lemma:empty} we know this contribution for the 32 subsets which
do not include $g_{11}$ is (\ref{eq:quant2}) so only a few cases
remain:
\begin{itemize}

\item $P$ gets a net contribution of $0$ from those $x$
which are in $g_{11}$ and in $g_{**}$ and in any other subset of
events (each such $x$ is counted in each of $p_{11}$, $p_{01}$, $p_{10}$,
and $p_{00}$). Similarly, $P$ gets a contribution of 0 from those
$x$ which are in $g_{11}$ and in at least three of $g_{1*}$, $g_{0*}$,
$g_{*1}$, and $g_{*0}$. So it remains only to analyze the contribution from
subsets which contain $g_{11}$, contain at most two of $g_{1*}$,
$g_{0*}$, $g_{*1}$, $g_{*0}$, and contain nothing else.

\item An analysis similar to that of \lemref{lemma:empty} shows that $P$ gets a net contribution of
\begin{align*}
	\Pr&[g_{11} \wedge g_{1*} \wedge g_{*0} \wedge (\text{no other
$g_{\cdot,\cdot}$})] + \Pr[g_{11} \wedge g_{0*} \wedge g_{*1}
\wedge (\text{no other $g_{\cdot, \cdot}$})]\\
 &- \Pr[g_{11} \wedge
g_{1*} \wedge g_{*1} \wedge (\text{no other $g_{\cdot,\cdot}$})]
-\Pr[g_{11} \wedge g_{0*} \wedge g_{*0} \wedge (\text{no other
$g_{\cdot, \cdot}$})]
\end{align*}
 from those $x$ which are in $g_{11}$, in
exactly two of $\{g_{1*}, g_{0*},g_{*1},g_{*0}\}$, and in no other
events. So it remains only to consider subsets which contain
$g_{11}$ and at most one of $g_{1*}$, $g_{0*}$, $g_{*1}$, $g_{*0}$, and
nothing else.

\item $P$ gets a contribution of $0$ from $x$ which are in $g_{11}$
and $g_{1*}$ and in nothing else; likewise from $x$ which are in
$g_{11}$ and $g_{*1}$ and in nothing else. $P$ gets a contribution
of 
\[
\Pr[g_{11} \wedge g_{0*} \wedge (\text{no other
$g_{\cdot,\cdot}$})]
\]
from $x$ which are in $g_{11}$ and $g_{0*}$
and in nothing else, and a contribution of
\[
\Pr[g_{11} \wedge g_{*0}
\wedge (\text{no other $g_{\cdot,\cdot}$})]
\]
from $x$ which are in
$g_{11}$ and $g_{*0}$ and in nothing else.

\item $P$ gets a net contribution of
$\Pr[g_{11} \wedge (\mbox{~no other $g_{\cdot,\cdot}$})]$ from those
$x$ which are in $g_{11}$ and in no other event.

\end{itemize}
\end{proof}

Using the above two lemmas we can show that the value of $P$ is a good
indicator for distinguishing between
all four of $g_{00}$, $g_{01}$, $g_{10}$, and $g_{11}$ being empty versus
exactly one of them being nonempty:

\begin{lemma} \label{lemma:Pnonmon}
For $n$ sufficiently large and $t \geq 4$,
with probability at least $1 - \dpusat - \dshared - \dmany$
over a random draw of $f$ from $\Dtk,$ we have that:
(i) if $v_1$ and $v_2$ do not co-occur in any term of $f$ then
$P \leq \alpha^2/8 t$;
(ii) if $v_1$ and $v_2$ do co-occur in some term of $f$ and
exactly one of $g_{00}$, $g_{01}$, $g_{10}$, and $g_{11}$ is nonempty,
then
  $P \geq {3\alpha^2}/ {16 t}$.
\end{lemma}
\begin{proof}
With probability at least $1 - \dpusat - \dshared - \dmany$
a randomly chosen $f$ from $\Dtk$ will have all of the
following properties:
\begin{enumerate}
 \item Each term in $f$ is uniquely satisfied with probability at
   least $\alpha/2^{k+1}$ (by \lemref{lem:uniquesatnon});
 \item Each variable in $f$ appears in at most
$2^{k-1}\alpha^2/(\sqrt{t}\log t)$
terms (by \lemref{lem:vjNotInManyTerms});
and
 \item \begin{sloppypar}
 Each pair of terms $T_i$ and $T_j$ in $f$ are both satisfied
   with probability at most $\log t/2^{2k}$ (by \lemref{lem:sattwonon}).
   \end{sloppypar}
\end{enumerate}

For the sequel assume that we have such an $f$.
We first prove (i) by showing that $P$---as represented by
(\ref{eq:quant2}) of \lemref{lemma:empty}---is at most
${\alpha^4}/(t \log t)$.
By property~3 above, for any pair of terms consisting of
one term from $g_{1*}$ and the other from $g_{*0}$, the probability
that both terms are satisfied is at most $\log t/2^{2k}$.  Since
each of $g_{1*}$ and $g_{*0}$ contains at most
$2^{k-1}\alpha^2/(\sqrt{t}\log t)$ terms
by property~2, a union bound gives
\[
\Pr[g_{1*} \wedge g_{*0} \wedge \text{(no other $g_{\cdot,\cdot}$)}]
\leq
\Pr[g_{1*} \wedge g_{*0}] \leq {\frac {\alpha^4}{4 t \log t}}\enspace.
\]
A similar bound holds for
$\Pr[g_{0*} \wedge g_{*1} \wedge (\text{no other $g_{\cdot, \cdot}$})]$,
which is the only other positive
summand in (\ref{eq:quant2}),
so $P$
is certainly at most ${\alpha^4}/(t \log t)$.
This is at most $\alpha^2/8 t$
since $\alpha \leq 1/2$ and $t\geq 4$.

We now prove (ii).  By an argument similar to the above we have
that the first six summands (not including (\ref{eq:quant2})
in the expression of \lemref{lemma:mess},
namely
$\Pr[g_{11} \wedge g_{1*} \wedge g_{*0} \wedge
(\text{no other $g_{\cdot,\cdot}$})]$ through
$\Pr[g_{11} \wedge g_{*0} \wedge (\text{no other $g_{\cdot,\cdot}$})]$,
are each at most $\alpha^4/(4 t \log t)$
in magnitude.  Now observe that each instance $x$ that uniquely
satisfies a term $T_j$ in $f$ containing both $v_1$ unnegated and $v_2$ unnegated
must satisfy $g_{11}$ and no other $g_{\cdot,\cdot}$.
Thus under the conditions of (ii)
the last summand in \lemref{lemma:mess}, namely
$\Pr[g_{11} \wedge (\mbox{~no other~}g_{\cdot,\cdot})]$,
is at least
$\alpha / 2^{k+1}$ by property~1 above, so we have that (ii) is at least
$$
{\frac \alpha {2^{k+1}}} - {\frac 5 2} {\frac {\alpha^4}{t \log t}}\enspace.
$$
(Here the ${5 \alpha^4}/(2 t \log t)$
comes from the ten summands -- four from
(\ref{eq:quant2}) and six from the first six summands of
\lemref{lemma:mess} -- each of which contributes at most
$\alpha^4/ (4t \log t)$ in magnitude.)
Since $(k,t)$ is $\alpha$-interesting we have
$t/{2^k} \geq \alpha,$ and from this
and the constant bounds on $\alpha$ and $t$
it is easily shown that
\[
{\frac \alpha {2^{k+1}}} \geq {\frac {\alpha^2} {2t}}
\qquad \text{and} \qquad
{\frac 5 2} {\frac {\alpha^4}{t \log t}} \leq
  {\frac {5 \alpha^2} {16 t}}\enspace,
  \]
from which the lemma follows.
\end{proof}

It is clear that an analogue of \lemref{lemma:Pnonmon}
holds for any pair of variables $v_i,v_j$ in place of
$v_1,v_2$.
Thus, for each pair of variables $v_i,v_j,$
if we decide whether $v_i$ and $v_j$ co-occur (negated or otherwise) in any
term on the basis of whether $P_{ij}$ is large or small,
we will err only if two or more of $g_{00}$, $g_{01}$, $g_{10}$, and $g_{11}$ are
nonempty.


We now show that for $f \in \Dtk,$ with very high probability there
are not too many pairs of variables $(v_i,v_j)$ which co-occur (with
any sign pattern) in at least two terms of $f$. Note that this
immediately bounds the number of pairs $(v_i,v_j)$ which have two or
more of the corresponding $g_{00}$, $g_{01}$, $g_{10}$, and $g_{11}$ nonempty.


\begin{lemma} \label{lem:vivjpair}
Let $d> 0$ and $f \in \Dtk$.
The probability that more than $(d+1)t^2k^4/n^2$
pairs of variables $(v_i,v_j)$ each co-occur in two or more terms of $f$
is at most $\exp(-d^2 t^3k^4/n^4)$.
\end{lemma}
\begin{proof}
We use McDiarmid's inequality, where the random variables are
the terms $T_1,\dots,T_t$ chosen independently from the set of all
possible terms of length $k$ and $F(T_1,\dots,T_t)$ denotes the
number of pairs of variables $(v_i,v_j)$ that co-occur in at least
two terms. For each $\ell=1,\dots,t$ we have 
\[
\Pr[\text{$T_\ell$ contains
both $v_1$ and $v_2$}] \leq {\frac {k^2}{n^2}}\enspace,
\]
so by a union bound
we have 
\[
\Pr[\text{$f$ contains at least two terms which contain both $v_1$
and $v_2$ in any form}] \leq {\frac {t^2 k^4}{n^4}}\enspace.
\]
By linearity
of expectation we have $\mu=\E[F] \leq t^2 k^4/ n^2$.
Since each term involves at most $k^2$ pairs of co-occurring
variables, we have
\[
|F(T_1,\dots,T_t) -
F(T_1,\dots,T_{i-1},T'_i,T_{i+1},\dots,T_t)| \leq c_i = k^2\enspace.
\]
We
thus have by McDiarmid's inequality that $ \Pr[F \geq t^2k^4/n^2 +
\tau] \leq \exp(-\tau^2/(tk^4)). $ Taking $ \tau = d t^2k^4 / n^2,$
we have $\Pr[F \geq (d + 1)t^2k^4/n^x2] \leq \exp(-d^2 t^3k^4/n^4)$.
\end{proof}


Taking $d = n^2/(t^{5/4} k^4)$ in the above lemma (note that $d>1$
for $n$ sufficiently large since $t^{5/4} = O(n^{15/8})$), we have
$(d+1)t^2 k^4/n^2 \leq 2t^{3/4}$ and the failure probability is at
most $\exp(-\sqrt{t}/k^4)$ (we henceforth write $\dcooccur$ to
denote this quantity $\exp(-\sqrt{t}/k^4)$). The results of this
section (together with a standard analysis of error in estimating
each $P_{ij}$) thus yield:

\begin{theorem} \label{thm:pairsnon}
For $n$ sufficiently large and for any
$\delta > 0$, with probability at least $1-\dcooccur - \dpusat - \dshared - \dmany - \delta$
over the random draw
of $f$ from $\Dtk$ and the choice of random examples,
the above algorithm runs in $O(n^2 t^2 \log(n/\delta))$ time and outputs a list of
pairs of variables $(v_i,v_j)$ such that: (i) if $(v_i,v_j)$ is in the
list then $v_i$ and $v_j$ co-occur in some term of $f$; and (ii)
at most $N_0 = 2t^{3/4}$
pairs of variables $(v_i,v_j)$ which do co-occur in $f$ are not on the list.
\end{theorem}

\subsection{Reconstructing an accurate DNF hypothesis} \label{sec:reconstructnon}

It remains to construct a good hypothesis for the target
DNF from a list of pairwise co-occurrence relationships
as provided by \thmref{thm:pairsnon}.
As in the monotone case, we consider the
graph $G$ with vertices $v_1,\dots,v_n$ and
edges for precisely those pairs of variables $(v_i,v_j)$ which co-occur
(with any sign pattern)
in some term of $f$.  As before this graph is a union of $t$
randomly chosen $k$-cliques $S_1,\dots,S_t$
which correspond to the $t$ terms in $f$, and as before
we would like to find a set of $k$-cliques in $G$
that contains the $t$ $k$-cliques corresponding to the terms of $f$
as a subset.
However, there are two differences now:
the first is that instead of having the true graph $G$, we instead
have access only to a graph $G'$ which is formed from $G$ by deleting
some set of
at most $N_0=2t^{3/4}$ edges.  The second difference is that the final
hypothesis must take the signs of literals in each term into account.
To handle these two differences, we use a different
reconstruction procedure than we used for monotone DNF in \secref{sec:reconmon};
this reconstruction procedure only
works for $t = O(n^{3/2 - \gamma})$ where $\gamma > 0$.

We first show how to identify (with high probability over the choice of $f$)
the set of {\em all} $k$-cliques in $G'$; clearly, the $k$-cliques
corresponding to terms in $f$ are a subset of this set.
We then show how to form a DNF hypothesis
from the set of all $k$-cliques in $G'$.

We now describe an algorithm which, for $t = O(n^{3/2 - \gamma})$ with $\gamma > 0,$
with high probability runs in polynomial time and
identifies
all the $k$-cliques in $G'$ which contain edge $(v_1,v_2)$.
Since $G'$ has at most $tk^2$ edges, running the algorithm at most
$t k^2$ times on all edges in $G'$
will give us with high probability all the $k$-cliques in $G'$.
The algorithm is:
\begin{itemize}
\item Let $\Delta$ be the set of vertices $v_j$ such that $v_1$, $v_2$, and $v_j$
form a triangle in $G'$.  Run a brute-force algorithm to find
all $(k - 2)$-cliques in the subgraph induced by $\Delta$
(this is the subgraph of $G'$ whose vertices are the vertices
of $\Delta$, and whose edges are the edges that are present
in $G'$ between vertices of $\Delta$).
\end{itemize}

It is clear that the algorithm finds every $k$-clique which contains
edge $(v_1,v_2)$.  To bound the algorithm's running time, it
suffices to give a high probability bound on the size of $\Delta$ in
the graph $G$ (clearly $\Delta$ only shrinks in passing from $G$ to
$G'$).  The following lemma  gives such a bound:

\begin{lemma} \label{lem:brute}
Let $G$ be a random graph as described above.  For any $t = O(n^{3/2
- \gamma})$ and any $C>0$ we have that with probability $1 - O\left(
{\frac {\log^{6C} n} {n^{2 \gamma C}}}\right)$ the size of $\Delta$
in $G$ is at most $Ck$.
\end{lemma}
\begin{proof}
In order for $v_1$, $v_2$, and $v_j$ to form a triangle in $G$, it must be the
case that either (i) some clique $S_i$ contains $\{1,2,j\}$; or (ii)
there is some pair of cliques $S_a,S_b$ with $2 \notin S_a$ and
$\{1,j\} \subset S_a$ and $1 \notin S_b$ and $\{2,j\} \subset S_b$.

For (i), we have from \lemref{lem:fewpairs} that $v_1$ and $v_2$
co-occur in more than $C$ terms with probability at most $(tk^2/n^2)^C$.  Since each term in which $v_1$ and $v_2$ co-occur
contributes at most $k-2$ vertices $v_j$ to condition (i), the
probability that more than $C(k - 2)$ vertices $v_j$ satisfy
condition (i) is at most $(t k^2/n^2)^C =
O(1/n^{C/2})$.

For (ii), let $A$ be the set of those indices $a \in \{1,\dots,t\}$
such that $2 \notin S_a$ and $1\in S_a$, and let $S_A$ be $\cup_{a
\in A}S_a$.  Similarly let $B$ be the set of indices $b$ such that
$1 \notin S_b$ and $2 \in S_b$, and let $S_B$ be $\cup_{b \in
B}S_b$. It is clear that $A$ and $B$ are disjoint. For each
$\ell=1,\dots,t$ we have that $\ell \in A$ independently with
probability at most $p =  k/n$, so $E[|A|] \leq tk/n$.  We
now consider two cases:

\noindent
{\bf Case 1:}  $t \leq n/\log n$.  In this case we may take $\beta =
n \log n/ (t k)$ in the Chernoff bound, and we have that $
\Pr[|A| \geq \beta p t]$ equals
$$
\Pr[|A| \geq \log n] \leq \left( {\frac e \beta} \right)^{\beta p t}
\leq \left( {\frac {ek}{\log^2 n}} \right)^{\log n} = \left({\frac
{e} {\Omega(\log n)}}\right)^{\log n} = {\frac 1 {n^{\omega(1)}}}\enspace.
$$
The same bound clearly holds for $B$.  Note that in Case 1 we thus
have $|S_A|,|S_B| \leq k\log n$ with probability $1 -
1/n^{\omega(1)}$.

\noindent {\bf Case 2:}  $t > n/\log n$.  In this case we may take
$\beta = \log n$ in the Chernoff bound and we obtain
$$
\Pr[|A| \geq \beta p t] = \Pr\left[|A| \geq {\frac {t k \log n}{n}}\right] \leq
\left( {\frac e {\log n}} \right)^{kt(\log n)/n} < \left({\frac e
{\log n}}\right)^{k} = {\frac 1 {n^{\omega(1)}}}$$ where the last
inequality holds since $k = \Omega(\log n)$ (since $t > n/\log n$
and $(k,t)$ is $\alpha$-interesting). In Case 2 we thus have
$|S_A|,|S_B| \leq (t k^2 \log n)/n$ with probability $1 -
1/n^{\omega(1)}$.


Let $S'_A$ denote $S_A - \{1\}$ and $S'_B$ denote $S_B - \{2\}$.
Since $A$ and $B$ are disjoint, it is easily seen that conditioned
on $S'_A$ being of some particular size $s'_A$, all $\binom{n-2}{s'_A}$ $s'_A$-element subsets of $\{3,\dots,n\}$ are equally likely
for $S'_A$. Likewise, conditioned on $S'_B$ being of size $s'_B$,
all $\binom{n - 2}{s'_B}$ $s'_B$-element subsets of $\{3,\dots,n\}$
are equally likely for $S'_B$.  Thus, the probability that $|S'_A
\cap S'_B| \geq C$ is at most
\begin{equation}
\binom{s'_B}{C} \left({\frac {s'_A} {n-2}}\right)^C \leq
\left({\frac {s'_A s'_B} {n-2}}\right)^C \leq \left({\frac {2s'_A
s'_B} {n}}\right)^C \label{eq:sasb}
\end{equation}
(since the expression on the left is an upper bound on the
probability that any collection of $C$ elements in $S'_B$ all
coincide with elements of $S'_A$).

In Case 1 ($t \leq n/\log n$) we may assume that $s'_A, s'_B$ are
each at most $k \log n$ (recall from above that this holds with
probability $1 - n^{-\omega(1)}$), and thus \eqref{eq:sasb} is at
most $[(2k^2 \log^2 n)/ n]^{C}$. In Case 2 ($t > n/\log n$)
we may assume that $s'_A, s'_B \leq (t k^2 \log n)/n$ (here
too from above we have that this holds with probability $1 -
n^{-\omega(1)}$) and thus \eqref{eq:sasb} is at most
\[
\left({\frac {2t^2
k^4 \log^2 n}{n^3}}\right)^C = O\left( {\frac {\log^{6C} n} {n^{2 \gamma C}}}\right)\enspace.
\]

Thus all in all, we have that except with probability $O(1/n^{C/2})$
event (i) contributes at most $C(k - 2)$ vertices $v_j$ such that
$\{1,2,j\}$ forms a triangle, and except with probability $O\left(
{\frac {\log^{6C} n} {n^{2 \gamma C}}}\right)$ event (ii)
contributes at most $C$ vertices $v_j$ such that $\{1,2,j\}$ forms a
triangle.  This proves the lemma.
\end{proof}




By \lemref{lem:brute},
doing a brute-force search which finds
all $(k-2)$-cliques in the graph induced by $\Delta$ takes at most
$
\binom{C k}{k} \leq \left({\frac {e C k}{k}}\right)^{k} = (eC)^{O(\log n)}
= n^{O(\log C)}
$
time steps.
Thus we can efficiently with high probability identify all the
$k$-cliques in $G'$.
How many of the ``true'' cliques $S_1,\dots,S_t$ in $G$ are not present
as $k$-cliques in $G'$?  By \lemref{lem:fewpairs}, with probability
at least $1 - t^2 (tk^2/n^2)^C$ each edge $(v_i,v_j)$ participates
in at most $C$ cliques from $S_1,\dots,S_t$.  Since $G'$ is missing
at most $N_0$ edges from $G'$, with probability at least $1 - t^2 (tk^2/n^2)^C$
the set of all $k$-cliques in $G'$ is missing at most $CN_0$ ``true''
cliques from $S_1,\dots,S_t$.

Summarizing the results of this section so far, we have:

\begin{theorem} \label{thm:cliques}  Fix $C \geq 2$.
Given a DNF formula $f$ drawn from $\Dtk$ and a list of pairs of
co-occurring variables as described in \thmref{thm:pairsnon},
with probability at least $1 - 1/n^{\Omega(C)}$ the above procedure runs
in $n^{O(\log C)}$ time and constructs a
a list $Z_1,\dots,Z_{N'}$ (where $N' =n^{O(\log C)}$) of $k$-cliques
which contains all but at most $CN_0$ of the cliques
$S_1,\dots,S_t$.
\end{theorem}

We construct a hypothesis DNF from the list $Z_1,\dots,Z_{N'}$ of
candidate $k$-cliques as follows:  for each $Z_i$ we form
all $2^k$ possible terms which could have given rise to $Z_i$
(corresponding to all $2^k$ sign patterns on the $k$ variables
in $Z_i$).   We then test each of these $2^k N'$ potential terms against
a sample of $M$ randomly drawn negative examples and discard any terms which output $1$
on any negative example; the final hypothesis $h$ is the OR
of all surviving terms.  Any candidate term
$T'$ which has
\[
\Pr_{x \in U_n}[T'(x) = 1 \wedge \ f(x) = 0]
\geq {\frac \eps {2^{k+1} N'}}
\]
will survive this test with probability at most
$\exp(-\eps M / (2^{k+1} N'))$.  Taking $\eps = 1/2^k$ and
\[
M = \frac{2^{k+1}N' \log^2 n}{\eps}
\]
we have that with probability $1 - 1/n^{\omega(1)}$ each term in the final
hypothesis contributes at most $\eps/(2^{k+1}N')$ toward the false positive
rate of $h$, so with high probability the false positive rate of
$h$ is at most $\eps = 1/2^k$.

The false negative rate of $h$ is at most ${\frac 1 {2^k}}$ times
the number of terms in $f$ which are missing in $h$.  Since the
above algorithm clearly will not discard any term in $f$ (since such
a term will never cause a false negative mistake), we need only
bound the number of terms in $f$ which are not among our $2^k N'$
candidates. With probability at least  $1 - \dclique := 1 - t^2/{\binom{n}{k}}$, each true clique $S_1,\dots,S_t$ in $G$ gives rise to
exactly one term of $f$ (the only way this does not happen is if two
terms consist of literals over the exact same set of $k$ variables,
and the probability that this occurs is at most $t^2/\binom{n}{k}$), so \thmref{thm:cliques} implies that $h$ is missing at
most $CN_0$ terms of $f$.  Thus the false negative rate is at most
\[
\frac{CN_0}{2^k} \leq \frac{2Ct^{3/4}}{2^k} = \frac{1}{\Omega(t^{1/4})}\enspace.
\]

\medskip

All in all the following is our main learning result for non-monotone DNF:

\begin{theorem} \label{thm:mainnonmon}
Fix $\gamma$, $\alpha > 0$ and $C \geq 2$.
  Let $(k,t)$ be a monotone
  $\alpha$-interesting pair.
For $f$ randomly chosen from $\Dtk$, with probability at least
$1 - \dcooccur - \dpusat - \dshared - \dmany -  \dclique  - 1/n^{\Omega(C)}$ 
the above algorithm
runs in $\tilde{O}(n^2t^2 + n^{O(\log C)})$ time and
outputs a hypothesis $h$ whose error rate relative to 
$f$ under the uniform distribution
is at most $1/\Omega(t^{1/4})$.
\end{theorem}
It can be verified from the definitions of the various $\delta$'s that
for any $t = \omega(1)$ as a function of $n$, the failure
probability is $o(1)$ and the accuracy is
$1-o(1)$.

\section{Future work} \label{sec:discuss}

We can currently
only learn random DNFs with $o(n^{3/2})$ terms ($o(n^2)$ terms for monotone
DNF); can stronger results be obtained which hold for all polynomial-size
DNF?
A natural approach here for learning $n^c$-term DNF
might be to first try to identify all $c'$-tuples of variables which co-occur
in a term, where $c'$ is some constant larger than $c$.
Also, our current results for $t = \omega(1)$-term DNF let us learn to
some $1 - o(1)$ accuracy
but we cannot yet achieve an arbitrary inverse polynomial error rate for
non-monotone DNF.
Finally, another interesting direction is to explore other natural models
of random DNF formulas, perhaps by allowing some variation among term sizes or
dependencies between terms.

\medskip

\noindent {\bf Acknowledgement.} Avrim Blum suggested to one of us
(JCJ) the basic strategy that learning monotone DNF with respect to
uniform might be reducible to finding the co-occurring pairs of
variables in the target function.   We thank the anonymous referees for
helpful suggestions and corrections.
This material is based upon work
supported by the National Science Foundation under Grant No.\
CCR-0209064 (JCJ) and CCF-0347282 (RAS).

\bibliography{a008}


\begin{tocauthors}
\begin{tocinfo}[jackson]
Jeffrey C. Jackson\\
Department of Mathematics and Computer Science\\
Duquesne University\\
Pittsburgh, PA 15282, USA\\
\texttt{jacksonj\tocat duq\tocdot edu}\\
\href{http://www.mathcs.duq.edu/~jackson}{\url{http://www.mathcs.duq.edu/~jackson}}
\end{tocinfo}
\begin{tocinfo}[servedio]
Rocco A. Servedio\\
Department of Computer Science\\
Columbia University\\
New York, NY 10027, USA\\
\texttt{rocco\tocat cs\tocdot columbia\tocdot edu}\\
\href{http://www.cs.columbia.edu/~rocco}{\url{http://www.cs.columbia.edu/~rocco}}
\end{tocinfo}
\end{tocauthors}

\begin{tocaboutauthors}
  \begin{tocabout}[jackson] \textsc{Jeffrey C. Jackson}
    has a distinctive educational background, having received his
    B.\,S.\ from \href{http://www.oru.edu}{Oral Roberts University} and
    his Ph.\,D.\ from \href{http://www.cmu.edu}{Carnegie Mellon}, where
    \href{http://www.cc.gatech.edu/component/option,com_peopledb/task,view/contact_id,285069031/}{Merrick
      Furst} was his advisor.  He has been a member of the faculty of
    \href{http://www.duq.edu}{Duquesne University} since 1995, where
    he is currently chair of the
    \href{http://www.mathcs.duq.edu}{Department of Mathematics and
      Computer Science}.  Jeff has also been a software engineer and
    manager in both the aerospace and dot-com industries and is the
    author of the textbook
    \href{http://vig.prenhall.com/catalog/academic/product/0,1144,0131856030,00.html}{Web
      Technologies: A Science Computer Perspective}.  He is the proud
    father of four children (think about his last name for a moment
    and you'll know why he and his wife didn't stop at three).
  \end{tocabout}
  \begin{tocabout}[servedio] \textsc{Rocco A. Servedio} received his B.\,S.,
M.\,S. and Ph.\,D. from \href{http://www.harvard.edu}{Harvard
University}, where his Ph.\,D. was supervised by
\href{http://people.deas.harvard.edu/~valiant}{Leslie Valiant}. For
a change of pace, he then held an NSF postdoc at
\href{http://www.harvard.edu}{Harvard University}, where he was
supervised by \href{http://people.deas.harvard.edu/~valiant}{Leslie
Valiant}. Since 2003 he has been an assistant professor at
\href{http://www.columbia.edu}{Columbia University} in the
\href{http://www.cs.columbia.edu}{Department of Computer Science} He
is interested in computational learning theory and computational
complexity, and has received the NSF Career award and a Sloan
Foundation Fellowship. He enjoys spending time with his family and
hopes to have dinner with
\href{http://en.wikipedia.org/wiki/Herman_Melville}{Herman Melville}
in the afterlife.
\end{tocabout}
\end{tocaboutauthors}

\end{document}
