### abstract ###
We study the problem of partitioning a small sample of  SYMBOL  individuals  from a mixture of  SYMBOL  product distributions over a Boolean cube  SYMBOL   according to their distributions
Each distribution is described by a  vector of allele frequencies in  SYMBOL
Given two distributions, we use  SYMBOL  to denote the average  SYMBOL   distance in frequencies across  SYMBOL  dimensions,  which measures the statistical divergence  between them
We study the case assuming that bits are independently distributed  across  SYMBOL  dimensions
This work demonstrates that, for a balanced input instance for  SYMBOL , a certain graph-based optimization function returns the correct partition with  high probability, where a weighted graph  SYMBOL  is formed over  SYMBOL  individuals,  whose pairwise hamming distances between their corresponding bit vectors define the edge weights, so long as  SYMBOL  and   SYMBOL
The function computes a maximum-weight balanced cut of  SYMBOL ,  where the weight of a cut is the sum of the weights across all edges in the cut
This result demonstrates a nice property in the high-dimensional feature space: one can trade off the number of features that are required with the size  of the sample to accomplish certain tasks like clustering
### introduction ###
We explore a type of classification problem that arises in the context of computational biology
The problem is that we are given a small sample of size  SYMBOL , eg , DNA of  SYMBOL  individuals, each described by the values  of  SYMBOL   features  or  markers , eg , SNPs (Single Nucleotide Polymorphisms),  where  SYMBOL
Features have slightly different frequencies depending on which population the  individual belongs to, and are assumed to be independent of each other
Given the population of origin of an individual, the genotype (represented as a bit vector in this paper) can be reasonably assumed to be generated by drawing  alleles independently from the appropriate distribution
The objective we consider is to minimize the number of features  SYMBOL , and thus total data size  SYMBOL , to correctly  classify the individuals in the sample according to their population of origin,  given any  SYMBOL
We describe  SYMBOL  and  SYMBOL  as a function of the ``average quality''   SYMBOL  of the features
Throughout the paper, we use  SYMBOL  and  SYMBOL  as  shorthands for  SYMBOL  and  SYMBOL  respectively
We first describe a general mixture model that we use in this paper
The same model was previously used  in~ CITATION  and ~ CITATION  {Statistical Model:} We have   SYMBOL  probability spaces  SYMBOL  over the set  SYMBOL
Further, the components ( features ) of  SYMBOL  are independent and  SYMBOL   ( SYMBOL ,  SYMBOL )
Hence, the probability spaces  SYMBOL  comprise the distribution of the features for each of the  SYMBOL  populations
Moreover, the input of the algorithm consists of  a collection ( mixture ) of  SYMBOL  unlabeled samples,  SYMBOL  points  from  SYMBOL , and the algorithm is to determine for each data point from which of  SYMBOL  it was chosen
In general we do  not  assume that  SYMBOL  are revealed to the algorithm; but we do require some bounds on their relative sizes
An important parameter of the probability ensemble  SYMBOL  is the  measure of divergence   between any two distributions
Note that  SYMBOL  provides a lower bound on the Euclidean distance  between the means of any two distributions and represents their separation
Further, let  SYMBOL  (so if the populations were balanced we would have  SYMBOL  of each type)
This paper proves the following theorem which gives a sufficient condition for a  balanced ( SYMBOL ) input instance when  SYMBOL
Variants of the above theorem, based on a model that allows two random draws at each dimension for all points, are given in~ CITATION  and ~ CITATION
The cleverness there is the construction of a diploid score at each dimension,  given any  pair of individuals , under the assumption that two random bits can be  drawn from the same distribution at each dimension
In expectation, diploid scores are higher among pairs from different groups  than for pairs in the same group across all  SYMBOL  dimensions
In addition,~ CITATION  shows that when  SYMBOL ,  given two bits from each dimension, one can always classify for any size of  SYMBOL ,  for unbalanced cases with any number of mixtures, using essentially connected  component based algorithms, given the weighted graph as in described in  Theorem~
The key contribution of this paper is to show new ideas that we use to  accomplish the goal of clustering with the same amount of features,  while requiring only one random bit at each dimension
While some ideas and proofs for Theorem~  in Section~ have appeared in~ CITATION , modifications for  handling a single bit at each dimension are ubiquitous throughout the proof
Hence we contain the complete proof in this paper nonetheless to give a  complete exposition
Finding a max-cut is computationally intractable;  a hill-climbing algorithm was given in~ CITATION  to partition a balanced  mixture, with a stronger requirement on  SYMBOL , given any  SYMBOL ,  as the middle green curve in Figure~ shows
Two simpler algorithms using spectral techniques were constructed  in~ CITATION , attempting to reproduce conditions above
Both spectral algorithms in~ CITATION  achieve the bound established by Theorem~ without requiring the input instances being balanced, and work for cases when  SYMBOL  is a constant; However, they require  SYMBOL , even when  SYMBOL  and the input instance  is balanced, as the vertical line in Figure~ shows
Note that when  SYMBOL , i e , when we have enough sample from  each distribution,  SYMBOL  becomes the only requirement  in Theorem~
Exploring the tradeoffs between  SYMBOL  and  SYMBOL , when  SYMBOL  is small, as in  Theorem~ in algorithmic design is both of  theoretical interests and practical value }                                                                                                                                                                                                                                                                                                                                                                                                                                           long-lemma
