Sparse Binary Polynomial Hashing — BayesianStatistics.com

Sparse binary polynomial hashing (SBPH) emerges from the intersection of hashing, random feature methods, and Bayesian reasoning about text. Originally developed in the context of spam filtering — where it powered the CRM114 discriminator — SBPH creates hash features by taking products of sparse subsets of input features, generating a rich nonlinear representation that captures higher-order interactions while remaining computationally tractable. Its Bayesian motivation arises from treating feature co-occurrences as evidence about class membership, naturally extending the naive Bayes framework to capture feature dependencies.

Mechanism and Construction

The core idea of SBPH is to create hash features from polynomial combinations of the original input features. Given a binary feature vector x ∈ {0, 1}^d, a polynomial hash feature is defined by a subset S of feature indices, and its value is the product of the corresponding input features:

Polynomial Featureφ_S(x) = Π_{i ∈ S} x_i

This product is 1 if and only if all features in the subset are simultaneously active, capturing a specific co-occurrence pattern. The "sparse" qualifier refers to two properties: only a sparse random selection of subsets is used (rather than all possible subsets), and each subset itself is typically small (involving only 2-5 features).

The final hash is constructed by combining multiple such polynomial features into a fixed-size hash table using a conventional hash function to map each subset's identity to a table position:

SBPH Hash Constructionh(x) = Σ_{S ∈ S} φ_S(x) · e_hash(S)

where e_hash(S) is the indicator vector for the hash bucket assigned to subset S, and the sum accumulates counts across all active polynomial features.

Bayesian Motivation

The connection to Bayesian classification is direct. In naive Bayes, each feature contributes independently to the log-posterior. SBPH extends this by allowing conjunctions of features to contribute, effectively relaxing the independence assumption without fully abandoning the tractability of the naive Bayes framework.

Consider classifying a text document. Naive Bayes treats each word independently, but the phrase "not good" carries very different information than "not" and "good" considered separately. SBPH captures this by creating features for pairs (and higher-order tuples) of words that appear in proximity, allowing the classifier to learn that "not good" is evidence for negative sentiment even though "good" alone is evidence for positive sentiment.

SBPH in CRM114

The CRM114 Discriminator, created by Bill Yerazunis around 2003, was one of the most accurate spam filters of its era. It used SBPH to create features from sequences of tokens in email messages, hashing polynomial combinations of nearby words into a fixed-size feature space. Classification was then performed using a Bayesian framework over these hashed features. CRM114 achieved spam filtering accuracy exceeding 99.9% in controlled tests, demonstrating that SBPH's ability to capture local context gave it a significant edge over pure unigram naive Bayes approaches.

Connection to Random Feature Methods

SBPH can be understood within the broader framework of random feature approximations to kernel methods. The polynomial features implicitly define a kernel — the polynomial kernel — and the sparse random selection of subsets constitutes a random sampling approximation to this kernel. The celebrated work of Rahimi and Recht (2007) on random Fourier features for approximating shift-invariant kernels follows a similar philosophical approach: replace the full kernel with a manageable random subset of features.

The sparsity of the polynomial subsets serves a dual purpose. Computationally, it makes the feature generation tractable — considering all subsets of size k from d features would require O(d^k) features, which quickly becomes infeasible. Statistically, sparsity acts as a regularizer: by considering only a random subset of possible interactions, the method avoids overfitting to spurious high-order correlations in the training data.

Applications in Text Classification

Beyond spam filtering, SBPH has been applied to a range of text classification tasks where local word order and phrase-level patterns matter:

Authorship attribution uses SBPH features to capture stylistic patterns — specific word combinations, syntactic constructions, and punctuation habits that distinguish one author from another.

Sentiment analysis benefits from SBPH's ability to capture negation and intensification patterns that pure bag-of-words models miss.

Language identification uses character-level SBPH features to capture orthographic and morphological patterns characteristic of different languages.

Efficiency and Scalability

One of SBPH's key advantages is its computational efficiency. Feature generation is linear in the document length multiplied by the window size and polynomial degree. The hash table provides a fixed-size, bounded-memory representation regardless of vocabulary size, making it naturally suited to streaming and online learning settings. Updates to the Bayesian classifier require only incrementing or decrementing counts in the hash table, enabling real-time adaptation.

"The trick of SBPH is that it captures just enough feature interaction to break the naive Bayes independence assumption, without drowning in the combinatorial explosion of all possible interactions."— Bill Yerazunis

Limitations and Modern Context

SBPH's primary limitation is that its hash collisions can create noise in the feature space, and the fixed polynomial structure may not capture all relevant interactions. Modern deep learning approaches — particularly transformer architectures with self-attention — learn arbitrary feature interactions from data, subsuming the fixed polynomial structure of SBPH. Nevertheless, SBPH remains relevant in resource-constrained settings where the simplicity, speed, and interpretability of a Bayesian hashing approach outweigh the representational advantages of neural methods.

Interactive Calculator

Each row has an item_id (numeric), hash_value (numeric), and bucket (integer bucket assignment, 0–9). The calculator tracks how many items land in each bucket, detects collisions (multiple items in the same bucket), and uses a Beta-Binomial model to estimate the true collision probability. The prior is Beta(1,1) (uniform), updated with each bucket assignment.

Dataset (CSV)

Click Calculate to see results, or Animate to watch the statistics update one record at a time.