Chapter 18: Text Analysis Pipeline
18.1 Introduction
Full-text search requires transforming unstructured text into searchable tokens—a process called text analysis. The quality of this transformation directly impacts search relevance: poor analysis leads to missed matches or irrelevant results, while good analysis enables users to find documents using natural language queries.
Text analysis bridges the gap between how humans write and how computers search. When a user searches for "running," they likely want to find documents containing "run," "runs," "running," and "ran." When searching for "cafe," they expect to find "cafe" and "cafe." These linguistic variations must be normalized to enable effective matching.
This chapter explores Cognica's text analysis pipeline, which transforms raw text through three stages: character filtering, tokenization, and token filtering. The pipeline supports multiple languages, Unicode text, and custom analysis configurations.
18.1.1 The Analysis Challenge
Consider the sentence: "The quick brown foxes jumped over the lazy dogs."
A naive approach might split on whitespace and match exactly, but this fails for:
- Case variations: "Quick" vs "quick"
- Morphological variations: "foxes" vs "fox," "jumped" vs "jump"
- Stop words: "the," "over" add noise without semantic value
- Diacritics: "cafe" should match "cafe"
The analysis pipeline addresses these challenges through a sequence of transformations:
18.1.2 Index-Time vs Query-Time Analysis
The analysis pipeline operates in two distinct modes:
Index-Time Analysis (tokenize): Full processing including stemming and stop word removal. Creates tokens for the inverted index.
Query-Time Analysis (normalize): Lighter processing that matches index terms without over-transforming query intent. Typically skips stemming to preserve user semantics.
This distinction is crucial: if the index contains stemmed terms but queries are not stemmed (or vice versa), matches will fail. Cognica's dual-mode design ensures consistency.
18.2 Analyzer Architecture
18.2.1 The Analyzer Interface
The Analyzer class defines the contract for text analysis:
class Analyzer {
public:
virtual ~Analyzer() = default;
// Full analysis for indexing
virtual auto tokenize(std::string_view text) const -> Tokens = 0;
virtual auto tokenize(const Value& value) const -> Tokens;
// Lighter analysis for queries
virtual auto normalize(std::string_view text) const -> Tokens = 0;
virtual auto normalize(const Value& value) const -> Tokens;
};
The tokenize and normalize methods may produce different results for the same input:
| Input | tokenize() | normalize() |
|---|---|---|
| "Running" | ["run"] | ["running"] |
| "The fox" | ["fox"] | ["the", "fox"] |
18.2.2 Analyzer Composition
Each analyzer composes three pipeline stages:
class StandardAnalyzer : public Analyzer {
std::unique_ptr<ChainedCharacterFilter> char_filters_;
TokenizerType tokenizer_;
std::unique_ptr<ChainedTokenFilter> token_filters_;
public:
auto tokenize(std::string_view text) const -> Tokens override {
// Stage 1: Character filtering
auto filtered = char_filters_->transform(text);
// Stage 2: Tokenization
auto tokens = tokenizer_.tokenize(filtered);
// Stage 3: Token filtering
return token_filters_->transform(tokens);
}
auto normalize(std::string_view text) const -> Tokens override {
auto filtered = char_filters_->transform(text);
auto tokens = tokenizer_.normalize(filtered);
return token_filters_->normalize(tokens);
}
};
18.2.3 Built-in Analyzers
Cognica provides 12 analyzer types:
| Analyzer | Description | Use Case |
|---|---|---|
| standard | ICU tokenization + stemming + stop words | General text |
| standard_cjk | Standard + n-gram filters | Chinese/Japanese/Korean |
| keyword | No tokenization | Exact match fields |
| custom | User-configured pipeline | Special requirements |
| whitespace | Simple whitespace splitting | Pre-tokenized input |
| regex | Pattern-based tokenization | Structured text |
| datetime | Date/time parsing | Temporal fields |
| number | Numeric value handling | Numeric fields |
| int64 | 64-bit integer specific | Integer fields |
| float64 | Double precision specific | Float fields |
| dense_vector | Vector embedding handling | ML embeddings |
| geopoint | Geographic coordinates | Location fields |
18.2.4 Type Erasure Pattern
Cognica uses type erasure to enable runtime composition without virtual function overhead:
template<typename T, typename Storage = te::local_storage<64>>
using poly = /* type-erased wrapper */;
using TokenizerType = te::poly<Tokenizer, te::local_storage<64>>;
using TokenFilterType = te::poly<TokenFilter>;
using CharacterFilterType = te::poly<CharacterFilter>;
Benefits:
- 64-byte local storage avoids heap allocation for small types
- Runtime polymorphism without virtual dispatch overhead
- Factory-based construction from configuration
18.3 Token Structure
18.3.1 Token Representation
Tokens carry rich metadata beyond the term itself:
template<typename T>
struct TokenTemplate {
TokenType type = TokenType::kString; // Data type
T token{}; // Processed term
std::optional<T> original_token{}; // Original form (for highlighting)
int32_t position = 0; // Position in token stream
Offset offset_bytes{}; // Byte offsets in source
Offset offset_chars{}; // Character offsets in source
};
using Token = TokenTemplate<std::string>;
using Tokens = std::vector<Token>;
18.3.2 Token Types
The TokenType enumeration supports typed tokens:
enum class TokenType : uint8_t {
kNull, // Null value
kBoolean, // Boolean literal
kInt64, // 64-bit integer
kUInt64, // Unsigned 64-bit integer
kDouble, // Double precision float
kString, // Text string
};
Typed tokens enable the keyword analyzer to preserve numeric and boolean values for exact matching.
18.3.3 Offset Tracking
The Offset structure tracks positions in the original text:
struct Offset {
int32_t begin = 0; // Start position
int32_t end = 0; // End position
static auto is_overlapped(const Offset& a, const Offset& b) -> bool {
return a.end > b.begin && b.end > a.begin;
}
};
Offset tracking enables:
- Highlighting: Show matched terms in context
- Snippet generation: Extract relevant text passages
- Phrase queries: Verify term adjacency
18.3.4 Position Semantics
Token positions support phrase and proximity queries:
Input: "The quick brown fox"
Tokens after stop word removal:
Token{term="quick", position=0, ...}
Token{term="brown", position=1, ...}
Token{term="fox", position=2, ...}
Note that "The" is removed but subsequent positions are adjusted to maintain adjacency information. This enables the phrase query "quick brown" to match correctly.
18.4 Character Filters
18.4.1 Purpose and Interface
Character filters transform raw text before tokenization:
class CharacterFilter {
public:
auto transform(std::string_view text) const -> std::string;
};
Character filters operate on the entire input string, enabling transformations that span token boundaries.
18.4.2 Lowercase Character Filter
The LowerCaseCharacterFilter performs Unicode-aware case folding:
class LowerCaseCharacterFilter {
public:
auto transform(std::string_view text) const -> std::string {
// Convert UTF-8 to ICU UnicodeString
auto input = icu::UnicodeString::fromUTF8({
text.data(),
static_cast<int32_t>(text.size())
});
// Apply Unicode lowercase
input.toLower();
// Convert back to UTF-8
std::string output;
auto sink = icu::StringByteSink<std::string>{&output, input.length()};
input.toUTF8(sink);
return output;
}
};
Unicode Considerations:
- German: "STRASSE" becomes "strasse" (not "strasze")
- Greek: "SIGMA" becomes context-appropriate "sigma" or "varsigma"
- Turkish: "I" becomes "i" (not "i" with dot above)
18.4.3 Normalization Character Filter
The NormalizationCharacterFilter applies Unicode normalization:
class NormalizationCharacterFilter {
icu::Transliterator* transliterator_;
public:
NormalizationCharacterFilter() {
UErrorCode status = U_ZERO_ERROR;
// NFD decomposition, remove combining marks, NFC recomposition
transliterator_ = icu::Transliterator::createInstance(
"NFD; [:Mn:] Remove; NFC",
UTRANS_FORWARD,
status
);
}
auto transform(std::string_view text) const -> std::string {
auto source = icu::UnicodeString::fromUTF8(text);
transliterator_->transliterate(source);
// Convert back to UTF-8...
}
};
Normalization Forms:
- NFD: Canonical decomposition (e becomes e + combining accent)
- NFC: Canonical composition (e + combining accent becomes e)
- NFKD: Compatibility decomposition (fi ligature becomes f + i)
- NFKC: Compatibility composition
18.4.4 Chained Character Filters
Multiple character filters compose in sequence:
class ChainedCharacterFilter {
std::vector<CharacterFilterType> filters_;
public:
auto transform(std::string_view text) const -> std::string {
std::string output{text};
for (const auto& filter : filters_) {
output = filter.transform(output);
}
return output;
}
};
A typical chain: Normalization -> Lowercase.
18.5 Tokenizers
18.5.1 Tokenizer Interface
Tokenizers segment text into individual tokens:
class Tokenizer {
public:
auto tokenize(std::string_view text) const -> Tokens;
auto normalize(std::string_view text) const -> Tokens;
};
18.5.2 ICU Word Tokenizer
The ICUWordTokenizer is the primary tokenizer, using ICU's BreakIterator for Unicode-aware word segmentation:
class ICUWordTokenizer {
static constexpr size_t kMaxConcurrency = 32;
std::array<std::unique_ptr<std::mutex>, kMaxConcurrency> locks_;
std::array<std::unique_ptr<icu::BreakIterator>, kMaxConcurrency> iterators_;
public:
ICUWordTokenizer() {
UErrorCode status = U_ZERO_ERROR;
for (size_t i = 0; i < kMaxConcurrency; ++i) {
locks_[i] = std::make_unique<std::mutex>();
iterators_[i].reset(
icu::BreakIterator::createWordInstance(
icu::Locale::getRoot(), status));
}
}
auto tokenize(std::string_view text) const -> Tokens {
// Select iterator based on hash for load distribution
size_t idx = std::hash<std::string_view>{}(text) % kMaxConcurrency;
std::lock_guard lock(*locks_[idx]);
auto& iterator = iterators_[idx];
auto source = icu::UnicodeString::fromUTF8(text);
iterator->setText(source);
Tokens tokens;
int32_t position = 0;
int32_t begin = iterator->first();
while (begin != icu::BreakIterator::DONE) {
int32_t end = iterator->next();
if (end == icu::BreakIterator::DONE) break;
// Skip non-word breaks (punctuation, whitespace)
if (iterator->getRuleStatus() == UBRK_WORD_NONE) {
begin = end;
continue;
}
// Extract word
auto word = source.tempSubStringBetween(begin, end);
std::string term;
word.toUTF8String(term);
tokens.push_back({
TokenType::kString,
std::move(term),
std::nullopt,
position++,
{begin, end}, // byte offsets
{begin, end}, // char offsets
});
begin = end;
}
return tokens;
}
};
Thread Safety: The tokenizer maintains 32 iterator instances with per-iterator locks, enabling concurrent tokenization without contention.
Word Break Rules: ICU's word break algorithm handles:
- Script boundaries (Latin, CJK, Arabic, etc.)
- Contractions ("don't" as one or two tokens based on locale)
- Numeric sequences ("3.14" as single token)
- Email addresses and URLs (configurable)
18.5.3 N-Gram Tokenizer
The NGramTokenizer generates character n-grams:
class NGramTokenizer {
int64_t min_size_ = 2;
int64_t max_size_ = 3;
bool track_offsets_ = false;
public:
auto tokenize(std::string_view text) const -> Tokens {
Tokens tokens;
int32_t position = 0;
// Generate n-grams for each size
for (int64_t n = min_size_; n <= max_size_; ++n) {
for (size_t i = 0; i + n <= text.size(); ++i) {
tokens.push_back({
TokenType::kString,
std::string(text.substr(i, n)),
std::nullopt,
position++,
track_offsets_ ? Offset{static_cast<int32_t>(i),
static_cast<int32_t>(i + n)}
: Offset{},
{},
});
}
}
return tokens;
}
};
Use Cases:
- Substring matching without wildcards
- CJK text where word boundaries are ambiguous
- Typo tolerance through partial matches
18.5.4 Keyword Tokenizer
The KeywordTokenizer emits the entire input as a single token:
class KeywordTokenizer {
public:
auto tokenize(std::string_view text) const -> Tokens {
return {{
TokenType::kString,
std::string(text),
std::nullopt,
0, // Position 0
{0, static_cast<int32_t>(text.size())},
{0, static_cast<int32_t>(text.size())},
}};
}
};
Use Cases:
- Exact match fields (product SKUs, IDs)
- Enumeration values
- Tags and categories
18.5.5 MeCab Tokenizer
The MeCabTokenizer provides morphological analysis for Japanese and Korean:
class MeCabTokenizer {
std::unique_ptr<MeCab::Tagger> tagger_;
public:
MeCabTokenizer() {
tagger_.reset(MeCab::createTagger("-Owakati"));
}
auto tokenize(std::string_view text) const -> Tokens {
const char* result = tagger_->parse(text.data());
// Parse MeCab output into tokens...
}
};
MeCab performs:
- Dictionary-based word segmentation
- Part-of-speech tagging
- Compound word decomposition
- Reading (furigana) extraction
18.5.6 Tokenizer Factory
Tokenizers are created through a factory:
class TokenizerFactory {
static const std::unordered_map<std::string_view, CreatorFn> kFactoryMap;
public:
static auto create(std::string_view type, const Value& options)
-> std::unique_ptr<TokenizerType> {
auto it = kFactoryMap.find(type);
if (it == kFactoryMap.end()) {
throw std::invalid_argument("Unknown tokenizer: " + std::string(type));
}
return it->second(options);
}
};
const std::unordered_map<std::string_view, CreatorFn>
TokenizerFactory::kFactoryMap = {
{"icu", create<ICUWordTokenizer>},
{"whitespace", create<WhitespaceTokenizer>},
{"keyword", create<KeywordTokenizer>},
{"ngram", create<NGramTokenizer>},
{"mecab", create<MeCabTokenizer>},
{"regex", create<RegexTokenizer>},
// ... more tokenizers
};
18.6 Token Filters
18.6.1 Token Filter Interface
Token filters transform the token stream:
class TokenFilter {
public:
auto transform(const Tokens& tokens) const -> Tokens; // For indexing
auto normalize(const Tokens& tokens) const -> Tokens; // For queries
};
18.6.2 Lowercase Token Filter
The LowerCaseTokenFilter normalizes token case:
class LowerCaseTokenFilter {
public:
auto transform(const Tokens& tokens) const -> Tokens {
Tokens result;
result.reserve(tokens.size());
for (const auto& token : tokens) {
auto source = icu::UnicodeString::fromUTF8(token.token);
source.toLower();
std::string lowered;
source.toUTF8String(lowered);
result.push_back({
token.type,
std::move(lowered),
token.token, // Preserve original for highlighting
token.position,
token.offset_bytes,
token.offset_chars,
});
}
return result;
}
};
18.6.3 Stop Word Filter
The StopWordsTokenFilter removes common words:
class StopWordsTokenFilter {
std::unordered_set<std::string> stop_words_;
public:
explicit StopWordsTokenFilter(std::string_view language) {
stop_words_ = load_stop_words(language);
}
auto transform(const Tokens& tokens) const -> Tokens {
Tokens result;
int32_t position_adjustment = 0;
for (const auto& token : tokens) {
if (stop_words_.contains(token.token)) {
++position_adjustment;
continue; // Skip stop word
}
result.push_back({
token.type,
token.token,
token.original_token,
token.position - position_adjustment, // Adjust position
token.offset_bytes,
token.offset_chars,
});
}
return result;
}
};
Position Adjustment: When stop words are removed, subsequent token positions are decremented to maintain correct phrase matching semantics.
Language-Specific Lists: Each language has its own stop word list. English includes "the," "a," "an," "is," "are," etc.
18.6.4 Snowball Stemmer Filter
The SnowballTokenFilter reduces words to their stems:
class SnowballTokenFilter {
static constexpr size_t kMaxStemmers = 32;
std::string language_;
std::array<std::unique_ptr<std::mutex>, kMaxStemmers> locks_;
std::array<sb_stemmer*, kMaxStemmers> stemmers_;
mutable LRUCache<std::string, std::string> cache_; // 64KB cache
public:
SnowballTokenFilter(std::string_view language)
: language_(language),
cache_(64 * 1024) {
for (size_t i = 0; i < kMaxStemmers; ++i) {
locks_[i] = std::make_unique<std::mutex>();
stemmers_[i] = sb_stemmer_new(language_.c_str(), "UTF_8");
}
}
auto transform(const Tokens& tokens) const -> Tokens {
Tokens result;
result.reserve(tokens.size());
for (const auto& token : tokens) {
// Check cache first
if (auto cached = cache_.get(token.token)) {
result.push_back({token.type, *cached, token.token, ...});
continue;
}
// Stem the token
size_t idx = std::hash<std::string>{}(token.token) % kMaxStemmers;
std::lock_guard lock(*locks_[idx]);
const sb_symbol* stemmed = sb_stemmer_stem(
stemmers_[idx],
reinterpret_cast<const sb_symbol*>(token.token.data()),
static_cast<int>(token.token.size())
);
std::string stem(reinterpret_cast<const char*>(stemmed),
sb_stemmer_length(stemmers_[idx]));
cache_.put(token.token, stem);
result.push_back({
token.type,
std::move(stem),
token.token, // Preserve original
token.position,
token.offset_bytes,
token.offset_chars,
});
}
return result;
}
auto normalize(const Tokens& tokens) const -> Tokens {
// Query-time: typically skip stemming to preserve user intent
return tokens;
}
};
Supported Languages: English, French, German, Spanish, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Russian, Finnish, Hungarian, Turkish, Arabic, and more.
Performance Optimization:
- 32 stemmer instances for concurrent access
- LRU cache (64KB) for frequently stemmed words
- Hash-based load distribution
18.6.5 N-Gram Token Filter
The NGramTokenFilter generates n-grams from tokens:
class NGramTokenFilter {
int64_t min_size_ = 2;
int64_t max_size_ = 2;
bool track_offsets_ = false;
bool allow_small_tokens_ = true;
public:
auto transform(const Tokens& tokens) const -> Tokens {
Tokens result;
for (const auto& token : tokens) {
if (token.token.size() < min_size_ && allow_small_tokens_) {
result.push_back(token); // Keep small tokens as-is
continue;
}
// Generate n-grams
for (int64_t n = min_size_; n <= max_size_; ++n) {
for (size_t i = 0; i + n <= token.token.size(); ++i) {
result.push_back({
TokenType::kString,
token.token.substr(i, n),
std::nullopt,
token.position,
track_offsets_ ? compute_offset(token, i, n) : Offset{},
{},
});
}
}
}
return result;
}
auto normalize(const Tokens& tokens) const -> Tokens {
// Query-time optimization: skip overlapping n-grams
Tokens result;
Offset prev_offset{};
for (const auto& token : tokens) {
// Generate n-grams but skip overlapping ones
for (int64_t n = min_size_; n <= max_size_; ++n) {
for (size_t i = 0; i + n <= token.token.size(); ++i) {
Offset current = compute_offset(token, i, n);
if (!Offset::is_overlapped(prev_offset, current)) {
result.push_back({...});
prev_offset = current;
}
}
}
}
return result;
}
};
18.6.6 Edge N-Gram Filter
The EdgeNGramTokenFilter generates n-grams anchored at token edges:
class EdgeNGramTokenFilter {
int64_t min_size_ = 1;
int64_t max_size_ = 2;
bool from_end_ = false; // true for suffix n-grams
public:
auto transform(const Tokens& tokens) const -> Tokens {
Tokens result;
for (const auto& token : tokens) {
for (int64_t n = min_size_;
n <= std::min(max_size_, static_cast<int64_t>(token.token.size()));
++n) {
if (from_end_) {
// Suffix n-gram
result.push_back({
TokenType::kString,
token.token.substr(token.token.size() - n, n),
...
});
} else {
// Prefix n-gram
result.push_back({
TokenType::kString,
token.token.substr(0, n),
...
});
}
}
}
return result;
}
};
Use Case: Prefix completion (autocomplete) queries. Indexing "hello" as ["h", "he"] enables prefix search.
18.6.7 Shingle Filter
The ShingleTokenFilter creates word n-grams (shingles):
class ShingleTokenFilter {
int64_t min_size_ = 2;
int64_t max_size_ = 2;
std::string separator_ = " ";
bool output_unigrams_ = true;
public:
auto transform(const Tokens& tokens) const -> Tokens {
Tokens result;
// Output unigrams if configured
if (output_unigrams_) {
result.insert(result.end(), tokens.begin(), tokens.end());
}
// Generate shingles
for (size_t i = 0; i < tokens.size(); ++i) {
for (int64_t n = min_size_; n <= max_size_; ++n) {
if (i + n > tokens.size()) break;
std::string shingle;
for (size_t j = 0; j < n; ++j) {
if (j > 0) shingle += separator_;
shingle += tokens[i + j].token;
}
result.push_back({
TokenType::kString,
std::move(shingle),
std::nullopt,
tokens[i].position,
{},
{},
});
}
}
return result;
}
};
Example: "quick brown fox" with 2-grams produces:
- Unigrams: ["quick", "brown", "fox"]
- Shingles: ["quick brown", "brown fox"]
18.6.8 ASCII Folding Filter
The ASCIIFoldingTokenFilter converts accented characters to ASCII:
class ASCIIFoldingTokenFilter {
static const std::unordered_map<char32_t, std::string> kFoldingMap;
public:
auto transform(const Tokens& tokens) const -> Tokens {
Tokens result;
for (const auto& token : tokens) {
std::string folded;
folded.reserve(token.token.size());
// Iterate over UTF-8 code points
for (char32_t cp : iterate_utf8(token.token)) {
if (auto it = kFoldingMap.find(cp); it != kFoldingMap.end()) {
folded += it->second; // Folded ASCII equivalent
} else if (cp < 128) {
folded += static_cast<char>(cp); // Already ASCII
}
// Non-ASCII characters without mapping are removed
}
result.push_back({
token.type,
std::move(folded),
token.token,
token.position,
token.offset_bytes,
token.offset_chars,
});
}
return result;
}
};
// Example mappings
const std::unordered_map<char32_t, std::string>
ASCIIFoldingTokenFilter::kFoldingMap = {
{U'a', "a"}, {U'a', "a"}, {U'a', "a"}, // a with various accents
{U'c', "c"}, // c with cedilla
{U'n', "n"}, // n with tilde
{U'ss', "ss"}, // German sharp s
// ... extensive mapping table
};
18.6.9 Double Metaphone Filter
The DoubleMetaphoneTokenFilter generates phonetic codes for sound-alike matching:
class DoubleMetaphoneTokenFilter {
public:
auto transform(const Tokens& tokens) const -> Tokens {
Tokens result;
for (const auto& token : tokens) {
auto [primary, secondary] = double_metaphone(token.token);
// Emit primary code
result.push_back({
TokenType::kString,
primary,
token.token,
token.position,
token.offset_bytes,
token.offset_chars,
});
// Emit secondary code if different
if (!secondary.empty() && secondary != primary) {
result.push_back({
TokenType::kString,
secondary,
token.token,
token.position, // Same position
token.offset_bytes,
token.offset_chars,
});
}
}
return result;
}
};
Example: "Smith" produces codes "SM0" and "XMT", matching "Smyth," "Schmidt," etc.
18.6.10 Chained Token Filters
Multiple filters compose in sequence:
class ChainedTokenFilter {
std::vector<std::unique_ptr<TokenFilterType>> filters_;
public:
auto transform(const Tokens& tokens) const -> Tokens {
Tokens result = tokens;
for (const auto& filter : filters_) {
result = filter->transform(result);
}
return result;
}
auto normalize(const Tokens& tokens) const -> Tokens {
Tokens result = tokens;
for (const auto& filter : filters_) {
result = filter->normalize(result);
}
return result;
}
};
Typical Chain: Lowercase -> Stop Words -> Stemming -> Length Filter.
18.7 Language Support
18.7.1 Multi-Language Analysis
Cognica supports language-specific analysis through:
- Stemmer selection: Snowball stemmers for 20+ languages
- Stop word lists: Language-specific common words
- Tokenization rules: Script-aware word breaking
class StandardAnalyzer {
public:
StandardAnalyzer(std::string_view language = "english") {
// Configure language-specific components
token_filters_ = std::make_unique<ChainedTokenFilter>();
token_filters_->add(std::make_unique<LowerCaseTokenFilter>());
token_filters_->add(std::make_unique<StopWordsTokenFilter>(language));
token_filters_->add(std::make_unique<SnowballTokenFilter>(language));
}
};
18.7.2 CJK Analysis
Chinese, Japanese, and Korean text requires special handling due to:
- No whitespace: Words are not separated by spaces
- Character-based: Each character may be a word
- Ambiguous boundaries: Multiple valid segmentations exist
The StandardCJKAnalyzer addresses these challenges:
class StandardCJKAnalyzer : public Analyzer {
public:
StandardCJKAnalyzer(const Value& options) {
// Parse options
auto ngram_type = options.get("ngram_type", "normal");
auto min_size = options.get("min_size", 1);
auto max_size = options.get("max_size", 2);
// Configure CJK-specific pipeline
tokenizer_ = std::make_unique<ICUWordTokenizer>();
token_filters_ = std::make_unique<ChainedTokenFilter>();
token_filters_->add(std::make_unique<LowerCaseTokenFilter>());
// Add n-gram filter for CJK
if (ngram_type == "edge") {
token_filters_->add(
std::make_unique<EdgeNGramTokenFilter>(min_size, max_size));
} else {
token_filters_->add(
std::make_unique<NGramTokenFilter>(min_size, max_size));
}
token_filters_->add(
std::make_unique<ByteLengthTokenFilter>(min_length, max_length));
}
};
N-Gram Strategy: For CJK text, generating 1-2 character n-grams ensures that any substring can be matched, compensating for ambiguous word boundaries.
18.7.3 Japanese with MeCab
For higher-quality Japanese analysis, the MeCab tokenizer provides morphological analysis:
auto tokens = mecab_tokenizer.tokenize("I eat sushi.");
// Result:
// Token{term="watashi", pos="pronoun", ...}
// Token{term="ha", pos="particle", ...}
// Token{term="sushi", pos="noun", ...}
// Token{term="wo", pos="particle", ...}
// Token{term="taberu", pos="verb", ...}
MeCab uses a dictionary-based approach with statistical disambiguation, producing more meaningful tokens than character n-grams.
18.8 ICU Integration
18.8.1 Unicode String Handling
Cognica uses ICU for all Unicode operations:
#include <unicode/unistr.h>
#include <unicode/brkiter.h>
#include <unicode/translit.h>
// UTF-8 to ICU UnicodeString
auto source = icu::UnicodeString::fromUTF8(text);
// Process with ICU
source.toLower();
// Back to UTF-8
std::string result;
source.toUTF8String(result);
18.8.2 Word Break Detection
ICU's BreakIterator provides sophisticated word boundary detection:
UErrorCode status = U_ZERO_ERROR;
auto iterator = std::unique_ptr<icu::BreakIterator>{
icu::BreakIterator::createWordInstance(
icu::Locale::getRoot(), status)
};
iterator->setText(source);
int32_t start = iterator->first();
while (start != icu::BreakIterator::DONE) {
int32_t end = iterator->next();
// Check if this is a word (not punctuation/space)
if (iterator->getRuleStatus() != UBRK_WORD_NONE) {
// Process word from start to end
}
start = end;
}
Rule Status Values:
UBRK_WORD_NONE: Not a word (whitespace, punctuation)UBRK_WORD_NUMBER: Numeric sequenceUBRK_WORD_LETTER: Alphabetic wordUBRK_WORD_KANA: Japanese kanaUBRK_WORD_IDEO: Ideographic (CJK)
18.8.3 Transliteration
ICU transliterators perform complex character transformations:
// Create transliterator with rule
auto transliterator = icu::Transliterator::createInstance(
"NFD; [:Mn:] Remove; NFC", // Normalize, remove combining marks, recompose
UTRANS_FORWARD,
status
);
// Apply transformation
icu::UnicodeString text = "cafe";
transliterator->transliterate(text);
// Result: "cafe" (accent removed)
Common Rules:
"NFD; [:Mn:] Remove; NFC": Remove accents"Any-Latin": Convert any script to Latin"Hiragana-Katakana": Convert Japanese scripts
18.9 Custom Analyzer Configuration
18.9.1 Configuration Schema
Custom analyzers are configured through JSON:
{
"type": "custom",
"char_filters": [
{"type": "normalization", "form": "NFKC"},
{"type": "lower_case"}
],
"tokenizer": {
"type": "icu"
},
"token_filters": [
{"type": "stopwords", "language": "english"},
{"type": "snowball", "language": "english"},
{"type": "char_length", "min": 2, "max": 50}
]
}
18.9.2 Custom Analyzer Factory
class CustomAnalyzer : public Analyzer {
public:
CustomAnalyzer(
std::unique_ptr<ChainedCharacterFilter> char_filters,
std::unique_ptr<TokenizerType> tokenizer,
std::unique_ptr<ChainedTokenFilter> token_filters)
: char_filters_(std::move(char_filters)),
tokenizer_(std::move(tokenizer)),
token_filters_(std::move(token_filters)) {}
static auto from_config(const Value& config)
-> std::unique_ptr<CustomAnalyzer> {
// Build character filter chain
auto char_filters = std::make_unique<ChainedCharacterFilter>();
for (const auto& cf_config : config["char_filters"]) {
char_filters->add(CharFilterFactory::create(cf_config));
}
// Build tokenizer
auto tokenizer = TokenizerFactory::create(config["tokenizer"]);
// Build token filter chain
auto token_filters = std::make_unique<ChainedTokenFilter>();
for (const auto& tf_config : config["token_filters"]) {
token_filters->add(TokenFilterFactory::create(tf_config));
}
return std::make_unique<CustomAnalyzer>(
std::move(char_filters),
std::move(tokenizer),
std::move(token_filters)
);
}
};
18.10 Performance Considerations
18.10.1 Thread Safety
All analyzers are designed for concurrent use:
- ICU Tokenizer: Pool of 32 BreakIterator instances with per-instance locks
- Snowball Stemmer: Pool of 32 stemmer instances with LRU cache
- Immutable configuration: Analyzer settings cannot change after construction
18.10.2 Memory Efficiency
Token filters operate on token streams without allocating intermediate strings:
// Efficient: modify in place where possible
for (auto& token : tokens) {
to_lowercase_inplace(token.token);
}
// Avoid: creating new string for each token
for (const auto& token : tokens) {
result.push_back({..., to_lowercase(token.token), ...}); // Extra allocation
}
18.10.3 Caching
Frequently used transformations are cached:
- Stemmer cache: LRU cache for stemmed forms (64KB per language)
- Analyzer cache: Parsed analyzer configurations
- Stop word sets: Loaded once, shared across instances
18.11 Summary
The text analysis pipeline is the foundation of full-text search, transforming unstructured text into searchable tokens through a carefully designed sequence of transformations. The key concepts covered in this chapter are:
Three-Stage Pipeline: Character filters operate on raw text, tokenizers segment text into tokens, and token filters transform the token stream. This modular design enables flexible configuration for diverse use cases.
Unicode Support: ICU integration provides proper handling of international text, including Unicode normalization, script-aware word breaking, and locale-specific case folding.
Linguistic Processing: Stop word removal eliminates noise, stemming normalizes morphological variations, and phonetic encoding enables sound-alike matching.
Dual-Mode Analysis: The distinction between tokenize (index-time) and normalize (query-time) ensures consistent matching while preserving query semantics.
Language Support: Built-in support for 20+ languages through Snowball stemmers, language-specific stop word lists, and specialized tokenizers (MeCab for Japanese/Korean).
CJK Handling: N-gram and edge n-gram filters address the unique challenges of Chinese, Japanese, and Korean text where word boundaries are ambiguous.
Type Erasure: The te::poly pattern enables runtime composition of analyzers without virtual function overhead, supporting custom analyzer configurations.
The text analysis pipeline sets the stage for the scoring algorithms covered in the next chapter, where we explore how matched tokens are ranked using BM25 and Bayesian BM25.
References
- Porter, M. F. (1980). An Algorithm for Suffix Stripping. Program.
- Unicode Consortium. (2023). Unicode Standard Annex #29: Unicode Text Segmentation.
- ICU Project. (2023). ICU User Guide. https://unicode-org.github.io/icu/
- Snowball. (2023). Snowball Stemming Algorithms. https://snowballstem.org/
- Kudo, T. (2006). MeCab: Yet Another Part-of-Speech and Morphological Analyzer.