Chapter 18: Text Analysis Pipeline

18.1 Introduction

Full-text search requires transforming unstructured text into searchable tokens—a process called text analysis. The quality of this transformation directly impacts search relevance: poor analysis leads to missed matches or irrelevant results, while good analysis enables users to find documents using natural language queries.

Text analysis bridges the gap between how humans write and how computers search. When a user searches for "running," they likely want to find documents containing "run," "runs," "running," and "ran." When searching for "cafe," they expect to find "cafe" and "cafe." These linguistic variations must be normalized to enable effective matching.

This chapter explores Cognica's text analysis pipeline, which transforms raw text through three stages: character filtering, tokenization, and token filtering. The pipeline supports multiple languages, Unicode text, and custom analysis configurations.

18.1.1 The Analysis Challenge

Consider the sentence: "The quick brown foxes jumped over the lazy dogs."

A naive approach might split on whitespace and match exactly, but this fails for:

Case variations: "Quick" vs "quick"
Morphological variations: "foxes" vs "fox," "jumped" vs "jump"
Stop words: "the," "over" add noise without semantic value
Diacritics: "cafe" should match "cafe"

The analysis pipeline addresses these challenges through a sequence of transformations:

Loading diagram...

18.1.2 Index-Time vs Query-Time Analysis

The analysis pipeline operates in two distinct modes:

Index-Time Analysis (tokenize): Full processing including stemming and stop word removal. Creates tokens for the inverted index.

Query-Time Analysis (normalize): Lighter processing that matches index terms without over-transforming query intent. Typically skips stemming to preserve user semantics.

This distinction is crucial: if the index contains stemmed terms but queries are not stemmed (or vice versa), matches will fail. Cognica's dual-mode design ensures consistency.

18.2 Analyzer Architecture

18.2.1 The Analyzer Interface

The Analyzer class defines the contract for text analysis:

class Analyzer {
public:
  virtual ~Analyzer() = default;

  // Full analysis for indexing
  virtual auto tokenize(std::string_view text) const -> Tokens = 0;
  virtual auto tokenize(const Value& value) const -> Tokens;

  // Lighter analysis for queries
  virtual auto normalize(std::string_view text) const -> Tokens = 0;
  virtual auto normalize(const Value& value) const -> Tokens;
};

The tokenize and normalize methods may produce different results for the same input:

Input	tokenize()	normalize()
"Running"	["run"]	["running"]
"The fox"	["fox"]	["the", "fox"]

18.2.2 Analyzer Composition

Each analyzer composes three pipeline stages:

class StandardAnalyzer : public Analyzer {
  std::unique_ptr<ChainedCharacterFilter> char_filters_;
  TokenizerType tokenizer_;
  std::unique_ptr<ChainedTokenFilter> token_filters_;

public:
  auto tokenize(std::string_view text) const -> Tokens override {
    // Stage 1: Character filtering
    auto filtered = char_filters_->transform(text);

    // Stage 2: Tokenization
    auto tokens = tokenizer_.tokenize(filtered);

    // Stage 3: Token filtering
    return token_filters_->transform(tokens);
  }

  auto normalize(std::string_view text) const -> Tokens override {
    auto filtered = char_filters_->transform(text);
    auto tokens = tokenizer_.normalize(filtered);
    return token_filters_->normalize(tokens);
  }
};

18.2.3 Built-in Analyzers

Cognica provides 12 analyzer types:

Analyzer	Description	Use Case
standard	ICU tokenization + stemming + stop words	General text
standard_cjk	Standard + n-gram filters	Chinese/Japanese/Korean
keyword	No tokenization	Exact match fields
custom	User-configured pipeline	Special requirements
whitespace	Simple whitespace splitting	Pre-tokenized input
regex	Pattern-based tokenization	Structured text
datetime	Date/time parsing	Temporal fields
number	Numeric value handling	Numeric fields
int64	64-bit integer specific	Integer fields
float64	Double precision specific	Float fields
dense_vector	Vector embedding handling	ML embeddings
geopoint	Geographic coordinates	Location fields

18.2.4 Type Erasure Pattern

Cognica uses type erasure to enable runtime composition without virtual function overhead:

template<typename T, typename Storage = te::local_storage<64>>
using poly = /* type-erased wrapper */;

using TokenizerType = te::poly<Tokenizer, te::local_storage<64>>;
using TokenFilterType = te::poly<TokenFilter>;
using CharacterFilterType = te::poly<CharacterFilter>;

Benefits:

64-byte local storage avoids heap allocation for small types
Runtime polymorphism without virtual dispatch overhead
Factory-based construction from configuration

18.3 Token Structure

18.3.1 Token Representation

Tokens carry rich metadata beyond the term itself:

template<typename T>
struct TokenTemplate {
  TokenType type = TokenType::kString;   // Data type
  T token{};                             // Processed term
  std::optional<T> original_token{};     // Original form (for highlighting)
  int32_t position = 0;                  // Position in token stream
  Offset offset_bytes{};                 // Byte offsets in source
  Offset offset_chars{};                 // Character offsets in source
};

using Token = TokenTemplate<std::string>;
using Tokens = std::vector<Token>;

18.3.2 Token Types

The TokenType enumeration supports typed tokens:

enum class TokenType : uint8_t {
  kNull,      // Null value
  kBoolean,   // Boolean literal
  kInt64,     // 64-bit integer
  kUInt64,    // Unsigned 64-bit integer
  kDouble,    // Double precision float
  kString,    // Text string
};

Typed tokens enable the keyword analyzer to preserve numeric and boolean values for exact matching.

18.3.3 Offset Tracking

The Offset structure tracks positions in the original text:

struct Offset {
  int32_t begin = 0;   // Start position
  int32_t end = 0;     // End position

  static auto is_overlapped(const Offset& a, const Offset& b) -> bool {
    return a.end > b.begin && b.end > a.begin;
  }
};

Offset tracking enables:

Highlighting: Show matched terms in context
Snippet generation: Extract relevant text passages
Phrase queries: Verify term adjacency

18.3.4 Position Semantics

Token positions support phrase and proximity queries:

Input: "The quick brown fox"

Tokens after stop word removal:
  Token{term="quick", position=0, ...}
  Token{term="brown", position=1, ...}
  Token{term="fox",   position=2, ...}

Note that "The" is removed but subsequent positions are adjusted to maintain adjacency information. This enables the phrase query "quick brown" to match correctly.

18.4 Character Filters

18.4.1 Purpose and Interface

Character filters transform raw text before tokenization:

class CharacterFilter {
public:
  auto transform(std::string_view text) const -> std::string;
};

Character filters operate on the entire input string, enabling transformations that span token boundaries.

18.4.2 Lowercase Character Filter

The LowerCaseCharacterFilter performs Unicode-aware case folding:

class LowerCaseCharacterFilter {
public:
  auto transform(std::string_view text) const -> std::string {
    // Convert UTF-8 to ICU UnicodeString
    auto input = icu::UnicodeString::fromUTF8({
      text.data(),
      static_cast<int32_t>(text.size())
    });

    // Apply Unicode lowercase
    input.toLower();

    // Convert back to UTF-8
    std::string output;
    auto sink = icu::StringByteSink<std::string>{&output, input.length()};
    input.toUTF8(sink);
    return output;
  }
};

Unicode Considerations:

German: "STRASSE" becomes "strasse" (not "strasze")
Greek: "SIGMA" becomes context-appropriate "sigma" or "varsigma"
Turkish: "I" becomes "i" (not "i" with dot above)

18.4.3 Normalization Character Filter

The NormalizationCharacterFilter applies Unicode normalization:

class NormalizationCharacterFilter {
  icu::Transliterator* transliterator_;

public:
  NormalizationCharacterFilter() {
    UErrorCode status = U_ZERO_ERROR;
    // NFD decomposition, remove combining marks, NFC recomposition
    transliterator_ = icu::Transliterator::createInstance(
      "NFD; [:Mn:] Remove; NFC",
      UTRANS_FORWARD,
      status
    );
  }

  auto transform(std::string_view text) const -> std::string {
    auto source = icu::UnicodeString::fromUTF8(text);
    transliterator_->transliterate(source);
    // Convert back to UTF-8...
  }
};

Normalization Forms:

NFD: Canonical decomposition (e becomes e + combining accent)
NFC: Canonical composition (e + combining accent becomes e)
NFKD: Compatibility decomposition (fi ligature becomes f + i)
NFKC: Compatibility composition

18.4.4 Chained Character Filters

Multiple character filters compose in sequence:

class ChainedCharacterFilter {
  std::vector<CharacterFilterType> filters_;

public:
  auto transform(std::string_view text) const -> std::string {
    std::string output{text};
    for (const auto& filter : filters_) {
      output = filter.transform(output);
    }
    return output;
  }
};

A typical chain: Normalization -> Lowercase.

18.5 Tokenizers

18.5.1 Tokenizer Interface

Tokenizers segment text into individual tokens:

class Tokenizer {
public:
  auto tokenize(std::string_view text) const -> Tokens;
  auto normalize(std::string_view text) const -> Tokens;
};

18.5.2 ICU Word Tokenizer

The ICUWordTokenizer is the primary tokenizer, using ICU's BreakIterator for Unicode-aware word segmentation:

class ICUWordTokenizer {
  static constexpr size_t kMaxConcurrency = 32;
  std::array<std::unique_ptr<std::mutex>, kMaxConcurrency> locks_;
  std::array<std::unique_ptr<icu::BreakIterator>, kMaxConcurrency> iterators_;

public:
  ICUWordTokenizer() {
    UErrorCode status = U_ZERO_ERROR;
    for (size_t i = 0; i < kMaxConcurrency; ++i) {
      locks_[i] = std::make_unique<std::mutex>();
      iterators_[i].reset(
        icu::BreakIterator::createWordInstance(
          icu::Locale::getRoot(), status));
    }
  }

  auto tokenize(std::string_view text) const -> Tokens {
    // Select iterator based on hash for load distribution
    size_t idx = std::hash<std::string_view>{}(text) % kMaxConcurrency;
    std::lock_guard lock(*locks_[idx]);

    auto& iterator = iterators_[idx];
    auto source = icu::UnicodeString::fromUTF8(text);
    iterator->setText(source);

    Tokens tokens;
    int32_t position = 0;
    int32_t begin = iterator->first();

    while (begin != icu::BreakIterator::DONE) {
      int32_t end = iterator->next();
      if (end == icu::BreakIterator::DONE) break;

      // Skip non-word breaks (punctuation, whitespace)
      if (iterator->getRuleStatus() == UBRK_WORD_NONE) {
        begin = end;
        continue;
      }

      // Extract word
      auto word = source.tempSubStringBetween(begin, end);
      std::string term;
      word.toUTF8String(term);

      tokens.push_back({
        TokenType::kString,
        std::move(term),
        std::nullopt,
        position++,
        {begin, end},  // byte offsets
        {begin, end},  // char offsets
      });

      begin = end;
    }

    return tokens;
  }
};

Thread Safety: The tokenizer maintains 32 iterator instances with per-iterator locks, enabling concurrent tokenization without contention.

Word Break Rules: ICU's word break algorithm handles:

Script boundaries (Latin, CJK, Arabic, etc.)
Contractions ("don't" as one or two tokens based on locale)
Numeric sequences ("3.14" as single token)
Email addresses and URLs (configurable)

18.5.3 N-Gram Tokenizer

The NGramTokenizer generates character n-grams:

class NGramTokenizer {
  int64_t min_size_ = 2;
  int64_t max_size_ = 3;
  bool track_offsets_ = false;

public:
  auto tokenize(std::string_view text) const -> Tokens {
    Tokens tokens;
    int32_t position = 0;

    // Generate n-grams for each size
    for (int64_t n = min_size_; n <= max_size_; ++n) {
      for (size_t i = 0; i + n <= text.size(); ++i) {
        tokens.push_back({
          TokenType::kString,
          std::string(text.substr(i, n)),
          std::nullopt,
          position++,
          track_offsets_ ? Offset{static_cast<int32_t>(i),
                                  static_cast<int32_t>(i + n)}
                         : Offset{},
          {},
        });
      }
    }

    return tokens;
  }
};

Use Cases:

Substring matching without wildcards
CJK text where word boundaries are ambiguous
Typo tolerance through partial matches

18.5.4 Keyword Tokenizer

The KeywordTokenizer emits the entire input as a single token:

class KeywordTokenizer {
public:
  auto tokenize(std::string_view text) const -> Tokens {
    return {{
      TokenType::kString,
      std::string(text),
      std::nullopt,
      0,  // Position 0
      {0, static_cast<int32_t>(text.size())},
      {0, static_cast<int32_t>(text.size())},
    }};
  }
};

Use Cases:

Exact match fields (product SKUs, IDs)
Enumeration values
Tags and categories

18.5.5 MeCab Tokenizer

The MeCabTokenizer provides morphological analysis for Japanese and Korean:

class MeCabTokenizer {
  std::unique_ptr<MeCab::Tagger> tagger_;

public:
  MeCabTokenizer() {
    tagger_.reset(MeCab::createTagger("-Owakati"));
  }

  auto tokenize(std::string_view text) const -> Tokens {
    const char* result = tagger_->parse(text.data());
    // Parse MeCab output into tokens...
  }
};

MeCab performs:

Dictionary-based word segmentation
Part-of-speech tagging
Compound word decomposition
Reading (furigana) extraction

18.5.6 Tokenizer Factory

Tokenizers are created through a factory:

class TokenizerFactory {
  static const std::unordered_map<std::string_view, CreatorFn> kFactoryMap;

public:
  static auto create(std::string_view type, const Value& options)
      -> std::unique_ptr<TokenizerType> {
    auto it = kFactoryMap.find(type);
    if (it == kFactoryMap.end()) {
      throw std::invalid_argument("Unknown tokenizer: " + std::string(type));
    }
    return it->second(options);
  }
};

const std::unordered_map<std::string_view, CreatorFn>
TokenizerFactory::kFactoryMap = {
  {"icu", create<ICUWordTokenizer>},
  {"whitespace", create<WhitespaceTokenizer>},
  {"keyword", create<KeywordTokenizer>},
  {"ngram", create<NGramTokenizer>},
  {"mecab", create<MeCabTokenizer>},
  {"regex", create<RegexTokenizer>},
  // ... more tokenizers
};

18.6 Token Filters

18.6.1 Token Filter Interface

Token filters transform the token stream:

class TokenFilter {
public:
  auto transform(const Tokens& tokens) const -> Tokens;  // For indexing
  auto normalize(const Tokens& tokens) const -> Tokens;  // For queries
};

18.6.2 Lowercase Token Filter

The LowerCaseTokenFilter normalizes token case:

class LowerCaseTokenFilter {
public:
  auto transform(const Tokens& tokens) const -> Tokens {
    Tokens result;
    result.reserve(tokens.size());

    for (const auto& token : tokens) {
      auto source = icu::UnicodeString::fromUTF8(token.token);
      source.toLower();

      std::string lowered;
      source.toUTF8String(lowered);

      result.push_back({
        token.type,
        std::move(lowered),
        token.token,  // Preserve original for highlighting
        token.position,
        token.offset_bytes,
        token.offset_chars,
      });
    }

    return result;
  }
};

18.6.3 Stop Word Filter

The StopWordsTokenFilter removes common words:

class StopWordsTokenFilter {
  std::unordered_set<std::string> stop_words_;

public:
  explicit StopWordsTokenFilter(std::string_view language) {
    stop_words_ = load_stop_words(language);
  }

  auto transform(const Tokens& tokens) const -> Tokens {
    Tokens result;
    int32_t position_adjustment = 0;

    for (const auto& token : tokens) {
      if (stop_words_.contains(token.token)) {
        ++position_adjustment;
        continue;  // Skip stop word
      }

      result.push_back({
        token.type,
        token.token,
        token.original_token,
        token.position - position_adjustment,  // Adjust position
        token.offset_bytes,
        token.offset_chars,
      });
    }

    return result;
  }
};

Position Adjustment: When stop words are removed, subsequent token positions are decremented to maintain correct phrase matching semantics.

Language-Specific Lists: Each language has its own stop word list. English includes "the," "a," "an," "is," "are," etc.

18.6.4 Snowball Stemmer Filter

The SnowballTokenFilter reduces words to their stems:

class SnowballTokenFilter {
  static constexpr size_t kMaxStemmers = 32;
  std::string language_;
  std::array<std::unique_ptr<std::mutex>, kMaxStemmers> locks_;
  std::array<sb_stemmer*, kMaxStemmers> stemmers_;
  mutable LRUCache<std::string, std::string> cache_;  // 64KB cache

public:
  SnowballTokenFilter(std::string_view language)
      : language_(language),
        cache_(64 * 1024) {
    for (size_t i = 0; i < kMaxStemmers; ++i) {
      locks_[i] = std::make_unique<std::mutex>();
      stemmers_[i] = sb_stemmer_new(language_.c_str(), "UTF_8");
    }
  }

  auto transform(const Tokens& tokens) const -> Tokens {
    Tokens result;
    result.reserve(tokens.size());

    for (const auto& token : tokens) {
      // Check cache first
      if (auto cached = cache_.get(token.token)) {
        result.push_back({token.type, *cached, token.token, ...});
        continue;
      }

      // Stem the token
      size_t idx = std::hash<std::string>{}(token.token) % kMaxStemmers;
      std::lock_guard lock(*locks_[idx]);

      const sb_symbol* stemmed = sb_stemmer_stem(
        stemmers_[idx],
        reinterpret_cast<const sb_symbol*>(token.token.data()),
        static_cast<int>(token.token.size())
      );

      std::string stem(reinterpret_cast<const char*>(stemmed),
                       sb_stemmer_length(stemmers_[idx]));

      cache_.put(token.token, stem);

      result.push_back({
        token.type,
        std::move(stem),
        token.token,  // Preserve original
        token.position,
        token.offset_bytes,
        token.offset_chars,
      });
    }

    return result;
  }

  auto normalize(const Tokens& tokens) const -> Tokens {
    // Query-time: typically skip stemming to preserve user intent
    return tokens;
  }
};

Supported Languages: English, French, German, Spanish, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Russian, Finnish, Hungarian, Turkish, Arabic, and more.

Performance Optimization:

32 stemmer instances for concurrent access
LRU cache (64KB) for frequently stemmed words
Hash-based load distribution

18.6.5 N-Gram Token Filter

The NGramTokenFilter generates n-grams from tokens:

class NGramTokenFilter {
  int64_t min_size_ = 2;
  int64_t max_size_ = 2;
  bool track_offsets_ = false;
  bool allow_small_tokens_ = true;

public:
  auto transform(const Tokens& tokens) const -> Tokens {
    Tokens result;

    for (const auto& token : tokens) {
      if (token.token.size() < min_size_ && allow_small_tokens_) {
        result.push_back(token);  // Keep small tokens as-is
        continue;
      }

      // Generate n-grams
      for (int64_t n = min_size_; n <= max_size_; ++n) {
        for (size_t i = 0; i + n <= token.token.size(); ++i) {
          result.push_back({
            TokenType::kString,
            token.token.substr(i, n),
            std::nullopt,
            token.position,
            track_offsets_ ? compute_offset(token, i, n) : Offset{},
            {},
          });
        }
      }
    }

    return result;
  }

  auto normalize(const Tokens& tokens) const -> Tokens {
    // Query-time optimization: skip overlapping n-grams
    Tokens result;
    Offset prev_offset{};

    for (const auto& token : tokens) {
      // Generate n-grams but skip overlapping ones
      for (int64_t n = min_size_; n <= max_size_; ++n) {
        for (size_t i = 0; i + n <= token.token.size(); ++i) {
          Offset current = compute_offset(token, i, n);
          if (!Offset::is_overlapped(prev_offset, current)) {
            result.push_back({...});
            prev_offset = current;
          }
        }
      }
    }

    return result;
  }
};

18.6.6 Edge N-Gram Filter

The EdgeNGramTokenFilter generates n-grams anchored at token edges:

class EdgeNGramTokenFilter {
  int64_t min_size_ = 1;
  int64_t max_size_ = 2;
  bool from_end_ = false;  // true for suffix n-grams

public:
  auto transform(const Tokens& tokens) const -> Tokens {
    Tokens result;

    for (const auto& token : tokens) {
      for (int64_t n = min_size_;
           n <= std::min(max_size_, static_cast<int64_t>(token.token.size()));
           ++n) {
        if (from_end_) {
          // Suffix n-gram
          result.push_back({
            TokenType::kString,
            token.token.substr(token.token.size() - n, n),
            ...
          });
        } else {
          // Prefix n-gram
          result.push_back({
            TokenType::kString,
            token.token.substr(0, n),
            ...
          });
        }
      }
    }

    return result;
  }
};

Use Case: Prefix completion (autocomplete) queries. Indexing "hello" as ["h", "he"] enables prefix search.

18.6.7 Shingle Filter

The ShingleTokenFilter creates word n-grams (shingles):

class ShingleTokenFilter {
  int64_t min_size_ = 2;
  int64_t max_size_ = 2;
  std::string separator_ = " ";
  bool output_unigrams_ = true;

public:
  auto transform(const Tokens& tokens) const -> Tokens {
    Tokens result;

    // Output unigrams if configured
    if (output_unigrams_) {
      result.insert(result.end(), tokens.begin(), tokens.end());
    }

    // Generate shingles
    for (size_t i = 0; i < tokens.size(); ++i) {
      for (int64_t n = min_size_; n <= max_size_; ++n) {
        if (i + n > tokens.size()) break;

        std::string shingle;
        for (size_t j = 0; j < n; ++j) {
          if (j > 0) shingle += separator_;
          shingle += tokens[i + j].token;
        }

        result.push_back({
          TokenType::kString,
          std::move(shingle),
          std::nullopt,
          tokens[i].position,
          {},
          {},
        });
      }
    }

    return result;
  }
};

Example: "quick brown fox" with 2-grams produces:

Unigrams: ["quick", "brown", "fox"]
Shingles: ["quick brown", "brown fox"]

18.6.8 ASCII Folding Filter

The ASCIIFoldingTokenFilter converts accented characters to ASCII:

class ASCIIFoldingTokenFilter {
  static const std::unordered_map<char32_t, std::string> kFoldingMap;

public:
  auto transform(const Tokens& tokens) const -> Tokens {
    Tokens result;

    for (const auto& token : tokens) {
      std::string folded;
      folded.reserve(token.token.size());

      // Iterate over UTF-8 code points
      for (char32_t cp : iterate_utf8(token.token)) {
        if (auto it = kFoldingMap.find(cp); it != kFoldingMap.end()) {
          folded += it->second;  // Folded ASCII equivalent
        } else if (cp < 128) {
          folded += static_cast<char>(cp);  // Already ASCII
        }
        // Non-ASCII characters without mapping are removed
      }

      result.push_back({
        token.type,
        std::move(folded),
        token.token,
        token.position,
        token.offset_bytes,
        token.offset_chars,
      });
    }

    return result;
  }
};

// Example mappings
const std::unordered_map<char32_t, std::string>
ASCIIFoldingTokenFilter::kFoldingMap = {
  {U'a', "a"}, {U'a', "a"}, {U'a', "a"},  // a with various accents
  {U'c', "c"},                            // c with cedilla
  {U'n', "n"},                            // n with tilde
  {U'ss', "ss"},                          // German sharp s
  // ... extensive mapping table
};

18.6.9 Double Metaphone Filter

The DoubleMetaphoneTokenFilter generates phonetic codes for sound-alike matching:

class DoubleMetaphoneTokenFilter {
public:
  auto transform(const Tokens& tokens) const -> Tokens {
    Tokens result;

    for (const auto& token : tokens) {
      auto [primary, secondary] = double_metaphone(token.token);

      // Emit primary code
      result.push_back({
        TokenType::kString,
        primary,
        token.token,
        token.position,
        token.offset_bytes,
        token.offset_chars,
      });

      // Emit secondary code if different
      if (!secondary.empty() && secondary != primary) {
        result.push_back({
          TokenType::kString,
          secondary,
          token.token,
          token.position,  // Same position
          token.offset_bytes,
          token.offset_chars,
        });
      }
    }

    return result;
  }
};

Example: "Smith" produces codes "SM0" and "XMT", matching "Smyth," "Schmidt," etc.

18.6.10 Chained Token Filters

Multiple filters compose in sequence:

class ChainedTokenFilter {
  std::vector<std::unique_ptr<TokenFilterType>> filters_;

public:
  auto transform(const Tokens& tokens) const -> Tokens {
    Tokens result = tokens;
    for (const auto& filter : filters_) {
      result = filter->transform(result);
    }
    return result;
  }

  auto normalize(const Tokens& tokens) const -> Tokens {
    Tokens result = tokens;
    for (const auto& filter : filters_) {
      result = filter->normalize(result);
    }
    return result;
  }
};

Typical Chain: Lowercase -> Stop Words -> Stemming -> Length Filter.

18.7 Language Support

18.7.1 Multi-Language Analysis

Cognica supports language-specific analysis through:

Stemmer selection: Snowball stemmers for 20+ languages
Stop word lists: Language-specific common words
Tokenization rules: Script-aware word breaking

class StandardAnalyzer {
public:
  StandardAnalyzer(std::string_view language = "english") {
    // Configure language-specific components
    token_filters_ = std::make_unique<ChainedTokenFilter>();
    token_filters_->add(std::make_unique<LowerCaseTokenFilter>());
    token_filters_->add(std::make_unique<StopWordsTokenFilter>(language));
    token_filters_->add(std::make_unique<SnowballTokenFilter>(language));
  }
};

18.7.2 CJK Analysis

Chinese, Japanese, and Korean text requires special handling due to:

No whitespace: Words are not separated by spaces
Character-based: Each character may be a word
Ambiguous boundaries: Multiple valid segmentations exist

The StandardCJKAnalyzer addresses these challenges:

class StandardCJKAnalyzer : public Analyzer {
public:
  StandardCJKAnalyzer(const Value& options) {
    // Parse options
    auto ngram_type = options.get("ngram_type", "normal");
    auto min_size = options.get("min_size", 1);
    auto max_size = options.get("max_size", 2);

    // Configure CJK-specific pipeline
    tokenizer_ = std::make_unique<ICUWordTokenizer>();

    token_filters_ = std::make_unique<ChainedTokenFilter>();
    token_filters_->add(std::make_unique<LowerCaseTokenFilter>());

    // Add n-gram filter for CJK
    if (ngram_type == "edge") {
      token_filters_->add(
        std::make_unique<EdgeNGramTokenFilter>(min_size, max_size));
    } else {
      token_filters_->add(
        std::make_unique<NGramTokenFilter>(min_size, max_size));
    }

    token_filters_->add(
      std::make_unique<ByteLengthTokenFilter>(min_length, max_length));
  }
};

N-Gram Strategy: For CJK text, generating 1-2 character n-grams ensures that any substring can be matched, compensating for ambiguous word boundaries.

18.7.3 Japanese with MeCab

For higher-quality Japanese analysis, the MeCab tokenizer provides morphological analysis:

auto tokens = mecab_tokenizer.tokenize("I eat sushi.");

// Result:
// Token{term="watashi", pos="pronoun", ...}
// Token{term="ha", pos="particle", ...}
// Token{term="sushi", pos="noun", ...}
// Token{term="wo", pos="particle", ...}
// Token{term="taberu", pos="verb", ...}

MeCab uses a dictionary-based approach with statistical disambiguation, producing more meaningful tokens than character n-grams.

18.8 ICU Integration

18.8.1 Unicode String Handling

Cognica uses ICU for all Unicode operations:

#include <unicode/unistr.h>
#include <unicode/brkiter.h>
#include <unicode/translit.h>

// UTF-8 to ICU UnicodeString
auto source = icu::UnicodeString::fromUTF8(text);

// Process with ICU
source.toLower();

// Back to UTF-8
std::string result;
source.toUTF8String(result);

18.8.2 Word Break Detection

ICU's BreakIterator provides sophisticated word boundary detection:

UErrorCode status = U_ZERO_ERROR;
auto iterator = std::unique_ptr<icu::BreakIterator>{
  icu::BreakIterator::createWordInstance(
    icu::Locale::getRoot(), status)
};

iterator->setText(source);

int32_t start = iterator->first();
while (start != icu::BreakIterator::DONE) {
  int32_t end = iterator->next();

  // Check if this is a word (not punctuation/space)
  if (iterator->getRuleStatus() != UBRK_WORD_NONE) {
    // Process word from start to end
  }

  start = end;
}

Rule Status Values:

UBRK_WORD_NONE: Not a word (whitespace, punctuation)
UBRK_WORD_NUMBER: Numeric sequence
UBRK_WORD_LETTER: Alphabetic word
UBRK_WORD_KANA: Japanese kana
UBRK_WORD_IDEO: Ideographic (CJK)

18.8.3 Transliteration

ICU transliterators perform complex character transformations:

// Create transliterator with rule
auto transliterator = icu::Transliterator::createInstance(
  "NFD; [:Mn:] Remove; NFC",  // Normalize, remove combining marks, recompose
  UTRANS_FORWARD,
  status
);

// Apply transformation
icu::UnicodeString text = "cafe";
transliterator->transliterate(text);
// Result: "cafe" (accent removed)

Common Rules:

"NFD; [:Mn:] Remove; NFC": Remove accents
"Any-Latin": Convert any script to Latin
"Hiragana-Katakana": Convert Japanese scripts

18.9 Custom Analyzer Configuration

18.9.1 Configuration Schema

Custom analyzers are configured through JSON:

{
  "type": "custom",
  "char_filters": [
    {"type": "normalization", "form": "NFKC"},
    {"type": "lower_case"}
  ],
  "tokenizer": {
    "type": "icu"
  },
  "token_filters": [
    {"type": "stopwords", "language": "english"},
    {"type": "snowball", "language": "english"},
    {"type": "char_length", "min": 2, "max": 50}
  ]
}

18.9.2 Custom Analyzer Factory

class CustomAnalyzer : public Analyzer {
public:
  CustomAnalyzer(
      std::unique_ptr<ChainedCharacterFilter> char_filters,
      std::unique_ptr<TokenizerType> tokenizer,
      std::unique_ptr<ChainedTokenFilter> token_filters)
      : char_filters_(std::move(char_filters)),
        tokenizer_(std::move(tokenizer)),
        token_filters_(std::move(token_filters)) {}

  static auto from_config(const Value& config)
      -> std::unique_ptr<CustomAnalyzer> {
    // Build character filter chain
    auto char_filters = std::make_unique<ChainedCharacterFilter>();
    for (const auto& cf_config : config["char_filters"]) {
      char_filters->add(CharFilterFactory::create(cf_config));
    }

    // Build tokenizer
    auto tokenizer = TokenizerFactory::create(config["tokenizer"]);

    // Build token filter chain
    auto token_filters = std::make_unique<ChainedTokenFilter>();
    for (const auto& tf_config : config["token_filters"]) {
      token_filters->add(TokenFilterFactory::create(tf_config));
    }

    return std::make_unique<CustomAnalyzer>(
      std::move(char_filters),
      std::move(tokenizer),
      std::move(token_filters)
    );
  }
};

18.10 Performance Considerations

18.10.1 Thread Safety

All analyzers are designed for concurrent use:

ICU Tokenizer: Pool of 32 BreakIterator instances with per-instance locks
Snowball Stemmer: Pool of 32 stemmer instances with LRU cache
Immutable configuration: Analyzer settings cannot change after construction

18.10.2 Memory Efficiency

Token filters operate on token streams without allocating intermediate strings:

// Efficient: modify in place where possible
for (auto& token : tokens) {
  to_lowercase_inplace(token.token);
}

// Avoid: creating new string for each token
for (const auto& token : tokens) {
  result.push_back({..., to_lowercase(token.token), ...});  // Extra allocation
}

18.10.3 Caching

Frequently used transformations are cached:

Stemmer cache: LRU cache for stemmed forms (64KB per language)
Analyzer cache: Parsed analyzer configurations
Stop word sets: Loaded once, shared across instances

18.11 Summary

The text analysis pipeline is the foundation of full-text search, transforming unstructured text into searchable tokens through a carefully designed sequence of transformations. The key concepts covered in this chapter are:

Three-Stage Pipeline: Character filters operate on raw text, tokenizers segment text into tokens, and token filters transform the token stream. This modular design enables flexible configuration for diverse use cases.

Unicode Support: ICU integration provides proper handling of international text, including Unicode normalization, script-aware word breaking, and locale-specific case folding.

Linguistic Processing: Stop word removal eliminates noise, stemming normalizes morphological variations, and phonetic encoding enables sound-alike matching.

Dual-Mode Analysis: The distinction between tokenize (index-time) and normalize (query-time) ensures consistent matching while preserving query semantics.

Language Support: Built-in support for 20+ languages through Snowball stemmers, language-specific stop word lists, and specialized tokenizers (MeCab for Japanese/Korean).

CJK Handling: N-gram and edge n-gram filters address the unique challenges of Chinese, Japanese, and Korean text where word boundaries are ambiguous.

Type Erasure: The te::poly pattern enables runtime composition of analyzers without virtual function overhead, supporting custom analyzer configurations.

The text analysis pipeline sets the stage for the scoring algorithms covered in the next chapter, where we explore how matched tokens are ranked using BM25 and Bayesian BM25.

References

Porter, M. F. (1980). An Algorithm for Suffix Stripping. Program.
Unicode Consortium. (2023). Unicode Standard Annex #29: Unicode Text Segmentation.
ICU Project. (2023). ICU User Guide. https://unicode-org.github.io/icu/
Snowball. (2023). Snowball Stemming Algorithms. https://snowballstem.org/
Kudo, T. (2006). MeCab: Yet Another Part-of-Speech and Morphological Analyzer.