🔍 Spotbot Indexer

SOLR indexing with language priority and character limits

Language Priority

nl → en → de → fr → eu → unknown → others

Explanation: When processing multi-language content, Spotbot prioritizes languages in this order (0 = highest priority):

  1. nl (Dutch) - Highest priority for Dutch business context
  2. en (English)
  3. de (German)
  4. fr (French)
  5. Collection of EU languages
  6. unknown - Language detection failed
  7. others - All other languages

Character Limits

Rule 1: Maximum 1M characters per registration

No single company can have more than 1 million characters indexed in SOLR.

Rule 2: ≤2k characters with priority < 6 → Skip other languages

If content is 2,000 characters or less and has priority 1-5, skip indexing other languages.

Why: Small high-priority content doesn't need multi-language indexing.

Rule 3: >100k characters → Keep nl+en only

If content exceeds 100,000 characters, only index Dutch (nl) and English (en).

Why: Large websites don't need all languages indexed to stay under 1M limit.

Multiple Correct URLs

Edge case: When a company has multiple url_active (Correct URLs), each URL is processed sequentially and all content is concatenated together. The character length rules apply to the total concatenated content across all URLs.

Example (edge case): Company has 2 active URLs:

Indexing Workflow

  1. Trigger: Index is triggered when an entity is inserted into BIQ (bedrijf_index_queue). Any process that changes an entity or anything related to an entity is responsible for adding it to the queue.
  2. Fetch entity: Get entity document (bedrijf) and all related information including employees, SBI codes, rechtsvorm, oprichting (founding date), and other entity properties
  3. Handle multi-value properties: Some properties have multiple values (labels, websites) and are handled as sub-children of the root document.
    These sub-children belong to and can only be fetched in relation to the root entity.
  4. Fetch websites (custom rules): Get all url_active (Correct URLs) for the entity
  5. Extract text: Extract each page text and metadata from NFS
  6. Sort by language: Sort pages by detected language (if language not defined, attempt to detect it)
  7. Apply character limits: Check 1M max, 2k rule, 100k rule on concatenated content
  8. Index to SOLR: Store processed content in search index with bedrijf as root document