Spotbot Indexer - Innovatiespotter Pipeline

Language Priority

nl → en → de → fr → eu → unknown → others

Explanation: When processing multi-language content, Spotbot prioritizes languages in this order (0 = highest priority):

nl (Dutch) - Highest priority for Dutch business context
en (English)
de (German)
fr (French)
Collection of EU languages
unknown - Language detection failed
others - All other languages

Character Limits

Rule 1: Maximum 1M characters per registration

No single company can have more than 1 million characters indexed in SOLR.

Rule 2: ≤2k characters with priority < 6 → Skip other languages

If content is 2,000 characters or less and has priority 1-5, skip indexing other languages.

Why: Small high-priority content doesn't need multi-language indexing.

Rule 3: >100k characters → Keep nl+en only

If content exceeds 100,000 characters, only index Dutch (nl) and English (en).

Why: Large websites don't need all languages indexed to stay under 1M limit.

Multiple Correct URLs

Edge case: When a company has multiple url_active (Correct URLs), each URL is processed sequentially and all content is concatenated together. The character length rules apply to the total concatenated content across all URLs.

Example (edge case): Company has 2 active URLs:

URL 1: 60k characters (nl), 30k characters (en), 20k characters (unknown)
URL 2: 40k characters (nl only)
Result: It will index 90k nl and en from URL 1 and 10k as unknown language, AND 40k nl from URL 2 (total 140k exceeds 100k limit but indexed anyway due to sequential processing)

Indexing Workflow

Trigger: Index is triggered when an entity is inserted into BIQ (bedrijf_index_queue). Any process that changes an entity or anything related to an entity is responsible for adding it to the queue.
Fetch entity: Get entity document (bedrijf) and all related information including employees, SBI codes, rechtsvorm, oprichting (founding date), and other entity properties
Handle multi-value properties: Some properties have multiple values (labels, websites) and are handled as sub-children of the root document.
These sub-children belong to and can only be fetched in relation to the root entity.
Fetch websites (custom rules): Get all url_active (Correct URLs) for the entity
Extract text: Extract each page text and metadata from NFS
Sort by language: Sort pages by detected language (if language not defined, attempt to detect it)
Apply character limits: Check 1M max, 2k rule, 100k rule on concatenated content
Index to SOLR: Store processed content in search index with bedrijf as root document