Language Priority
Explanation: When processing multi-language content, Spotbot prioritizes languages in this order (0 = highest priority):
- nl (Dutch) - Highest priority for Dutch business context
- en (English)
- de (German)
- fr (French)
- Collection of EU languages
- unknown - Language detection failed
- others - All other languages
Character Limits
Rule 1: Maximum 1M characters per registration
No single company can have more than 1 million characters indexed in SOLR.
Rule 2: ≤2k characters with priority < 6 → Skip other languages
If content is 2,000 characters or less and has priority 1-5, skip indexing other languages.
Why: Small high-priority content doesn't need multi-language indexing.
Rule 3: >100k characters → Keep nl+en only
If content exceeds 100,000 characters, only index Dutch (nl) and English (en).
Why: Large websites don't need all languages indexed to stay under 1M limit.
Multiple Correct URLs
Edge case: When a company has multiple url_active (Correct URLs), each URL is processed sequentially and all content is concatenated together. The character length rules apply to the total concatenated content across all URLs.
Example (edge case): Company has 2 active URLs:
- URL 1: 60k characters (nl), 30k characters (en), 20k characters (unknown)
- URL 2: 40k characters (nl only)
- Result: It will index 90k nl and en from URL 1 and 10k as unknown language, AND 40k nl from URL 2 (total 140k exceeds 100k limit but indexed anyway due to sequential processing)
Indexing Workflow
- Trigger: Index is triggered when an entity is inserted into BIQ (
bedrijf_index_queue). Any process that changes an entity or anything related to an entity is responsible for adding it to the queue. - Fetch entity: Get entity document (
bedrijf) and all related information including employees, SBI codes, rechtsvorm, oprichting (founding date), and other entity properties - Handle multi-value properties: Some properties have multiple values (labels, websites) and are handled as sub-children of the root document.
These sub-children belong to and can only be fetched in relation to the root entity. - Fetch websites (custom rules): Get all
url_active(Correct URLs) for the entity - Extract text: Extract each page text and metadata from NFS
- Sort by language: Sort pages by detected language (if language not defined, attempt to detect it)
- Apply character limits: Check 1M max, 2k rule, 100k rule on concatenated content
- Index to SOLR: Store processed content in search index with
bedrijfas root document