Entity Types
⚠️ Entity (SOLR/SQL field/table
All records (including stichting, vereniging, etc.) with or without address. Includes inactive records like bankrupt records.
Be careful: This legacy naming by IT includes much more than what Business team defines as "companies".
bedrijf)All records (including stichting, vereniging, etc.) with or without address. Includes inactive records like bankrupt records.
Be careful: This legacy naming by IT includes much more than what Business team defines as "companies".
Registration / Registratie
All records (including stichting, vereniging, etc.) that are active and have an address.
All records (including stichting, vereniging, etc.) that are active and have an address.
Company
All records that are active and have an address, excluding some rechtsvormen like stichting, vereniging, etc.
All records that are active and have an address, excluding some rechtsvormen like stichting, vereniging, etc.
Startups
Companies founded less than 10 years ago that are (or might be) innovative.
Companies founded less than 10 years ago that are (or might be) innovative.
Innovative Registration/Company
What the client sees as innovative. What is included in Innovatiethemas.
What the client sees as innovative. What is included in Innovatiethemas.
Innovation & Classification
Innovatiethema (Innovation Theme)
Our innovation labels: agrifood, bouw, circulaire economie, energie, health (LSH), ICT, hightech (HTSM), logistiek, sociale impact, water.
Stored as public labels (
Note: "Topic" is the deprecated term for Innovatiethema.
Our innovation labels: agrifood, bouw, circulaire economie, energie, health (LSH), ICT, hightech (HTSM), logistiek, sociale impact, water.
Stored as public labels (
im_public_*).Note: "Topic" is the deprecated term for Innovatiethema.
Label
Classification markers attached to companies for categorization, filtering, and search.
Classification markers attached to companies for categorization, filtering, and search.
Ecosystem
Refers to Netlists: awards, subsidies, networks, and more (ref_scrapers project).
Refers to Netlists: awards, subsidies, networks, and more (ref_scrapers project).
Platform Components & Features
Online / Online Innovatiespotter
Client-facing platform for searching companies and accessing dashboards.
Client-facing platform for searching companies and accessing dashboards.
Bedrijfsspotter
Internal platform for company data management and processing.
Internal platform for company data management and processing.
Subscription
Clients with paid access to Online who can perform their own searches.
Clients with paid access to Online who can perform their own searches.
Dashboard
Client-specific dashboards created in Online.
Client-specific dashboards created in Online.
Internal Dashboard
Dashboard showing user activity in Bedrijfsspotter (internal use only).
Dashboard showing user activity in Bedrijfsspotter (internal use only).
Admin
Bedrijfsspotter admin page for system management.
Bedrijfsspotter admin page for system management.
Qnection
Manual quality control validation process for companies.
Manual quality control validation process for companies.
API
RESTful API providing structured json access to entity data and search functionality for clients.
Detailed documentation at https://online.innovatiespotter.nl/api/documentation.
RESTful API providing structured json access to entity data and search functionality for clients.
Detailed documentation at https://online.innovatiespotter.nl/api/documentation.
Query (Business Context)
SOLR search queries. When business discusses "queries," they mean SOLR, not SQL.
SOLR search queries. When business discusses "queries," they mean SOLR, not SQL.
Query (IT Context)
SQL or SOLR query. Must be specified based on context.
SQL or SOLR query. Must be specified based on context.
Website Categories
Working URL [Priority 3]
Active and secure websites. Accessible only via SQL, not indexed in SOLR.
Active and secure websites. Accessible only via SQL, not indexed in SOLR.
Correct URL [Priority 2] →
Active, secure, verified, and not blacklisted websites.
url_active in SOLRActive, secure, verified, and not blacklisted websites.
Scraped URL →
Websites with successfully scraped content stored in database.
url_scraped in SOLRWebsites with successfully scraped content stored in database.
ML Eligible URL [Priority 1] →
Scraped URLs with more than 100 words, with specific filters on employees and rechtsvormen.
url_word_threshold in SOLRScraped URLs with more than 100 words, with specific filters on employees and rechtsvormen.
rescrape_priority label [Priority 0]Manually flagged websites to be re-scraped with highest priority.
dfe_priority labelManually flagged companies to be processed by WGP (Website Guessing Process) with highest priority.
Website Scraping
Character threshold to scrape →
500 characters minimum per website. SOLR field is true if at least one active_url has >500 characters.
url_char_threshold in SOLR500 characters minimum per website. SOLR field is true if at least one active_url has >500 characters.
Word threshold for ML processing
100 words minimum.
100 words minimum.
url_scraped_chars in SOLRInteger: total character count of all scraped content.
url_scraped_words in SOLRInteger: total word count of all scraped content.
Maximum pages per website
20 pages.
20 pages.
Maximum crawl depth
How many levels deep the scraper follows links from the homepage.
Examples:
How many levels deep the scraper follows links from the homepage.
Examples:
www.example.nl/home = depth 0, www.example.nl/nl/products = depth 1
dh_full (Datahunter Full)
Scrapes Priority 0, 1, 2 URLs completely - maximum number of pages and depth.
Scrapes Priority 0, 1, 2 URLs completely - maximum number of pages and depth.
dh_fast (Datahunter Fast)
Scrapes Priority 3 URLs partially - limited number of pages and depth.
Scrapes Priority 3 URLs partially - limited number of pages and depth.
url_summary in SOLRLLM-generated summary of website content and company activities.