Datahunter Instances
dh_full (Datahunter Full)
Scrapes: Correct URLs (active, secure, verified, not blacklisted)
dh_fast (Datahunter Fast)
Scrapes: Working URLs (active and secure only)
Scraping Parameters
Performance Settings
- 5 concurrent scrapers per instance
- 20 URLs per batch processing
Content Limits
- 2MB page limit - Maximum HTML size per page
- 20 pages max - Maximum pages per website
- 350 char minimum per page - Minimum content per page to be useful
- 500 char total minimum - Minimum total content across all pages
Crawling Strategy
- Depth + URL length ordering - Prioritize shallower pages and shorter URLs
- robots.txt and sitemap - Respect robots.txt rules and use sitemaps when available
- Fallback - If no sitemap or robots.txt, scrape homepage only
Language Handling
Languages normalized: nl, en, de, fr, eu (Collection of EU languages)
Scraping Frequency
Rescrape intervals:
- Priority 1 - 4 months
- Priority 2 - 8 months
- Priority 3 - 1 year