🌐 Datahunter

Website scraping with dh_full and dh_fast instances

Datahunter Instances

dh_full (Datahunter Full)

Scrapes: Correct URLs (active, secure, verified, not blacklisted)

dh_fast (Datahunter Fast)

Scrapes: Working URLs (active and secure only)

Scraping Parameters

Performance Settings

  • 5 concurrent scrapers per instance
  • 20 URLs per batch processing

Content Limits

  • 2MB page limit - Maximum HTML size per page
  • 20 pages max - Maximum pages per website
  • 350 char minimum per page - Minimum content per page to be useful
  • 500 char total minimum - Minimum total content across all pages

Crawling Strategy

  • Depth + URL length ordering - Prioritize shallower pages and shorter URLs
  • robots.txt and sitemap - Respect robots.txt rules and use sitemaps when available
  • Fallback - If no sitemap or robots.txt, scrape homepage only

Language Handling

Languages normalized: nl, en, de, fr, eu (Collection of EU languages)

Scraping Frequency

Rescrape intervals:

  • Priority 1 - 4 months
  • Priority 2 - 8 months
  • Priority 3 - 1 year