description: Load data from Apify Website Content Crawler.
Apify Website Content Crawler
Apify is a web scraping and data extraction platform that provides an app store with more than a thousand ready-made cloud tools called Actors.
The Website Content Crawler Actor can deeply crawl websites, clean their HTML by removing a cookies modals, footers, or navigation, and then transform the HTML into Markdown. This Markdown can then be stored in a vector database for semantic search or Retrieval-Augmented Generation (RAG).
 (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1).png)
Apify Website Content Crawler Node
This module provides a sophisticated web crawler that can:
- Crawl multiple websites from specified start URLs
- Use different crawling engines (Chrome, Firefox, Cheerio, JSDOM)
- Control crawling depth and page limits
- Handle JavaScript-rendered content
- Process extracted content with text splitters
- Customize metadata extraction
Inputs
Required Parameters
- Start URLs: Comma-separated list of URLs where crawling will begin
- Connect Apify API: Apify API credentials
- Crawler Type: Choice of crawling engine:
- Headless web browser (Chrome+Playwright)
- Stealthy web browser (Firefox+Playwright)
- Raw HTTP client (Cheerio)
- Raw HTTP client with JavaScript execution (JSDOM)
Optional Parameters
- Text Splitter: A text splitter to process the extracted content
- Max Crawling Depth: Maximum depth of page links to follow (default: 1)
- Max Crawl Pages: Maximum number of pages to crawl (default: 3)
- Additional Input: JSON object with additional crawler configuration
- Additional Metadata: JSON object with additional metadata
- Omit Metadata Keys: Comma-separated list of metadata keys to omit
Outputs
- Document: Array of document objects containing metadata and pageContent
- Text: Concatenated string from pageContent of documents
Features
- Multiple crawling engine support
- Configurable crawling parameters
- JavaScript rendering support
- Depth and page limit controls
- Metadata customization
- Text splitting capabilities
- Error handling
Crawler Types
Headless Chrome (Playwright)
- Best for modern web applications
- Full JavaScript support
- Higher resource usage
Stealthy Firefox (Playwright)
- Good for sites with bot detection
- Full JavaScript support
- More stealthy operation
Cheerio
- Fast and lightweight
- No JavaScript support
- Lower resource usage
JSDOM (Experimental)
- JavaScript execution support
- Lightweight alternative to browsers
- Experimental features
Notes
- Requires valid Apify API token
- Different crawler types have different capabilities
- Resource usage varies by crawler type
- JavaScript support depends on crawler type
- Rate limiting may apply based on Apify plan
- Additional configuration available through JSON input
Crawl Entire Website
- (Optional) Connect Text Splitter.
- Connect Apify API (create a new credential with your Apify API token).
- Input one or more URLs (separated by commas) where the crawler will start, e.g
https://github.com/operativestech/MicroMind-Doc/. - Select the crawler type. Refer to Website Content Crawler documentation for more information.
- (Optional) Specify additional parameters such as maximum crawling depth and the maximum number of pages to crawl.
Output
Loads website content as a Document.