description: Load data from Apify Website Content Crawler.

Apify Website Content Crawler

Apify is a web scraping and data extraction platform that provides an app store with more than a thousand ready-made cloud tools called Actors.

The Website Content Crawler Actor can deeply crawl websites, clean their HTML by removing a cookies modals, footers, or navigation, and then transform the HTML into Markdown. This Markdown can then be stored in a vector database for semantic search or Retrieval-Augmented Generation (RAG).

This module provides a sophisticated web crawler that can:

Crawl multiple websites from specified start URLs
Use different crawling engines (Chrome, Firefox, Cheerio, JSDOM)
Control crawling depth and page limits
Handle JavaScript-rendered content
Process extracted content with text splitters
Customize metadata extraction

Inputs

Required Parameters

Start URLs: Comma-separated list of URLs where crawling will begin
Connect Apify API: Apify API credentials
Crawler Type: Choice of crawling engine:
- Headless web browser (Chrome+Playwright)
- Stealthy web browser (Firefox+Playwright)
- Raw HTTP client (Cheerio)
- Raw HTTP client with JavaScript execution (JSDOM)

Optional Parameters

Text Splitter: A text splitter to process the extracted content
Max Crawling Depth: Maximum depth of page links to follow (default: 1)
Max Crawl Pages: Maximum number of pages to crawl (default: 3)
Additional Input: JSON object with additional crawler configuration
Additional Metadata: JSON object with additional metadata
Omit Metadata Keys: Comma-separated list of metadata keys to omit

Outputs

Document: Array of document objects containing metadata and pageContent
Text: Concatenated string from pageContent of documents

Features

Multiple crawling engine support
Configurable crawling parameters
JavaScript rendering support
Depth and page limit controls
Metadata customization
Text splitting capabilities
Error handling

Crawler Types

Headless Chrome (Playwright)

Best for modern web applications
Full JavaScript support
Higher resource usage

Stealthy Firefox (Playwright)

Good for sites with bot detection
Full JavaScript support
More stealthy operation

Cheerio

Fast and lightweight
No JavaScript support
Lower resource usage

JSDOM (Experimental)

JavaScript execution support
Lightweight alternative to browsers
Experimental features

Notes

Requires valid Apify API token
Different crawler types have different capabilities
Resource usage varies by crawler type
JavaScript support depends on crawler type
Rate limiting may apply based on Apify plan
Additional configuration available through JSON input

Crawl Entire Website

(Optional) Connect Text Splitter.
Connect Apify API (create a new credential with your Apify API token).
Input one or more URLs (separated by commas) where the crawler will start, e.g https://github.com/operativestech/MicroMind-Doc/.
Select the crawler type. Refer to Website Content Crawler documentation for more information.
(Optional) Specify additional parameters such as maximum crawling depth and the maximum number of pages to crawl.

Output

Loads website content as a Document.

AI MicroMind Documentation