Puppeteer Web Scraper
Puppeteer is a Node.js library, controls Chrome/Chromium through the DevTools Protocol in headless mode. Keep in mind that when scraping websites, you should always review and comply with the website's terms of service and policies to ensure ethical and legal use of the data.
This module provides a sophisticated web scraper that can:
- Load content from single or multiple web pages
- Handle JavaScript-rendered content
- Support various page load strategies
- Wait for specific elements to load
- Crawl relative links from websites
- Process XML sitemaps
Inputs
- URL: The webpage URL to scrape
- Text Splitter (optional): A text splitter to process the extracted content
- Get Relative Links Method (optional): Choose between:
- Web Crawl: Crawl relative links from HTML URL
- Scrape XML Sitemap: Scrape relative links from XML sitemap URL
- Get Relative Links Limit (optional): Limit for number of relative links to process (default: 10, 0 for all links)
- Wait Until (optional): Page load strategy:
- Load: When initial HTML document's DOM is loaded
- DOM Content Loaded: When complete HTML document's DOM is loaded
- Network Idle 0: No network connections for 500ms
- Network Idle 2: No more than 2 network connections for 500ms
- Wait for selector to load (optional): CSS selector to wait for before scraping
- Additional Metadata (optional): JSON object with additional metadata to add to documents
- Omit Metadata Keys (optional): Comma-separated list of metadata keys to omit
Outputs
- Document: Array of document objects containing metadata and pageContent
- Text: Concatenated string from pageContent of documents
Features
- JavaScript execution support
- Configurable page load strategies
- Element wait capabilities
- Web crawling functionality
- XML sitemap processing
- Headless browser operation
- Sandbox configuration
- Error handling for invalid URLs
- Metadata customization
Notes
- Runs in headless mode by default
- Uses no-sandbox mode for compatibility
- Invalid URLs will throw an error
- Setting link limit to 0 will retrieve all available links (may take longer)
- Supports waiting for specific DOM elements before extraction
Scrape One URL
- (Optional) Connect Text Splitter.
- Input desired URL to be scraped.
Crawl & Scrape Multiple URLs
Visit Web Crawl guide to allow scraping of multiple pages.
Output
Loads URL content as Document