Web crawler data source

14 February 2026 04:14

A web crawler data source adds content from web pages. The web pages must be publicly accessible and not require authorization or be limited to specific networks.

The web crawler downloads content from the source URLs into the data source, including subdomains (if selected), linked documents, and pages that those URLs link to. It continues downloading pages and linked documents until either it has download all linked pages and documents it can find, or it reaches the maximum number of visited pages.

You can define up to ten source URLs where the crawler will start its search and add regular expressions to tell the web crawler to include or exclude certain URLs from being downloaded.

If you want to add metadata to your web pages, you can add meta tags in the HTML of your web pages.

Remember to sync if you update the web pages

The web crawler data source ingests a copy of the web page data when you sync the data source. It doesn't automatically re-crawl the pages or detect when they change. Remember to sync the data source again after your web pages change or if you've updated the web crawler configuration.

The updated data source will be available in TestBot for testing after the sync is complete, but you'll need to publish your changes before the updated web content will be available to your chatbot. Make sure you test your changes thoroughly before you publish.

You can: