Skip to main content

Source

Sources are the websites from which you extract data into a single database. They define where your data comes from while the database itself maintains consistency with a unified schema, prompt, and extraction schedule across all sources.

Types of Sources

  • Website URLs: Direct links to web pages containing your target data
  • Google Search results: Dynamic listings generated from Google search queries
  • Reddit communities: Content from subreddits
  • LinkedIn pages: User and company information, posts and updates
  • RSS feeds: Regularly updated content streams

Adding Sources to a Database

To include data from multiple websites in the same database, you must add them as sources. A database can have one or multiple sources, allowing you to collect data from different websites while maintaining a unified structure.

Why Are Sources Important?

By defining sources strategically, you can extract comprehensive, structured, and up-to-date data across multiple websites into a single, unified database.

  • Flexible data collection – Allows combining data from multiple sites in a structured way.
  • Scalability – Add or remove sources as needed without affecting the database structure.
  • Automated updates – Sources are reprocessed based on your extraction schedule.

Example Use Cases and Their Sources

Use caseExample sources
Market researchHome pages of competitors and their LinkedIn pages
E-commerce product trackingProduct pages from different online stores
Job listingsCompany career pages and job boards
News and regulation aggregationArticles from various news sites and regulation boards