Source
Sources are the websites from which you extract data into a single database. They define where your data comes from while the database itself maintains consistency with a unified schema, prompt, and extraction schedule across all sources.
Types of Sources
- Website URLs: Direct links to web pages containing your target data
- Google Search results: Dynamic listings generated from Google search queries
- Reddit communities: Content from subreddits
- LinkedIn pages: User and company information, posts and updates
- RSS feeds: Regularly updated content streams
Adding Sources to a Database
To include data from multiple websites in the same database, you must add them as sources. A database can have one or multiple sources, allowing you to collect data from different websites while maintaining a unified structure.
Why Are Sources Important?
By defining sources strategically, you can extract comprehensive, structured, and up-to-date data across multiple websites into a single, unified database.
- Flexible data collection – Allows combining data from multiple sites in a structured way.
- Scalability – Add or remove sources as needed without affecting the database structure.
- Automated updates – Sources are reprocessed based on your extraction schedule.
Example Use Cases and Their Sources
Use case | Example sources |
---|---|
Market research | Home pages of competitors and their LinkedIn pages |
E-commerce product tracking | Product pages from different online stores |
Job listings | Company career pages and job boards |
News and regulation aggregation | Articles from various news sites and regulation boards |