Schema
A schema in Lightfeed defines the structure, format, and organization of your extracted data. It ensures that all data follows a consistent pattern, making it easier to analyze, query, and integrate with other systems.
What is a Schema?
A schema is essentially a blueprint for your data. It defines:
- Field names: The labels for each piece of data you extract
- Data types: What kind of data each field contains (text, URL, number, boolean, etc.)
- Field descriptions: Clear explanations of what each field represents to guides the AI in extracting appropriate values
- Is ID field: Which field(s) uniquely identify each record for deduplication and updates
Why Schemas Matter
A well-designed schema:
- Ensures consistency: All extracted data follows the same format
- Improves data quality: Sets expectations for what output data should look like during extraction
- Enables effective deduplication: Prevents duplicate records in your database
- Simplifies analysis: Makes it easier to query and analyze your data
- Facilitates integration: Makes it easier to work with other tools and systems
Schema Components
Field Names
Field names should be:
- Descriptive: Clearly indicate what data the field contains
- Concise: Brief but meaningful
- Consistent: Follow a naming convention (e.g., snake_case)
- Unique: Each field name should be distinct
Examples:
✅ product_price
, publication_date
, author_name
❌ field1
, data
, info
(too vague)
Data Types
Lightfeed supports various data types to ensure your data is stored appropriately:
Data Type | Description | Example |
---|---|---|
Text | General text content | Article titles, Product descriptions |
URL | Web addresses | Product links, profile URLs |
Number | Numerical values | Prices, ratings, quantities |
Boolean | True or false | In stock or not |
Choose the appropriate data type for each field to ensure proper validation and formatting.
Field Descriptions
Each field should have a clear description that explains:
- What data the field contains
- Any formatting expectations
- How to handle edge cases or missing data
Good descriptions help ensure consistent data extraction.
Examples:
✅ The current selling price of the product in USD. Extract numeric value only (no currency symbols). If price is shown as a range, extract the lowest price.
✅ The current selling price (numeric only, no currency symbol). Leave empty if no pricing available.
❌ Pricing in number. Use "N/A" if no pricing is available
("Pricing" is vague - with or without currency symbol. "N/A" is not a valid numerical value)
ID Field
Every schema needs ID field(s) that uniquely identifies each record. It can be either a single field or a combination of multiple fields that together form a unique identifier. Using multiple fields as a composite ID is particularly useful when no single field uniquely identifies a record, but a combination of fields does.
The ID field is crucial for:
- Preventing duplicate records in your database
- Updating existing records when data changes
- Tracking historical data over time
When generating a schema from your prompt, Lightfeed will automatically select appropriate ID field(s), but you can always customize this selection to better match your data structure.
Common ID Field(s) by Use Case:
Use Case | Recommended ID Field(s) |
---|---|
News articles | Article URL |
E-commerce products | Product name |
Job listings | Job ID or (job title + company name + location) |
Social media profiles | Profile URL or username |
Real estate listings | Property address or listing ID |
For more on how deduplication works based on ID fields, see the Data Deduplication Guide.
Schema Best Practices
Keep It Simple
Start with the essential fields and add more as needed. A simpler schema is easier to maintain and more likely to yield consistent results.
Plan for Variability
Consider how data might vary across different sources and plan your schema accordingly. For example, some product listings might not include ratings or review counts.
Use Consistent Naming Conventions
Choose a naming convention (e.g. snake_case) and stick with it throughout your schema.
Include Clear Descriptions
Provide detailed descriptions for each field to guide AI on what data should be extracted.
Choose the Right Data Types
Select appropriate data types for each field to ensure proper validation and formatting.
Review and Refine
Test your schema with a small extraction before scaling up. Adjust as needed based on the results.