Skip to main content

Schema

A schema in Lightfeed defines the structure, format, and organization of your extracted data. It ensures that all data follows a consistent pattern, making it easier to analyze, query, and integrate with other systems.

What is a Schema?

A schema is essentially a blueprint for your data. It defines:

  • Field names: The labels for each piece of data you extract
  • Data types: What kind of data each field contains (text, URL, number, boolean, etc.)
  • Field descriptions: Clear explanations of what each field represents to guides the AI in extracting appropriate values
  • Is ID field: Which field(s) uniquely identify each record for deduplication and updates

Why Schemas Matter

A well-designed schema:

  • Ensures consistency: All extracted data follows the same format
  • Improves data quality: Sets expectations for what output data should look like during extraction
  • Enables effective deduplication: Prevents duplicate records in your database
  • Simplifies analysis: Makes it easier to query and analyze your data
  • Facilitates integration: Makes it easier to work with other tools and systems

Schema Components

Field Names

Field names should be:

  • Descriptive: Clearly indicate what data the field contains
  • Concise: Brief but meaningful
  • Consistent: Follow a naming convention (e.g., snake_case)
  • Unique: Each field name should be distinct

Examples:

product_price, publication_date, author_name

field1, data, info (too vague)

Data Types

Lightfeed supports various data types to ensure your data is stored appropriately:

Data TypeDescriptionExample
TextGeneral text contentArticle titles, Product descriptions
URLWeb addressesProduct links, profile URLs
NumberNumerical valuesPrices, ratings, quantities
BooleanTrue or falseIn stock or not

Choose the appropriate data type for each field to ensure proper validation and formatting.

Field Descriptions

Each field should have a clear description that explains:

  • What data the field contains
  • Any formatting expectations
  • How to handle edge cases or missing data

Good descriptions help ensure consistent data extraction.

Examples:

The current selling price of the product in USD. Extract numeric value only (no currency symbols). If price is shown as a range, extract the lowest price.

The current selling price (numeric only, no currency symbol). Leave empty if no pricing available.

Pricing in number. Use "N/A" if no pricing is available ("Pricing" is vague - with or without currency symbol. "N/A" is not a valid numerical value)

ID Field

Every schema needs ID field(s) that uniquely identifies each record. It can be either a single field or a combination of multiple fields that together form a unique identifier. Using multiple fields as a composite ID is particularly useful when no single field uniquely identifies a record, but a combination of fields does.

The ID field is crucial for:

  • Preventing duplicate records in your database
  • Updating existing records when data changes
  • Tracking historical data over time

When generating a schema from your prompt, Lightfeed will automatically select appropriate ID field(s), but you can always customize this selection to better match your data structure.

Common ID Field(s) by Use Case:

Use CaseRecommended ID Field(s)
News articlesArticle URL
E-commerce productsProduct name
Job listingsJob ID or (job title + company name + location)
Social media profilesProfile URL or username
Real estate listingsProperty address or listing ID

For more on how deduplication works based on ID fields, see the Data Deduplication Guide.

Schema Best Practices

Keep It Simple

Start with the essential fields and add more as needed. A simpler schema is easier to maintain and more likely to yield consistent results.

Plan for Variability

Consider how data might vary across different sources and plan your schema accordingly. For example, some product listings might not include ratings or review counts.

Use Consistent Naming Conventions

Choose a naming convention (e.g. snake_case) and stick with it throughout your schema.

Include Clear Descriptions

Provide detailed descriptions for each field to guide AI on what data should be extracted.

Choose the Right Data Types

Select appropriate data types for each field to ensure proper validation and formatting.

Review and Refine

Test your schema with a small extraction before scaling up. Adjust as needed based on the results.