Skip to main content

Data Deduplication

Lightfeed automatically handles deduplicating data whenever you extract, ensuring your database contains only unique records while properly tracking changes and updates.

What is Deduplication?

Deduplication is the process of identifying and managing duplicate records when extracting web data. It ensures that:

  • Clean and consistent data: Each unique item is represented only once in your database
  • Efficient change tracking: Updates to existing items are properly tracked
  • Historical record keeping: Historical data is maintained appropriately

How Deduplication Works in Lightfeed

When Lightfeed extracts data, it uses ID field values(s) as a unique key (the deduplication key) to determine if a data row is new or a duplicate of data that has already been extracted and saved.

The Deduplication Process

For each extraction, Lightfeed follows this process:

  1. Extract data from the source websites
  2. Compare each record against existing records using the ID field values as the deduplication key
  3. Process the record based on the comparison result

The action taken depends on the scenario:

ScenarioWhat happensRun time value
New data row extractedRecord is added to the databaseCurrent date of extraction
Duplicate of existing data row (no changes)Original record is retainedUpdated to current date of extraction
Updated version of existing data rowRecord is updated and marked as "Changed"Updated to current date of extraction
Data row no longer available on sourcesRecord is retained in databaseShows the last date when the data was found

Example Deduplication Scenarios

Scenario 1: E-commerce Product Tracking

Let's say you're tracking product data across multiple e-commerce sites:

  1. Initial Extraction (Jan 1): Product A is found on Site 1 with price $49.99
    • Added to database with timestamp Jan 1
  2. Second Extraction (Jan 8): Product A still costs $49.99
    • No changes detected, timestamp updated to Jan 8
  3. Third Extraction (Jan 15): Product A now costs $39.99
    • Price is updated, record marked as "Changed", timestamp updated to Jan 15
  4. Fourth Extraction (Jan 22): Product A is no longer on the site
    • Record remains in database with Jan 15 timestamp

Scenario 2: Job Listing Monitoring

When monitoring job listings across company career pages:

  1. Initial Extraction (Mar 1): Company X posts "Senior Developer" position
    • Added to database with timestamp Mar 1
  2. Second Extraction (Mar 8): Listing updated with new requirements
    • Record updated, marked as "Changed", timestamp updated to Mar 8
  3. Third Extraction (Mar 15): Position is filled and removed from the site
    • Record remains in database with Mar 8 timestamp

Best Practices for Deduplication

To get the most out of Lightfeed's deduplication capabilities:

Choose the Right Deduplication Key

  • Select a field or combination of fields that truly identifies unique records
  • Plan for records that might legitimately share the same key value

Use Consistent Data Formats

  • Ensure consistent formatting across fields used for deduplication
  • Consider how case sensitivity, whitespace, or special characters might affect matches