Data Deduplication
Lightfeed automatically handles deduplicating data whenever you extract, ensuring your database contains only unique records while properly tracking changes and updates.
What is Deduplication?
Deduplication is the process of identifying and managing duplicate records when extracting web data. It ensures that:
- Clean and consistent data: Each unique item is represented only once in your database
- Efficient change tracking: Updates to existing items are properly tracked
- Historical record keeping: Historical data is maintained appropriately
How Deduplication Works in Lightfeed
When Lightfeed extracts data, it uses ID field values(s) as a unique key (the deduplication key) to determine if a data row is new or a duplicate of data that has already been extracted and saved.
The Deduplication Process
For each extraction, Lightfeed follows this process:
- Extract data from the source websites
- Compare each record against existing records using the ID field values as the deduplication key
- Process the record based on the comparison result
The action taken depends on the scenario:
Scenario | What happens | Run time value |
---|---|---|
New data row extracted | Record is added to the database | Current date of extraction |
Duplicate of existing data row (no changes) | Original record is retained | Updated to current date of extraction |
Updated version of existing data row | Record is updated and marked as "Changed" | Updated to current date of extraction |
Data row no longer available on sources | Record is retained in database | Shows the last date when the data was found |
Example Deduplication Scenarios
Scenario 1: E-commerce Product Tracking
Let's say you're tracking product data across multiple e-commerce sites:
- Initial Extraction (Jan 1): Product A is found on Site 1 with price $49.99
- Added to database with timestamp Jan 1
- Second Extraction (Jan 8): Product A still costs $49.99
- No changes detected, timestamp updated to Jan 8
- Third Extraction (Jan 15): Product A now costs $39.99
- Price is updated, record marked as "Changed", timestamp updated to Jan 15
- Fourth Extraction (Jan 22): Product A is no longer on the site
- Record remains in database with Jan 15 timestamp
Scenario 2: Job Listing Monitoring
When monitoring job listings across company career pages:
- Initial Extraction (Mar 1): Company X posts "Senior Developer" position
- Added to database with timestamp Mar 1
- Second Extraction (Mar 8): Listing updated with new requirements
- Record updated, marked as "Changed", timestamp updated to Mar 8
- Third Extraction (Mar 15): Position is filled and removed from the site
- Record remains in database with Mar 8 timestamp
Best Practices for Deduplication
To get the most out of Lightfeed's deduplication capabilities:
Choose the Right Deduplication Key
- Select a field or combination of fields that truly identifies unique records
- Plan for records that might legitimately share the same key value
Use Consistent Data Formats
- Ensure consistent formatting across fields used for deduplication
- Consider how case sensitivity, whitespace, or special characters might affect matches