SquareTrade: Efficient automation for collecting & posting product details
Project Overview
The Scraper Portal was developed for SquareTrade Allstate Protection Plans, a leading U.S. insurance company. This platform automates the collection and posting of up-to-date product data on TVs and furniture from e-commerce giants such as Costco, Walmart, and Amazon.
Problem Statement
SquareTrade required a robust, scalable system to automatically gather and update product data from various e-commerce vendors. This information needed to be posted in real-time to the client portal, ensuring accurate warranty plans and up-to-date pricing. The solution needed to be flexible, allowing quick integration of new vendors without disrupting operations.
Key Findings
- Frequent Product Updates: E-commerce platforms often update product information, making it essential for SquareTrade to have real-time data to offer accurate warranty plans and competitive pricing.
- Scalable Data Scraping: A scalable solution was necessary to handle the constant influx of product data from multiple vendors, ensuring efficient scraping and importing without manual intervention.
- Task Management Efficiency: Efficient scheduling and task management were critical for handling large volumes of data and ensuring that no product details were overlooked or delayed.
Implemented Solution
The portal was designed with the following key modules to address the problem:
-
Microservices Architecture:
Utilized a microservices-based platform to enable independent operation of each module, ensuring scalability and easy integration of new vendors without affecting the system.
-
Discovery Module:
Built with Scrapy, this module continuously discovered new product URLs from vendor websites, ensuring that no new listings were missed.
-
Scheduler Module:
Developed using AWS Lambda for lightweight, cost-efficient scheduling of web scraping tasks, ensuring consistent data collection from all vendors.
-
Crawler Module:
Web scraping was handled by specific crawlers, triggered by the scheduled URLs, utilizing Scrapy to gather critical product data like price, warranty, description, and SKU.
-
Importer Module:
After data collection, the importer processed and cleaned the data before posting it to the client portal via APIs, ensuring accuracy and consistency.
-
CI/CD Pipeline with Jenkins:
Implemented a continuous integration/continuous deployment pipeline with Jenkins, integrated with AWS EC2 and ECR, enabling automated and reliable deployment of microservices.
-
AWS Lambda for Scheduling:
Leveraged AWS Lambda to create a lightweight, scalable, and cost-effective scheduling service that could handle high volumes of data without incurring significant infrastructure costs.
Results
The Scraper Portal ensured real-time updates to product data, providing SquareTrade with accurate and timely information for warranty offerings. The scalable architecture allowed for quick addition of new vendors, drastically reducing vendor onboarding time. Automation through Scrapy and the CI/CD pipeline minimized manual intervention, increasing development speed and reliability. The use of AWS Lambda kept infrastructure costs low while maintaining high performance and availability, ensuring a seamless and efficient data collection process.