1, it can scrape about 100M products info daily or weekly. Each product have 300-1K bytes. If it is hard, scraping 10M product daily or weekly also ok.
2, so it seem a distributed crawl is a must.
3, use proxies to save money and can detect IP ban and rotation or slow scraping to avoid IP ban.
4, I found scrapy cluster project is great buy not try.
At moment, scrapy(or custom crawl) + redis is a option.
5, save data to mysql or sql server or even oracle to query products.