Application Models

Sites

The site to be scraped. More

Articles

A specific product page to be scraped on a site. More

Items

An item is the abstraction of a product. A site does not have items as the system is modeled. It instead has articles which likely match items. More

Excerpts

The classification of the data to be scraped at each page. E.g. name, identification number, image, etc. More

Layouts

The layout of the site to be scraped. Contains a list of "locations", called paths, on the page where excerpted data can be found. More

Paths

The location of data to be excerpted from the webpage of a site. Stored in the form of a css path. E.g. #id > div.container > p.name More

Crawlers

A crawler is used to find new items on each site. Each site has a different crawling system defined by various crawler parameters. The crawler has been designed such that it will only scrape category pages to find all articles and not the entire site. More

Crawler Params

A crawler has parameters to guide it and improve efficiency when finding articles, such as url components to require or ignore. More