recipe-crawler is a Python web crawler for recipes in HTML https://schema.org/Recipe (Microdata/JSON-LD), which output cookbooks in JSON format. Beautiful Soup library is used to get the anchor tags from HTML. scrape-schema-recipe library is used to get the recipes from HTML. My intended use for the cookbooks was to serve a test suite for recipe software.
Some of the problems that have to be addressed when designing a web crawler:
robots.txt,
$ ./recipe_crawler.py --help
usage: recipe_crawler.py [-h] [-c CONFIG] [-f FILTER] [--limit LIMIT]
[-o OUTPUT] [--version]
Recipe Crawler to that saves a cookbook to a JSON file.
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
website configuration YAML file
-f FILTER, --filter FILTER
filter names of websites to crawl
--limit LIMIT Limit of number of recipes to collect (default: 20)
-o OUTPUT, --output OUTPUT
Output to a JSON file
--version show program's version number and exit
BeautifulSoup, CLI, Python — Feb 28, 2023
Made with Hexo . Website's repo.