SCP01: Disallowed domain
What it does
Finds URLs in start_urls whose netloc is not in
allowed_domains.
Why is this bad?
The default implementation of start() sets
dont_filter to True. As a result, URLs from
start_urls are sent by default even if their domain is
not in allowed_domains.
However, any follow-up Request yielded from a
callback that points to that domain will be filtered
out, which is usually not what you want.
Example
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
allowed_domains = ["b.example"]
start_urls = [
"https://a.example/",
]
Use instead:
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
allowed_domains = ["a.example"]
start_urls = [
"https://a.example/",
]