SCP01: Disallowed domain

What it does

Finds URLs in start_urls whose netloc is not in allowed_domains.

Why is this bad?

The default implementation of start() sets dont_filter to True. As a result, URLs from start_urls are sent by default even if their domain is not in allowed_domains.

However, any follow-up Request yielded from a callback that points to that domain will be filtered out, which is usually not what you want.

Example

import scrapy


class MySpider(scrapy.Spider):
    name = "myspider"
    allowed_domains = ["b.example"]
    start_urls = [
        "https://a.example/",
    ]

Use instead:

import scrapy


class MySpider(scrapy.Spider):
    name = "myspider"
    allowed_domains = ["a.example"]
    start_urls = [
        "https://a.example/",
    ]