Zyte: Scrapy Cloud Companion

What is Zyte?

Think of Zyte Scrapy Cloud as a manager for your robot workers.

When you write a web crawler (a "spider"), it's like building a robot that goes out to the internet to collect information for you. But running this robot on your own computer can be annoying: you have to keep your computer on, the internet might cut out, or your IP address might get blocked.

Zyte is a cloud platform where you can send your robots to live and work.

You upload your code to Zyte.
Zyte runs your spiders on their servers 24/7.
You can schedule them to run automatically (e.g., "every morning at 6 AM").
You can see all the data they collected in a nice dashboard.

Offers

Free Forever Scrapy Cloud Unit

Zyte offers a generous free tier for developers:

1 Free Forever Scrapy Cloud Unit: This is enough to run one spider at a time continuously.
Unlimited team members: You can invite your whole team.
Unlimited projects: Organize your spiders however you like.
Unlimited requests: No cap on how many pages you crawl (within the hardware limits).
Unlimited crawl time: Your spiders can run as long as they need.
120 day data retention: Your scraped data is saved for 4 months.

Technical Examples for Developers

Here is how you actually use Zyte Scrapy Cloud with a standard Scrapy project.

1. Prerequisites

You will need the shub command-line tool.

pip install shub

2. Configuration (`scrapinghub.yml`)

In the root of your Scrapy project, you create a scrapinghub.yml file. This tells the deployment tool which project on Zyte you want to deploy to.

First, create a project in the Zyte Dashboard and get your Project ID (e.g., 12345).

File: scrapinghub.yml

projects:
  default: 12345

stacks:
  default: scrapy:2.11  # Specify the Scrapy environment version

requirements:
  file: requirements.txt  # List extra Python dependencies here

3. Deployment

Once your configuration is ready, you deploy your code to the cloud with a single command:

shub deploy

If it works, you will see output like:

Packing version 1.0
Deploying to Scrapy Cloud project "12345"
Run your spiders at: https://app.zyte.com/p/12345/

4. Running a Spider

You can run a spider from the web dashboard, or you can schedule it directly from your terminal using shub:

# shub schedule <spider_name>
shub schedule myspider

5. Example Spider Code

Your code doesn't need to change much to run on the cloud. Here is a simple standard Scrapy spider:

File: myproject/spiders/quotes.py

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

When this runs on Zyte Scrapy Cloud, all the yielded data (text, author) is automatically captured and stored in the database, ready for you to export as CSV or JSON.

Get Started

Website: Zyte.com
Support: Zyte Support

Navigation