fbpx

Setting Up Scrapyd on AWS EC2 with SSL and Docker

Setup Scrapyd EC2 Docker
Share on linkedin
Share on twitter
Share on reddit
Share on facebook
Share on email

If you wish to learn web scraping, I highly recommend Scrapy because it is truly an amazing framework. I really have to say kudos to Scrapinghub for the job well done.

AWS is also amazing… but at times so confusing. It’s not always simple to grasp everything.

In this short post, we’ll go through the entire setup process to get you scraping quickly.

At the end of this post, you will have:

  • A running instance of scrapyd on AWS EC2
  • SSL setup with a load balancer

If you do not know how to scrape a website, check out this post.

Setting up the EC2 Instance

The security group setup is important!

Don’t forget to add the port 8080 on the inbound rules, otherwise it won’t work:

Verify you can ssh to the instance.

ssh -i key.pem ec2-user@your.public.ip.v4

Update packages.

sudo yum update -y

Install Git.

sudo yum install git

Clone your repo with git clone (use an HTTPS URL instead of git@).

git clone https://...git

Install Docker.

sudo amazon-linux-extras install docker

Start the Docker service.

sudo service docker start

Add the ec2-user to the docker group so you can execute Docker commands without using sudo.

sudo usermod -a -G docker ec2-user

Check that the new docker group permissions have been correctly applied by exiting the instance and ssh again, then:

docker info

If you’re getting permission denied try logging out of SSH and login again.

Install docker-compose.

sudo curl -L "https://github.com/docker/compose/releases/download/1.23.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

Apply executable permissions to the binary.

sudo chmod +x /usr/local/bin/docker-compose

Create the .env file for your production environment variables.

touch .env
vi .env

Your .env file will look something like this:

CONFIGURATION=development
SCRAPYD_CONFIGURATION=development
USERNAME=admin
PASSWORD=admin
PORT=8080
THIRD_PARTY_URL=http://url.com/api

You’ll need a YML file:

version: '3'
services:
  scrapyd:
    build: .
    ports:
      -  "8080:8080"
    restart: always
    env_file:
      - .env

If you do not have a docker image, here’s one, courtesy of Captain Data:

# Dockerfile for deploying scrapyd #
FROM debian:stretch

ADD requirements.txt .

RUN set -xe \\
    && apt-get update \\
    && apt-get install -y python3 \\
    && apt-get install -y python3-pip \\
    && apt-get install -y autoconf \\
                          build-essential \\
                          curl \\
                          git \\
                          libffi-dev \\
                          libssl-dev \\
                          libtool \\
                          libxml2 \\
                          libxml2-dev \\
                          libxslt1.1 \\
                          libxslt1-dev \\
                          python \\
                          python-dev \\
                          vim-tiny \\
    && apt-get install -y libtiff5 \\
                          libtiff5-dev \\
                          libfreetype6-dev \\
                          libjpeg62-turbo \\
                          libjpeg62-turbo-dev \\
                          liblcms2-2 \\
                          liblcms2-dev \\
                          libwebp6 \\
                          libwebp-dev \\
                          zlib1g \\
                          zlib1g-dev \\
    && curl -sSL <https://bootstrap.pypa.io/get-pip.py> | python \\
    && curl -sSL <https://github.com/scrapy/scrapy/raw/master/extras/scrapy_bash_completion> -o /etc/bash_completion.d/scrapy_bash_completion \\
    && echo 'source /etc/bash_completion.d/scrapy_bash_completion' >> /root/.bashrc \\
    && apt-get update && apt-get install --no-install-recommends -y nginx apache2-utils \\
    && pip3 install -r requirements.txt \\
    && apt-get purge -y --auto-remove autoconf \\
                                      build-essential \\
                                      libffi-dev \\
                                      libssl-dev \\
                                      libtool \\
                                      libxml2-dev \\
                                      libxslt1-dev \\
                                      python-dev \\
    && apt-get purge -y --auto-remove libtiff5-dev \\
                                      libfreetype6-dev \\
                                      libjpeg62-turbo-dev \\
                                      liblcms2-dev \\
                                      libwebp-dev \\
                                      zlib1g-dev \\
    && rm -rf /var/lib/apt/lists/*


COPY run_container.sh /usr/local/bin/run_container.sh
COPY ./scrapyd.conf /etc/scrapyd/
COPY nginx.conf /etc/nginx/sites-enabled/default

VOLUME /etc/scrapyd/ /var/lib/scrapyd/
CMD /usr/local/bin/run_container.sh

Use the following script to run the container behind NGINX (which we highly recommend):

#!/bin/bash
htpasswd -b -c /etc/nginx/htpasswd $USERNAME $PASSWORD
nginx
scrapyd --pidfile=

This way, you’re protecting your scrapyd instance with basic authentication. The setup uses the environment variables USERNAME and PASSWORD that you setup in the .env.

Launch the Docker container.

docker-compose -f docker-compose.yml up

Setting up the Elastic IP Address

Go to Elastic IPs on the left side panel in your console.

Click on Allocate new address.

Then Associate address in the upper dropdown menu Actions.

Setting up SSL

Click on Services, search ACM and click on Certificate Manager.

Click Request a Certificate (a public one) and add your domain scrapy.example.com.

Choose DNS validation (way faster).

Add the CNAME record on your provider (Route53, GoDaddy, OVH, Kinsta or any other) and hit Continue.

Once the validation is ready, you will see a green issued status.

If you’re having trouble while setting up the SSL certificate,check out this guide.

Setting up the Load Balancer

A load balancer makes it easy to distribute traffic from your site to the servers that are running it.

Go to Load Balancers (from EC2) on the left side panel in your console and Create Load Balancer.

Choose Application Load Balancer (HTTP/HTTPS, the first one) and hit Create.

Then, add the zones you wish to use (below Listeners).

Hit Next and Choose a certificate from ACM and select the one you previously created.

Next and select the security group we created in the first step.

Next and create a Target Group by just adding a name to it.

Next and select the instance and Add to registered on port 8080. Make sure the port is 8080 (the value next to the button).

Review and Create. You’ll be redirected to the Load Balancers view. Select the load balancer you created, and Add a Listener with a port value of 80 and forward to the Target group we just made (ignore the red warning).

This should already be the case.

And… tada! That’s all for today 😀

You’re now able to push your scrapy project to your scrapyd instance on EC2. Consider using scrapyd-client to do so.


Don’t hesitate to ping us if we made a typo or if something is not up-to-date.

And if you don’t want to manage your own scraping architecture, give Captain Data a try.

Astronaut

Subscribe to our Newsletter

Read about successful use of web data, new scraping techniques and ways to optimize your business with web data.

Share this post with your friends

Share on facebook
Share on google
Share on twitter
Share on linkedin

Your Free Trial

14 days of unlimited data extraction 🤖

Start extracting, automating and building processes & databases faster and get more done with web data.