Rails Web Scraper



In this post, I'll walk you through building a web scraper in Ruby on Rails. I'm assuming an intermediate skill level with Rails.

  1. Goal: I want to make a web scraper in a Rails app that runs indefinitely and can be scaled. Current stack app is running on: ROR/Heroku/Redis/Postgres. Idea: I was thinking of running a Sidekiq Job that runs every n minutes and checks if there are any proxies available to scrape with (these will be stored in a table with status sleeping/scraping).
  2. Ruby on Rails which is one of the most preferred web frameworks that enables one to write less code and prevent any kind of repetition. Features NokoGiri, HTTParty and Pry can enable you to set up your web scraper without any hassle.
  3. I'm looking to do some web scraping for my backend rails JSON API. After over 4 hours of research, I believe I'm in the clear legally, and am not restricted by the sites robots.txt. I know very little about web scraping. I know HTML, but that's about it. My experience with rails is limited to creating a backend API to serve a mobile client only.

you can a completed version of this project here

This application can be used to scrape job postings.

👉 NEW Patreon: 👉 Subscribe For More Ruby Videos: https://www.youtube.com/channel/UCkoEStUK7wxmZef2DcPuCAQ?subconfirmation=1👉. A Web Scraper is a program that has the process of retrieving data from a website (this process is called “scraping”). So If you’ve ever copy and pasted information from a website, you’ve performed the same function as any web scraper, but manually.

Requirements

  • ruby-2.1.1
  • rails 4.1.1
  • local instance of postgresql

Create new rails project

rails new jobscraper -d postgresql

Open source web scraper

Install gems

bundle install

Create Database

postgres -D /usr/local/pgsql/data

Rails Web Scraper Free

rake db:create

Create 'Job' Resource

rails g scaffold job title:string location:string link:text haveapplied:boolean company:string interested:boolean referred:string

Use scaffold generator to get .json API for free

rake db:migrate

Scraper

Add Active Admin

add these lines to your Gemfilerubygem 'devise'gem 'activeadmin', github: 'gregbell/active_admin'and run

bundle install

Install ActiveAdmin

rails g active_admin:install

Register Jobs with ActiveAdmin

rails generate active_admin:resource job

Customize ActiveAdmin Jobs View

Add Rake Task

Rails Web Scraper Tutorial

rails generate task jobs fetch prune clean

Rails Web Scraper

If you run rake -T you can see these tasks are registered with rake.rake jobs:clean # Delete all jobsrake jobs:fetch # Fill database with Job listingsrake jobs:prune # Delete Jobs that are older than 7 days

Open Source Web Scraper

Write custom nokogiri scripts to populate ActiveRecord attributes.