In this post, I'll walk you through building a web scraper in Ruby on Rails. I'm assuming an intermediate skill level with Rails.
- Goal: I want to make a web scraper in a Rails app that runs indefinitely and can be scaled. Current stack app is running on: ROR/Heroku/Redis/Postgres. Idea: I was thinking of running a Sidekiq Job that runs every n minutes and checks if there are any proxies available to scrape with (these will be stored in a table with status sleeping/scraping).
- Ruby on Rails which is one of the most preferred web frameworks that enables one to write less code and prevent any kind of repetition. Features NokoGiri, HTTParty and Pry can enable you to set up your web scraper without any hassle.
- I'm looking to do some web scraping for my backend rails JSON API. After over 4 hours of research, I believe I'm in the clear legally, and am not restricted by the sites robots.txt. I know very little about web scraping. I know HTML, but that's about it. My experience with rails is limited to creating a backend API to serve a mobile client only.
you can a completed version of this project here
This application can be used to scrape job postings.
👉 NEW Patreon: 👉 Subscribe For More Ruby Videos: https://www.youtube.com/channel/UCkoEStUK7wxmZef2DcPuCAQ?subconfirmation=1👉. A Web Scraper is a program that has the process of retrieving data from a website (this process is called “scraping”). So If you’ve ever copy and pasted information from a website, you’ve performed the same function as any web scraper, but manually.
Requirements
- ruby-2.1.1
- rails 4.1.1
- local instance of postgresql
Create new rails project
rails new jobscraper -d postgresql
Install gems
bundle install
Create Database
postgres -D /usr/local/pgsql/data
Rails Web Scraper Free
rake db:create
Create 'Job' Resource
rails g scaffold job title:string location:string link:text haveapplied:boolean company:string interested:boolean referred:string
Use scaffold generator to get .json API for free
rake db:migrate
Add Active Admin
add these lines to your Gemfile
rubygem 'devise'gem 'activeadmin', github: 'gregbell/active_admin'
and run
bundle install
Install ActiveAdmin
rails g active_admin:install
Register Jobs with ActiveAdmin
rails generate active_admin:resource job
Customize ActiveAdmin Jobs View
Add Rake Task
Rails Web Scraper Tutorial
rails generate task jobs fetch prune clean
Rails Web Scraper
If you run rake -T
you can see these tasks are registered with rake.rake jobs:clean # Delete all jobsrake jobs:fetch # Fill database with Job listingsrake jobs:prune # Delete Jobs that are older than 7 days
Open Source Web Scraper
Write custom nokogiri scripts to populate ActiveRecord attributes.