Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to use ScrapedPage subclasses #14

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
# Ignore output of scraper
data.sqlite
.cache
8 changes: 2 additions & 6 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,6 @@ git_source(:github) { |repo_name| "https://github.com/#{repo_name}.git" }
ruby "2.3.1"

gem "scraperwiki", github: "openaustralia/scraperwiki-ruby", branch: "morph_defaults"
gem "nokogiri"
gem "open-uri-cached"
gem "pry"
gem "colorize"
gem "capybara"
gem "poltergeist"
gem 'scraped_page_archive', github: "everypolitician/scraped_page_archive", branch: "master"
gem 'scraped_page', github: 'everypolitician/scraped_page', branch: 'master'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps worth running a rubocop tidy against this file too, so we don't have mismatched quotes like this?

gem 'open-uri-cached'
59 changes: 18 additions & 41 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
GIT
remote: https://github.com/everypolitician/scraped_page_archive.git
revision: 9d8a6347d122d42a983ba28ff84d24de6bdca8a0
remote: https://github.com/everypolitician/scraped_page.git
revision: 6eb806c01e8b9f3e333575a639e40a68be60ae48
branch: master
specs:
scraped_page_archive (0.4.1)
git (~> 1.3.0)
vcr-archive (~> 0.3.0)
scraped_page (0.1.0)
field_serializer
nokogiri
scraped_page_archive (>= 0.5)

GIT
remote: https://github.com/openaustralia/scraperwiki-ruby.git
Expand All @@ -19,44 +20,29 @@ GIT
GEM
remote: https://rubygems.org/
specs:
addressable (2.4.0)
capybara (2.6.2)
addressable
mime-types (>= 1.16)
nokogiri (>= 1.3.3)
rack (>= 1.0.0)
rack-test (>= 0.5.4)
xpath (~> 2.0)
cliver (0.3.2)
addressable (2.5.0)
public_suffix (~> 2.0, >= 2.0.2)
coderay (1.1.1)
colorize (0.7.7)
crack (0.4.3)
safe_yaml (~> 1.0.0)
field_serializer (0.2.0)
git (1.3.0)
hashdiff (0.3.0)
httpclient (2.7.1)
method_source (0.8.2)
mime-types (3.0)
mime-types-data (~> 3.2015)
mime-types-data (3.2016.0221)
mini_portile2 (2.0.0)
multi_json (1.11.2)
nokogiri (1.6.7.2)
mini_portile2 (~> 2.0.0.rc2)
mini_portile2 (2.1.0)
nokogiri (1.6.8.1)
mini_portile2 (~> 2.1.0)
open-uri-cached (0.0.5)
poltergeist (1.9.0)
capybara (~> 2.1)
cliver (~> 0.3.1)
multi_json (~> 1.0)
websocket-driver (>= 0.2.0)
pry (0.10.3)
coderay (~> 1.1.0)
method_source (~> 0.8.1)
slop (~> 3.4)
rack (1.6.4)
rack-test (0.6.3)
rack (>= 1.0)
public_suffix (2.0.4)
safe_yaml (1.0.4)
scraped_page_archive (0.5.0)
git (~> 1.3.0)
vcr-archive (~> 0.3.0)
slop (3.6.0)
sqlite3 (1.3.11)
sqlite_magic (0.0.6)
Expand All @@ -69,27 +55,18 @@ GEM
addressable (>= 2.3.6)
crack (>= 0.3.2)
hashdiff
websocket-driver (0.6.3)
websocket-extensions (>= 0.1.0)
websocket-extensions (0.1.2)
xpath (2.0.0)
nokogiri (~> 1.3)

PLATFORMS
ruby

DEPENDENCIES
capybara
colorize
nokogiri
open-uri-cached
poltergeist
pry
scraped_page_archive!
scraped_page!
scraperwiki!

RUBY VERSION
ruby 2.3.1p112

BUNDLED WITH
1.13.5
1.13.7
6 changes: 6 additions & 0 deletions lib/core_ext.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# frozen_string_literal: true
class String
def tidy
gsub(/[[:space:]]+/, ' ').strip
end
end
25 changes: 25 additions & 0 deletions lib/date_of_birth.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# frozen_string_literal: true
class DateOfBirth
DATE_REGEX = /(?<day>\d+) de (?<month>[^[:space:]]*) de (?<year>\d+)/

def initialize(date_string)
@date_string = date_string
end

def to_s
return '' if match.nil?
'%d-%02d-%02d' % [match[:year], month(match[:month]), match[:day]]
end

private

attr_reader :date_string

def match
@match ||= date_string.match(DATE_REGEX)
end

def month(str)
['', 'enero', 'febrero', 'marzo', 'abril', 'mayo', 'junio', 'julio', 'agosto', 'septiembre', 'octubre', 'noviembre', 'diciembre'].find_index(str.downcase) || raise("Unknown month #{str}".magenta)
end
end
115 changes: 115 additions & 0 deletions lib/member_page.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# frozen_string_literal: true
require_relative 'spanish_congress_page'
require_relative 'date_of_birth'
require_relative 'core_ext'

class MemberPage < SpanishCongressPage
field :iddiputado do
query['idDiputado']
Copy link
Contributor

@tmtmtmtm tmtmtmtm Nov 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query is a little more ambiguous in this context than when you're already explicitly dealing with a URL. Perhaps something like query_string would be a little more obvious?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, I think query_string is much clearer, will change it to that.

end

field :term do
query['idLegislatura']
end

field :name do
noko.css('div#curriculum div.nombre_dip').text
end

field :family_names do
name.split(/,/).first.to_s.tidy
end

field :given_names do
name.split(/,/).last.to_s.tidy
end

field :gender do
return 'female' if seat.include? 'Diputada'
return 'male' if seat.include? 'Diputado'
end

field :party do
noko.at_css('#datos_diputado .nombre_grupo').text.tidy
end

field :source do
url.to_s
end

field :dob do
DateOfBirth.new(
noko.xpath('.//div[@class="titular_historico"]/following::div/ul/li').first.text
).to_s
end

field :faction do
faction_information[:faction].to_s.tidy
end

field :faction_id do
faction_information[:faction_id].to_s.tidy
end

field :start_date do
start_date = noko.xpath('.//div[@class="dip_rojo"][contains(.,"Fecha alta")]')
.text.match(/(\d+)\/(\d+)\/(\d+)\./)
return if start_date.nil?
start_date.captures.reverse.join('-')
end

field :end_date do
end_date = noko.xpath('.//div[@class="dip_rojo"][contains(.,"Causó baja")]')
.text.match(/(\d+)\/(\d+)\/(\d+)\./)
return if end_date.nil?
end_date.captures.reverse.join('-')
end

field :email do
noko.css('.webperso_dip a[href*="mailto"]').text.tidy
end

field :twitter do
noko.css('.webperso_dip a[href*="twitter.com"]').text.tidy
end

field :facebook do
noko.css('.webperso_dip a[href*="facebook.com"]').text.tidy
end

field :phone do
noko.css('.texto_dip').text.match(/Teléfono: (.*)$/).to_a.last.to_s.tidy
end

field :fax do
noko.css('.texto_dip').text.match(/Fax: (.*)$/).to_a.last.to_s.tidy
end

field :constituency do
seat[/Diputad. por (.*)\./, 1]
end

field :photo do
foto = noko.at_css('#datos_diputado img[name="foto"]')
return if foto.nil?
URI.join(url, foto[:src]).to_s
end

private

def seat
@seat ||= noko.at_css('div#curriculum div.texto_dip ul li div.dip_rojo:first').text.tidy
end

def group
@group ||= noko.at_css('div#curriculum div.texto_dip ul li div.dip_rojo:last').text.tidy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really worth memoizing all these things? I don't think they're particularly costly, and some of them aren't even called more than once anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just habit, but you're quite right, this is a form of unnecessary premature optimization.

end

def query
@query ||= URI.decode_www_form(URI.parse(url).query).to_h
end

def faction_information
@faction_information ||= group.match(/(?<faction>.*?) \((?<faction_id>.*?)\)/) || {}
end
end
16 changes: 16 additions & 0 deletions lib/members_list_page.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# frozen_string_literal: true
require_relative 'spanish_congress_page'

class MembersListPage < SpanishCongressPage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 'ScrapedPage' with no fields is a bit of a red flag. The first two of these look like they should be fields to me (and the next_page_link a private method).

def member_urls
@member_urls ||= noko.css('div#RESULTADOS_DIPUTADOS div.listado_1 ul li a').map { |p| p[:href] }
end

def next_page_url
next_page_link && next_page_link[:href]
end

def next_page_link
@next_page_url ||= noko.css('//div[@class = "paginacion"]//a[contains("Página Siguiente")]').first
end
end
13 changes: 13 additions & 0 deletions lib/spanish_congress_page.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# frozen_string_literal: true
require 'scraped_page'
require 'uri'

class SpanishCongressPage < ScrapedPage
# Remove session information from url
def url
uri = URI.parse(super.to_s)
return uri.to_s unless uri.query
uri.query = uri.query.gsub(/_piref[\d_]+\./, '')
uri.to_s
end
end
Loading