Refactor to use ScrapedPage subclasses #14

chrismytton · 2016-11-08T14:27:44Z

This changes the scraper to use ScrapedPage subclasses, which now handle stripping the session information from the url.

Notes to reviewer

You can see the scraper doing things by running the following:

VERBOSE=1 bundle exec ruby scraper.rb

This is currently missing the archiver because I've hit a couple of scraped_page_archive bugs relating to VCR (I think because the archive branch on this repo contains some non-standard cassettes, so it was a good test-case!).

Notes to merger

Set https://morph.io/everypolitician-scrapers/spain_congreso_es to auto-run once this is merged.

These pick up the other fields that the existing scraper was getting.

Use the URI module to make it easier to just replace the unwanted part of the query string.

This means that rather than having to override the url in every class in the system we can just do it once here and then inherit from this class.

This should become part of ScapedPage RSN.

This means these pages also have the session information stripped from the url.

Rather than just scraping the first page, scrape all pages.

On the last page there is no next link, so trying to call `next[:href]` blows up. Instead I've moved `next_page_link` to a separate method then checking that exists in `next_page_url`.

Most gems we need now come as dependencies of scraped_page.

chrismytton · 2016-11-15T09:17:13Z

@tmtmtmtm 👀

tmtmtmtm

Some comments in place, but the bigger problem here is that this doesn't seem to be fetching the history for each person.

Each person page has a 'Todas las legislaturas' link that goes to a page that lists all the different term memberships they've had: e.g. http://www.congreso.es/portal/page/portal/Congreso/Congreso/Diputados/BusqForm?idDiputado=174&idLegislatura=7

tmtmtmtm · 2016-11-15T09:28:49Z

Gemfile

-gem "capybara"
-gem "poltergeist"
-gem 'scraped_page_archive', github: "everypolitician/scraped_page_archive", branch: "master"
+gem 'scraped_page', github: 'everypolitician/scraped_page', branch: 'master'


Perhaps worth running a rubocop tidy against this file too, so we don't have mismatched quotes like this?

tmtmtmtm · 2016-11-15T09:30:33Z

lib/member_page.rb

+
+class MemberPage < SpanishCongressPage
+  field :iddiputado do
+    query['idDiputado']


query is a little more ambiguous in this context than when you're already explicitly dealing with a URL. Perhaps something like query_string would be a little more obvious?

Yeah good point, I think query_string is much clearer, will change it to that.

tmtmtmtm · 2016-11-15T09:32:09Z

lib/member_page.rb

+  end
+
+  def group
+    @group ||= noko.at_css('div#curriculum div.texto_dip ul li div.dip_rojo:last').text.tidy


is it really worth memoizing all these things? I don't think they're particularly costly, and some of them aren't even called more than once anyway.

No, just habit, but you're quite right, this is a form of unnecessary premature optimization.

tmtmtmtm · 2016-11-15T09:43:32Z

lib/members_list_page.rb

+# frozen_string_literal: true
+require_relative 'spanish_congress_page'
+
+class MembersListPage < SpanishCongressPage


A 'ScrapedPage' with no fields is a bit of a red flag. The first two of these look like they should be fields to me (and the next_page_link a private method).

This caches a copy of fetched webpages to make the scraper quicker to develop locally.

chrismytton added 2 commits November 11, 2016 17:59

Add scraped_page dependency

42c6c3e

WIP: Refactor scraper to use ScrapedPage classes

a2da5d1

chrismytton force-pushed the refactor-to-use-scraped_page branch from f2feae0 to a2da5d1 Compare November 11, 2016 17:00

chrismytton added 3 commits November 14, 2016 13:30

Add class to represent members' DOB

454a4a5

Add party field

44001c8

Add remaining methods to MemberPage class

e08861d

These pick up the other fields that the existing scraper was getting.

chrismytton changed the title ~~[WIP] Refactor to use scraped page~~ Refactor to use ScrapedPage subclasses Nov 14, 2016

chrismytton force-pushed the refactor-to-use-scraped_page branch from c46019e to bf6b8a4 Compare November 14, 2016 20:00

chrismytton added 12 commits November 14, 2016 21:01

Move DateOfBirth class into its own file

a36bbcc

Handle missing faction information

7c1c583

More robust url session handling

14bea9d

Use the URI module to make it easier to just replace the unwanted part of the query string.

Extract a SpanishCongressPage base class

d48f099

This means that rather than having to override the url in every class in the system we can just do it once here and then inherit from this class.

Move String#tidy into its own file

5485aae

This should become part of ScapedPage RSN.

Make MembersListPage inherit from SpanishCongressPage

b3b1096

This means these pages also have the session information stripped from the url.

Handle pagination in scraper.rb

c7163ee

Rather than just scraping the first page, scrape all pages.

Convert name parts to string before tidying

0e158b9

Correctly handle the last page of results

cddfd48

On the last page there is no next link, so trying to call `next[:href]` blows up. Instead I've moved `next_page_link` to a separate method then checking that exists in `next_page_url`.

Simplify phone and fax fields

80b4a12

DateOfBirth#to_s should always return a string

c25c4b0

Run rubocop -a on lib

4fb737a

chrismytton force-pushed the refactor-to-use-scraped_page branch from bf6b8a4 to 4fb737a Compare November 14, 2016 20:03

chrismytton added 2 commits November 14, 2016 21:05

Delete unused code in scraper.rb

b1c1884

Remove unneeded gems

4b7fe80

Most gems we need now come as dependencies of scraped_page.

chrismytton assigned tmtmtmtm Nov 15, 2016

tmtmtmtm suggested changes Nov 15, 2016

View reviewed changes

tmtmtmtm assigned chrismytton and unassigned tmtmtmtm Nov 15, 2016

chrismytton mentioned this pull request Nov 15, 2016

Add method for accessing url query string params everypolitician/scraped#13

Open

Add open-uri-cached

95cad54

This caches a copy of fetched webpages to make the scraper quicker to develop locally.

chrismytton removed their assignment Nov 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor to use ScrapedPage subclasses #14

Refactor to use ScrapedPage subclasses #14

chrismytton commented Nov 8, 2016 •

edited

Loading

chrismytton commented Nov 15, 2016

tmtmtmtm left a comment

tmtmtmtm Nov 15, 2016

tmtmtmtm Nov 15, 2016 •

edited

Loading

chrismytton Nov 15, 2016

tmtmtmtm Nov 15, 2016

chrismytton Nov 15, 2016

tmtmtmtm Nov 15, 2016

Refactor to use ScrapedPage subclasses #14

Are you sure you want to change the base?

Refactor to use ScrapedPage subclasses #14

Conversation

chrismytton commented Nov 8, 2016 • edited Loading

Notes to reviewer

Notes to merger

chrismytton commented Nov 15, 2016

tmtmtmtm left a comment

Choose a reason for hiding this comment

tmtmtmtm Nov 15, 2016

Choose a reason for hiding this comment

tmtmtmtm Nov 15, 2016 • edited Loading

Choose a reason for hiding this comment

chrismytton Nov 15, 2016

Choose a reason for hiding this comment

tmtmtmtm Nov 15, 2016

Choose a reason for hiding this comment

chrismytton Nov 15, 2016

Choose a reason for hiding this comment

tmtmtmtm Nov 15, 2016

Choose a reason for hiding this comment

chrismytton commented Nov 8, 2016 •

edited

Loading

tmtmtmtm Nov 15, 2016 •

edited

Loading