Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to use ScrapedPage subclasses #14

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

chrismytton
Copy link
Contributor

@chrismytton chrismytton commented Nov 8, 2016

This changes the scraper to use ScrapedPage subclasses, which now handle stripping the session information from the url.

Notes to reviewer

You can see the scraper doing things by running the following:

VERBOSE=1 bundle exec ruby scraper.rb

This is currently missing the archiver because I've hit a couple of scraped_page_archive bugs relating to VCR (I think because the archive branch on this repo contains some non-standard cassettes, so it was a good test-case!).

Notes to merger

Set https://morph.io/everypolitician-scrapers/spain_congreso_es to auto-run once this is merged.

@chrismytton chrismytton force-pushed the refactor-to-use-scraped_page branch from f2feae0 to a2da5d1 Compare November 11, 2016 17:00
@chrismytton chrismytton changed the title [WIP] Refactor to use scraped page Refactor to use ScrapedPage subclasses Nov 14, 2016
@chrismytton chrismytton force-pushed the refactor-to-use-scraped_page branch from c46019e to bf6b8a4 Compare November 14, 2016 20:00
Use the URI module to make it easier to just replace the unwanted part
of the query string.
This means that rather than having to override the url in every class in
the system we can just do it once here and then inherit from this class.
This should become part of ScapedPage RSN.
This means these pages also have the session information stripped from
the url.
Rather than just scraping the first page, scrape all pages.
On the last page there is no next link, so trying to call `next[:href]`
blows up. Instead I've moved `next_page_link` to a separate method then
checking that exists in `next_page_url`.
@chrismytton chrismytton force-pushed the refactor-to-use-scraped_page branch from bf6b8a4 to 4fb737a Compare November 14, 2016 20:03
Most gems we need now come as dependencies of scraped_page.
@chrismytton
Copy link
Contributor Author

@tmtmtmtm 👀

Copy link
Contributor

@tmtmtmtm tmtmtmtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments in place, but the bigger problem here is that this doesn't seem to be fetching the history for each person.

Each person page has a 'Todas las legislaturas' link that goes to a page that lists all the different term memberships they've had: e.g. http://www.congreso.es/portal/page/portal/Congreso/Congreso/Diputados/BusqForm?idDiputado=174&idLegislatura=7

gem "capybara"
gem "poltergeist"
gem 'scraped_page_archive', github: "everypolitician/scraped_page_archive", branch: "master"
gem 'scraped_page', github: 'everypolitician/scraped_page', branch: 'master'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps worth running a rubocop tidy against this file too, so we don't have mismatched quotes like this?


class MemberPage < SpanishCongressPage
field :iddiputado do
query['idDiputado']
Copy link
Contributor

@tmtmtmtm tmtmtmtm Nov 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query is a little more ambiguous in this context than when you're already explicitly dealing with a URL. Perhaps something like query_string would be a little more obvious?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, I think query_string is much clearer, will change it to that.

end

def group
@group ||= noko.at_css('div#curriculum div.texto_dip ul li div.dip_rojo:last').text.tidy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really worth memoizing all these things? I don't think they're particularly costly, and some of them aren't even called more than once anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just habit, but you're quite right, this is a form of unnecessary premature optimization.

# frozen_string_literal: true
require_relative 'spanish_congress_page'

class MembersListPage < SpanishCongressPage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 'ScrapedPage' with no fields is a bit of a red flag. The first two of these look like they should be fields to me (and the next_page_link a private method).

This caches a copy of fetched webpages to make the scraper quicker to
develop locally.
@chrismytton chrismytton removed their assignment Nov 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants