-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MembersPage class #10
Conversation
lib/members_page.rb
Outdated
decorator Scraped::Response::Decorator::CleanUrls | ||
|
||
field :member_urls do | ||
noko.css('div#RESULTADOS_DIPUTADOS div.listado_1 ul li a').map { |p| p[:href] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the comments on everypolitician-scrapers/spain_congreso_es#17, our normal approach to this is to have the a/@href
in the selector. If there's a reason why that won't work here, you should note it, otherwise it's better to be consistent.
lib/members_page.rb
Outdated
private | ||
|
||
def next_page_link | ||
noko.css('//div[@class = "paginacion"]//a[contains("Página Siguiente")]').first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto here.
test/members_page_test.rb
Outdated
end | ||
|
||
it 'should contain the expected url' do | ||
response.member_urls.first.must_equal 'http://www.congreso.es/portal/page/portal/Congreso/Congreso/Diputados/BusqForm?_piref73_1333155_73_1333154_1333154.next_page=/wc/fichaDiputado&idDiputado=87&idLegislatura=10' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little surprised to see the "_piref" here. I thought the plan was to remove this from URLs where possible? Has that plan changed, or is that something that will happen later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with stripping the '_piref' part of the URL out is that when we request the next page without the the '_piref' in the url, we get a page with the members listed twice:
I think the decorator should be used on the PersonMembership
pages though. My plan is to use these URLs to create the MembershipPage
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we get a page with the members listed twice
Is that a problem? Calling .uniq
on the list is fairly trivial…
Following links with session information doesn't work correctly with Ruby's open-uri, so rather than using Capybara this simply strips away the session information.
Uses decorator to 'piref' info from MembersPage urls URLs stripped of piref info point to pages that list the members in duplicate. Calling .uniq ensures we get each URL only once.
425f5f7
to
33ff039
Compare
33ff039
to
b1d5690
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✨ Looks good! 👍 ✨
Have re-reviewed the PR and confirmed that the changes requested have been made.
This PR adds the
MembersPage
class which represents a page of a paginated list of member URLs.Part of: everypolitician/everypolitician-data#27746