enqueueLinks usage to filter links with regexps #2844
Unanswered
antonymarion
asked this question in
Q&A
Replies: 1 comment 3 replies
-
Hello @antonymarion and thank you for the kind words! Unfortunately, we don't know why it doesn't work. Could you please share an executable example that would demonstrate the issue? Especially the globs and regexps would help a lot. Also, if we could see the target page, that could clear up a lot of things. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello Apify team and thks for your nice tool !
I am using it in order to get a snasphot version of a whole website and it works well.
Now I added the possibility to tell to my script that I want only some url matching a list of regexps to be visited but it does not seem to work since the time elapsed for the regexp version is the same...
So it takes as much time as before with the whole Website snapshot version...
Here is the pseudo code:
I start with the sitemap.xml
arrayOfPagesToVisit = await downloadListOfUrls({url: sitemapXmlUrl});
I then define the PuppeteerCrawler like this for the requestHandler:
await page.waitForSelector('.sp-snapshot-ready', {timeout: 120000});
const html = await page.evaluate(() => document.documentElement.outerHTML);
call to the writeFile to store my page.
Then call to enqueueLinks with the following condition:
await enqueueLinks(options);
where
options = {globs: allowedListOfGlobs (some globs url I always want to crawl), regexps: allowedListOfRegexps }
and finally:
crawler.addRequests(arrayOfPagesToVisit)
Do you know why filtering with regexp options does not decrease the whole crawling time?
Cheers,
Antony
Beta Was this translation helpful? Give feedback.
All reactions