Skip to content Skip to sidebar Skip to footer

Navigate Through All The Search Results Pages With Beautifulsoup

I can not seem to grasp. How can I make BeautifulSoup parse every page by navigating using Next page link up until the last page and stop parsing when there is no 'Next page' found

Solution 1:

beautiful soup will only give you the tools, how to go about navigating pages is something you need to work out in a flow diagram sense.

Taking the page you mentioned, clicking through a few of the pages it seems that when we are on page 1, nothing is shown in the url.

htt...ru/moskva/transport

and we see in the source of the page:

<div class="pagination-pages clearfix">
   <spanclass="pagination-page pagination-page_current">1</span><aclass="pagination-page"href="/moskva/transport?p=2">2</a>

lets check what happens when we go to page 2

ht...ru/moskva/transport?p=2

<div class="pagination-pages clearfix">
  <aclass="pagination-page"href="/moskva/transport">1</a><spanclass="pagination-page pagination-page_current">2</span><aclass="pagination-page"href="/moskva/transport?p=3">3</a>

perfect, now we have the layout. one more thing to know before we make our beautiful soup. what happenes when we go to a page past the last available page. which at the time of this writing was: 40161

ht...ru/moskva/transport?p=40161
we change this to:
ht...ru/moskva/transport?p=40162

the page seems to go back to page 1 automatically. great!

so now we have everything we need to make our soup loop.

instead of clicking next each time, just make a url statement. you know the elements required.

url = ht...ru/moskva/$searchterm?p=$pagenum

im assuming transport is the search term??? i dont know, i cant read russian. but you get the idea. construct the url. then do a requests call

request =  requests.get(url)
mysoup = bs4.BeautifulSoup(request.text)

and now you can wrap that whole thing in a while loop, and each time except the first time check

mysoup.select['.pagination-page_current'][0].text == 1

this says, each time we get the page, find the currently selected page by using the class pagination-page_current, it returns an array so we select the first element [0] get its text .text and see if it equals 1.

this should only be true in two cases. the first page you run, and the last. so you can use this to start and stop the script, or however you want.

this should be everything you need to do this properly. :)

Solution 2:

BeautifulSoup by itself does not load pages. You need to use something like requests, scrape the URL you want to follow, load it and pass its content to another BS4 soup.

import requests

# Scrape your url
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser') # You can now scrape the new page

Post a Comment for "Navigate Through All The Search Results Pages With Beautifulsoup"