Python ( Selenium ) Script For Downloading PDFs, And When Those Aren't Found It Scrapes Pages For Similar Information

February 06, 2023 Post a Comment

So essentially, I am writing a script that loops through a list of search terms, googles them, and downloads the first PDF it sees, but if it can't find one then it goes to the fir

Solution 1:

This works properly so far for what I can see it. It doesn't download them all.

for k, key in enumerate(keys):
    try:
        start = time.time()
        driver.implicitly_wait(10)
        driver.get("https://www.google.com/")

        sleep_between_interactions = 5
        searchbar = driver.find_element_by_name("q")
        searchbar.send_keys(key)
        searchbar.send_keys(Keys.ARROW_DOWN)
        searchbar.send_keys(Keys.RETURN)
        pdf_element = driver.find_elements(By.XPATH, ("//a[contains(@href, '.pdf')]"))
        key_index_number = str(keys.index(key) +1 )
        key_length = str(len(keys))
        print(key_index_number + " out of " + key_length)
            
        if len(pdf_element) > 0 and  key_length < key_index_number :
            print("pdf found for: "+ key)
            pdf_element[0].click()
            time.sleep(sleep_between_interactions)
            print("downloaded " + key_index_number + " out of "+ str(len(keys)))
                
        elif len(pdf_element) == 0 and key_index_number != key_length:
            print("pdf NOT found for "+ key)
            print(key + " pdf not downloaded, moving on...")     
            url_search = f"https://www.google.com/search?q={key}"
            request = requests.get(url_search)
            soup = BeautifulSoup(request.text, "lxml")
            first_link = soup.find("div", class_="BNeawe").text
            links_list.append(first_link)

            
    except IndexError as index_error:
        print("Couldn't find pdf file for "+"\"" + key + "\""+" due to Index Error moving on....")
        print(key_index_number + " out of " + str(len(keys)))
        continue
    except NoSuchElementException:
        print("search bar didn't load, iterating next in loop")
        print(" pdf NOT found for "+ key)
        print(key + " pdf not downloaded, moving on...")
        continue
    except ElementNotInteractableException:
        print("element either didn't load or doesn't exist")
        driver.get("https://www.google.com/")
        continue

Outputs

1 out of 40
2 out of 40
3 out of 40
4 out of 40
5 out of 40
pdf NOT found for computer science Learning Outcomes California Baptist University
computer science Learning Outcomes California Baptist University pdf not downloaded, moving on...
6 out of 40
pdf NOT found for physicsmath Learning Outcomes California Baptist University
physicsmath Learning Outcomes California Baptist University pdf not downloaded, moving on...
7 out of 40
pdf found for: computer science Learning Outcomes California Lutheran University
downloaded 7 out of 40
8 out of 40
pdf found for: physicsmath Learning Outcomes California Lutheran University
downloaded 8 out of 40
9 out of 40
pdf found for: computer science Program Handbook Azusa Pacific University
downloaded 9 out of 40

Python Tutorial for Beginners

Python ( Selenium ) Script For Downloading PDFs, And When Those Aren't Found It Scrapes Pages For Similar Information

Solution 1:

Post a Comment for "Python ( Selenium ) Script For Downloading PDFs, And When Those Aren't Found It Scrapes Pages For Similar Information"