Python ( Selenium ) Script For Downloading PDFs, And When Those Aren't Found It Scrapes Pages For Similar Information
So essentially, I am writing a script that loops through a list of search terms, googles them, and downloads the first PDF it sees, but if it can't find one then it goes to the fir
Solution 1:
This works properly so far for what I can see it. It doesn't download them all.
for k, key in enumerate(keys):
try:
start = time.time()
driver.implicitly_wait(10)
driver.get("https://www.google.com/")
sleep_between_interactions = 5
searchbar = driver.find_element_by_name("q")
searchbar.send_keys(key)
searchbar.send_keys(Keys.ARROW_DOWN)
searchbar.send_keys(Keys.RETURN)
pdf_element = driver.find_elements(By.XPATH, ("//a[contains(@href, '.pdf')]"))
key_index_number = str(keys.index(key) +1 )
key_length = str(len(keys))
print(key_index_number + " out of " + key_length)
if len(pdf_element) > 0 and key_length < key_index_number :
print("pdf found for: "+ key)
pdf_element[0].click()
time.sleep(sleep_between_interactions)
print("downloaded " + key_index_number + " out of "+ str(len(keys)))
elif len(pdf_element) == 0 and key_index_number != key_length:
print("pdf NOT found for "+ key)
print(key + " pdf not downloaded, moving on...")
url_search = f"https://www.google.com/search?q={key}"
request = requests.get(url_search)
soup = BeautifulSoup(request.text, "lxml")
first_link = soup.find("div", class_="BNeawe").text
links_list.append(first_link)
except IndexError as index_error:
print("Couldn't find pdf file for "+"\"" + key + "\""+" due to Index Error moving on....")
print(key_index_number + " out of " + str(len(keys)))
continue
except NoSuchElementException:
print("search bar didn't load, iterating next in loop")
print(" pdf NOT found for "+ key)
print(key + " pdf not downloaded, moving on...")
continue
except ElementNotInteractableException:
print("element either didn't load or doesn't exist")
driver.get("https://www.google.com/")
continue
Outputs
1 out of 40
2 out of 40
3 out of 40
4 out of 40
5 out of 40
pdf NOT found for computer science Learning Outcomes California Baptist University
computer science Learning Outcomes California Baptist University pdf not downloaded, moving on...
6 out of 40
pdf NOT found for physicsmath Learning Outcomes California Baptist University
physicsmath Learning Outcomes California Baptist University pdf not downloaded, moving on...
7 out of 40
pdf found for: computer science Learning Outcomes California Lutheran University
downloaded 7 out of 40
8 out of 40
pdf found for: physicsmath Learning Outcomes California Lutheran University
downloaded 8 out of 40
9 out of 40
pdf found for: computer science Program Handbook Azusa Pacific University
downloaded 9 out of 40
Post a Comment for "Python ( Selenium ) Script For Downloading PDFs, And When Those Aren't Found It Scrapes Pages For Similar Information"