Python Web Crawler Sometimes Returns Half Of The Source Code, Sometimes All Of It... From The Same Website
I have a spreadsheet of patent numbers that I'm getting extra data for by scraping Google Patents, the USPTO website, and a few others. I mostly have it running, but there's one th
Solution 1:
I still don't know what's causing the issue, but if someone has a similar problem I was able to figure out a workaround. If you send the source code to a text file instead of trying to work with it directly, it won't be cut off. I guess the issue comes after the data is downloaded, but before it's imported to the 'workspace'. Here's the piece of code I wrote into the scraper:
if examiner == "Examiner not found":
filename = r'C:\Users\pathblahblahblah\Code and Output\Scraped Source Code\scraper_errors_' + patnum + '.html'
sys.stdout = open(filename, 'w')
print(patnum)
print(pto_soup.prettify())
sys.stdout = console_out
# Take that logged code and find the examiner name
sec = "Not found"
prim = "Not found"
scraped_code = open(r'C:\Users\pathblahblahblah\Code and Output\Scraped Source Code\scraper_errors_' + patnum + '.txt')
scrapedsoup = BeautifulSoup(scraped_code.read(), 'html5lib')
# Find all italics (<i>) tags
for italics in scrapedsoup.find_all("i"):
for desc in italics.descendants:
# Check to see if any of them affect the words "Primary Examiner"
if "Primary Examiner:" in desc:
prim = desc.next_element.strip()
#print("Primary found: ", prim)
else:
pass
# Same for "Assistant Examiner"
if "Assistant Examiner:" in desc:
sec = desc.next_element.strip()
#print("Assistant found: ", sec)
else:
pass
# If "Secondary Examiner" in there, set 'examiner' to the next string
# If there is no secondary examiner, use the primary examiner
if sec != "Not found":
examiner = sec
elif prim != "Not found":
examiner = prim
else:
examiner = "Examiner not found"
# Show new results in the console
print(examiner)
Post a Comment for "Python Web Crawler Sometimes Returns Half Of The Source Code, Sometimes All Of It... From The Same Website"