Python Web Crawler Sometimes Returns Half Of The Source Code, Sometimes All Of It... From The Same Website

February 25, 2024 Post a Comment

I have a spreadsheet of patent numbers that I'm getting extra data for by scraping Google Patents, the USPTO website, and a few others. I mostly have it running, but there's one th

Solution 1:

I still don't know what's causing the issue, but if someone has a similar problem I was able to figure out a workaround. If you send the source code to a text file instead of trying to work with it directly, it won't be cut off. I guess the issue comes after the data is downloaded, but before it's imported to the 'workspace'. Here's the piece of code I wrote into the scraper:

 if examiner == "Examiner not found":
        filename = r'C:\Users\pathblahblahblah\Code and Output\Scraped Source Code\scraper_errors_' + patnum + '.html'
        sys.stdout = open(filename, 'w')
        print(patnum)
        print(pto_soup.prettify())
        sys.stdout = console_out

        # Take that logged code and find the examiner name
        sec = "Not found"
        prim = "Not found"
        scraped_code = open(r'C:\Users\pathblahblahblah\Code and Output\Scraped Source Code\scraper_errors_' + patnum + '.txt')

        scrapedsoup = BeautifulSoup(scraped_code.read(), 'html5lib')
        # Find all italics (<i>) tags
        for italics in scrapedsoup.find_all("i"):
            for desc in italics.descendants:
                # Check to see if any of them affect the words "Primary Examiner"
                if "Primary Examiner:" in desc:
                    prim = desc.next_element.strip()
                    #print("Primary found: ", prim)
                else:
                    pass
                # Same for "Assistant Examiner"
                if "Assistant Examiner:" in desc:
                    sec = desc.next_element.strip()
                    #print("Assistant found: ", sec)
                else:
                    pass

        # If "Secondary Examiner" in there, set 'examiner' to the next string 
        # If there is no secondary examiner, use the primary examiner
        if sec != "Not found":
            examiner = sec
        elif prim != "Not found":
            examiner = prim
        else:
            examiner = "Examiner not found"
        # Show new results in the console
        print(examiner)

Python Tutorial for Beginners

Python Web Crawler Sometimes Returns Half Of The Source Code, Sometimes All Of It... From The Same Website

Solution 1:

Post a Comment for "Python Web Crawler Sometimes Returns Half Of The Source Code, Sometimes All Of It... From The Same Website"