Retrieve Only A Portion Of An Xml Feed

January 31, 2024 Post a Comment

I'm using Scrapy XMLFeedSpider to parse a big XML feed(60MB) from a website, and i was just wondering if there is a way to retrieve only a portion of it instead of all 60MB because

Solution 1:

When you are processing large xml documents and you don't want to load the whole thing in memory as DOM parsers do. You need to switch to a SAX parser.

SAX parsers have some benefits over DOM-style parsers. A SAX parser only needs to report each parsing event as it happens, and normally discards almost all of that information once reported (it does, however, keep some things, for example a list of all elements that have not been closed yet, in order to catch later errors such as end-tags in the wrong order). Thus, the minimum memory required for a SAX parser is proportional to the maximum depth of the XML file (i.e., of the XML tree) and the maximum data involved in a single XML event (such as the name and attributes of a single start-tag, or the content of a processing instruction, etc.).

For a 60 MB XML document, this is likely to be very low compared to the requirments for creating a DOM. Most DOM based systems actually use at a much lower level to build up the tree.

In order to create make use of sax, subclass xml.sax.saxutils.XMLGenerator and overrider endElement, startElement and characters. Then call xml.sax.parse with it. I am sorry I don't have a detailed example at hand to share with you, but I am sure you will find plenty online.

Solution 2:

You should set the iterator mode of your XMLFeedSpider to iternodes (see here):

It’s recommended to use the iternodes iterator for performance reasons

After doing so, you should be able to iterate over your feed and stop at any point.

Python Tutorial for Beginners

Retrieve Only A Portion Of An Xml Feed

Solution 1:

Solution 2:

Post a Comment for "Retrieve Only A Portion Of An Xml Feed"