Skip to content Skip to sidebar Skip to footer

Retrieve Only A Portion Of An Xml Feed

I'm using Scrapy XMLFeedSpider to parse a big XML feed(60MB) from a website, and i was just wondering if there is a way to retrieve only a portion of it instead of all 60MB because

Solution 1:

When you are processing large xml documents and you don't want to load the whole thing in memory as DOM parsers do. You need to switch to a SAX parser.

SAX parsers have some benefits over DOM-style parsers. A SAX parser only needs to report each parsing event as it happens, and normally discards almost all of that information once reported (it does, however, keep some things, for example a list of all elements that have not been closed yet, in order to catch later errors such as end-tags in the wrong order). Thus, the minimum memory required for a SAX parser is proportional to the maximum depth of the XML file (i.e., of the XML tree) and the maximum data involved in a single XML event (such as the name and attributes of a single start-tag, or the content of a processing instruction, etc.).

For a 60 MB XML document, this is likely to be very low compared to the requirments for creating a DOM. Most DOM based systems actually use at a much lower level to build up the tree.

In order to create make use of sax, subclass xml.sax.saxutils.XMLGenerator and overrider endElement, startElement and characters. Then call xml.sax.parse with it. I am sorry I don't have a detailed example at hand to share with you, but I am sure you will find plenty online.

Solution 2:

You should set the iterator mode of your XMLFeedSpider to iternodes (see here):

It’s recommended to use the iternodes iterator for performance reasons

After doing so, you should be able to iterate over your feed and stop at any point.

Post a Comment for "Retrieve Only A Portion Of An Xml Feed"