Retrieve Only A Portion Of An Xml Feed
Solution 1:
When you are processing large xml documents and you don't want to load the whole thing in memory as DOM parsers do. You need to switch to a SAX parser.
SAX parsers have some benefits over DOM-style parsers. A SAX parser only needs to report each parsing event as it happens, and normally discards almost all of that information once reported (it does, however, keep some things, for example a list of all elements that have not been closed yet, in order to catch later errors such as end-tags in the wrong order). Thus, the minimum memory required for a SAX parser is proportional to the maximum depth of the XML file (i.e., of the XML tree) and the maximum data involved in a single XML event (such as the name and attributes of a single start-tag, or the content of a processing instruction, etc.).
For a 60 MB XML document, this is likely to be very low compared to the requirments for creating a DOM. Most DOM based systems actually use at a much lower level to build up the tree.
In order to create make use of sax, subclass xml.sax.saxutils.XMLGenerator
and overrider endElement
, startElement
and characters
. Then call xml.sax.parse
with it. I am sorry I don't have a detailed example at hand to share with you, but I am sure you will find plenty online.
Solution 2:
You should set the iterator mode of your XMLFeedSpider to iternodes
(see here):
It’s recommended to use the iternodes iterator for performance reasons
After doing so, you should be able to iterate over your feed and stop at any point.
Post a Comment for "Retrieve Only A Portion Of An Xml Feed"