Lxml element.lear()和访问子元素

本教程将介绍Lxml element.lear()和访问子元素的处理方法,这篇教程是从别的地方看到的,然后加了一些国外程序员的疑问与解答,希望能对你有所帮助,好了,下面开始学习吧。

Lxml element.lear()和访问子元素 教程 第1张

问题描述

我使用lxml.iterparse来解析一个相当大的XML文件。在某个点上会引发内存不足异常。我知道有类似的问题,当您不再使用它时,通常应该使用element.lear()清除构建的树。

我的代码如下(缩短):

for  event,element in context :
 if element.tag == xmlns + 'initialized':  attributes = element.findall(xmlns+'attribute')
  heapsize = filter(lambda x:x.attrib['name']=='maxHeapSize', attributes)[0].attrib['value']
  characteristics['max_heap_size_MB'] = bytes_to_MB(int(heapsize, 16))

 #clear up the built tree to avoid mem alloc fails
 element.clear()
del context

如果我注释掉element.lear(),则可以使用此方法。如果我使用的是element.lear,我会得到如下的键错误:

Traceback (most recent call last):
  File "C:UsersNNDocumentsscriptsanalyseanalyse_all.py", line 289, in <module>
 main()
  File "C:UsersNNDocumentsscriptsanalyseanalyse_all.py", line 277, in main
 join_characteristics_and_score(logpath, benchmarkscores)
  File "C:UsersNNDocumentsscriptsanalyseanalyse_all.py", line 140, in join_characteristics_and_score
 parsed_verbose_xml  = parse_xml(verbose)
  File "C:UsersNNDocumentsscriptsanalyseanalyze_g.py", line 62, in parse_xml
 heapsize = filter(lambda x:x.attrib['name']=='maxHeapSize', attributes)[0].attrib['value']
  File "C:UsersNNDocumentsscriptsanalyseanalyze_g.py", line 62, in <lambda>
 heapsize = filter(lambda x:x.attrib['name']=='maxHeapSize', attributes)[0].attrib['value']
  File "lxml.etree.pyx", line 2272, in lxml.etree._Attrib.__getitem__ (srclxmllxml.etree.c:54751)
KeyError: 'name'

当我打印元素时,它们是带有值的常规DICT,没有使用element.lear()。清除时,这些DICT为空。

编辑

说明该问题的最小运行的python程序:

#!/usr/bin/python

from lxml import etree
from pprint import pprint

def fast_iter(context, func, *args, **kwargs):
  # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
  # Author: Liza Daly
  for event, elem in context:
func(elem, *args, **kwargs) 
elem.clear()
while elem.getprevious() is not None:
 del elem.getparent()[0]
  del context

def process_element(elem):
  xmlns = "{http://www.ibm.com/j9/verbosegc}"

  if elem.tag == xmlns + "gc-start":
memelements = elem.findall('.//root:mem', namespaces = {'root':xmlns[1:-1]})
pprint(memelements)

if __name__ == '__main__':
 with open('small.xml', "r+") as xmlf:
 context = etree.iterparse(xmlf)
 fast_iter(context, process_element)

xmlfile的内容如下:

<verbosegc xmlns="http://www.ibm.com/j9/verbosegc">
<gc-start id="5" type="scavenge" contextid="4" timestamp="2013-06-14T15:48:46.815">
  <mem-info id="6" free="3048240" total="4194304" percent="72">
 <mem type="nursery" free="0" total="1048576" percent="0">
<mem type="allocate" free="0" total="524288" percent="0" />
<mem type="survivor" free="0" total="524288" percent="0" />
 </mem>
 <mem type="tenure" free="3048240" total="3145728" percent="96">
<mem type="soa" free="2891568" total="2989056" percent="96" />
<mem type="loa" free="156672" total="156672" percent="100" />
 </mem>
 <remembered-set count="1593" />
  </mem-info>
</gc-start>
</verbosegc>

推荐答案

Liza Daly写了一篇关于processing large XML using lxml的精彩文章。尝试此处提供的fast_iter代码:

import lxml.etree as ET
import pprint


def fast_iter(context, func, *args, **kwargs):
 """
 http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ (Liza Daly)
 See also http://effbot.org/zone/element-iterparse.htm
 """
 for event, elem in context:
  func(elem, *args, **kwargs)
  # It's safe to call clear() here because no descendants will be
  # accessed
  elem.clear()
  # Also eliminate now-empty references from the root node to elem
  # (ancestor loop added by unutbu)
  for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
 del ancestor.getparent()[0]
 del context


def process_element(elem, namespaces):
 memelements = elem.findall('.//root:mem', namespaces=namespaces)
 pprint.pprint(memelements)

if __name__ == '__main__':
 xmlns = "http://www.ibm.com/j9/verbosegc"
 namespaces = {'root': xmlns}
 with open('small.xml', "r+") as xmlf:
  context = ET.iterparse(xmlf, events=('end', ),
 tag='{{{}}}gc-start'.format(xmlns))
  fast_iter(context, process_element, namespaces)

好了关于Lxml element.lear()和访问子元素的教程就到这里就结束了,希望趣模板源码网找到的这篇技术文章能帮助到大家,更多技术教程可以在站内搜索。