Parsing xml in Python with etree.ElementTree

The xml module in the standard library provide tools for working with XML documents.

The ElementTree class in the etree submodule of the xml module offers an intuitive way of parsing and representing XML data.

ElementTree objects represents xml data in form of a tree structure in which the hierarchy is based on the nesting of the xml elements.

Basic parsing example

Consider if we have an xml file called articles.xml with the following content.

<?xml version = '1.0' encoding = 'UTF-8'?>

<articlelist>

  <article>
    <author country = 'India'>John Doe</author>
    <datepublished>2024/04/05</datepublished>  
    <title>Lorem ipsum dolor sit amet consectetur adipisicing elit</title>
    <content>Lorem ipsum dolor sit amet consectetur adipisicing elit. Maxime mollitia,
  molestiae quas vel sint commodi repudiandae consequuntur voluptatum laborum
  numquam blanditiis harum quisquam eius sed odit fugiat iusto fuga praesentium
  optio, eaque rerum! Provident similique accusantium nemo autem.
   </content>
  </article>
  <article>
    <author country = 'Finland'>Mary Smith</author>
    <datepublished>2024/04/07</datepublished>
    <title>Perspiciatis minima nesciunt dolorem</title>
    <content>Perspiciatis 
  minima nesciunt dolorem! Officiis iure rerum voluptates a cumque velit 
  quibusdam sed amet tempora. Sit laborum ab, eius fugit doloribus tenetur 
  fugiat, temporibus enim commodi iusto libero magni deleniti quod quam 
  consequuntur! Commodi minima excepturi repudiandae velit hic maxime
  doloremque.</content>
  </article>
  
</articlelist>

We can parse the document by passing the opened file object as an argument to the ElementTree.parse() method, as shown below:

from xml.etree import ElementTree

with open('articles.xml') as file:
  tree = ElementTree.parse(file)
  print(tree)

<xml.etree.ElementTree.ElementTree object at 0x000001B03DB770E0>

As shown in the above example, the ElementTree.parse() helper method creates an ElementTree instance from the given file object.

The ElementTree object represents the structure of the xml documents in form of a tree, where each node in the tree represents the corresponding element in the xml document.

Traversing an ElementTree

The tree.iter() method returns an iterator object that yields the nodes of the parsed tree from top to bottom. By default it returns all nodes in the tree.

from xml.etree import ElementTree

with open('articles.xml') as file:
  tree = ElementTree.parse(file)

  for node in tree.iter():
    print(node.tag)

articlelist
article
author
datepublished
title
content
article
author
datepublished
title
content

You can pass a tag as an argument to the tree.iter() method so that it will only iterate over the elements with that tag.

tree.iter(tag = None)

For example, to get only nodes with author tag, we can parse 'author' as the tag argument, as shown below:

from xml.etree import ElementTree

with open('articles.xml') as file:
  tree = ElementTree.parse(file)
  for node in tree.iter('author'):
    print(node.text)

John Doe
Mary Smith

Search for Nodes

Parsed trees contains some useful methods to expressively search for nodes with certain characteristics. This allows you to find for nodes with given tags or even nodes that appears at certain depth of the parse tree.

The two basic methods for searching are find() and findall().

Find single node - `tree.find()`

The tree.find() method returns the first node that matches the search strings. It returns None, if there is no matching node.

from xml.etree import ElementTree

with open('articles.xml') as file:
  tree = ElementTree.parse(file)
  n = tree.find('.//author')
  print(n.text)

John Doe

Note how the search string is formatted, each stroke represents a depth starting from the root. So in the above case, we are finding for a node with author tag at the second depth. If we are looking for an article tag we would use a single stroke i.e "./article" to correspond with the depth of that article tag from the root.

from xml.etree import ElementTree

with open('pynerds.txt') as file:
  tree = ElementTree.parse(file)

  article = tree.find('./article')
  for i in article:
    print(i.tag, ': ', i.text)

author : John Doe
datepublished : 2024/04/05
title : Lorem ipsum dolor sit amet consectetur adipisicing elit
content : Lorem ipsum dolor sit amet consectetur adipisicing elit. Maxime mollitia,
molestiae quas vel sint commodi repudiandae consequuntur voluptatum laborum
numquam blanditiis harum quisquam eius sed odit fugiat iusto fuga praesentium
optio, eaque rerum! Provident similique accusantium nemo autem.

If you only want the text value, you can use the findtext() method instead of find().

from xml.etree import ElementTree

with open('articles.xml') as file:
  tree = ElementTree.parse(file)
  print(tree.findtext('./article/title'))

Lorem ipsum dolor sit amet consectetur adipisicing elit

The './article/title' search string, searches for a title element that is nested inside of an article element. This can be especially useful if the xml document contains elements that have similar tags.

Find all matching elements - `tree.findall()`

The tree.findall() method returns a list of all matching nodes for the given search string.

from xml.etree import ElementTree

with open('articles.xml') as file:
  tree = ElementTree.parse(file)
  nodes = tree.findall('.//author')
  print(nodes)

  for n in nodes:
    print(n.text)

[<Element 'author' at 0x0000011425175FD0>, <Element 'author' at 0x0000011425176160>]
John Doe
Mary Smith

Deeper look on nodes

The Element objects returned by methods like tree.iter(), tree.find(), etc are used to represent a single node in the xml parse tree.

Element objects contain some useful attributes and methods for accessing and manipulating information of the represented xml element. We have already used some of the attributes such as text and tag.

The attrib dictionary of an Element object stores the attributes of the represented xml element.

from xml.etree import ElementTree

with open('articles.xml') as file:
  tree = ElementTree.parse(file)
  n = tree.find('*author')
  print(n.attrib)
  print(n.text)
  print(n.attrib.get('country'))

{'country': 'India'}
John Doe
India

You can use the tail attribute to get the text that comes after the closing tag of a given node.

Parsing Strings

If the xml data is in form of a string, we can parse it using the XML() function. Which takes the xml string as an argument, parses it and creates an Element object representation.

from xml.etree import ElementTree

xml_data = '''
  <article>
    <author country = 'India'>John Doe</author>
    <datepublished>2024/04/05</datepublished>  
    <title>Lorem ipsum dolor sit amet consectetur adipisicing elit</title>
    <content>Lorem ipsum dolor sit amet consectetur adipisicing elit. Maxime mollitia,
  molestiae quas vel sint commodi repudiandae consequuntur voluptatum laborum
  numquam blanditiis harum quisquam eius sed odit fugiat iusto fuga praesentium
  optio, eaque rerum! Provident similique accusantium nemo autem.
   </content>
  </article>'''

article = ElementTree.XML(xml_data)

print(article.findtext('.author'))
print(article.findtext('.datepublished'))
print(article.findtext('.title'))

John Doe
2024/04/05
Lorem ipsum dolor sit amet consectetur adipisicing elit

Note that unlike parse() which returns an ElementTree instance, the return value of XML() is an Element object.

storage&exchange

pickle module

shelve module

dbm module

sqlite3 module

csv module

xml module

Parsing xml in Python with etree.ElementTree

Basic parsing example

Traversing an ElementTree

Search for Nodes

Find single node - `tree.find()`

Find all matching elements - `tree.findall()`

Deeper look on nodes

Parsing Strings

storage&exchange

pickle module

shelve module

dbm module

sqlite3 module

csv module

xml module

Parsing xml in Python with etree.ElementTree

Basic parsing example

Traversing an ElementTree

Search for Nodes

Find single node - tree.find()

Find all matching elements - tree.findall()

Deeper look on nodes

Parsing Strings

Related articles

Find single node - `tree.find()`

Find all matching elements - `tree.findall()`