The xml module in the standard library provide tools for working with XML documents.
The ElementTree
class in the etree
submodule of the xml module offers an intuitive way of parsing and representing XML data.
ElementTree
objects represents xml data in form of a tree structure in which the hierarchy is based on the nesting of the xml elements.
Basic parsing example
Consider if we have an xml file called articles.xml
with the following content.
<?xml version = '1.0' encoding = 'UTF-8'?>
<articlelist>
<article>
<author country = 'India'>John Doe</author>
<datepublished>2024/04/05</datepublished>
<title>Lorem ipsum dolor sit amet consectetur adipisicing elit</title>
<content>Lorem ipsum dolor sit amet consectetur adipisicing elit. Maxime mollitia,
molestiae quas vel sint commodi repudiandae consequuntur voluptatum laborum
numquam blanditiis harum quisquam eius sed odit fugiat iusto fuga praesentium
optio, eaque rerum! Provident similique accusantium nemo autem.
</content>
</article>
<article>
<author country = 'Finland'>Mary Smith</author>
<datepublished>2024/04/07</datepublished>
<title>Perspiciatis minima nesciunt dolorem</title>
<content>Perspiciatis
minima nesciunt dolorem! Officiis iure rerum voluptates a cumque velit
quibusdam sed amet tempora. Sit laborum ab, eius fugit doloribus tenetur
fugiat, temporibus enim commodi iusto libero magni deleniti quod quam
consequuntur! Commodi minima excepturi repudiandae velit hic maxime
doloremque.</content>
</article>
</articlelist>
We can parse the document by passing the opened file object as an argument to the ElementTree.parse()
method, as shown below:
from xml.etree import ElementTree
with open('articles.xml') as file:
tree = ElementTree.parse(file)
print(tree)
<xml.etree.ElementTree.ElementTree object at 0x000001B03DB770E0>
As shown in the above example, the ElementTree.parse()
helper method creates an ElementTree
instance from the given file object.
The ElementTree
object represents the structure of the xml documents in form of a tree, where each node in the tree represents the corresponding element in the xml document.
Traversing an ElementTree
The tree.iter()
method returns an iterator object that yields the nodes of the parsed tree from top to bottom. By default it returns all nodes in the tree.
from xml.etree import ElementTree
with open('articles.xml') as file:
tree = ElementTree.parse(file)
for node in tree.iter():
print(node.tag)
articlelist
article
author
datepublished
title
content
article
author
datepublished
title
content
You can pass a tag as an argument to the tree.iter()
method so that it will only iterate over the elements with that tag.
tree.iter(tag = None)
For example, to get only
nodes with author
tag, we can parse 'author'
as the tag
argument, as shown below:
from xml.etree import ElementTree
with open('articles.xml') as file:
tree = ElementTree.parse(file)
for node in tree.iter('author'):
print(node.text)
John Doe
Mary Smith
Search for Nodes
Parsed trees contains some useful methods to expressively search for nodes with certain characteristics. This allows you to find for nodes with given tags or even nodes that appears at certain depth of the parse tree.
The two basic methods for searching are find()
and findall()
.
Find single node - tree.find()
The tree.find()
method returns the first node that matches the search strings. It returns None
, if there is no matching node.
from xml.etree import ElementTree
with open('articles.xml') as file:
tree = ElementTree.parse(file)
n = tree.find('.//author')
print(n.text)
John Doe
Note how the search string is formatted, each stroke represents a depth starting from the root. So in the above case, we are finding for a node with author
tag at the second depth. If we are looking for an article
tag we would use a single stroke i.e "./article"
to correspond with the depth of that article tag from the root.
from xml.etree import ElementTree
with open('pynerds.txt') as file:
tree = ElementTree.parse(file)
article = tree.find('./article')
for i in article:
print(i.tag, ': ', i.text)
author : John Doe
datepublished : 2024/04/05
title : Lorem ipsum dolor sit amet consectetur adipisicing elit
content : Lorem ipsum dolor sit amet consectetur adipisicing elit. Maxime mollitia,
molestiae quas vel sint commodi repudiandae consequuntur voluptatum laborum
numquam blanditiis harum quisquam eius sed odit fugiat iusto fuga praesentium
optio, eaque rerum! Provident similique accusantium nemo autem.
If you only want the text value, you can use the findtext()
method instead of find()
.
from xml.etree import ElementTree
with open('articles.xml') as file:
tree = ElementTree.parse(file)
print(tree.findtext('./article/title'))
Lorem ipsum dolor sit amet consectetur adipisicing elit
The './article/title'
search string, searches for a title
element that is nested inside of an article
element. This can be especially useful if the xml document contains elements that have similar tags.
Find all matching elements - tree.findall()
The tree.findall()
method returns a list of all matching nodes for the given search string.
from xml.etree import ElementTree
with open('articles.xml') as file:
tree = ElementTree.parse(file)
nodes = tree.findall('.//author')
print(nodes)
for n in nodes:
print(n.text)
[<Element 'author' at 0x0000011425175FD0>, <Element 'author' at 0x0000011425176160>]
John Doe
Mary Smith
Deeper look on nodes
The Element
objects returned by methods like tree.iter()
, tree.find()
, etc are used to represent a single node in the xml parse tree.
Element objects contain some useful attributes and methods for accessing and manipulating information of the represented xml element. We have already used some of the attributes such as text
and tag
.
The attrib
dictionary of an Element
object stores the attributes of the represented xml element.
from xml.etree import ElementTree
with open('articles.xml') as file:
tree = ElementTree.parse(file)
n = tree.find('*author')
print(n.attrib)
print(n.text)
print(n.attrib.get('country'))
{'country': 'India'}
John Doe
India
You can use the tail
attribute to get the text that comes after the closing tag of a given node.
Parsing Strings
If the xml data is in form of a string, we can parse it using the XML()
function. Which takes the xml string as an argument, parses it and creates an Element
object representation.
from xml.etree import ElementTree
xml_data = '''
<article>
<author country = 'India'>John Doe</author>
<datepublished>2024/04/05</datepublished>
<title>Lorem ipsum dolor sit amet consectetur adipisicing elit</title>
<content>Lorem ipsum dolor sit amet consectetur adipisicing elit. Maxime mollitia,
molestiae quas vel sint commodi repudiandae consequuntur voluptatum laborum
numquam blanditiis harum quisquam eius sed odit fugiat iusto fuga praesentium
optio, eaque rerum! Provident similique accusantium nemo autem.
</content>
</article>'''
article = ElementTree.XML(xml_data)
print(article.findtext('.author'))
print(article.findtext('.datepublished'))
print(article.findtext('.title'))
John Doe
2024/04/05
Lorem ipsum dolor sit amet consectetur adipisicing elit
Note that unlike parse()
which returns an ElementTree
instance, the return value of XML()
is an Element
object.