XML
stands for Extensible Markup Language. It is used to represent arbitrary data in plain text format, in a way that is both human and machine-friendly.
The fact that xml data is stored simply in plain text format makes xml independent of any software or hardware system, this ensures that xml data can be exchanged across varying systems.
xml's syntax is much like that of html. However, unlike HTML and similar languages, xml does not have any centralized domain of application. Its main purpose is for storage and transmission of arbitrary data between any type of systems.
Typically, files with .xml
extension are associated with xml data.
Consider the following xml document, we can name it messages.xml:
<?xml version ="1.0" encoding = "UTF-8" ?>
<messages>
<message>
<from>John</from>
<to>Mary</to>
<date>2024/04/05</date>
<time>11:00pm</time>
<subject>Requesting a date.</subject>
<body>I hope you are doing fine, I saw you yesterday and...</body>
</message>
<message>
<from>Mary</from>
<to>John</to>
<date>2024/04/09</date>
<time>12:00pm</time>
<subject>Lorem ipsum dolor sit amet</subject>
<body>Maxime mollitia, molestiae quas vel sint commodi repudiandae consequuntur....</body>
</message>
</messages>
As you can see in the above example, the xml tags used in the document are arbitrary and invented by the user.
What the xml module does
The xml
package in the standard library provide tools for working with and manipulating xml documents. It contains four sub-modules as shown in the following table:
xml.etree |
Provides the ElementTree API, which is useful when processing xml documents. |
xml.dom |
W3C Document Object Model. Used for creating a hierarchical representation of elements in an xml document. |
xml.parsers |
Wrappers for XML parsers. |
xml.sax |
Provides support for SAX 2 API. |
xml.etree
submodule
The xml.etree submodule provide the ElementTree
class which is useful for parsing xml data. ElementTree
objects represents xml data in form of a tree structure in which the hierarchy of the nodes is based on the nesting of the xml document. This is useful for parsing xml and representing the entire document in form of a tree.
The ElementTree.parse()
helper function parses an xml file and creates an ElementTree for it. Consider the following example:
parse an xml document
from xml.etree import ElementTree
with open('pynerds.txt') as file:
tree = ElementTree.parse(file)
for node in tree.iter('message'):
for element in node:
print(element.tag, ': ', element.text)
print(node.tail)
from : John
to : Mary
date : 2024/04/05
time : 11:00pm
subject : Requesting a date.
body : I hope you are doing fine, I saw you yesterday and...
from : Mary
to : John
date : 2024/04/09
time : 12:00pm
subject : Lorem ipsum dolor sit amet
body : Maxime mollitia, molestiae quas vel sint commodi repudiandae consequuntur...
xml.dom
The Document Object Model (DOM)is a programming interface for processing XML and similar documents like HTML.
The xml.dom
submodule provides the DOM interface for processing xml documents in Python.
We can use the minidom.parse()
function to create a simple dom interface.
from xml.dom import minidom
dom = minidom.parse('pynerds.txt')
messages = dom.getElementsByTagName('message')
#retrieves text from a node
def getNodeText(node):
nodelist = node.childNodes
result = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
result.append(node.data)
return ''.join(result)
for message in messages:
print('from: ', getNodeText(message.getElementsByTagName('from')[0]))
print('to: ', getNodeText(message.getElementsByTagName('to')[0]))
print('subject: ', getNodeText(message.getElementsByTagName('subject')[0]))
print('body: ', getNodeText(message.getElementsByTagName('body')[0]))
print('\n')
from: John
to: Mary
subject: Requesting a date.
body: I hope you are doing fine, I saw you yesterday and...
from: Mary
to: John
subject: Lorem ipsum dolor sit amet
body: Maxime mollitia, molestiae quas vel sint commodi repudiandae consequuntur....