XML Technologies Tutorials

# DTD

DTD - Quick Guide - Tutorialspoint

address.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<?xml version = "1.0" encoding = "UTF-8" standalone = "yes" ?>

<!DOCTYPE address [
<!ELEMENT address (name,company,phone)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
]>

<address>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</address>

address.xml & address.dtd

1
2
3
4
5
6
7
8
<?xml version = "1.0" encoding = "UTF-8" standalone = "no" ?>
<!DOCTYPE address SYSTEM "address.dtd">

<address>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</address>
1
2
3
4
<!ELEMENT address (name,company,phone)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT phone (#PCDATA)>

# XSD

XSD - Quick Guide - Tutorialspoint

students.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
<?xml version = "1.0"?>

<!-- <class xmlns = "http://www.tutorialspoint.com"
xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation = "http://www.tutorialspoint.com student.xsd"> -->

<class xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="students.xsd">
<student rollno = "393">
<firstname>Dinkar</firstname>
<lastname>Kad</lastname>
<nickname>Dinkar</nickname>
<marks>85</marks>
</student>

<student rollno = "493">
<firstname>Vaneet</firstname>
<lastname>Gupta</lastname>
<nickname>Vinni</nickname>
<marks>95</marks>
</student>

<student rollno = "593">
<firstname>Jasvir</firstname>
<lastname>Singh</lastname>
<nickname>Jazz</nickname>
<marks>90</marks>
</student>
</class>

students.xsd

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<?xml version = "1.0"?>

<xs:schema xmlns:xs = "http://www.w3.org/2001/XMLSchema">
<xs:element name = 'class'>
<xs:complexType>
<xs:sequence>
<xs:element name = 'student' type = 'StudentType' minOccurs = '0'
maxOccurs = 'unbounded' />
</xs:sequence>
</xs:complexType>
</xs:element>

<xs:complexType name = "StudentType">
<xs:sequence>
<xs:element name = "firstname" type = "xs:string"/>
<xs:element name = "lastname" type = "xs:string"/>
<xs:element name = "nickname" type = "xs:string"/>
<xs:element name = "marks" type = "xs:positiveInteger"/>
</xs:sequence>
<xs:attribute name = 'rollno' type = 'xs:positiveInteger'/>
</xs:complexType>
</xs:schema>

# XSD Example

repost: XML Schema Example

This chapter will demonstrate how to write an XML Schema. You will also learn that a schema can be written in different ways.

# An XML Document

Let’s have a look at this XML document called “shiporder.xml”:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<?xml version="1.0" encoding="UTF-8"?>

<shiporder orderid="889923"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="shiporder.xsd">
<orderperson>John Smith</orderperson>
<shipto>
<name>Ola Nordmann</name>
<address>Langgt 23</address>
<city>4000 Stavanger</city>
<country>Norway</country>
</shipto>
<item>
<title>Empire Burlesque</title>
<note>Special Edition</note>
<quantity>1</quantity>
<price>10.90</price>
</item>
<item>
<title>Hide your heart</title>
<quantity>1</quantity>
<price>9.90</price>
</item>
</shiporder>

The XML document above consists of a root element, “shiporder”, that contains a required attribute called “orderid”. The “shiporder” element contains three different child elements: “orderperson”, “shipto” and “item”. The “item” element appears twice, and it contains a “title”, an optional “note” element, a “quantity”, and a “price” element.

The line above: xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance” tells the XML parser that this document should be validated against a schema. The line: xsi:noNamespaceSchemaLocation=“shiporder.xsd” specifies WHERE the schema resides (here it is in the same folder as “shiporder.xml”).

# Create an XML Schema

Now we want to create a schema for the XML document above.

We start by opening a new file that we will call “shiporder.xsd”. To create the schema we could simply follow the structure in the XML document and define each element as we find it. We will start with the standard XML declaration followed by the xs:schema element that defines a schema:

1
2
3
4
<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
...
</xs:schema>

In the schema above we use the standard namespace (xs), and the URI associated with this namespace is the Schema language definition, which has the standard value of http://www.w3.org/2001/XMLSchema.

Next, we have to define the “shiporder” element. This element has an attribute and it contains other elements, therefore we consider it as a complex type. The child elements of the “shiporder” element is surrounded by a xs:sequence element that defines an ordered sequence of sub elements:

1
2
3
4
5
6
7
<xs:element name="shiporder">
<xs:complexType>
<xs:sequence>
...
</xs:sequence>
</xs:complexType>
</xs:element>

Then we have to define the “orderperson” element as a simple type (because it does not contain any attributes or other elements). The type (xs:string) is prefixed with the namespace prefix associated with XML Schema that indicates a predefined schema data type:

1
<xs:element name="orderperson" type="xs:string"/>

Next, we have to define two elements that are of the complex type: “shipto” and “item”. We start by defining the “shipto” element:

1
2
3
4
5
6
7
8
9
10
<xs:element name="shipto">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="address" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
<xs:element name="country" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>

With schemas we can define the number of possible occurrences for an element with the maxOccurs and minOccurs attributes. maxOccurs specifies the maximum number of occurrences for an element and minOccurs specifies the minimum number of occurrences for an element. The default value for both maxOccurs and minOccurs is 1!

Now we can define the “item” element. This element can appear multiple times inside a “shiporder” element. This is specified by setting the maxOccurs attribute of the “item” element to “unbounded” which means that there can be as many occurrences of the “item” element as the author wishes. Notice that the “note” element is optional. We have specified this by setting the minOccurs attribute to zero:

1
2
3
4
5
6
7
8
9
10
<xs:element name="item" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="title" type="xs:string"/>
<xs:element name="note" type="xs:string" minOccurs="0"/>
<xs:element name="quantity" type="xs:positiveInteger"/>
<xs:element name="price" type="xs:decimal"/>
</xs:sequence>
</xs:complexType>
</xs:element>

We can now declare the attribute of the “shiporder” element. Since this is a required attribute we specify use=“required”.

Note: The attribute declarations must always come last:

1
<xs:attribute name="orderid" type="xs:string" use="required"/>

Here is the complete listing of the schema file called “shiporder.xsd”:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="shiporder">
<xs:complexType>
<xs:sequence>
<xs:element name="orderperson" type="xs:string"/>
<xs:element name="shipto">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="address" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
<xs:element name="country" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="item" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="title" type="xs:string"/>
<xs:element name="note" type="xs:string" minOccurs="0"/>
<xs:element name="quantity" type="xs:positiveInteger"/>
<xs:element name="price" type="xs:decimal"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="orderid" type="xs:string" use="required"/>
</xs:complexType>
</xs:element>

</xs:schema>

# Divide the Schema

The previous design method is very simple, but can be difficult to read and maintain when documents are complex.

The next design method is based on defining all elements and attributes first, and then referring to them using the ref attribute.

Here is the new design of the schema file (“shiporder.xsd”):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<!-- definition of simple elements -->
<xs:element name="orderperson" type="xs:string"/>
<xs:element name="name" type="xs:string"/>
<xs:element name="address" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
<xs:element name="country" type="xs:string"/>
<xs:element name="title" type="xs:string"/>
<xs:element name="note" type="xs:string"/>
<xs:element name="quantity" type="xs:positiveInteger"/>
<xs:element name="price" type="xs:decimal"/>

<!-- definition of attributes -->
<xs:attribute name="orderid" type="xs:string"/>

<!-- definition of complex elements -->
<xs:element name="shipto">
<xs:complexType>
<xs:sequence>
<xs:element ref="name"/>
<xs:element ref="address"/>
<xs:element ref="city"/>
<xs:element ref="country"/>
</xs:sequence>
</xs:complexType>
</xs:element>

<xs:element name="item">
<xs:complexType>
<xs:sequence>
<xs:element ref="title"/>
<xs:element ref="note" minOccurs="0"/>
<xs:element ref="quantity"/>
<xs:element ref="price"/>
</xs:sequence>
</xs:complexType>
</xs:element>

<xs:element name="shiporder">
<xs:complexType>
<xs:sequence>
<xs:element ref="orderperson"/>
<xs:element ref="shipto"/>
<xs:element ref="item" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute ref="orderid" use="required"/>
</xs:complexType>
</xs:element>

</xs:schema>

# Using Named Types

The third design method defines classes or types, that enables us to reuse element definitions. This is done by naming the simpleTypes and complexTypes elements, and then point to them through the type attribute of the element.

Here is the third design of the schema file (“shiporder.xsd”):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:simpleType name="stringtype">
<xs:restriction base="xs:string"/>
</xs:simpleType>

<xs:simpleType name="inttype">
<xs:restriction base="xs:positiveInteger"/>
</xs:simpleType>

<xs:simpleType name="dectype">
<xs:restriction base="xs:decimal"/>
</xs:simpleType>

<xs:simpleType name="orderidtype">
<xs:restriction base="xs:string">
<xs:pattern value="[0-9]{6}"/>
</xs:restriction>
</xs:simpleType>

<xs:complexType name="shiptotype">
<xs:sequence>
<xs:element name="name" type="stringtype"/>
<xs:element name="address" type="stringtype"/>
<xs:element name="city" type="stringtype"/>
<xs:element name="country" type="stringtype"/>
</xs:sequence>
</xs:complexType>

<xs:complexType name="itemtype">
<xs:sequence>
<xs:element name="title" type="stringtype"/>
<xs:element name="note" type="stringtype" minOccurs="0"/>
<xs:element name="quantity" type="inttype"/>
<xs:element name="price" type="dectype"/>
</xs:sequence>
</xs:complexType>

<xs:complexType name="shipordertype">
<xs:sequence>
<xs:element name="orderperson" type="stringtype"/>
<xs:element name="shipto" type="shiptotype"/>
<xs:element name="item" maxOccurs="unbounded" type="itemtype"/>
</xs:sequence>
<xs:attribute name="orderid" type="orderidtype" use="required"/>
</xs:complexType>

<xs:element name="shiporder" type="shipordertype"/>

</xs:schema>

The restriction element indicates that the datatype is derived from a W3C XML Schema namespace datatype. So, the following fragment means that the value of the element or attribute must be a string value:

1
<xs:restriction base="xs:string">

The restriction element is more often used to apply restrictions to elements. Look at the following lines from the schema above:

1
2
3
4
5
<xs:simpleType name="orderidtype">
<xs:restriction base="xs:string">
<xs:pattern value="[0-9]{6}"/>
</xs:restriction>
</xs:simpleType>

This indicates that the value of the element or attribute must be a string, it must be exactly six characters in a row, and those characters must be a number from 0 to 9.

# DOM

XML DOM - Quick Guide - Tutorialspoint

node.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<Company>
<Employee category = "Technical" id = "firstelement">
<FirstName>Tanmay</FirstName>
<LastName>Patil</LastName>
<ContactNo>1234567890</ContactNo>
<Email>tanmaypatil@xyz.com</Email>
</Employee>

<Employee category = "Non-Technical">
<FirstName>Taniya</FirstName>
<LastName>Mishra</LastName>
<ContactNo>1234667898</ContactNo>
<Email>taniyamishra@xyz.com</Email>
</Employee>

<Employee category = "Management">
<FirstName>Tanisha</FirstName>
<LastName>Sharma</LastName>
<ContactNo>1234562350</ContactNo>
<Email>tanishasharma@xyz.com</Email>
</Employee>
</Company>

index.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
<!DOCTYPE html>
<html>
<body>
<div>
<b>FirstName:</b> <span id="FirstName"></span><br />
<b>LastName:</b> <span id="LastName"></span><br />
<b>ContactNo:</b> <span id="ContactNo"></span><br />
<b>Email:</b> <span id="Email"></span>
</div>
<script>
const xmlhttp;

//if browser supports XMLHttpRequest
if (window.XMLHttpRequest) {
// Create an instance of XMLHttpRequest object. code for IE7+, Firefox, Chrome, Opera, Safari
xmlhttp = new XMLHttpRequest();
} else {
// code for IE6, IE5
xmlhttp = new ActiveXObject("Microsoft.XMLHTTP");
}

// sets and sends the request for calling "node.xml"
xmlhttp.open("GET", "/dom/node.xml", false);
xmlhttp.send();

// sets and returns the content as XML DOM
xmlDoc = xmlhttp.responseXML;

//parsing the DOM object
document.getElementById("FirstName").innerHTML =
xmlDoc.getElementsByTagName("FirstName")[0].childNodes[0].nodeValue;
document.getElementById("LastName").innerHTML =
xmlDoc.getElementsByTagName("LastName")[0].childNodes[0].nodeValue;
document.getElementById("ContactNo").innerHTML =
xmlDoc.getElementsByTagName("ContactNo")[0].childNodes[0].nodeValue;
document.getElementById("Email").innerHTML =
xmlDoc.getElementsByTagName("Email")[0].childNodes[0].nodeValue;s
</script>
</body>
</html>

# XPath

XPath - Quick Guide - Tutorialspoint

students.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<?xml version = "1.0"?>
<?xml-stylesheet type = "text/xsl" href = "students.xsl"?>
<class>
<student rollno = "393">
<firstname>Dinkar</firstname>
<lastname>Kad</lastname>
<nickname>Dinkar</nickname>
<marks>85</marks>
</student>
<student rollno = "493">
<firstname>Vaneet</firstname>
<lastname>Gupta</lastname>
<nickname>Vinni</nickname>
<marks>95</marks>
</student>
<student rollno = "593">
<firstname>Jasvir</firstname>
<lastname>Singh</lastname>
<nickname>Jazz</nickname>
<marks>90</marks>
</student>
</class>

students.xsl

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
<?xml version = "1.0" encoding = "UTF-8"?>
<xsl:stylesheet version = "1.0"
xmlns:xsl = "http://www.w3.org/1999/XSL/Transform">

<xsl:template match = "/">
<html>
<body>
<h2>Students</h2>
<table border = "1">
<tr bgcolor = "#9acd32">
<th>Roll No</th>
<th>First Name</th>
<th>Last Name</th>
<th>Nick Name</th>
<th>Marks</th>
</tr>
<xsl:for-each select = "class/student">
<tr>
<td><xsl:value-of select = "@rollno"/></td>
<td><xsl:value-of select = "firstname"/></td>
<td><xsl:value-of select = "lastname"/></td>
<td><xsl:value-of select = "nickname"/></td>
<td><xsl:value-of select = "marks"/></td>
</tr>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>

</xsl:stylesheet>

# XPath Cheatsheet

from: Xpath cheatsheet


Comments

Prefix Relative

//section[.//h1[@id='hi']] need add . before //h1 represents relative to section, otherwise the h1 match will be from the root, not from the current element.

String Function ends-with

//a[ends-with(@href, '.pdf')] The ends-with function is part of xpath 2.0 but browsers generally only support 1.0. So you’ll have to implement it yourself with a combination of string-length , substring and equals.

a[substring(@href, string-length(@href) - string-length('.pdf') +1) = '.pdf']

or

a[@href[substring(., string-length(.) - string-length('.pdf') +1) = '.pdf']]

String Function normalize-space

The normalize-space function strips leading and trailing white-space from a string, replaces sequences of whitespace characters by a single space, and returns the resulting string.

# XQuery

XQuery - Quick Guide - Tutorialspoint

books.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<?xml version="1.0" encoding="UTF-8"?>
<books>

<book category="JAVA">
<title lang="en">Learn Java in 24 Hours</title>
<author>Robert</author>
<year>2005</year>
<price>30.00</price>
</book>

<book category="DOTNET">
<title lang="en">Learn .Net in 24 hours</title>
<author>Peter</author>
<year>2011</year>
<price>40.50</price>
</book>

<book category="XML">
<title lang="en">Learn XQuery in 24 hours</title>
<author>Robert</author>
<author>Peter</author>
<year>2013</year>
<price>50.00</price>
</book>

<book category="XML">
<title lang="en">Learn XPath in 24 hours</title>
<author>Jay Ban</author>
<year>2010</year>
<price>16.50</price>
</book>

</books>

books.xqy

1
2
3
4
5
6
7
for $x in doc("books.xml")/books/book
where $x/price>30
return $x/title

(: result :)
(: <title lang="en">Learn .Net in 24 hours</title> :)
(: <title lang="en">Learn XQuery in 24 hours</title> :)

# XSLT

XSLT - Quick Guide - Tutorialspoint

students.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<?xml version = "1.0"?>
<?xml-stylesheet type = "text/xsl" href = "students.xsl"?>
<class>
<student rollno = "393">
<firstname>Dinkar</firstname>
<lastname>Kad</lastname>
<nickname>Dinkar</nickname>
<marks>85</marks>
</student>
<student rollno = "493">
<firstname>Vaneet</firstname>
<lastname>Gupta</lastname>
<nickname>Vinni</nickname>
<marks>95</marks>
</student>
<student rollno = "593">
<firstname>Jasvir</firstname>
<lastname>Singh</lastname>
<nickname>Jazz</nickname>
<marks>90</marks>
</student>
</class>

students.xsl

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
<?xml version = "1.0" encoding = "UTF-8"?>
<xsl:stylesheet version = "1.0"
xmlns:xsl = "http://www.w3.org/1999/XSL/Transform">

<xsl:template match = "/">
<html>
<body>
<h2>Students</h2>
<table border = "1">
<tr bgcolor = "#9acd32">
<th>Roll No</th>
<th>First Name</th>
<th>Last Name</th>
<th>Nick Name</th>
<th>Marks</th>
</tr>
<xsl:for-each select = "class/student">
<tr>
<td><xsl:value-of select = "@rollno"/></td>
<td><xsl:value-of select = "firstname"/></td>
<td><xsl:value-of select = "lastname"/></td>
<td><xsl:value-of select = "nickname"/></td>
<td><xsl:value-of select = "marks"/></td>
</tr>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>

</xsl:stylesheet>

# XML Tool

BaseX | The XML Framework: Lightweight and High-Performance Data Processing

XML - Visual Studio Code Plugin

XML Tools - Visual Studio Code Plugin

# XML、HTML and Excel

XML 和 HTML 可以使用 Excel 打开,并且 HTML 中可以使用 CSS 样式。

customers.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<title>table</title>
<style>
#customers {
font-family: "Trebuchet MS", Arial, Helvetica, sans-serif;
width: 100%;
border-collapse: collapse;
}

#customers td,
#customers th {
font-size: 1em;
border: 1px solid #98bf21;
padding: 3px 7px 2px 7px;
}

#customers th {
font-size: 1.1em;
text-align: left;
padding-top: 5px;
padding-bottom: 4px;
background-color: #a7c942;
color: #ffffff;
}

#customers tr.alt td {
color: #000000;
background-color: #eaf2d3;
}
</style>
</head>
<body>
<table id="customers">
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr class="alt">
<td>Berglunds snabbköp</td>
<td>Christina Berglund</td>
<td>Sweden</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
<tr class="alt">
<td>Ernst Handel</td>
<td>Roland Mendel</td>
<td>Austria</td>
</tr>
<tr>
<td>Island Trading</td>
<td>Helen Bennett</td>
<td>UK</td>
</tr>
<tr class="alt">
<td>Königlich Essen</td>
<td>Philip Cramer</td>
<td>Germany</td>
</tr>
<tr>
<td>Laughing Bacchus Winecellars</td>
<td>Yoshi Tannamuri</td>
<td>Canada</td>
</tr>
<tr class="alt">
<td>Magazzini Alimentari Riuniti</td>
<td>Giovanni Rovelli</td>
<td>Italy</td>
</tr>
<tr>
<td>North/South</td>
<td>Simon Crowther</td>
<td>UK</td>
</tr>
<tr class="alt">
<td>Paris spécialités</td>
<td>Marie Bertrand</td>
<td>France</td>
</tr>
</table>
</body>
</html>

目前发现需要注意的问题:

  • 01 在 Excel 中会显示为 1, 需要在 XML、HTML 中写成 =“01”, 类似 csv 使用 Excel 打开的表现。
  • XML 中的元素属性也会变成 Excel 中的一列。

扩展:在 ERP 中使用 RTF 开发报表时生成的 xls 文件实际就是包含样式的 HTML 文件,所以可以考虑不借助 RTF Template 而是直接输出 HTML 并将文件后缀修改为 xls(可以用 Excel 直接打开,并进行公式计算等). 这种做法的好处是比较灵活,例如可以为不同客户编写不同的样式文件,也可以定义多套样式,每次报表输出随机选择其中一个样式等。另外因为 RTF 报表本质是 HTML 的 table, 可以使用 pandas 解析。

1
2
3
4
5
6
7
8
9
import pandas as pd

tables = pd.read_html('./DN.html')
# one html maybe has many tables
df = pd.DataFrame(tables[0])
print(df.head())
# replace ="data" to data
df.replace(regex={'^="(.*?)"$': '\\1'}, inplace=True)
print(df.head())

Excel xlsx 通过 7z 等解压打开可以看到里面实际也是 XML。

# XML Parser

不同的解析器对 xml 的表现可能存在差异,如下 xml 中包含中文 学习 ,但是 encoding 使用 ASCII , 分別在 IE 和 Chrome 中打开。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<?xml version="1.0" encoding="ASCII"?>
<books>

<book category="JAVA">
<title lang="en">Learn Java in 24 Hours</title>
<author>Robert</author>
<year>2005</year>
<price>30.00</price>
</book>

<book category="DOTNET">
<title lang="en">Learn .Net in 24 hours</title>
<author>Peter</author>
<year>2011</year>
<price>40.50</price>
</book>

<book category="XML">
<title lang="en">学习 XQuery</title>
<author>Robert</author>
<author>Peter</author>
<year>2013</year>
<price>50.00</price>
</book>

</books>

如下,IE 发现非 ASCII 字符会停止解析,Chrome 会尝试解析但是中文也会显示错误,因为 xml 中指定 encoding="ASCII" 但是 ASCII 不包含中文。如果 xml 中包含中文需要使用 encoding="UTF-8" .

# XML Preserve Space

如下在 XML 中 title 的 lang 属性值中间有 4 个空格,其内容前中后各有 4 个空格,但是在 Chrome 浏览器中多个连续空格会显示成一个空格。

1
2
3
4
5
6
7
8
9
<?xml version="1.0" encoding="ASCII"?>
<books>
<book category="JAVA">
<title lang="e n"> Learn Java </title>
<author>Robert</author>
<year>2005</year>
<price>30.00</price>
</book>
</books>

可以设置 xml: space="preserve" 保留元素和其子元素中的多个空白内容。 但是在浏览器中无效。

1
2
3
4
5
6
7
8
9
<?xml version="1.0" encoding="ASCII"?>
<books xml:space="preserve">
<book category="JAVA">
<title lang="e n"> Learn Java </title>
<author>Robert</author>
<year>2005</year>
<price>30.00</price>
</book>
</books>

在 HTML 中内容包含多个连续空格同样只会显示一个空格,可以使用 pre 标签,或者通过 CSS 设定 { white-space: pre; } , 另外也可以通过 HTML 实体 &nbsp; 表示空格,例如 4 个空格可以使用 &nbsp;&nbsp;&nbsp;&nbsp; .

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Books Table</title>
<style>
body {
display: flex;
justify-content: center;
}

table {
width: 80%;
border-collapse: collapse;
}

table td,
table th {
font-size: 1em;
border: 1px solid #98bf21;
padding: 3px 7px 2px 7px;
}

table th {
font-size: 1.3em;
text-align: left;
padding: 5px auto;
background-color: #a7c942;
color: #ffffff;
}
</style>
</head>
<body>
<table>
<thead>
<tr>
<th>title</th>
<th>author</th>
<th>year</th>
<th>price</th>
</tr>
</thead>
<tbody>
<tr>
<td> Learn Java </td>
<td>Robert</td>
<td>2005</td>
<td>30.00</td>
</tr>
</tbody>
</table>
</body>
</html>

Use pre tag

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Books Table</title>
<style>
body {
display: flex;
justify-content: center;
}

table {
width: 80%;
border-collapse: collapse;
}

table td,
table th {
font-size: 1em;
border: 1px solid #98bf21;
padding: 3px 7px 2px 7px;
}

table th {
font-size: 1.3em;
text-align: left;
padding: 5px auto;
background-color: #a7c942;
color: #ffffff;
}
</style>
</head>
<body>
<table>
<thead>
<tr>
<th>title</th>
<th>author</th>
<th>year</th>
<th>price</th>
</tr>
</thead>
<tbody>
<tr>
<td><pre> Learn Java </pre></td>
<td>Robert</td>
<td>2005</td>
<td>30.00</td>
</tr>
</tbody>
</table>
</body>
</html>

Use CSS { white-space: pre; }

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Books Table</title>
<style>
body {
display: flex;
justify-content: center;
}

table {
width: 80%;
border-collapse: collapse;
}

table td,
table th {
font-size: 1em;
border: 1px solid #98bf21;
padding: 3px 7px 2px 7px;
}

table th {
font-size: 1.3em;
text-align: left;
padding: 5px auto;
background-color: #a7c942;
color: #ffffff;
}

tr td:first-child {
white-space: pre;
}
</style>
</head>
<body>
<table>
<thead>
<tr>
<th>title</th>
<th>author</th>
<th>year</th>
<th>price</th>
</tr>
</thead>
<tbody>
<tr>
<td> Learn Java </td>
<td>Robert</td>
<td>2005</td>
<td>30.00</td>
</tr>
</tbody>
</table>
</body>
</html>

# XML Locate Errors

XML 文档中存在错误,并且文件很大,难以通过 VS Code 等工具格式化时,可以将 XML 拖入浏览器,浏览器会显示 XML 中存在的错误原因,然后在 VS Code/Nopad++ 中通过 ctrl + g 定位到指定行数。

# XML and Python Pandas DataFrame

pandas.read_xml

pandas.DataFrame.to_xml

# Read XML to DataFrame

# Simple Read

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd
from io import StringIO

xml = '''<?xml version='1.0' encoding='utf-8'?>
<data xmlns="http://example.com">
<row>
<shape>square</shape>
<degrees>360</degrees>
<sides>4.0</sides>
</row>
<row>
<shape>circle</shape>
<degrees>360</degrees>
<sides/>
</row>
<row>
<shape>triangle</shape>
<degrees>180</degrees>
<sides>3.0</sides>
</row>
</data>'''

df = pd.read_xml('./data.xml')
df = pd.read_xml(StringIO(xml))
print(df)
# shape degrees sides
# 0 square 360 4.0
# 1 circle 360 NaN
# 2 triangle 180 3.0

# Read use XPath

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import pandas as pd
from io import StringIO

xml = '''<?xml version='1.0' encoding='utf-8'?>
<data>
<row shape="square" degrees="360" sides="4.0"/>
<row shape="circle" degrees="360"/>
<row shape="triangle" degrees="180" sides="3.0"/>
</data>'''

df = pd.read_xml(StringIO(xml), xpath=".//row")
print(df)
# shape degrees sides
# 0 square 360 4.0
# 1 circle 360 NaN
# 2 triangle 180 3.0

# Read use Iterparse

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from io import BytesIO, StringIO
import pandas as pd

xml = '''
<data>
<record>
<name>Babb</name>
<age>30</age>
<address postcode="123">
<city>New York</city>
<state>NY</state>
</address>
</record>
<record>
<name>John</name>
<age>25</age>
<address postcode="456">
<city>BRENTWOOD</city>
<state>ESSEX</state>
</address>
</record>
</data>
'''


iterparse_config = {
"record": ["name", "age", "city", "state", "postcode"]
}

# Read the XML using iterparse, default parser is 'lxml'
# df = pd.read_xml(StringIO(xml_data.encode()), iterparse=iterparse_config) # TypeError: reading file objects must return bytes objects
df = pd.read_xml(BytesIO(xml_data.encode()), iterparse=iterparse_config)
# from lxml import etree
# for event, elements in etree.iterparse(BytesIO(xml_data.encode()):
# pass

# Use parser 'etree'
df = pd.read_xml(BytesIO(xml_data.encode()), iterparse=iterparse_config, parser='etree')
df = pd.read_xml(StringIO(xml_data), iterparse=iterparse_config, parser='etree')

print(df)

# name age postcode city state
# 0 Babb 30 123 New York NY
# 1 John 25 456 BRENTWOOD ESSEX

# Read with Namespace

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
from io import StringIO

xml = '''<?xml version='1.0' encoding='utf-8'?>
<doc:data xmlns:doc="https://example.com">
<doc:row>
<doc:shape>square</doc:shape>
<doc:degrees>360</doc:degrees>
<doc:sides>4.0</doc:sides>
</doc:row>
<doc:row>
<doc:shape>circle</doc:shape>
<doc:degrees>360</doc:degrees>
<doc:sides/>
</doc:row>
<doc:row>
<doc:shape>triangle</doc:shape>
<doc:degrees>180</doc:degrees>
<doc:sides>3.0</doc:sides>
</doc:row>
</doc:data>'''

df = pd.read_xml(StringIO(xml),
xpath="//doc:row",
namespaces={"doc": "https://example.com"})
print(df)
# shape degrees sides
# 0 square 360 4.0
# 1 circle 360 NaN
# 2 triangle 180 3.0

If only define default xlmns namespace, can’t read use xpath .

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
from io import StringIO

xml = '''<?xml version='1.0' encoding='utf-8'?>
<data xmlns="http://example.com">
<row>
<shape>square</shape>
<degrees>360</degrees>
<sides>4.0</sides>
</row>
<row>
<shape>circle</shape>
<degrees>360</degrees>
<sides/>
</row>
<row>
<shape>triangle</shape>
<degrees>180</degrees>
<sides>3.0</sides>
</row>
</data>'''

df = pd.read_xml(StringIO(xml), xpath='.//row') # ValueError: xpath does not return any nodes or attributes. Be sure to specify in `xpath` the parent nodes of children and attributes to parse. If document uses namespaces denoted with xmlns, be sure to define namespaces and use them in xpath.
print(df)

etree parse xml which only defined default xlmns namespace.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<?xml version='1.0' encoding='utf-8'?>
<data xmlns="http://example.com">
<record>
<name>Babb</name>
<age>30</age>
<address>
<city>New York</city>
<state>NY</state>
</address>
</record>
<record>
<name>John</name>
<age>25</age>
<address>
<city>New York</city>
<state>NY</state>
</address>
</record>
</data>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import xml.etree.ElementTree as ET

tree = ET.parse('./data.xml')
root = tree.getroot()


def parse_element(element, item):
if len(list(element)) == 0:
item[element.tag] = element.text
else:
for child in list(element):
parse_element(child, item)


data = []

for child in root:
item = {}
parse_element(child, item)
data.append(item)

print(data)
# [{'{http://example.com}name': 'Babb', '{http://example.com}age': '30', '{http://example.com}city': 'New York', '{http://example.com}state': 'NY'}, {'{http://example.com}name': 'John', '{http://example.com}age': '25', '{http://example.com}city': 'New York', '{http://example.com}state': 'NY'}]

use xpath.

1
2
3
4
5
6
7
8
import xml.etree.ElementTree as ET

tree = ET.parse('./data.xml')
root = tree.getroot()

for record in root.findall('.//record'):
for element in record:
print(element.tag, element.text) # No result output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import xml.etree.ElementTree as ET

tree = ET.parse('./data.xml')
root = tree.getroot()

for record in root.findall('.//doc:record', {'doc': 'http://example.com'}):
for element in record:
print(element.tag, element.text)

# {http://example.com}name Babb
# {http://example.com}age 30
# {http://example.com}address

# {http://example.com}name John
# {http://example.com}age 25
# {http://example.com}address

etree parse xml which defined xlmns namespace but not add prefix before element.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<?xml version='1.0' encoding='utf-8'?>
<data xmlns:doc="http://example.com">
<record>
<name>Babb</name>
<age>30</age>
<address>
<city>New York</city>
<state>NY</state>
</address>
</record>
<record>
<name>John</name>
<age>25</age>
<address>
<city>New York</city>
<state>NY</state>
</address>
</record>
</data>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import xml.etree.ElementTree as ET

tree = ET.parse('./data.xml')
root = tree.getroot()


def parse_element(element, item):
if len(list(element)) == 0:
item[element.tag] = element.text
else:
for child in list(element):
parse_element(child, item)


data = []

for child in root:
item = {}
parse_element(child, item)
data.append(item)

print(data)
# [{'name': 'Babb', 'age': '30', 'city': 'New York', 'state': 'NY'}, {'name': 'John', 'age': '25', 'city': 'New York', 'state': 'NY'}]

use xpath.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import xml.etree.ElementTree as ET

tree = ET.parse('./data.xml')
root = tree.getroot()

for record in root.findall('.//record'):
for element in record:
print(element.tag, element.text)

# name Babb
# age 30
# address

# name John
# age 25
# address
1
2
3
4
5
6
7
8
import xml.etree.ElementTree as ET

tree = ET.parse('./data.xml')
root = tree.getroot()

for record in root.findall('.//doc:record', {'doc': 'http://example.com'}):
for element in record:
print(element.tag, element.text) # No result output

etree parse xml which defined xlmns namespace and add prefix before element.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<?xml version='1.0' encoding='utf-8'?>
<doc:data xmlns:doc="http://example.com">
<doc:record>
<doc:name>Babb</doc:name>
<doc:age>30</doc:age>
<doc:address>
<doc:city>New York</doc:city>
<doc:state>NY</doc:state>
</doc:address>
</doc:record>
<doc:record>
<doc:name>John</doc:name>
<doc:age>25</doc:age>
<doc:address>
<doc:city>New York</doc:city>
<doc:state>NY</doc:state>
</doc:address>
</doc:record>
</doc:data>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import xml.etree.ElementTree as ET

tree = ET.parse('./data.xml')
root = tree.getroot()


def parse_element(element, item):
if len(list(element)) == 0:
item[element.tag] = element.text
else:
for child in list(element):
parse_element(child, item)


data = []

for child in root:
item = {}
parse_element(child, item)
data.append(item)

print(data) # [{'{http://example.com}name': 'Babb', '{http://example.com}age': '30', '{http://example.com}city': 'New York', '{http://example.com}state': 'NY'}, {'{http://example.com}name': 'John', '{http://example.com}age': '25', '{http://example.com}city': 'New York', '{http://example.com}state': 'NY'}]

use xpath.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import xml.etree.ElementTree as ET

tree = ET.parse('./data.xml')
root = tree.getroot()

for record in root.findall('.//doc:record', {'doc': 'http://example.com'}):
for element in record:
print(element.tag, element.text)

# {http://example.com}name Babb
# {http://example.com}age 30
# {http://example.com}address

# {http://example.com}name John
# {http://example.com}age 25
# {http://example.com}address

# Parse Null and Date

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
from io import StringIO

xml_data = '''
<data>
<row>
<index>0</index>
<a>1</a>
<b>2.5</b>
<c>True</c>
<d>a</d>
<e>2019-12-31 00:00:00</e>
</row>
<row>
<index>1</index>
<b>4.5</b>
<c>False</c>
<d>b</d>
<e>2019-12-31 00:00:00</e>
</row>
</data>
'''

df = pd.read_xml(StringIO(xml_data),
dtype_backend="numpy_nullable",
parse_dates=["e"])
print(df)
# index a b c d e
# 0 0 1 2.5 True a 2019-12-31
# 1 1 <NA> 4.5 False b 2019-12-31

# DataFrame to XML

# Simple to

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd
import numpy as np

df = pd.DataFrame({'shape': ['square', 'circle', 'triangle'],
'degrees': [360, 360, 180],
'sides': [4, np.nan, 3]})

print(df.to_xml())
# <?xml version='1.0' encoding='utf-8'?>
# <data>
# <row>
# <index>0</index>
# <shape>square</shape>
# <degrees>360</degrees>
# <sides>4.0</sides>
# </row>
# <row>
# <index>1</index>
# <shape>circle</shape>
# <degrees>360</degrees>
# <sides/>
# </row>
# <row>
# <index>2</index>
# <shape>triangle</shape>
# <degrees>180</degrees>
# <sides>3.0</sides>
# </row>
# </data>

# attr_cols

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd
import numpy as np

df = pd.DataFrame({'shape': ['square', 'circle', 'triangle'],
'degrees': [360, 360, 180],
'sides': [4, np.nan, 3]})

print(df.to_xml(attr_cols=[
'index', 'shape', 'degrees', 'sides'
]))

# <?xml version='1.0' encoding='utf-8'?>
# <data>
# <row index="0" shape="square" degrees="360" sides="4.0"/>
# <row index="1" shape="circle" degrees="360"/>
# <row index="2" shape="triangle" degrees="180" sides="3.0"/>
# </data>

# nampespace

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pandas as pd
import numpy as np

df = pd.DataFrame({'shape': ['square', 'circle', 'triangle'],
'degrees': [360, 360, 180],
'sides': [4, np.nan, 3]})

print(df.to_xml(namespaces={"doc": "https://example.com"},
prefix="doc"))

# <?xml version='1.0' encoding='utf-8'?>
# <doc:data xmlns:doc="https://example.com">
# <doc:row>
# <doc:index>0</doc:index>
# <doc:shape>square</doc:shape>
# <doc:degrees>360</doc:degrees>
# <doc:sides>4.0</doc:sides>
# </doc:row>
# <doc:row>
# <doc:index>1</doc:index>
# <doc:shape>circle</doc:shape>
# <doc:degrees>360</doc:degrees>
# <doc:sides/>
# </doc:row>
# <doc:row>
# <doc:index>2</doc:index>
# <doc:shape>triangle</doc:shape>
# <doc:degrees>180</doc:degrees>
# <doc:sides>3.0</doc:sides>
# </doc:row>
# </doc:data>

# root_name and row_name

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
import numpy as np

df = pd.DataFrame({'shape': ['square', 'circle', 'triangle'],
'degrees': [360, 360, 180],
'sides': [4, np.nan, 3]})

print(df.to_xml(root_name='records', row_name='record'))

# <?xml version='1.0' encoding='utf-8'?>
# <records>
# <record>
# <index>0</index>
# <shape>square</shape>
# <degrees>360</degrees>
# <sides>4.0</sides>
# </record>
# <record>
# <index>1</index>
# <shape>circle</shape>
# <degrees>360</degrees>
# <sides/>
# </record>
# <record>
# <index>2</index>
# <shape>triangle</shape>
# <degrees>180</degrees>
# <sides>3.0</sides>
# </record>
# </records>

# Write to file

1
2
3
4
5
6
7
8
9
import pandas as pd
import numpy as np

df = pd.DataFrame({'shape': ['square', 'circle', 'triangle'],
'degrees': [360, 360, 180],
'sides': [4, np.nan, 3]})

print(df.to_xml('records.xml', index=False, root_name='records',
row_name='record', xml_declaration=False))

records.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<records>
<record>
<shape>square</shape>
<degrees>360</degrees>
<sides>4.0</sides>
</record>
<record>
<shape>circle</shape>
<degrees>360</degrees>
<sides/>
</record>
<record>
<shape>triangle</shape>
<degrees>180</degrees>
<sides>3.0</sides>
</record>
</records>
Edited on