XML Technologies Tutorials
# DTD
DTD - Quick Guide - Tutorialspoint
address.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 <?xml version = "1.0" encoding = "UTF-8" standalone = "yes" ?> <!DOCTYPE address [ <!ELEMENT address (name,company,phone)> <!ELEMENT name (#PCDATA)> <!ELEMENT company (#PCDATA)> <!ELEMENT phone (#PCDATA)> ]> <address> <name>Tanmay Patil</name> <company>TutorialsPoint</company> <phone>(011) 123-4567</phone> </address>
address.xml & address.dtd
1 2 3 4 5 6 7 8 <?xml version = "1.0" encoding = "UTF-8" standalone = "no" ?> <!DOCTYPE address SYSTEM "address.dtd" > <address > <name > Tanmay Patil</name > <company > TutorialsPoint</company > <phone > (011) 123-4567</phone > </address >
1 2 3 4 <!ELEMENT address (name,company,phone)> <!ELEMENT name (#PCDATA)> <!ELEMENT company (#PCDATA)> <!ELEMENT phone (#PCDATA)>
# XSD
XSD - Quick Guide - Tutorialspoint
students.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 <?xml version = "1.0"?> <class xmlns:xsi ="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation ="students.xsd" > <student rollno = "393" > <firstname > Dinkar</firstname > <lastname > Kad</lastname > <nickname > Dinkar</nickname > <marks > 85</marks > </student > <student rollno = "493" > <firstname > Vaneet</firstname > <lastname > Gupta</lastname > <nickname > Vinni</nickname > <marks > 95</marks > </student > <student rollno = "593" > <firstname > Jasvir</firstname > <lastname > Singh</lastname > <nickname > Jazz</nickname > <marks > 90</marks > </student > </class >
students.xsd
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 <?xml version = "1.0"?> <xs:schema xmlns:xs = "http://www.w3.org/2001/XMLSchema" > <xs:element name = 'class' > <xs:complexType > <xs:sequence > <xs:element name = 'student' type = 'StudentType' minOccurs = '0' maxOccurs = 'unbounded' /> </xs:sequence > </xs:complexType > </xs:element > <xs:complexType name = "StudentType" > <xs:sequence > <xs:element name = "firstname" type = "xs:string" /> <xs:element name = "lastname" type = "xs:string" /> <xs:element name = "nickname" type = "xs:string" /> <xs:element name = "marks" type = "xs:positiveInteger" /> </xs:sequence > <xs:attribute name = 'rollno' type = 'xs:positiveInteger' /> </xs:complexType > </xs:schema >
# XSD Example
repost: XML Schema Example
This chapter will demonstrate how to write an XML Schema. You will also learn that a schema can be written in different ways.
# An XML Document
Let’s have a look at this XML document called “shiporder.xml”:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 <?xml version="1.0" encoding="UTF-8"?> <shiporder orderid ="889923" xmlns:xsi ="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation ="shiporder.xsd" > <orderperson > John Smith</orderperson > <shipto > <name > Ola Nordmann</name > <address > Langgt 23</address > <city > 4000 Stavanger</city > <country > Norway</country > </shipto > <item > <title > Empire Burlesque</title > <note > Special Edition</note > <quantity > 1</quantity > <price > 10.90</price > </item > <item > <title > Hide your heart</title > <quantity > 1</quantity > <price > 9.90</price > </item > </shiporder >
The XML document above consists of a root element, “shiporder”, that contains a required attribute called “orderid”. The “shiporder” element contains three different child elements: “orderperson”, “shipto” and “item”. The “item” element appears twice, and it contains a “title”, an optional “note” element, a “quantity”, and a “price” element.
The line above: xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance ” tells the XML parser that this document should be validated against a schema. The line: xsi:noNamespaceSchemaLocation=“shiporder.xsd” specifies WHERE the schema resides (here it is in the same folder as “shiporder.xml”).
# Create an XML Schema
Now we want to create a schema for the XML document above.
We start by opening a new file that we will call “shiporder.xsd”. To create the schema we could simply follow the structure in the XML document and define each element as we find it. We will start with the standard XML declaration followed by the xs:schema element that defines a schema:
1 2 3 4 <?xml version="1.0" encoding="UTF-8" ?> <xs:schema xmlns:xs ="http://www.w3.org/2001/XMLSchema" > ... </xs:schema >
In the schema above we use the standard namespace (xs), and the URI associated with this namespace is the Schema language definition, which has the standard value of http://www.w3.org/2001/XMLSchema .
Next, we have to define the “shiporder” element. This element has an attribute and it contains other elements, therefore we consider it as a complex type. The child elements of the “shiporder” element is surrounded by a xs:sequence element that defines an ordered sequence of sub elements:
1 2 3 4 5 6 7 <xs:element name ="shiporder" > <xs:complexType > <xs:sequence > ... </xs:sequence > </xs:complexType > </xs:element >
Then we have to define the “orderperson” element as a simple type (because it does not contain any attributes or other elements). The type (xs:string) is prefixed with the namespace prefix associated with XML Schema that indicates a predefined schema data type:
1 <xs:element name ="orderperson" type ="xs:string" />
Next, we have to define two elements that are of the complex type: “shipto” and “item”. We start by defining the “shipto” element:
1 2 3 4 5 6 7 8 9 10 <xs:element name ="shipto" > <xs:complexType > <xs:sequence > <xs:element name ="name" type ="xs:string" /> <xs:element name ="address" type ="xs:string" /> <xs:element name ="city" type ="xs:string" /> <xs:element name ="country" type ="xs:string" /> </xs:sequence > </xs:complexType > </xs:element >
With schemas we can define the number of possible occurrences for an element with the maxOccurs and minOccurs attributes. maxOccurs specifies the maximum number of occurrences for an element and minOccurs specifies the minimum number of occurrences for an element. The default value for both maxOccurs and minOccurs is 1!
Now we can define the “item” element. This element can appear multiple times inside a “shiporder” element. This is specified by setting the maxOccurs attribute of the “item” element to “unbounded” which means that there can be as many occurrences of the “item” element as the author wishes. Notice that the “note” element is optional. We have specified this by setting the minOccurs attribute to zero:
1 2 3 4 5 6 7 8 9 10 <xs:element name ="item" maxOccurs ="unbounded" > <xs:complexType > <xs:sequence > <xs:element name ="title" type ="xs:string" /> <xs:element name ="note" type ="xs:string" minOccurs ="0" /> <xs:element name ="quantity" type ="xs:positiveInteger" /> <xs:element name ="price" type ="xs:decimal" /> </xs:sequence > </xs:complexType > </xs:element >
We can now declare the attribute of the “shiporder” element. Since this is a required attribute we specify use=“required”.
Note: The attribute declarations must always come last:
1 <xs:attribute name ="orderid" type ="xs:string" use ="required" />
Here is the complete listing of the schema file called “shiporder.xsd”:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 <?xml version="1.0" encoding="UTF-8" ?> <xs:schema xmlns:xs ="http://www.w3.org/2001/XMLSchema" > <xs:element name ="shiporder" > <xs:complexType > <xs:sequence > <xs:element name ="orderperson" type ="xs:string" /> <xs:element name ="shipto" > <xs:complexType > <xs:sequence > <xs:element name ="name" type ="xs:string" /> <xs:element name ="address" type ="xs:string" /> <xs:element name ="city" type ="xs:string" /> <xs:element name ="country" type ="xs:string" /> </xs:sequence > </xs:complexType > </xs:element > <xs:element name ="item" maxOccurs ="unbounded" > <xs:complexType > <xs:sequence > <xs:element name ="title" type ="xs:string" /> <xs:element name ="note" type ="xs:string" minOccurs ="0" /> <xs:element name ="quantity" type ="xs:positiveInteger" /> <xs:element name ="price" type ="xs:decimal" /> </xs:sequence > </xs:complexType > </xs:element > </xs:sequence > <xs:attribute name ="orderid" type ="xs:string" use ="required" /> </xs:complexType > </xs:element > </xs:schema >
# Divide the Schema
The previous design method is very simple, but can be difficult to read and maintain when documents are complex.
The next design method is based on defining all elements and attributes first, and then referring to them using the ref attribute.
Here is the new design of the schema file (“shiporder.xsd”):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 <?xml version="1.0" encoding="UTF-8" ?> <xs:schema xmlns:xs ="http://www.w3.org/2001/XMLSchema" > <xs:element name ="orderperson" type ="xs:string" /> <xs:element name ="name" type ="xs:string" /> <xs:element name ="address" type ="xs:string" /> <xs:element name ="city" type ="xs:string" /> <xs:element name ="country" type ="xs:string" /> <xs:element name ="title" type ="xs:string" /> <xs:element name ="note" type ="xs:string" /> <xs:element name ="quantity" type ="xs:positiveInteger" /> <xs:element name ="price" type ="xs:decimal" /> <xs:attribute name ="orderid" type ="xs:string" /> <xs:element name ="shipto" > <xs:complexType > <xs:sequence > <xs:element ref ="name" /> <xs:element ref ="address" /> <xs:element ref ="city" /> <xs:element ref ="country" /> </xs:sequence > </xs:complexType > </xs:element > <xs:element name ="item" > <xs:complexType > <xs:sequence > <xs:element ref ="title" /> <xs:element ref ="note" minOccurs ="0" /> <xs:element ref ="quantity" /> <xs:element ref ="price" /> </xs:sequence > </xs:complexType > </xs:element > <xs:element name ="shiporder" > <xs:complexType > <xs:sequence > <xs:element ref ="orderperson" /> <xs:element ref ="shipto" /> <xs:element ref ="item" maxOccurs ="unbounded" /> </xs:sequence > <xs:attribute ref ="orderid" use ="required" /> </xs:complexType > </xs:element > </xs:schema >
# Using Named Types
The third design method defines classes or types, that enables us to reuse element definitions. This is done by naming the simpleTypes and complexTypes elements, and then point to them through the type attribute of the element.
Here is the third design of the schema file (“shiporder.xsd”):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 <?xml version="1.0" encoding="UTF-8" ?> <xs:schema xmlns:xs ="http://www.w3.org/2001/XMLSchema" > <xs:simpleType name ="stringtype" > <xs:restriction base ="xs:string" /> </xs:simpleType > <xs:simpleType name ="inttype" > <xs:restriction base ="xs:positiveInteger" /> </xs:simpleType > <xs:simpleType name ="dectype" > <xs:restriction base ="xs:decimal" /> </xs:simpleType > <xs:simpleType name ="orderidtype" > <xs:restriction base ="xs:string" > <xs:pattern value ="[0-9]{6}" /> </xs:restriction > </xs:simpleType > <xs:complexType name ="shiptotype" > <xs:sequence > <xs:element name ="name" type ="stringtype" /> <xs:element name ="address" type ="stringtype" /> <xs:element name ="city" type ="stringtype" /> <xs:element name ="country" type ="stringtype" /> </xs:sequence > </xs:complexType > <xs:complexType name ="itemtype" > <xs:sequence > <xs:element name ="title" type ="stringtype" /> <xs:element name ="note" type ="stringtype" minOccurs ="0" /> <xs:element name ="quantity" type ="inttype" /> <xs:element name ="price" type ="dectype" /> </xs:sequence > </xs:complexType > <xs:complexType name ="shipordertype" > <xs:sequence > <xs:element name ="orderperson" type ="stringtype" /> <xs:element name ="shipto" type ="shiptotype" /> <xs:element name ="item" maxOccurs ="unbounded" type ="itemtype" /> </xs:sequence > <xs:attribute name ="orderid" type ="orderidtype" use ="required" /> </xs:complexType > <xs:element name ="shiporder" type ="shipordertype" /> </xs:schema >
The restriction element indicates that the datatype is derived from a W3C XML Schema namespace datatype. So, the following fragment means that the value of the element or attribute must be a string value:
1 <xs:restriction base ="xs:string" >
The restriction element is more often used to apply restrictions to elements. Look at the following lines from the schema above:
1 2 3 4 5 <xs:simpleType name ="orderidtype" > <xs:restriction base ="xs:string" > <xs:pattern value ="[0-9]{6}" /> </xs:restriction > </xs:simpleType >
This indicates that the value of the element or attribute must be a string, it must be exactly six characters in a row, and those characters must be a number from 0 to 9.
# DOM
XML DOM - Quick Guide - Tutorialspoint
node.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 <Company > <Employee category = "Technical" id = "firstelement" > <FirstName > Tanmay</FirstName > <LastName > Patil</LastName > <ContactNo > 1234567890</ContactNo > <Email > tanmaypatil@xyz.com</Email > </Employee > <Employee category = "Non-Technical" > <FirstName > Taniya</FirstName > <LastName > Mishra</LastName > <ContactNo > 1234667898</ContactNo > <Email > taniyamishra@xyz.com</Email > </Employee > <Employee category = "Management" > <FirstName > Tanisha</FirstName > <LastName > Sharma</LastName > <ContactNo > 1234562350</ContactNo > <Email > tanishasharma@xyz.com</Email > </Employee > </Company >
index.html
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 <!DOCTYPE html > <html > <body > <div > <b > FirstName:</b > <span id ="FirstName" > </span > <br /> <b > LastName:</b > <span id ="LastName" > </span > <br /> <b > ContactNo:</b > <span id ="ContactNo" > </span > <br /> <b > Email:</b > <span id ="Email" > </span > </div > <script > const xmlhttp; if (window .XMLHttpRequest) { xmlhttp = new XMLHttpRequest(); } else { xmlhttp = new ActiveXObject("Microsoft.XMLHTTP" ); } xmlhttp.open("GET" , "/dom/node.xml" , false ); xmlhttp.send(); xmlDoc = xmlhttp.responseXML; document .getElementById("FirstName" ).innerHTML = xmlDoc.getElementsByTagName("FirstName" )[0 ].childNodes[0 ].nodeValue; document .getElementById("LastName" ).innerHTML = xmlDoc.getElementsByTagName("LastName" )[0 ].childNodes[0 ].nodeValue; document .getElementById("ContactNo" ).innerHTML = xmlDoc.getElementsByTagName("ContactNo" )[0 ].childNodes[0 ].nodeValue; document .getElementById("Email" ).innerHTML = xmlDoc.getElementsByTagName("Email" )[0 ].childNodes[0 ].nodeValue;s </script > </body > </html >
# XPath
XPath - Quick Guide - Tutorialspoint
students.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 <?xml version = "1.0"?> <?xml-stylesheet type = "text/xsl" href = "students.xsl"?> <class > <student rollno = "393" > <firstname > Dinkar</firstname > <lastname > Kad</lastname > <nickname > Dinkar</nickname > <marks > 85</marks > </student > <student rollno = "493" > <firstname > Vaneet</firstname > <lastname > Gupta</lastname > <nickname > Vinni</nickname > <marks > 95</marks > </student > <student rollno = "593" > <firstname > Jasvir</firstname > <lastname > Singh</lastname > <nickname > Jazz</nickname > <marks > 90</marks > </student > </class >
students.xsl
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 <?xml version = "1.0" encoding = "UTF-8"?> <xsl:stylesheet version = "1.0" xmlns:xsl = "http://www.w3.org/1999/XSL/Transform" > <xsl:template match = "/" > <html > <body > <h2 > Students</h2 > <table border = "1" > <tr bgcolor = "#9acd32" > <th > Roll No</th > <th > First Name</th > <th > Last Name</th > <th > Nick Name</th > <th > Marks</th > </tr > <xsl:for-each select = "class/student" > <tr > <td > <xsl:value-of select = "@rollno" /> </td > <td > <xsl:value-of select = "firstname" /> </td > <td > <xsl:value-of select = "lastname" /> </td > <td > <xsl:value-of select = "nickname" /> </td > <td > <xsl:value-of select = "marks" /> </td > </tr > </xsl:for-each > </table > </body > </html > </xsl:template > </xsl:stylesheet >
# XPath Cheatsheet
from: Xpath cheatsheet
Comments
Prefix Relative
//section[.//h1[@id='hi']]
need add . before //h1
represents relative to section, otherwise the h1 match will be from the root, not from the current element.
String Function ends-with
//a[ends-with(@href, '.pdf')]
The ends-with
function is part of xpath 2.0 but browsers generally only support 1.0. So you’ll have to implement it yourself with a combination of string-length
, substring
and equals.
a[substring(@href, string-length(@href) - string-length('.pdf') +1) = '.pdf']
or
a[@href[substring(., string-length(.) - string-length('.pdf') +1) = '.pdf']]
String Function normalize-space
The normalize-space
function strips leading and trailing white-space from a string, replaces sequences of whitespace characters by a single space, and returns the resulting string.
# XQuery
XQuery - Quick Guide - Tutorialspoint
books.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 <?xml version="1.0" encoding="UTF-8"?> <books > <book category ="JAVA" > <title lang ="en" > Learn Java in 24 Hours</title > <author > Robert</author > <year > 2005</year > <price > 30.00</price > </book > <book category ="DOTNET" > <title lang ="en" > Learn .Net in 24 hours</title > <author > Peter</author > <year > 2011</year > <price > 40.50</price > </book > <book category ="XML" > <title lang ="en" > Learn XQuery in 24 hours</title > <author > Robert</author > <author > Peter</author > <year > 2013</year > <price > 50.00</price > </book > <book category ="XML" > <title lang ="en" > Learn XPath in 24 hours</title > <author > Jay Ban</author > <year > 2010</year > <price > 16.50</price > </book > </books >
books.xqy
1 2 3 4 5 6 7 for $x in doc ("books.xml" )/books/bookwhere $x /price>30 return $x /title
# XSLT
XSLT - Quick Guide - Tutorialspoint
students.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 <?xml version = "1.0"?> <?xml-stylesheet type = "text/xsl" href = "students.xsl"?> <class > <student rollno = "393" > <firstname > Dinkar</firstname > <lastname > Kad</lastname > <nickname > Dinkar</nickname > <marks > 85</marks > </student > <student rollno = "493" > <firstname > Vaneet</firstname > <lastname > Gupta</lastname > <nickname > Vinni</nickname > <marks > 95</marks > </student > <student rollno = "593" > <firstname > Jasvir</firstname > <lastname > Singh</lastname > <nickname > Jazz</nickname > <marks > 90</marks > </student > </class >
students.xsl
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 <?xml version = "1.0" encoding = "UTF-8"?> <xsl:stylesheet version = "1.0" xmlns:xsl = "http://www.w3.org/1999/XSL/Transform" > <xsl:template match = "/" > <html > <body > <h2 > Students</h2 > <table border = "1" > <tr bgcolor = "#9acd32" > <th > Roll No</th > <th > First Name</th > <th > Last Name</th > <th > Nick Name</th > <th > Marks</th > </tr > <xsl:for-each select = "class/student" > <tr > <td > <xsl:value-of select = "@rollno" /> </td > <td > <xsl:value-of select = "firstname" /> </td > <td > <xsl:value-of select = "lastname" /> </td > <td > <xsl:value-of select = "nickname" /> </td > <td > <xsl:value-of select = "marks" /> </td > </tr > </xsl:for-each > </table > </body > </html > </xsl:template > </xsl:stylesheet >
BaseX | The XML Framework: Lightweight and High-Performance Data Processing
XML - Visual Studio Code Plugin
XML Tools - Visual Studio Code Plugin
# XML、HTML and Excel
XML 和 HTML 可以使用 Excel 打开,并且 HTML 中可以使用 CSS 样式。
customers.html
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 <!DOCTYPE html > <html lang ="en" > <head > <meta charset ="UTF-8" /> <title > table</title > <style > #customers { font-family : "Trebuchet MS" , Arial, Helvetica, sans-serif; width : 100% ; border-collapse : collapse; } #customers td , #customers th { font-size : 1em ; border : 1px solid #98bf21 ; padding : 3px 7px 2px 7px ; } #customers th { font-size : 1.1em ; text-align : left; padding-top : 5px ; padding-bottom : 4px ; background-color : #a7c942 ; color : #ffffff ; } #customers tr .alt td { color : #000000 ; background-color : #eaf2d3 ; } </style > </head > <body > <table id ="customers" > <tr > <th > Company</th > <th > Contact</th > <th > Country</th > </tr > <tr > <td > Alfreds Futterkiste</td > <td > Maria Anders</td > <td > Germany</td > </tr > <tr class ="alt" > <td > Berglunds snabbköp</td > <td > Christina Berglund</td > <td > Sweden</td > </tr > <tr > <td > Centro comercial Moctezuma</td > <td > Francisco Chang</td > <td > Mexico</td > </tr > <tr class ="alt" > <td > Ernst Handel</td > <td > Roland Mendel</td > <td > Austria</td > </tr > <tr > <td > Island Trading</td > <td > Helen Bennett</td > <td > UK</td > </tr > <tr class ="alt" > <td > Königlich Essen</td > <td > Philip Cramer</td > <td > Germany</td > </tr > <tr > <td > Laughing Bacchus Winecellars</td > <td > Yoshi Tannamuri</td > <td > Canada</td > </tr > <tr class ="alt" > <td > Magazzini Alimentari Riuniti</td > <td > Giovanni Rovelli</td > <td > Italy</td > </tr > <tr > <td > North/South</td > <td > Simon Crowther</td > <td > UK</td > </tr > <tr class ="alt" > <td > Paris spécialités</td > <td > Marie Bertrand</td > <td > France</td > </tr > </table > </body > </html >
目前发现需要注意的问题:
01 在 Excel 中会显示为 1, 需要在 XML、HTML 中写成 =“01”, 类似 csv 使用 Excel 打开的表现。
XML 中的元素属性也会变成 Excel 中的一列。
扩展:在 ERP 中使用 RTF 开发报表时生成的 xls 文件实际就是包含样式的 HTML 文件,所以可以考虑不借助 RTF Template 而是直接输出 HTML 并将文件后缀修改为 xls(可以用 Excel 直接打开,并进行公式计算等). 这种做法的好处是比较灵活,例如可以为不同客户编写不同的样式文件,也可以定义多套样式,每次报表输出随机选择其中一个样式等。另外因为 RTF 报表本质是 HTML 的 table, 可以使用 pandas 解析。
1 2 3 4 5 6 7 8 9 import pandas as pdtables = pd.read_html('./DN.html' ) df = pd.DataFrame(tables[0 ]) print(df.head()) df.replace(regex={'^="(.*?)"$' : '\\1' }, inplace=True ) print(df.head())
Excel xlsx 通过 7z 等解压打开可以看到里面实际也是 XML。
# XML Parser
不同的解析器对 xml 的表现可能存在差异,如下 xml 中包含中文 学习
,但是 encoding 使用 ASCII
, 分別在 IE 和 Chrome 中打开。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 <?xml version="1.0" encoding="ASCII"?> <books > <book category ="JAVA" > <title lang ="en" > Learn Java in 24 Hours</title > <author > Robert</author > <year > 2005</year > <price > 30.00</price > </book > <book category ="DOTNET" > <title lang ="en" > Learn .Net in 24 hours</title > <author > Peter</author > <year > 2011</year > <price > 40.50</price > </book > <book category ="XML" > <title lang ="en" > 学习 XQuery</title > <author > Robert</author > <author > Peter</author > <year > 2013</year > <price > 50.00</price > </book > </books >
如下,IE 发现非 ASCII 字符会停止解析,Chrome 会尝试解析但是中文也会显示错误,因为 xml 中指定 encoding="ASCII"
但是 ASCII
不包含中文。如果 xml 中包含中文需要使用 encoding="UTF-8"
.
# XML Preserve Space
如下在 XML 中 title 的 lang 属性值中间有 4 个空格,其内容前中后各有 4 个空格,但是在 Chrome 浏览器中多个连续空格会显示成一个空格。
1 2 3 4 5 6 7 8 9 <?xml version="1.0" encoding="ASCII"?> <books > <book category ="JAVA" > <title lang ="e n" > Learn Java </title > <author > Robert</author > <year > 2005</year > <price > 30.00</price > </book > </books >
可以设置 xml: space="preserve"
保留元素和其子元素中的多个空白内容。 但是在浏览器中无效。
1 2 3 4 5 6 7 8 9 <?xml version="1.0" encoding="ASCII"?> <books xml:space ="preserve" > <book category ="JAVA" > <title lang ="e n" > Learn Java </title > <author > Robert</author > <year > 2005</year > <price > 30.00</price > </book > </books >
在 HTML 中内容包含多个连续空格同样只会显示一个空格,可以使用 pre
标签,或者通过 CSS 设定 { white-space: pre; }
, 另外也可以通过 HTML 实体
表示空格,例如 4 个空格可以使用
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 <!DOCTYPE html > <html lang ="en" > <head > <meta charset ="UTF-8" /> <meta http-equiv ="X-UA-Compatible" content ="IE=edge" /> <meta name ="viewport" content ="width=device-width, initial-scale=1.0" /> <title > Books Table</title > <style > body { display : flex; justify-content : center; } table { width : 80% ; border-collapse : collapse; } table td , table th { font-size : 1em ; border : 1px solid #98bf21 ; padding : 3px 7px 2px 7px ; } table th { font-size : 1.3em ; text-align : left; padding : 5px auto; background-color : #a7c942 ; color : #ffffff ; } </style > </head > <body > <table > <thead > <tr > <th > title</th > <th > author</th > <th > year</th > <th > price</th > </tr > </thead > <tbody > <tr > <td > Learn Java </td > <td > Robert</td > <td > 2005</td > <td > 30.00</td > </tr > </tbody > </table > </body > </html >
Use pre
tag
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 <!DOCTYPE html > <html lang ="en" > <head > <meta charset ="UTF-8" /> <meta http-equiv ="X-UA-Compatible" content ="IE=edge" /> <meta name ="viewport" content ="width=device-width, initial-scale=1.0" /> <title > Books Table</title > <style > body { display : flex; justify-content : center; } table { width : 80% ; border-collapse : collapse; } table td , table th { font-size : 1em ; border : 1px solid #98bf21 ; padding : 3px 7px 2px 7px ; } table th { font-size : 1.3em ; text-align : left; padding : 5px auto; background-color : #a7c942 ; color : #ffffff ; } </style > </head > <body > <table > <thead > <tr > <th > title</th > <th > author</th > <th > year</th > <th > price</th > </tr > </thead > <tbody > <tr > <td > <pre > Learn Java </pre > </td > <td > Robert</td > <td > 2005</td > <td > 30.00</td > </tr > </tbody > </table > </body > </html >
Use CSS { white-space: pre; }
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 <!DOCTYPE html > <html lang ="en" > <head > <meta charset ="UTF-8" /> <meta http-equiv ="X-UA-Compatible" content ="IE=edge" /> <meta name ="viewport" content ="width=device-width, initial-scale=1.0" /> <title > Books Table</title > <style > body { display : flex; justify-content : center; } table { width : 80% ; border-collapse : collapse; } table td , table th { font-size : 1em ; border : 1px solid #98bf21 ; padding : 3px 7px 2px 7px ; } table th { font-size : 1.3em ; text-align : left; padding : 5px auto; background-color : #a7c942 ; color : #ffffff ; } tr td :first -child { white-space : pre; } </style > </head > <body > <table > <thead > <tr > <th > title</th > <th > author</th > <th > year</th > <th > price</th > </tr > </thead > <tbody > <tr > <td > Learn Java </td > <td > Robert</td > <td > 2005</td > <td > 30.00</td > </tr > </tbody > </table > </body > </html >
# XML Locate Errors
XML 文档中存在错误,并且文件很大,难以通过 VS Code 等工具格式化时,可以将 XML 拖入浏览器,浏览器会显示 XML 中存在的错误原因,然后在 VS Code/Nopad++ 中通过 ctrl + g 定位到指定行数。
# XML and Python Pandas DataFrame
pandas.read_xml
pandas.DataFrame.to_xml
# Read XML to DataFrame
# Simple Read
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 import pandas as pdfrom io import StringIOxml = '''<?xml version='1.0' encoding='utf-8'?> <data xmlns="http://example.com"> <row> <shape>square</shape> <degrees>360</degrees> <sides>4.0</sides> </row> <row> <shape>circle</shape> <degrees>360</degrees> <sides/> </row> <row> <shape>triangle</shape> <degrees>180</degrees> <sides>3.0</sides> </row> </data>''' df = pd.read_xml('./data.xml' ) df = pd.read_xml(StringIO(xml)) print(df)
# Read use XPath
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import pandas as pdfrom io import StringIOxml = '''<?xml version='1.0' encoding='utf-8'?> <data> <row shape="square" degrees="360" sides="4.0"/> <row shape="circle" degrees="360"/> <row shape="triangle" degrees="180" sides="3.0"/> </data>''' df = pd.read_xml(StringIO(xml), xpath=".//row" ) print(df)
# Read use Iterparse
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 from io import BytesIO, StringIOimport pandas as pdxml = ''' <data> <record> <name>Babb</name> <age>30</age> <address postcode="123"> <city>New York</city> <state>NY</state> </address> </record> <record> <name>John</name> <age>25</age> <address postcode="456"> <city>BRENTWOOD</city> <state>ESSEX</state> </address> </record> </data> ''' iterparse_config = { "record" : ["name" , "age" , "city" , "state" , "postcode" ] } df = pd.read_xml(BytesIO(xml_data.encode()), iterparse=iterparse_config) df = pd.read_xml(BytesIO(xml_data.encode()), iterparse=iterparse_config, parser='etree' ) df = pd.read_xml(StringIO(xml_data), iterparse=iterparse_config, parser='etree' ) print(df)
# Read with Namespace
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import pandas as pdfrom io import StringIOxml = '''<?xml version='1.0' encoding='utf-8'?> <doc:data xmlns:doc="https://example.com"> <doc:row> <doc:shape>square</doc:shape> <doc:degrees>360</doc:degrees> <doc:sides>4.0</doc:sides> </doc:row> <doc:row> <doc:shape>circle</doc:shape> <doc:degrees>360</doc:degrees> <doc:sides/> </doc:row> <doc:row> <doc:shape>triangle</doc:shape> <doc:degrees>180</doc:degrees> <doc:sides>3.0</doc:sides> </doc:row> </doc:data>''' df = pd.read_xml(StringIO(xml), xpath="//doc:row" , namespaces={"doc" : "https://example.com" }) print(df)
If only define default xlmns
namespace, can’t read use xpath
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import pandas as pdfrom io import StringIOxml = '''<?xml version='1.0' encoding='utf-8'?> <data xmlns="http://example.com"> <row> <shape>square</shape> <degrees>360</degrees> <sides>4.0</sides> </row> <row> <shape>circle</shape> <degrees>360</degrees> <sides/> </row> <row> <shape>triangle</shape> <degrees>180</degrees> <sides>3.0</sides> </row> </data>''' df = pd.read_xml(StringIO(xml), xpath='.//row' ) print(df)
etree
parse xml which only defined default xlmns
namespace.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 <?xml version='1.0' encoding='utf-8'?> <data xmlns ="http://example.com" > <record > <name > Babb</name > <age > 30</age > <address > <city > New York</city > <state > NY</state > </address > </record > <record > <name > John</name > <age > 25</age > <address > <city > New York</city > <state > NY</state > </address > </record > </data >
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import xml.etree.ElementTree as ETtree = ET.parse('./data.xml' ) root = tree.getroot() def parse_element (element, item ): if len (list (element)) == 0 : item[element.tag] = element.text else : for child in list (element): parse_element(child, item) data = [] for child in root: item = {} parse_element(child, item) data.append(item) print(data)
use xpath.
1 2 3 4 5 6 7 8 import xml.etree.ElementTree as ETtree = ET.parse('./data.xml' ) root = tree.getroot() for record in root.findall('.//record' ): for element in record: print(element.tag, element.text)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import xml.etree.ElementTree as ETtree = ET.parse('./data.xml' ) root = tree.getroot() for record in root.findall('.//doc:record' , {'doc' : 'http://example.com' }): for element in record: print(element.tag, element.text)
etree
parse xml which defined xlmns
namespace but not add prefix before element.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 <?xml version='1.0' encoding='utf-8'?> <data xmlns:doc ="http://example.com" > <record > <name > Babb</name > <age > 30</age > <address > <city > New York</city > <state > NY</state > </address > </record > <record > <name > John</name > <age > 25</age > <address > <city > New York</city > <state > NY</state > </address > </record > </data >
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import xml.etree.ElementTree as ETtree = ET.parse('./data.xml' ) root = tree.getroot() def parse_element (element, item ): if len (list (element)) == 0 : item[element.tag] = element.text else : for child in list (element): parse_element(child, item) data = [] for child in root: item = {} parse_element(child, item) data.append(item) print(data)
use xpath.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import xml.etree.ElementTree as ETtree = ET.parse('./data.xml' ) root = tree.getroot() for record in root.findall('.//record' ): for element in record: print(element.tag, element.text)
1 2 3 4 5 6 7 8 import xml.etree.ElementTree as ETtree = ET.parse('./data.xml' ) root = tree.getroot() for record in root.findall('.//doc:record' , {'doc' : 'http://example.com' }): for element in record: print(element.tag, element.text)
etree
parse xml which defined xlmns
namespace and add prefix before element.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 <?xml version='1.0' encoding='utf-8'?> <doc:data xmlns:doc ="http://example.com" > <doc:record > <doc:name > Babb</doc:name > <doc:age > 30</doc:age > <doc:address > <doc:city > New York</doc:city > <doc:state > NY</doc:state > </doc:address > </doc:record > <doc:record > <doc:name > John</doc:name > <doc:age > 25</doc:age > <doc:address > <doc:city > New York</doc:city > <doc:state > NY</doc:state > </doc:address > </doc:record > </doc:data >
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 import xml.etree.ElementTree as ETtree = ET.parse('./data.xml' ) root = tree.getroot() def parse_element (element, item ): if len (list (element)) == 0 : item[element.tag] = element.text else : for child in list (element): parse_element(child, item) data = [] for child in root: item = {} parse_element(child, item) data.append(item) print(data)
use xpath.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import xml.etree.ElementTree as ETtree = ET.parse('./data.xml' ) root = tree.getroot() for record in root.findall('.//doc:record' , {'doc' : 'http://example.com' }): for element in record: print(element.tag, element.text)
# Parse Null and Date
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import pandas as pdfrom io import StringIOxml_data = ''' <data> <row> <index>0</index> <a>1</a> <b>2.5</b> <c>True</c> <d>a</d> <e>2019-12-31 00:00:00</e> </row> <row> <index>1</index> <b>4.5</b> <c>False</c> <d>b</d> <e>2019-12-31 00:00:00</e> </row> </data> ''' df = pd.read_xml(StringIO(xml_data), dtype_backend="numpy_nullable" , parse_dates=["e" ]) print(df)
# DataFrame to XML
# Simple to
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 import pandas as pdimport numpy as npdf = pd.DataFrame({'shape' : ['square' , 'circle' , 'triangle' ], 'degrees' : [360 , 360 , 180 ], 'sides' : [4 , np.nan, 3 ]}) print(df.to_xml())
# attr_cols
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import pandas as pdimport numpy as npdf = pd.DataFrame({'shape' : ['square' , 'circle' , 'triangle' ], 'degrees' : [360 , 360 , 180 ], 'sides' : [4 , np.nan, 3 ]}) print(df.to_xml(attr_cols=[ 'index' , 'shape' , 'degrees' , 'sides' ]))
# nampespace
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 import pandas as pdimport numpy as npdf = pd.DataFrame({'shape' : ['square' , 'circle' , 'triangle' ], 'degrees' : [360 , 360 , 180 ], 'sides' : [4 , np.nan, 3 ]}) print(df.to_xml(namespaces={"doc" : "https://example.com" }, prefix="doc" ))
# root_name and row_name
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import pandas as pdimport numpy as npdf = pd.DataFrame({'shape' : ['square' , 'circle' , 'triangle' ], 'degrees' : [360 , 360 , 180 ], 'sides' : [4 , np.nan, 3 ]}) print(df.to_xml(root_name='records' , row_name='record' ))
# Write to file
1 2 3 4 5 6 7 8 9 import pandas as pdimport numpy as npdf = pd.DataFrame({'shape' : ['square' , 'circle' , 'triangle' ], 'degrees' : [360 , 360 , 180 ], 'sides' : [4 , np.nan, 3 ]}) print(df.to_xml('records.xml' , index=False , root_name='records' , row_name='record' , xml_declaration=False ))
records.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 <records > <record > <shape > square</shape > <degrees > 360</degrees > <sides > 4.0</sides > </record > <record > <shape > circle</shape > <degrees > 360</degrees > <sides /> </record > <record > <shape > triangle</shape > <degrees > 180</degrees > <sides > 3.0</sides > </record > </records >