Saat ini, alat I / O panda tidak mempertahankan read_xml()
metode dan pasangannya to_xml()
. Namun, read_json
membuktikan struktur seperti pohon dapat diterapkan untuk impor bingkai data dan read_html
untuk format markup.
Jika tim panda tidak mempertimbangkan seperti read_xml
metode untuk versi panda masa depan, apa yang pelaksanaannya akan mereka mengejar: parsing dengan built-in xml.etree.ElementTree
dengan yang iterfind()
atau iterparse()
fungsi atau modul pihak ketiga, lxml
dengan XPath yang 1.0 dan XSLT 1.0 metode?
Di bawah ini adalah pengujian saya untuk empat jenis metode pada input XML sederhana, datar, dan berpusat pada elemen. Semua disiapkan untuk penguraian umum untuk turunan root level kedua dan setiap metode harus menghasilkan dataframe pandas yang sama persis. Semua kecuali panggilan terakhir pd.Dataframe()
dalam daftar kamus. Metode XSLT mengubah XML ke CSV untuk dicor StringIO()
di pd.read_csv()
.
Pertanyaan (multi-bagian)
KINERJA: Bagaimana Anda menjelaskan lambat yang
iterparse
sering direkomendasikan untuk file yang lebih besar karena file diurai secara berulang? Apakah sebagian karenaif
pemeriksaan logika?MEMORI: Apakah memori CPU berhubungan dengan pengaturan waktu dalam panggilan I / O? XSLT dan XPath 1.0 cenderung tidak dapat diskalakan dengan baik dengan dokumen XML yang lebih besar karena seluruh file harus dibaca di memori untuk diurai.
STRATEGI: Apakah daftar kamus merupakan strategi yang optimal untuk
Dataframe()
menelepon? Lihat jawaban menarik ini: versi generator dan versi iterwalk yang ditentukan pengguna . Kedua daftar upcast ke dataframe.
Input Data ( pengguna teratas Stack Overflow saat ini menurut tahun teman panda kami disertakan)
<?xml version="1.0" encoding="utf-8"?>
<stackoverflow>
<topusers>
<user>Gordon Linoff</user>
<link>http://www.stackoverflow.com//users/1144035/gordon-linoff</link>
<location>New York, United States</location>
<year_rep>5,985</year_rep>
<total_rep>499,408</total_rep>
<tag1>sql</tag1>
<tag2>sql-server</tag2>
<tag3>mysql</tag3>
</topusers>
<topusers>
<user>Günter Zöchbauer</user>
<link>http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer</link>
<location>Linz, Austria</location>
<year_rep>5,835</year_rep>
<total_rep>154,439</total_rep>
<tag1>angular2</tag1>
<tag2>typescript</tag2>
<tag3>javascript</tag3>
</topusers>
<topusers>
<user>jezrael</user>
<link>http://www.stackoverflow.com//users/2901002/jezrael</link>
<location>Bratislava, Slovakia</location>
<year_rep>5,740</year_rep>
<total_rep>83,237</total_rep>
<tag1>pandas</tag1>
<tag2>python</tag2>
<tag3>dataframe</tag3>
</topusers>
<topusers>
<user>VonC</user>
<link>http://www.stackoverflow.com//users/6309/vonc</link>
<location>France</location>
<year_rep>5,577</year_rep>
<total_rep>651,397</total_rep>
<tag1>git</tag1>
<tag2>github</tag2>
<tag3>docker</tag3>
</topusers>
<topusers>
<user>Martijn Pieters</user>
<link>http://www.stackoverflow.com//users/100297/martijn-pieters</link>
<location>Cambridge, United Kingdom</location>
<year_rep>5,337</year_rep>
<total_rep>525,176</total_rep>
<tag1>python</tag1>
<tag2>python-3.x</tag2>
<tag3>python-2.7</tag3>
</topusers>
<topusers>
<user>T.J. Crowder</user>
<link>http://www.stackoverflow.com//users/157247/t-j-crowder</link>
<location>United Kingdom</location>
<year_rep>5,258</year_rep>
<total_rep>508,310</total_rep>
<tag1>javascript</tag1>
<tag2>jquery</tag2>
<tag3>java</tag3>
</topusers>
<topusers>
<user>akrun</user>
<link>http://www.stackoverflow.com//users/3732271/akrun</link>
<location></location>
<year_rep>5,188</year_rep>
<total_rep>229,553</total_rep>
<tag1>r</tag1>
<tag2>dplyr</tag2>
<tag3>dataframe</tag3>
</topusers>
<topusers>
<user>Wiktor Stribi?ew</user>
<link>http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew</link>
<location>Warsaw, Poland</location>
<year_rep>4,948</year_rep>
<total_rep>158,134</total_rep>
<tag1>regex</tag1>
<tag2>javascript</tag2>
<tag3>c#</tag3>
</topusers>
<topusers>
<user>Darin Dimitrov</user>
<link>http://www.stackoverflow.com//users/29407/darin-dimitrov</link>
<location>Sofia, Bulgaria</location>
<year_rep>4,936</year_rep>
<total_rep>709,683</total_rep>
<tag1>c#</tag1>
<tag2>asp.net-mvc</tag2>
<tag3>asp.net-mvc-3</tag3>
</topusers>
<topusers>
<user>Eric Duminil</user>
<link>http://www.stackoverflow.com//users/6419007/eric-duminil</link>
<location></location>
<year_rep>4,854</year_rep>
<total_rep>12,557</total_rep>
<tag1>ruby</tag1>
<tag2>ruby-on-rails</tag2>
<tag3>arrays</tag3>
</topusers>
<topusers>
<user>alecxe</user>
<link>http://www.stackoverflow.com//users/771848/alecxe</link>
<location>New York, United States</location>
<year_rep>4,723</year_rep>
<total_rep>233,368</total_rep>
<tag1>python</tag1>
<tag2>selenium</tag2>
<tag3>protractor</tag3>
</topusers>
<topusers>
<user>Jean-François Fabre</user>
<link>http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre</link>
<location>Toulouse, France</location>
<year_rep>4,526</year_rep>
<total_rep>30,027</total_rep>
<tag1>python</tag1>
<tag2>python-3.x</tag2>
<tag3>python-2.7</tag3>
</topusers>
<topusers>
<user>piRSquared</user>
<link>http://www.stackoverflow.com//users/2336654/pirsquared</link>
<location>Bellevue, WA, United States</location>
<year_rep>4,482</year_rep>
<total_rep>41,183</total_rep>
<tag1>pandas</tag1>
<tag2>python</tag2>
<tag3>dataframe</tag3>
</topusers>
<topusers>
<user>CommonsWare</user>
<link>http://www.stackoverflow.com//users/115145/commonsware</link>
<location>Who Wants to Know?</location>
<year_rep>4,475</year_rep>
<total_rep>616,135</total_rep>
<tag1>android</tag1>
<tag2>java</tag2>
<tag3>android-intent</tag3>
</topusers>
<topusers>
<user>Quentin</user>
<link>http://www.stackoverflow.com//users/19068/quentin</link>
<location>United Kingdom</location>
<year_rep>4,464</year_rep>
<total_rep>509,365</total_rep>
<tag1>javascript</tag1>
<tag2>html</tag2>
<tag3>css</tag3>
</topusers>
<topusers>
<user>Jon Skeet</user>
<link>http://www.stackoverflow.com//users/22656/jon-skeet</link>
<location>Reading, United Kingdom</location>
<year_rep>4,348</year_rep>
<total_rep>921,690</total_rep>
<tag1>c#</tag1>
<tag2>java</tag2>
<tag3>.net</tag3>
</topusers>
<topusers>
<user>Felix Kling</user>
<link>http://www.stackoverflow.com//users/218196/felix-kling</link>
<location>Sunnyvale, CA</location>
<year_rep>4,324</year_rep>
<total_rep>411,535</total_rep>
<tag1>javascript</tag1>
<tag2>jquery</tag2>
<tag3>asynchronous</tag3>
</topusers>
<topusers>
<user>matt</user>
<link>http://www.stackoverflow.com//users/341994/matt</link>
<location></location>
<year_rep>4,313</year_rep>
<total_rep>220,515</total_rep>
<tag1>swift</tag1>
<tag2>ios</tag2>
<tag3>xcode</tag3>
</topusers>
<topusers>
<user>Psidom</user>
<link>http://www.stackoverflow.com//users/4983450/psidom</link>
<location>Atlanta, GA, United States</location>
<year_rep>4,236</year_rep>
<total_rep>36,950</total_rep>
<tag1>python</tag1>
<tag2>pandas</tag2>
<tag3>r</tag3>
</topusers>
<topusers>
<user>Martin R</user>
<link>http://www.stackoverflow.com//users/1187415/martin-r</link>
<location>Germany</location>
<year_rep>4,195</year_rep>
<total_rep>269,380</total_rep>
<tag1>swift</tag1>
<tag2>ios</tag2>
<tag3>swift3</tag3>
</topusers>
<topusers>
<user>Barmar</user>
<link>http://www.stackoverflow.com//users/1491895/barmar</link>
<location>Arlington, MA</location>
<year_rep>4,179</year_rep>
<total_rep>289,989</total_rep>
<tag1>javascript</tag1>
<tag2>php</tag2>
<tag3>jquery</tag3>
</topusers>
<topusers>
<user>Alexey Mezenin</user>
<link>http://www.stackoverflow.com//users/1227923/alexey-mezenin</link>
<location>??????</location>
<year_rep>4,142</year_rep>
<total_rep>31,602</total_rep>
<tag1>laravel</tag1>
<tag2>php</tag2>
<tag3>laravel-5.3</tag3>
</topusers>
<topusers>
<user>BalusC</user>
<link>http://www.stackoverflow.com//users/157882/balusc</link>
<location>Amsterdam, Netherlands</location>
<year_rep>4,046</year_rep>
<total_rep>703,046</total_rep>
<tag1>java</tag1>
<tag2>jsf</tag2>
<tag3>servlets</tag3>
</topusers>
<topusers>
<user>GurV</user>
<link>http://www.stackoverflow.com//users/6348498/gurv</link>
<location></location>
<year_rep>4,016</year_rep>
<total_rep>7,932</total_rep>
<tag1>sql</tag1>
<tag2>mysql</tag2>
<tag3>sql-server</tag3>
</topusers>
<topusers>
<user>Nina Scholz</user>
<link>http://www.stackoverflow.com//users/1447675/nina-scholz</link>
<location>Berlin, Deutschland</location>
<year_rep>3,950</year_rep>
<total_rep>61,135</total_rep>
<tag1>javascript</tag1>
<tag2>arrays</tag2>
<tag3>object</tag3>
</topusers>
<topusers>
<user>JB Nizet</user>
<link>http://www.stackoverflow.com//users/571407/jb-nizet</link>
<location>Saint-Etienne, France</location>
<year_rep>3,923</year_rep>
<total_rep>418,780</total_rep>
<tag1>java</tag1>
<tag2>hibernate</tag2>
<tag3>java-8</tag3>
</topusers>
<topusers>
<user>Frank van Puffelen</user>
<link>http://www.stackoverflow.com//users/209103/frank-van-puffelen</link>
<location>San Francisco, CA</location>
<year_rep>3,920</year_rep>
<total_rep>86,520</total_rep>
<tag1>firebase</tag1>
<tag2>firebase-database</tag2>
<tag3>android</tag3>
</topusers>
<topusers>
<user>dasblinkenlight</user>
<link>http://www.stackoverflow.com//users/335858/dasblinkenlight</link>
<location>United States</location>
<year_rep>3,886</year_rep>
<total_rep>475,813</total_rep>
<tag1>c#</tag1>
<tag2>java</tag2>
<tag3>c++</tag3>
</topusers>
<topusers>
<user>Tim Biegeleisen</user>
<link>http://www.stackoverflow.com//users/1863229/tim-biegeleisen</link>
<location>Singapore</location>
<year_rep>3,814</year_rep>
<total_rep>77,211</total_rep>
<tag1>sql</tag1>
<tag2>mysql</tag2>
<tag3>java</tag3>
</topusers>
<topusers>
<user>Greg Hewgill</user>
<link>http://www.stackoverflow.com//users/893/greg-hewgill</link>
<location>Christchurch, New Zealand</location>
<year_rep>3,796</year_rep>
<total_rep>529,137</total_rep>
<tag1>git</tag1>
<tag2>python</tag2>
<tag3>git-pull</tag3>
</topusers>
<topusers>
<user>unutbu</user>
<link>http://www.stackoverflow.com//users/190597/unutbu</link>
<location></location>
<year_rep>3,735</year_rep>
<total_rep>401,595</total_rep>
<tag1>python</tag1>
<tag2>pandas</tag2>
<tag3>numpy</tag3>
</topusers>
<topusers>
<user>Hans Passant</user>
<link>http://www.stackoverflow.com//users/17034/hans-passant</link>
<location>Madison, WI</location>
<year_rep>3,688</year_rep>
<total_rep>672,118</total_rep>
<tag1>c#</tag1>
<tag2>.net</tag2>
<tag3>winforms</tag3>
</topusers>
<topusers>
<user>Jonathan Leffler</user>
<link>http://www.stackoverflow.com//users/15168/jonathan-leffler</link>
<location>California, USA</location>
<year_rep>3,649</year_rep>
<total_rep>455,157</total_rep>
<tag1>c</tag1>
<tag2>bash</tag2>
<tag3>unix</tag3>
</topusers>
<topusers>
<user>paxdiablo</user>
<link>http://www.stackoverflow.com//users/14860/paxdiablo</link>
<location></location>
<year_rep>3,636</year_rep>
<total_rep>507,043</total_rep>
<tag1>c</tag1>
<tag2>c++</tag2>
<tag3>bash</tag3>
</topusers>
<topusers>
<user>Pranav C Balan</user>
<link>http://www.stackoverflow.com//users/3037257/pranav-c-balan</link>
<location>Ramanthali, Kannur, Kerala, India</location>
<year_rep>3,604</year_rep>
<total_rep>64,476</total_rep>
<tag1>javascript</tag1>
<tag2>jquery</tag2>
<tag3>html</tag3>
</topusers>
<topusers>
<user>Suragch</user>
<link>http://www.stackoverflow.com//users/3681880/suragch</link>
<location>Hohhot, China</location>
<year_rep>3,580</year_rep>
<total_rep>71,032</total_rep>
<tag1>swift</tag1>
<tag2>ios</tag2>
<tag3>android</tag3>
</topusers>
</stackoverflow>
Metode Python
import xml.etree.ElementTree as et
import pandas as pd
from io import StringIO
from lxml import etree as lxet
def read_xml_iterfind():
tree = et.parse('Input.xml')
data = []
inner = {}
for el in tree.iterfind('./*'):
for i in el.iterfind('*'):
inner[i.tag] = i.text
data.append(inner)
inner = {}
df = pd.DataFrame(data)
def read_xml_iterparse():
data = []
inner = {}
i = 1
for (ev, el) in et.iterparse(path):
if i <= 2:
first_tag = el.tag
if el.tag == first_tag and len(inner) != 0:
data.append(inner)
inner = {}
if el.text is not None and len(el.text.strip()) > 0:
inner[el.tag] = el.text
i += 1
df = pd.DataFrame(data)
def read_xml_lxml_xpath():
tree = lxet.parse('Input.xml')
data = []
inner = {}
for el in tree.xpath('/*/*'):
for i in el:
inner[i.tag] = i.text
data.append(inner)
inner = {}
df = pd.DataFrame(data)
def read_xml_lxml_xsl():
xml = lxet.parse('Input.xml')
xslstr = '''
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="text"/>
<xsl:strip-space elements="*"/>
<!-- HEADERS -->
<xsl:template match = "/*">
<xsl:for-each select="*[1]/*">
<xsl:value-of select="local-name()" />
<xsl:choose>
<xsl:when test="position() != last()">
<xsl:text>,</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>
</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
<xsl:apply-templates/>
</xsl:template>
<!-- DATA ROWS (COMMA-SEPARATED) -->
<xsl:template match="/*/*" priority="2">
<xsl:for-each select="*">
<xsl:if test="position() = 1">
<xsl:text>"</xsl:text>
</xsl:if>
<xsl:value-of select="." />
<xsl:choose>
<xsl:when test="position() != last()">
<xsl:text>","</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>"
</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</xsl:template>
</xsl:transform>
'''
xsl = lxet.fromstring(xslstr)
transform = lxet.XSLT(xsl)
newdom = transform(xml)
df = pd.read_csv(StringIO(str(newdom)))
Pengaturan waktu (dengan XML dan XML saat ini dengan 25 kali anak-anak (yaitu, 900 catatan pengguna StackOverflow)
# SHORTER FILE
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 3.87 msec per loop
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 5.5 msec per loop
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 3.86 msec per loop
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 5.68 msec per loop
# LARGER FILE
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 36 msec per loop
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 78.9 msec per loop
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 32.7 msec per loop
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 51.4 msec per loop