0

I have some trouble with parsing an XML file to a dataframe in R.

I have some XML code

<?xml version="1.0" encoding="windows-1251"?>
<dlc ac="ED29099541DB7B022D00E4179F00" softversion="0.2">
  <statistics enterprise="Организация">
  <shop Id="4" GUID="{F5D518E4-3C80-44E9-835B-D87CC35A7BDB}" 
worktimefrom="2015-04-03 08:00:00" worktimeto="2015-04-03 20:00:00" 
name="Объект" clientId="Client 1">
  <sensor GUID="{63017726-D121-4EB3-A684-BC3D27AED119}" GCGUID="00000000-
 0000-0000-0000-000000000000" Id="25" type="1" minortype="1" address="01" 
 name="Устройство" balance="0" devtype="1">
    <stat datetime="2017-01-20 09:37:00" realin="1" realout="2" />
    <stat datetime="2017-01-20 09:38:00" realin="1" realout="2" />
    <stat datetime="2017-01-20 09:39:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:40:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 09:41:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:42:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:43:00" realin="1" realout="1" />
    <stat datetime="2017-01-20 09:44:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 09:52:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:53:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 09:56:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:57:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 10:08:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 10:16:00" realin="0" realout="1" />
  </sensor>
</shop>

I need to parse it into a dataframe in R, how do I do this ?

rjdkolb
  • 9,403
  • 8
  • 65
  • 80

1 Answers1

0

It's unclear what exactly you want into the data frame, but here is my solution:

First, the data:

file <- '
<?xml version="1.0" encoding="windows-1251"?>
 <dlc ac="ED29099541DB7B022D00E4179F00" softversion="0.2">
<statistics enterprise="Организация">
<shop Id="4" GUID="{F5D518E4-3C80-44E9-835B-D87CC35A7BDB}" 
worktimefrom="2015-04-03 08:00:00" worktimeto="2015-04-03 20:00:00" 
name="Объект" clientId="Client 1">
  <sensor GUID="{63017726-D121-4EB3-A684-BC3D27AED119}" GCGUID="00000000-
  0000-0000-0000-000000000000" Id="25" type="1" minortype="1" address="01" 
 name="Устройство" balance="0" devtype="1">
    <stat datetime="2017-01-20 09:37:00" realin="1" realout="2" />
    <stat datetime="2017-01-20 09:38:00" realin="1" realout="2" />
    <stat datetime="2017-01-20 09:39:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:40:00" realin="0" realout="1" />
   <stat datetime="2017-01-20 09:41:00" realin="1" realout="0" />
   <stat datetime="2017-01-20 09:42:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:43:00" realin="1" realout="1" />
    <stat datetime="2017-01-20 09:44:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 09:52:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:53:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 09:56:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:57:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 10:08:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 10:16:00" realin="0" realout="1" />
  </sensor>
</shop>'

Now, we use rvest to extract the elements from each stat line and put them in a data frame:

library(rvest)
lines <- read_html(file) %>% html_nodes('stat')

time <- lines %>% html_attr('datetime')
realin <- lines %>% html_attr('realin')
realout <- lines %>% html_attr('realout')

df <- data.frame(time, realin, realout, stringsAsFactors = F)

The result is:

> df

##                   time realin realout
## 1  2017-01-20 09:37:00      1       2
## 2  2017-01-20 09:38:00      1       2
## 3  2017-01-20 09:39:00      1       0
## 4  2017-01-20 09:40:00      0       1
## 5  2017-01-20 09:41:00      1       0
## 6  2017-01-20 09:42:00      1       0
## 7  2017-01-20 09:43:00      1       1
## 8  2017-01-20 09:44:00      0       1
## 9  2017-01-20 09:52:00      1       0
## 10 2017-01-20 09:53:00      0       1
## 11 2017-01-20 09:56:00      1       0
## 12 2017-01-20 09:57:00      0       1
Oriol Mirosa
  • 2,621
  • 1
  • 12
  • 15
  • Thank, but maybe you know how extract after I product some variable data – Panchenko Andrey Aug 11 '17 at 18:04
  • I'm sorry, I don't understand what you mean in your comment. Can you clarify? – Oriol Mirosa Aug 11 '17 at 18:05
  • I have a many XML file I need convert all of them into data frame, but all of solution don't work. And I wount convert it into text before do some dataframe – Panchenko Andrey Aug 11 '17 at 18:09
  • I don't see why my solution wouldn't work. You don't need to convert anything to text. I used the `file` object above because that's what you provided. If you have the names of the files that you need to parse, you can just read them with rvest directly. For instance: `file – Oriol Mirosa Aug 11 '17 at 18:25
  • I use this and this doesn't work https://www.tutorialspoint.com/r/r_xml_files.htm – Panchenko Andrey Aug 11 '17 at 18:26
  • The solution I wrote works with the data you provided. If you need something different, you should be more specific and/or provide different data. – Oriol Mirosa Aug 11 '17 at 18:31
  • file % html_nodes('stat') Error in UseMethod("read_xml") : no applicable method for 'read_xml' applied to an object of class "c('xml_document', 'xml_node')" – Panchenko Andrey Aug 11 '17 at 18:34
  • In what you wrote in the comments you are using `read_html()` twice, and that's why you see the error. Notice that in my code there is only one instance of `read_html()`. – Oriol Mirosa Aug 11 '17 at 18:44
  • It's worked, thank you. Last question. How read shop ID shop Id="4" – Panchenko Andrey Aug 11 '17 at 19:24
  • If what you want is to get the '4' and there is only one 'shop ID' in each file, then you can do this: `shopID % html_node('shop') %>% html_attr('id')`. If, instead, what you're looking for is the long GUID, then you can do this: `shopID % html_node('shop') %>% html_attr('guid')`. – Oriol Mirosa Aug 11 '17 at 19:32
  • Notice the pattern here: you read the file with `read_html()`, and then you extract the nodes with `html_node()` (specifying the node between the brackets), and the attributes with `html_attr()` (again, specifying the attribute of those nodes). Hope this makes sense. – Oriol Mirosa Aug 11 '17 at 19:33
  • Thank you so much. – Panchenko Andrey Aug 11 '17 at 19:37
  • No problem! Please mark the answer as correct so that others can find it. – Oriol Mirosa Aug 11 '17 at 19:42