3

I am facing a difficult challenge, given my very low skills at text mining… Basically, I have a list of approx. 200 individuals described in a plain text file following a simple structure:

N: (name)
Y: (year of birth)
S: (sibling)

N: (name)
Y: (year of birth)
S: (sibling)

[etc.]

Here's the catch:

  • each individual can have missing data, in which case the line is omitted;
  • multiple siblings, in which case his/her entry has more than one S: line.

The only real constants in the text file are:

  • each individual is separated by a blank file from each other;
  • all lines feature the appropriate N:, Y: or S: prefix descriptor.

Is there a method, using pretty much anything from Excel (!) to Stata to R or even Google Refine or Wrangler, to turn that organised chaos into a standard dataset, by which I mean, one column per descriptor, with S1, S2... Sn columns for siblings?

If that is the wrong forum, please redirect me. I just figured that the stats crew would be the most acquainted with text mining.

Fr.
  • 1,453

2 Answers2

3

Load data

Assuming fd.txt contains the following

N: toto
Y: 2000
S: tata

N: titi
Y: 2004
S: tutu

N: toto
Y: 2000
S: tata2

N: toto
Y: 2000
S: tata3

N: tete
Y: 2002
S: tyty

N: tete
Y: 2002
S: tyty2

here is one solution in R:

tmp <- scan("fd.txt", what="character")
res <- data.frame(matrix(tmp[seq(2, length(tmp), by=2)], nc=3, byrow=TRUE))

The first command read everything a single vector of character, skipping blank lines; then we remove every odd elements ("N:", "S:", "Y:"); finally, we arrange them in a data.frame (this a convenient way to make each column a factor).

The output is

    X1   X2   X3
1 toto 2000 tata
2 titi 2004 tutu
3 toto 2000 tata2
4 toto 2000 tata3
5 tete 2002  tyty
6 tete 2002 tyty2

Please note that if you have some GNU utilities on your machine, you can use awk

sed 's/[NYS]: //' fd.txt | awk 'ORS=(FNR%4)?FS:RS' > res.txt

The first command uses sed to filter the descriptor (replace by blank); then awk will produce its output (Output Record Separator) as follows: arrange each record using default Field Separator (space), and put a new Record Separator (new line) every 4 fields. Of note, we could filter the data using awk directly, but I like separating tasks a little.

The result is written in res.txt and can be imported into R using read.table():

toto 2000 tata 
titi 2004 tutu 
toto 2000 tata2 
toto 2000 tata3 
tete 2002 tyty 
tete 2002 tyty2 

Process and transform data

I didn't find a very elegant solution in R, but the following works:

library(plyr)
tmp <- ddply(res, .(X1,X2), mutate, S=list(X3))[,-3]
resf <- tmp[!duplicated(tmp[,1:2]),]

Then, resf has three columns, where column S list the levels of the X3 factor (siblings' name). So, instead of putting siblings in different columns, I concatenated them in a list. In other words,

as.character(resf$S[[1]])

gives you the name of tete's siblings, which are tyty and tyty2.

I'm pretty sure there's a better way to do this with plyr, but I didn't manage to get a nice solution for the moment.


With repeated "S:" motif, here is one possible quick and dirty solution. Say fd.txt now reads

N: toto
Y: 2000
S: tata
S: tata2
S: tata3

N: titi
Y: 2004
S: tutu

N: tete
Y: 2002
S: tyty
S: tyty2

then,

tmp <- read.table("fd.txt")
tmp$V1 <- gsub(":","",tmp$V1)
start <- c(which(tmp$V1=="N"), nrow(tmp)+1)
max.col <- max(diff(start))
res <- matrix(nr=length(start)-1, nc=max.col)
for (i in 1:(length(start)-1))
  res[i,1:diff(start[i:(i+1)])] <- t(tmp[start[i]:(start[i+1]-1),])[2,]
res <- data.frame(res)
colnames(res) <- c("name","year",paste("S",1:(max.col-2),sep=""))

will produce

  name year   S1    S2    S3
1 toto 2000 tata tata2 tata3
2 titi 2004 tutu  <NA>  <NA>
3 tete 2002 tyty tyty2  <NA>  
chl
  • 53,725
  • Thans for your help, especially on building the original matrix. Unfortunately, it fails because your example uses fixed width entries (NYS-NYS-NYS) while the actual file has entries with multiple siblings (NYS-NYSSS-NYS). Yes, I know… – Fr. Jun 04 '11 at 11:09
  • @Fr. I added a PS (well, it's so ugly code that I have no doubt someone will come with a better one :-) – chl Jun 04 '11 at 14:14
  • Dirty perhaps, but it worked alright! Thanks a thousand :) – Fr. Jun 04 '11 at 15:22
1

Interesting question...here's a Stata-based solution (that doesnt require OS utilities like awk). My example uses the data example contributed by chl, except that I added in the data problems described by Fr. (missing years, multiple siblings, etc).

Run the code below from your do-file editor. Note that there are some comments explaining alternatives for handling multiple siblings by reshaping the data wide.

**********************! Begin example
clear

**input dataset:
input str20(v1)
""
"N: toto"
"Y: 2000"
"S: tata"
""
"N: titi"
"Y: 2004"
"S: tutu"
""
"N: toto"
"Y: 2000"
"S: tata2"
""
"N: toto"
"Y: 2000"
"S: tata3"
""
"N: tete"
"Y: 2002"
"S: tyty"
""
"N: tete"
"Y: 2002"
"S: tyty2"
"S: tyty99"
""
"N: tete2"
"S: tyty22"
""
"N: tete3"
"Y: 2004"
"S: tytya"
"S: tytyb"
"S: tytyc"
end

**parse data/v1
split v1, parse(": ")
l in 1/7
drop v1

**create panel id
g id = _n if mi(v12)
replace id = id[_n-1] ///
     if mi(id) & !mi(v12)
drop if mi(v12)
//get max value for later//
bys id: g i = _n
qui sum i
loc max `r(max)'
rename v12 attribute
l in 1/6

**reshape data
foreach x in N Y S {
g `x' = attribute if v11=="`x'" ///
    & !mi(attribute)
    forval n = 1/`=`max'-1' {
    foreach s in + - {
bys id: replace `x' = `x'[_n`s'`n'] if !mi(`x'[_n`s'`n']) /// 
    & mi(`x')
    } //end x.loop
  } //end n.loop
} //end s.loop

drop v11 attribute i
duplicates drop

**if you want multiple siblings on one line, reshape wide:
bys id N Y: g j = _n
reshape wide S, i(id) j(j)
order id N Y
li

**********************! End example
suncoolsu
  • 6,622