0

I am currently separating a dataframe with lists in each column and row. There are 3 columns: jobId (that is unique), skills, skillTypeId

I am hoping to create two new columns that separate those vectors in "skills" and "skillTypeId" and match them respectively. i.e. for example1:

original, and after

Currently, I managed to separate them by tackling creating a dataframe of "skills" and another of "skillTypeId". For "skills" dataframe, it will contain just jobId and skills. For "skillTypeId" dataframe, it will contain just jobId and skillTypeId. Then I use separate_rows. Eventually, I then use cbind to merge the two data frames together.

However, one problem arise: there were different number of entries (differ by 100+ rows out of the million rows). And I have too much data to troubleshoot which rows went wrong.

I understand that my approach is rather manual, hence I am hoping to get some help in making this less manual, and also most importantly, no missing rows.

ThomasIsCoding
  • 80,151
  • 7
  • 17
  • 65
Koh
  • 1
  • 1
  • 1
    Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. – Sotos Jul 22 '20 at 09:31
  • What do you mean by different number of entries ? Is it that the value of `skill` for a certain `jobID` is of varying length ? – Romain Jul 22 '20 at 09:40
  • See https://stackoverflow.com/questions/15347282/split-delimited-strings-in-a-column-and-insert-as-new-rows and https://stackoverflow.com/questions/26194298/unlist-data-frame-column-preserving-information-from-other-column . – Ronak Shah Jul 22 '20 at 09:55
  • @Romain yup! After splitting them into 2 data frames, they should still have the same number of entries after separate_rows(). First dataframe with jobId & skill, the second with jobId & skillTypeId. In my example above, "microsoft excel" is of skillTypeId 2. And "product development" is of skillTypeId 2 as well. Each skill belongs to a skillTypeId. In my data, after unlisting each row, length of skill should thus = length of skillTypeId. But somehow it wasn't.. so I'm suspecting separate_rows() had remove some entries which were desirable. – Koh Jul 24 '20 at 01:44

0 Answers0