1

There is a similar question but the output I am looking for is different.

I have a dataframe which lists all the words (columns) and the number they occur for each document (rows).

It looks like this:

{'orange': {0: '1',
1: '3'},
'blue': {0: '0',
1: '2'}}

The output should "re-create" the original document as a bag of words in this way:

corpus = [
['orange'],
['orange', 'orange', 'orange', 'blue', 'blue']]

How to do this?

Nick stands with Ukraine
  • 2,634
  • 2
  • 32
  • 39

1 Answers1

2

if you want a dataframe at the end, you could do:

import pandas as pd
from collections import defaultdict
data = {'orange': {0: '1',
                   1: '3'},
        'blue': {0: '0',
                 1: '2'}}


results = defaultdict(list)
for color, placement in data.items():
    for row, count in placement.items():
        values = results[row]
        values.extend(int(count) * [color])
df = pd.DataFrame.from_dict(results, orient='index')

if you want a list of list just do:

[v for row, v in results.items()]

instead of the df build

Steven G
  • 14,602
  • 6
  • 47
  • 72