Load NSF Research Award Abstracts in Python or R

Question

I want to work on NSF Research Award Abstracts 1990-2003 Data Set in R. Every record of the data is stored in a text file and text files are grouped in folders (See data files here)

Is there any easy and straightforward way to load the NSF data in Python or R?

How much programming experience do you have with R? The way I would do it: iterate the two directory levels by two for loops. Read the files line-wise. Merge a line, which starts with a space, with the previous line (and add a separator like ,). Split each line at :. Write the data into one row or one column of a 2D array (or another 2D data structure). Proceed with the next file the same way and write the data into the next row or column. I would suggest to write the data row-wise into the array. — daniel.heydebreck, Aug 04 '16 at 10:46
I don't know how similar the files are over the years, but this one example is begging for the data to be repackaged and put in a single sqlite3 database file for easier access. https://i.imgur.com/DTdDviN.png — philshem, Mar 21 '17 at 12:15

Hamideh · Answer 1 · 2017-03-20T21:38:20.620

This is one way of working with the dataset in Python 3.5. The code and document are available here: https://github.com/HamidehIraj/ReadNSF

Import Libraries

import re
import os

Loading Data

indir = './Part1/'
# indir = './Part1-Complete/'

file_list=[]
year_list=[]
abstract_list=[]

for root, dirs, filenames in os.walk(indir):
    for file in filenames:

        # Filtering text files
        if file.endswith('.txt'):

            log = open(os.path.join(root, file),'r')
            raw = log.read()

            # Converting text file to string for regular expression purposes

            # Finding the abstract
            # Assuming that Abstract is the last item in all files

            pat_abstract=re.compile('Abstract.*',re.M|re.DOTALL)
            abstract=pat_abstract.findall(raw)

            # Converting list to string
            abstract=''.join(abstract)

#             print(abstract)
#             print(type(abstract))

            # Finding the term containing the start year
            # Assuming that both Start Date and Expires exist in all text files

            pat_year=re.compile('Start Date.*Expires',re.M|re.DOTALL)
            year_term=pat_year.findall(raw)

            # Converting list to string
            year_term=''.join(year_term)

            # Finding the start year. The result of the findall is a list
            year=re.findall('[1-2][0-9][0-9][0-9]',year_term)

            # converting list to integer
            for item in year:
                year=int(item)

#             print(type(year))
#             print(year)

            # Creating lists for filename, year and abstract
            # filename is saved for reference
            file_list.append(file)
            year_list.append(year)
            abstract_list.append(abstract)

# print(file_list)
# print(year_list)
# print(abstract_list)

print('Number of Files')
print(len(file_list))
# print(len(year_list))
# print(len(abstract_list))

A Sample of the Data

print(’FileName ’,’Year’, ’Abstract’, sep=’|’)
print(file_list[0] ,year_list[0], abstract_list[0], sep=’|’)

Data Preparation

clean_list=[]
punctuation = re.compile(r'[+-/\?!$%&,":;()<>@©*.|0-9]')

for element in abstract_list:

    # remove the term "Abstract"  from abstracts
    element=element.replace('Abstract', ' ')

    # convert to lowercase
    element=element.lower()

    # remove punctuation and numbers from abstracts
    element = punctuation.sub("", element)

    # remove multiple spaces from abstracts
    element=" ".join(element.split())

    # Creating the new list
    clean_list.append(element)

    print(clean_list[0] )
    abstract_list =  clean_list

Load NSF Research Award Abstracts in Python or R

1 Answers1