0

I have two dataframes with reddit data.

The first dataframe(df1) has the baseline posts. These are posts that have the user account name (author) and what time they posted at (created_utc), along with other things like the body of the post and id,permalink,etc.

The second dataframe(df2) contains posts only from the authors of the first dataframe during a 4 month period.

I am trying to get a dataframe that only has posts up to 30 days before the baseline df1 post and all the posts after the baseline post up to 30 days after.

So if "user1" posted sept 13, I would want to get all of the users posts from aug 13-oct 13. And then if "user2" posted oct 10, then I would want to get all of users posts from sept 10-nov 10.

I have the 30 days before and 30 days after times for each unique user, but I do not know how to look through each row of df2 and delete the row if it doesn't meet the before and after time requirements. Its like 500k rows of comments so I really can't do this manually.

any help or just direction so that I can get this data cleaned would be greatly helpful.

I have tried something like this;

# create a .csv with headers
csvFile = open("reddit_comments.csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
csvWriter.writerow(["author","created_utc","body","controversiality","post_id","permalink","score","subreddit","before_30_count", "after_30_count"])

counter = 0
# iterating through 500k lines of reddit comments
for index, row in df2.iterrows():
    author_id = row["author"]
    created_at = row["created_utc"]
    body_id = row["body"]
    controversiality_id = row['controversiality']
    post_id = row['id']
    permalink = row["permalink"]
    score = row['score']
    subreddit = row['subreddit']

    if author_id in df1['author']:
        if created_at <= df1.thirty_days_after:
            after_30_count = (add a count to after_30_count)

        elif created_at >= df1.thirty_days_before:
            before_30_count = (add a count to before_30_count)

        else:
            continue  
    
        res = [author_id,created_at,body_id,controversiality,post_id,permalink,score,subreddit,after_30_count,before_30_count]
        csvWriter.writerow(res)
        counter += 1
        print(counter)

    
csvFile.close()
  • Please provide a [mre](https://stackoverflow.com/help/minimal-reproducible-example) (also look [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)). – Timus Apr 25 '22 at 09:31

0 Answers0