0

I have a question regarding a data science approach I'm currently working on.

I have two dimensional data (x, y). For each data entry a date (t) of when this data is recorded is attached. So, basically each data point has the following attributes: (x, y, t)

Now, I want to conduct a linear regression between x and y, but with the addition that t is respected in a way, that older dates are less decisive than earlier ones. In that case, every data point is rated for its date and is included in the regression differently based on this rating. In other words: Newer data would be more important and would influence the linear regression stronger than older data.

I've looked around a couple of hours now, but I haven't stumbled upon a suitable solution yet. Do you guys know a fitting methodology for this and an implementation strategy in Python?

Cheers and thank you!

Soda
  • 79
  • 6

1 Answers1

0

If the date in string format , try this approach.

Parse the dates, currently coded as strings, into datetime format

raw_df['Date'] = pd.to_datetime(raw_df['Date'])

Extract year from date

raw_df['Year'] = raw_df['Date'].dt.year

raw_df['Year'].head()

Extract month from date

raw_df['Month'] = raw_df['Date'].dt.month

raw_df['Month'].head()

Extract day from date

raw_df['Day'] = raw_df['Date'].dt.day

raw_df['Day'].head()

Distribution of Date

sns.countplot(x=pd.to_datetime(raw_df.Date).dt.year);

Drop the original Date variable.

  • Thank you for your comment! However, that's not exactly what I meant. I'd like to know, if there is a methodology worth implementing in Python, where the observations (x,y) for my linear regression are additionally weighted down the older they are. So basically, newer data would be more important and would influence the linear regression stronger. – Soda Aug 29 '21 at 14:53
  • If nothing else you could probably represent dates as number of days (weeks? years?) since a reference date, and use some function of that to weight cases. See [this previous answer](https://stackoverflow.com/a/40217971/16327476). As a next step, maybe add some info to the question showing a few representative rows of data and some code that does the regression you want except without taking the date into account. Also, it might help to describe the range of dates you are interested in. Do they span a month, a year, a millennium, or what? – TMBailey Aug 30 '21 at 07:49