Applying last valid index mask to dataframe to get last valid values

Question

I have a dataframe that looks like the following:

    s1        s2       s3       s4
0   v1        v2       v3       v4
0   v5        v6       v7       np.nan
0   v8      np.nan     v9       np.nan
0   v10     np.nan     np.nan   np.nan

Essentially from top down there are numerical values and across columns at some random index values will switch to np.nan only.

I've used .apply(pd.Series.last_valid_index) to get the indexes for which the values are still numerical, however, I'm not sure of the most efficient way to retrieve a series for which I have the actual value at the last valid index.

Ideally I'd be able to derive a series that looks like:

   value
s1 v10
s2 v6
s3 v9
s4 v4

or as a dataframe that looks like

   s1 s2 s3 s4
0 v10 v6 v9 v4

Many thanks!

score 5 · Accepted Answer · answered Jun 14 '18 at 15:32

This is one way using NumPy indexing:

# ensure index is normalised
df = df.reset_index(drop=True)

# calculate last valid index across dataframe
idx = df.apply(pd.Series.last_valid_index)

# create result using NumPy indexing
res = pd.Series(df.values[idx, np.arange(df.shape[1])],
                index=df.columns,
                name='value')

print(res)

s1    v10
s2     v6
s3     v9
s4     v4
Name: value, dtype: object

score 5 · Answer 2 · answered Jun 14 '18 at 15:59

5

Here is another way to do it, without resetting the index:

df.apply(lambda x: x[x.notnull()].values[-1])

s1    v10
s2     v6
s3     v9
s4     v4

answered Jun 14 '18 at 15:59

sacuL

45,929
8
75
99

1

Nice solution +1, would you know if `last_valid_index` is more or less efficient than Boolean indexing via `pd.Series.notnull` ? – jpp Jun 14 '18 at 16:00
Thanks, Not sure, but it should be easy to test... give me a second – sacuL Jun 14 '18 at 16:01
1

Based on a super crude test on a large df, I think that `last_valid_index` is actually quite a bit faster than `notnull()` – sacuL Jun 14 '18 at 16:09
Thanks for testing. Yeh I thought so, maybe `last_valid_index` goes backwards like a generator rather than testing each and every item. – jpp Jun 14 '18 at 16:10
`x[x.notnull()]` can be replaced with `x.dropna()`. – EliadL Jul 08 '21 at 09:46

harvpan · Answer 3 · 2018-06-14T15:57:42.310

2

You need to normalize the index, find the last valid index per column and create a dataframe from it.

df = df.reset_index(drop=True)
ser = df.apply(lambda x: x.last_valid_index())
pd.DataFrame([df[col][ser[col]] for col in df.columns], index=df.columns).T

Output:

     s1 s2  s3  s4
0   v10 v6  v9  v4

Also, this way, your original df stays intact.

edited Jun 14 '18 at 15:57

answered Jun 14 '18 at 15:39

harvpan

8,311
2
17
34

Thanks for the quick reply! I tried this but I get the following error: TypeError: cannot do label indexing on with these indexers [nan] of – wingsoficarus116 Jun 14 '18 at 15:51
Are you sure, it is coming from my code? Which line? What version are you running for pandas? – harvpan Jun 14 '18 at 15:53
It seems the be the second line: Traceback (most recent call last): File "", line 26, in print(pd.DataFrame([df.reset_index()[col][ser[col]] for col in df.columns], index=df.columns).T) TypeError: cannot do label indexing on with these indexers [nan] of – wingsoficarus116 Jun 14 '18 at 15:56
`df.reset_index()[col].iloc[ser[col]]`, does that work for you? If so, I would update my answer. I am using '0.23.0' version for pandas. It works for me. Which version are you using? – harvpan Jun 14 '18 at 16:01
I figured it out, I had to drop columns where all the values were np.nan - then it works like a charm! Thank you very much! – wingsoficarus116 Jun 14 '18 at 16:04

Applying last valid index mask to dataframe to get last valid values

3 Answers3

Linked