5

I have a dataframe that looks like the following:

    s1        s2       s3       s4
0   v1        v2       v3       v4
0   v5        v6       v7       np.nan
0   v8      np.nan     v9       np.nan
0   v10     np.nan     np.nan   np.nan

Essentially from top down there are numerical values and across columns at some random index values will switch to np.nan only.

I've used .apply(pd.Series.last_valid_index) to get the indexes for which the values are still numerical, however, I'm not sure of the most efficient way to retrieve a series for which I have the actual value at the last valid index.

Ideally I'd be able to derive a series that looks like:

   value
s1 v10
s2 v6
s3 v9
s4 v4

or as a dataframe that looks like

   s1 s2 s3 s4
0 v10 v6 v9 v4

Many thanks!

marc_s
  • 704,970
  • 168
  • 1,303
  • 1,425
wingsoficarus116
  • 349
  • 3
  • 13

3 Answers3

5

This is one way using NumPy indexing:

# ensure index is normalised
df = df.reset_index(drop=True)

# calculate last valid index across dataframe
idx = df.apply(pd.Series.last_valid_index)

# create result using NumPy indexing
res = pd.Series(df.values[idx, np.arange(df.shape[1])],
                index=df.columns,
                name='value')

print(res)

s1    v10
s2     v6
s3     v9
s4     v4
Name: value, dtype: object
jpp
  • 147,904
  • 31
  • 244
  • 302
5

Here is another way to do it, without resetting the index:

df.apply(lambda x: x[x.notnull()].values[-1])

s1    v10
s2     v6
s3     v9
s4     v4
sacuL
  • 45,929
  • 8
  • 75
  • 99
  • 1
    Nice solution +1, would you know if `last_valid_index` is more or less efficient than Boolean indexing via `pd.Series.notnull` ? – jpp Jun 14 '18 at 16:00
  • Thanks, Not sure, but it should be easy to test... give me a second – sacuL Jun 14 '18 at 16:01
  • 1
    Based on a super crude test on a large df, I think that `last_valid_index` is actually quite a bit faster than `notnull()` – sacuL Jun 14 '18 at 16:09
  • Thanks for testing. Yeh I thought so, maybe `last_valid_index` goes backwards like a generator rather than testing each and every item. – jpp Jun 14 '18 at 16:10
  • `x[x.notnull()]` can be replaced with `x.dropna()`. – EliadL Jul 08 '21 at 09:46
2

You need to normalize the index, find the last valid index per column and create a dataframe from it.

df = df.reset_index(drop=True)
ser = df.apply(lambda x: x.last_valid_index())
pd.DataFrame([df[col][ser[col]] for col in df.columns], index=df.columns).T

Output:

     s1 s2  s3  s4
0   v10 v6  v9  v4

Also, this way, your original df stays intact.

harvpan
  • 8,311
  • 2
  • 17
  • 34
  • Thanks for the quick reply! I tried this but I get the following error: TypeError: cannot do label indexing on with these indexers [nan] of – wingsoficarus116 Jun 14 '18 at 15:51
  • Are you sure, it is coming from my code? Which line? What version are you running for pandas? – harvpan Jun 14 '18 at 15:53
  • It seems the be the second line: Traceback (most recent call last): File "", line 26, in print(pd.DataFrame([df.reset_index()[col][ser[col]] for col in df.columns], index=df.columns).T) TypeError: cannot do label indexing on with these indexers [nan] of – wingsoficarus116 Jun 14 '18 at 15:56
  • `df.reset_index()[col].iloc[ser[col]]`, does that work for you? If so, I would update my answer. I am using '0.23.0' version for pandas. It works for me. Which version are you using? – harvpan Jun 14 '18 at 16:01
  • I figured it out, I had to drop columns where all the values were np.nan - then it works like a charm! Thank you very much! – wingsoficarus116 Jun 14 '18 at 16:04