Goodmorning to everyone (admitting is morning wherever you are), I am honoured to write in this wonderful site, this is my first question, hope you'll find this interesting. I have a series from which I want to extract rows for another DataFrame by Slicing it. Here it is some context: the series represent a process variable generated by a machine in a lane. I have the times in which sections of material go under this machine, respecitvely the arriavl and exit time, that are contained in two other series.
- I'm going to use these signals in a ML project I'm trying to develop and the alghoritm is supervised so I have the Label. I thought to put each process variable I have in a separate Dataframe and eventually do a join later instead of having an array in a column field, sine I read Pandas is more faster if every column of a DataFrame have only a Scalar, is this correct or it doesn't change anything in a creation of a DataFrame?
- Assuming that my previously reasinement is correct I have the three series tarr, tend and s, similar to the one below:
import pandas as pd
import numpy as np
TCamp = np.timedelta64(250,'ms')
idx = pd.MultiIndex.from_tuples([(464288, 55, 0),(464288, 55, 1),(464288, 55, 2),(464288, 55, 3),(464288, 55, 4), (464288, 55, 5),(464288, 55, 6),(464288, 55, 7),(464288, 55, 8),(464288, 55, 9)], names=['Colata(Par)', 'Slab', 'Slice'])
tarr = pd.Series(np.array(['2021-11-20T01:37:22.090000000','2021-11-20T01:37:14.090000000','2021-11-20T01:37:06.340000000','2021-11-20T01:36:58.340000000','2021-11-20T01:36:50.590000000','2021-11-20T01:36:42.840000000','2021-11-20T01:36:34.840000000','2021-11-20T01:36:27.090000000','2021-11-20T01:36:19.090000000','2021-11-20T01:36:11.090000000'],dtype='datetime64[ns]'),index=idx)
tend = pd.Series(np.array(['2021-11-20T01:37:29.590000000','2021-11-20T01:37:21.840000000','2021-11-20T01:37:13.840000000','2021-11-20T01:37:06.090000000','2021-11-20T01:36:58.090000000','2021-11-20T01:36:50.340000000','2021-11-20T01:36:42.590000000','2021-11-20T01:36:34.590000000','2021-11-20T01:36:26.840000000','2021-11-20T01:36:18.840000000'],dtype='datetime64[ns]'),index=idx)
s = pd.Series(np.random.uniform(1.4,1.5,315), index=[tarr.values[-1]+i*TCamp for i in range(315)])
I want to create, admitting it's the best option, a DataFrame where the index is taken form tarr (remembering it's the same index in tend) and as values I want the slices of s teken from tarr to tend. I wrote this code
idx = tarr.index
tarr = tarr.values
tend = tend.values
df = pd.concat([pd.DataFrame([s[x:y].values], index=[idx[i]]) for i,(x,y) in enumerate(zip(tarr, tend))])
Maybe it's silly, but first I converted the series in numpy array hoping that this could fasten the computing.
Out[101]:
0 1 2 3 4 5 6 7 8 ... 23 24 25 26 27 28 29 30 31
(464288, 55, 0) 1.440785 1.416969 1.422908 1.442870 1.436515 1.414262 1.474375 1.426283 1.460844 ... 1.445614 1.455839 1.434067 1.428702 1.456841 1.406867 1.416680 1.498526 NaN
(464288, 55, 1) 1.424743 1.491728 1.441107 1.468172 1.428047 1.425284 1.442666 1.408087 1.492354 ... 1.449526 1.421042 1.442521 1.466031 1.429108 1.462252 1.439046 1.421225 1.486515
(464288, 55, 2) 1.475340 1.465274 1.494914 1.401652 1.453811 1.419799 1.436432 1.475635 1.412079 ... 1.476395 1.404988 1.470828 1.472004 1.494946 1.468327 1.423555 1.406835 NaN
(464288, 55, 3) 1.475291 1.431561 1.495104 1.460506 1.421245 1.411963 1.411887 1.468771 1.471905 ... 1.419006 1.476418 1.427846 1.426180 1.476835 1.499209 1.472152 1.443316 1.449456
(464288, 55, 4) 1.423420 1.429175 1.440244 1.476869 1.442356 1.459726 1.475783 1.453469 1.428708 ... 1.494100 1.409971 1.412222 1.446604 1.453895 1.472602 1.408404 1.406487 NaN
(464288, 55, 5) 1.462259 1.422866 1.428152 1.449186 1.414712 1.498510 1.495371 1.416978 1.432208 ... 1.486691 1.420766 1.409290 1.422413 1.495155 1.472826 1.407366 1.418593 NaN
(464288, 55, 6) 1.428384 1.449596 1.477857 1.407660 1.435210 1.491050 1.458693 1.492235 1.483423 ... 1.477553 1.447089 1.429405 1.433497 1.484601 1.485276 1.430502 1.464646 1.449117
(464288, 55, 7) 1.486584 1.405553 1.446462 1.468316 1.499108 1.405232 1.410002 1.450734 1.467617 ... 1.493026 1.431031 1.462516 1.479795 1.470909 1.435462 1.412458 1.488296 NaN
(464288, 55, 8) 1.430702 1.455248 1.485239 1.451106 1.452792 1.416129 1.468898 1.487687 1.454509 ... 1.423530 1.469781 1.430409 1.498169 1.467050 1.428302 1.489567 1.416920 1.405749
(464288, 55, 9) 1.498188 1.458571 1.476445 1.491516 1.479044 1.477391 1.455229 1.403565 1.483421 ... 1.445026 1.416440 1.453774 1.406295 1.499867 1.486285 1.437962 1.446287 1.404403
The point is that the series tarr and tend are about 6/7 k rows (while the s series is about 70k). I measured with the debugger that to process each element in the list comprehension it takes about 2 second, giving me a lot of time to wait for a single signal. Is there any other way to do the above operation which is faster? Should I think about writing a multithread model like the map funciton in the map-reduce method (even thought I haven't really experience in this field) or is there a way to vectorize the problem? Thank you all for the help.