Python vectorization -- Is there some limitations?

Question

I have a technical question about Python vectorization. By definition, Vectorization is a technique of implementing array operations without using for loops. That reduce the running and execution time of code. But what about precision? Personally, I have applied vectorization on a relatively large Numpy matrix (800000 x 1000) to scale it. The first scaling method is done by vectorization. Then, I scale it using a foor loop . Of course a foor loop need more and more time. In other terms, I implement the same mathematical operation by two method: vectorization and without vectorization. ==> When I use the scaled data for further processing, I obtain different results. The "non-vectorized " dataset gives better results when fed the model (pca).

Is that possible?
Vectorization of a large dataset could gain you time but loss in precision?

Check out [ask] - there are very few cases where a question without code is on-topic here. — Daniel F, May 12 '22 at 12:38
As long you are working with floats there shouldn't a difference in values. For integer work, python integers can be larger than the numpy `int64`. — hpaulj, May 12 '22 at 14:20
For floating-point values, see https://stackoverflow.com/questions/21895756/why-are-floating-point-numbers-inaccurate . Additionally, the precision of the result can change regarding the chosen algorithm since floating-point number operations are not associative. Since Numpy certainly does not use the same algorithm than you, results are different. By the way, doing a PCA on 800 million numbers (certainly 6.4 Go of data) using (naive) basic loops executed by the CPython *interpreter* is insanely *inefficient*. — Jérôme Richard, May 12 '22 at 18:24
@JérômeRichard : yes indeed it is about this volume of data .. the first scaled dataset is obtained using (scaler = StandardScaler() ; scaler.fit(data) ; scaler.transform(data) ; return scaled_data) The second one is obtained using (for each element i in range(len(data)) do: scaler.fit_(i) ; scaler.transform(i); return scaled_data ). The surprise that I found that PCA return better results with the method 2. I, think that CPython integers are larger than int64 as mentionned by hpaulj ; that lead to more precise results with loop even if this a naive approach !! — seif elfetni, May 13 '22 at 07:35

Python vectorization -- Is there some limitations?

0 Answers0