-4

I have a technical question about Python vectorization. By definition, Vectorization is a technique of implementing array operations without using for loops. That reduce the running and execution time of code. But what about precision? Personally, I have applied vectorization on a relatively large Numpy matrix (800000 x 1000) to scale it. The first scaling method is done by vectorization. Then, I scale it using a foor loop . Of course a foor loop need more and more time. In other terms, I implement the same mathematical operation by two method: vectorization and without vectorization. ==> When I use the scaled data for further processing, I obtain different results. The "non-vectorized " dataset gives better results when fed the model (pca).

Is that possible?
Vectorization of a large dataset could gain you time but loss in precision?

  • 4
    Check out [ask] - there are very few cases where a question without code is on-topic here. – Daniel F May 12 '22 at 12:38
  • As long you are working with floats there shouldn't a difference in values. For integer work, python integers can be larger than the numpy `int64`. – hpaulj May 12 '22 at 14:20
  • For floating-point values, see https://stackoverflow.com/questions/21895756/why-are-floating-point-numbers-inaccurate . Additionally, the precision of the result can change regarding the chosen algorithm since floating-point number operations are not associative. Since Numpy certainly does not use the same algorithm than you, results are different. By the way, doing a PCA on 800 million numbers (certainly 6.4 Go of data) using (naive) basic loops executed by the CPython *interpreter* is insanely *inefficient*. – Jérôme Richard May 12 '22 at 18:24
  • @JérômeRichard : yes indeed it is about this volume of data .. the first scaled dataset is obtained using (scaler = StandardScaler() ; scaler.fit(data) ; scaler.transform(data) ; return scaled_data) The second one is obtained using (for each element i in range(len(data)) do: scaler.fit_(i) ; scaler.transform(i); return scaled_data ). The surprise that I found that PCA return better results with the method 2. I, think that CPython integers are larger than int64 as mentionned by hpaulj ; that lead to more precise results with loop even if this a naive approach !! – seif elfetni May 13 '22 at 07:35
  • as mentionned by @hpaulj – seif elfetni May 13 '22 at 07:36

0 Answers0