Questions tagged [preprocessing]

Data preprocessing is a data mining technique that involves transforming raw data into a better understandable or more useful format.

Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing or predictive modeling.

533 questions

votes

4 answers

Different Test Set and Training Set Distribution

I am working on a data science competition for which the distribution of my test set is different from the training set. I want to subsample observations from training set which closely resembles test set. How can I do this?

preprocessing

asked Feb 26 '18 at 20:29

Pooja

votes

3 answers

Organising the preprocessing of a dataset that will be consumed by multiple models

In a data science project, data is typically preprocessed. We also build, test, and select different models. Models also come with their own preprocessing requirements that can vary greatly from model to model, e.g., some models require scaling,…

preprocessing

asked Oct 08 '21 at 07:38

Oaty

votes

2 answers

How to preprocess Acoustic Data

I am dealing with acoustic data with very high sampling frequency of 2MHz and want to build a classifier. I was wondering if there are any rules of thumb for preprocessing acoustic data. Is it better to directly use raw data (timesignal) or first to…

preprocessing

asked Aug 31 '17 at 07:59

Andreas Look

votes

1 answer

What does normalizing and mean centering data do?

Are there any concerns to normalizing data to be within the range 0 - 1 and mean centering the data as well? Does it matter which comes first? If you do one, is the other not required?

preprocessing

asked Jul 15 '16 at 17:36

atomsmasher

vote

1 answer

Reconstituting estimated/predicted values to original scale from MinMaxScaler

I am playing around with a deterministic function in order to understand machine learning as in this tutorial blog. The program I am using a deterministic function $y = f(x)$ where $f(x) = x^2$. I get a beautiful plot with the ($x$, predicted…

preprocessing

asked Oct 03 '19 at 20:34

Anthony from Sydney

vote

1 answer

How to treat Compass data in random forest regression

I'm working on a project where two of the features are entryHeading and exitHeading. Both state the direction (N, NE, E, SE, S, SW, W) of a vehicle at multiple points. My question is how would i go about pre-processing this? My first thought would…

preprocessing

asked Sep 24 '19 at 16:01

brokenfulcrum

vote

1 answer

Cleaning data automatically

Do you use automatic cleaning tools for data? I mean something similar to h2o.ai's auto ml function but applied to preprocessing data. Or do you always clean data 'by hand'.

preprocessing

asked Dec 02 '18 at 16:28

CezarySzulc

vote

0 answers

Are there some resources for filters specifically applicable in big data applications?

Are there some resources for filters specifically applicable in big data applications? Particularly, are there major differences between filter design for other domains and filter design for data mining?

preprocessing

asked Sep 11 '18 at 12:41

mavavilj

votes

2 answers

Rolling window features for multiclass classification

I'm doing a multiclass classification and data is considered as not being a time-series. Working on a feature engineering and trying to solve the problem with classic KNN, RF, boosting etc. I'm creating new features based on rolling window and found…

preprocessing

asked Oct 15 '20 at 08:17

imitusov

votes

0 answers

Outliers in test data

This famous dataset have a lot of zero values (Above 500). But it is not really clear, some of them are outliers or not. My question is: What if i decided that ALL 0's are outliers (it's pretty likely) and removed objects with any 0 from the dataset…

preprocessing

asked Feb 15 '24 at 22:20

Тима

votes

0 answers

Encoding necessary for numerical data that can be summarised into a few groups

I have an input parameter that have 200 values. However, among the 200 values, there are only 3 distinct values. For example, like this: X1 10 14 14 10 22 22 10 10 14 . . . Should I treat this parameter as a categorical input and encode it before…

preprocessing

asked Nov 22 '22 at 09:50

Steven Chan

votes

1 answer

Preserve relations between data points when preprocessing

I am tasked with a project that aims to predict the probability of a product being returned before the product is even ordered. I have an excel containing a bunch of orders. In order to make predictions, it is important to predict each item in the…

preprocessing

asked Oct 05 '22 at 14:51

christallclear