Questions tagged [preprocessing]

Data preprocessing is a data mining technique that involves transforming raw data into a better understandable or more useful format.

Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing or predictive modeling.

533 questions
26
votes
4 answers

Different Test Set and Training Set Distribution

I am working on a data science competition for which the distribution of my test set is different from the training set. I want to subsample observations from training set which closely resembles test set. How can I do this?
Pooja
  • 261
  • 1
  • 3
  • 3
3
votes
3 answers

Organising the preprocessing of a dataset that will be consumed by multiple models

In a data science project, data is typically preprocessed. We also build, test, and select different models. Models also come with their own preprocessing requirements that can vary greatly from model to model, e.g., some models require scaling,…
Oaty
  • 33
  • 4
2
votes
2 answers

How to preprocess Acoustic Data

I am dealing with acoustic data with very high sampling frequency of 2MHz and want to build a classifier. I was wondering if there are any rules of thumb for preprocessing acoustic data. Is it better to directly use raw data (timesignal) or first to…
Andreas Look
  • 921
  • 5
  • 14
2
votes
1 answer

What does normalizing and mean centering data do?

Are there any concerns to normalizing data to be within the range 0 - 1 and mean centering the data as well? Does it matter which comes first? If you do one, is the other not required?
atomsmasher
  • 123
  • 1
  • 3
1
vote
1 answer

Reconstituting estimated/predicted values to original scale from MinMaxScaler

I am playing around with a deterministic function in order to understand machine learning as in this tutorial blog. The program I am using a deterministic function $y = f(x)$ where $f(x) = x^2$. I get a beautiful plot with the ($x$, predicted…
1
vote
1 answer

How to treat Compass data in random forest regression

I'm working on a project where two of the features are entryHeading and exitHeading. Both state the direction (N, NE, E, SE, S, SW, W) of a vehicle at multiple points. My question is how would i go about pre-processing this? My first thought would…
1
vote
1 answer

Cleaning data automatically

Do you use automatic cleaning tools for data? I mean something similar to h2o.ai's auto ml function but applied to preprocessing data. Or do you always clean data 'by hand'.
CezarySzulc
  • 257
  • 3
  • 10
1
vote
0 answers

Are there some resources for filters specifically applicable in big data applications?

Are there some resources for filters specifically applicable in big data applications? Particularly, are there major differences between filter design for other domains and filter design for data mining?
mavavilj
  • 416
  • 1
  • 3
  • 12
0
votes
2 answers

Rolling window features for multiclass classification

I'm doing a multiclass classification and data is considered as not being a time-series. Working on a feature engineering and trying to solve the problem with classic KNN, RF, boosting etc. I'm creating new features based on rolling window and found…
imitusov
  • 153
  • 1
  • 6
0
votes
0 answers

Outliers in test data

This famous dataset have a lot of zero values (Above 500). But it is not really clear, some of them are outliers or not. My question is: What if i decided that ALL 0's are outliers (it's pretty likely) and removed objects with any 0 from the dataset…
Тима
  • 13
  • 2
0
votes
0 answers

Encoding necessary for numerical data that can be summarised into a few groups

I have an input parameter that have 200 values. However, among the 200 values, there are only 3 distinct values. For example, like this: X1 10 14 14 10 22 22 10 10 14 . . . Should I treat this parameter as a categorical input and encode it before…
0
votes
1 answer

Preserve relations between data points when preprocessing

I am tasked with a project that aims to predict the probability of a product being returned before the product is even ordered. I have an excel containing a bunch of orders. In order to make predictions, it is important to predict each item in the…