1

It's a well-known fact that several hedge funds have a handful of PhDs just doing data cleansing. All day. Every day.

What kind of data cleansing are they actually doing? Is it really that difficult? How much depth is there in such a topic? Why do they need PhDs to do this?

Dylan Kerler
  • 181
  • 6
  • 2
    Could you please be more specific? What kind of data are you referring to? What’s the source to your fact? Thanks a lot – Kermittfrog Aug 07 '21 at 11:21
  • Any data related to trading - so just about everything. There are quite a few sources but here is one from Nick Patterson (former Rentec employee) who says that they had 7 PhDs just working on data cleansing at minute 38:00: http://www.thetalkingmachines.com/episodes/ai-safety-and-legacy-bletchley-park – Dylan Kerler Aug 07 '21 at 11:39
  • 2
    "It's a well-known fact that several hedge funds have a handful of PhDs just doing data cleansing". Be aware that many large institutions using vast amount of data for their internal models (banks, pension- and hedge-funds, insurance) usually have their own division for data cleaning and gathering. Often, to strengthen internal quantitative models, companies might rely on external data bought from another firm, which needs further cleaning in order to be reliable. [1/2] – Pleb Aug 07 '21 at 11:49
  • 2
    In general, employing proper data cleaning is an important part of creating a working quantitative model/strategy, since feeding noisy (improperly cleaned) data into a quantitative model will always yield bad results. In my honest opinion, I do not believe you need to be a PhD to do the job. However, there is a large supply of jobseeking quant developers/IT guys wanting to work in a hedgefund. Thus, hedgefunds can be selective and get "the best of the best" for the job, which is usually PhDs. [2/2] – Pleb Aug 07 '21 at 11:52
  • 1
    I guess if you listen to your own source of Nick Patterson, you have your answer. Firstly, he does not claim all they do is data cleansing. Secondly he just mentions that in his view (look up what he does, and did, so some caution is needed) they mainly did simple regression. The reason they need to hire smart people is not because the model is so difficult, but to understand when the data is rubbish (his words and he elaborates on that). "The most important thing is to do the simply things right". "Nobody tells you what to regress, what's the target, what's the source...." – AKdemy Aug 07 '21 at 14:28
  • 1
    Is it really that difficult to run 100m - the answer is no.

    If the goal is to run it in less than 9.85s, it becomes almost impossible. Very good speech for this topic.

    – AKdemy Aug 07 '21 at 14:32
  • @Pleb Can you make it an answer? – Bob Jansen Aug 07 '21 at 14:47
  • @BobJansen I can. But the comment is very generic. – Pleb Aug 07 '21 at 15:18
  • 2
    @Pleb Agreed, but that’s inherent to this question unless Jim Simons himself decides to answer ;) – Bob Jansen Aug 07 '21 at 15:22

2 Answers2

4

Data cleaning is important for many large institutions:

"It's a well-known fact that several hedge funds have a handful of PhDs just doing data cleansing". Be aware that many large institutions using vast amount of data for their internal models (banks, pension- and hedge-funds, insurance etc.) usually have their own division for data cleaning and gathering. Often, to strengthen internal quantitative models, companies might rely on external data bought from another firm, which needs further cleaning in order to be reliable.

Employing proper data cleaning is an important part of creating a working quantitative model/strategy, since feeding noisy (improperly cleaned) data into a quantitative model will always yield bad results. In my honest opinion, I do not believe you need to be a PhD to do the job. However, there is a large supply of job-seeking quant developers/IT guys wanting to work in a hedge fund. Thus, hedge funds can be selective and hire "the best of the best" for the job, which is usually PhDs.


An example of a simple cleaning procedure:

I've provided a quick example of a data-cleaning procedure for better insight.

When you are working with high frequency trades and quotes (TAQ) stock data (ie. intraday stock data), you need to clean it before the data will be useful. A well-known cleaning procedure is described in Barndorff‐Nielsen et al. (2009). Realized kernels in practice: Trades and quotes. (see section 3.1), which gives you the necessary steps to delete outliers, abnormal trades, misrecordings of timestamps and prices in the database, and more. In the paper, they provide a detailed analysis on how the realized variance changes drastically when applying more of their specified data-cleaning rules (see section 4. Data analysis). However, this cleaning procedure only applies to high frequency stock data and it will differ when you need to clean alternative data.

To conclude the answer, I have provided a graphical illustration of cleaned vs raw (noisy) trade data for a single arbitrary day on SPY. The cleaning procedure follows exactly from the rules provided in the above paper (click the picture for better image quality):

enter image description here

We see how the cleaning procedure is able to detect outliers. Also, notice the odd behaviour of trades in the pre- and after-market hours. This is the main reason for the cleaning step, P1.

Pleb
  • 4,276
  • 3
  • 11
  • 26
2

Some years ago I used to work for a large institution that was not a hedge fund and interacted with some folks who worked primarily on data cleanup. I'll share some observations that I hope may help understand what they do.

They focused on two kinds of data: securities indicative data (stock dividends, nond maturities and coupons), and market data (prices, rates). I think these days, "alternative" data (things like the number of people who visited a particular mall on a given date) has also grown more prominent.

Some of the people developing the processes and procedures for cleanup had PhDs. However I'm pretty certain that no one executing the operational procedures did.

Typical examples of indicative data cleanup - Bloomberg is missing local identifier, amortization schedule, exotic coupon formula... all of these need to be amended in the internal databases.

Typical examples of indicative data cleanup - some value is an outlier, accoring to some criteria, then it needs to be investigated with the source (vendor or internal), and possibly replaced with a "missing" value. An unexpectedly "missing" value needs to be investigated with the source and hopefully populated.

As you see, such examples are seldom quantitative, and much of it could be replaced with AI / automation.

Dimitri Vulis
  • 12,363
  • 3
  • 19
  • 57