2

I'm trying to predict the daily positivity or negativity of stock market value through Twitter.
I researched a lot about this topic and I found this article to start.
Basically, what I've done is get from yahoo finance the date relative to Down Jones and calculate if the day was positive or negative.

For the same date, get all the tweets that contain words like I'm, feel, makes me, in order by collect only the tweets that express a sentiment.

I have a list of words (positive and negative), without score, just words.
For every day analyzed, I create a Python dictionary which has as keys the words of the list and as value a score,calculated in the following way:

score of a word = num of times the word matches tweets in a day /
                  num of total matches of all words

In order to predict the stock market I train naive bayes algorithms as data, the python dictionary with words and relative score and as target 'pos' or 'neg' according to the finance data.

I collected one year of date (from 1-1-2010 to 31-12-2010).
The length of the list of words is 18540.
I'm working with Python 3.4, tweepy and scikit-learn

The classifier doesn't work well and since I'm a novice in this field, I would like ask you if there is something wrong in my procedure or if you have some suggestions to help me.

Any help is appreciated

Richard Hardy
  • 67,272
Giordano
  • 121

1 Answers1

1

You have an outcome of interest: if some stock market increases or decreases. While I have my doubts about how useful this is, a binary outcome is a fine place to start practicing machine learning.

You also have predictor variables that you will consider, which you calculate from your text analysis.

Thus, you are asking if this text-based feature is predictive of whether or not the stock market increases or decreases.

After you did this, you got a model that has poor performance. There are a few reasons for that.

  1. Is there a reason to believe that the text-based data would be predictive of the stock market? There must be economic and political reasons for stock market movements, and you seem not to capture such data.

  2. Even if the Tweets contains a great deal of information that is predictive of stock market movements, is your way of extracting information from the Tweets one that should preserve that information?

  3. While investor sentiment about investments might be reasonably regarded as predictive of stock market movement, your Tweets capture much information that is unrelated. For instance, a Tweet like, "I feel so down after my date stood me up," has a negative sentiment, but I see little connection to the stock market. I would expect much of Twitter to deal with the kind of Tweet I gave above instead of investor sentiment about investing.

I see a huge issue leading to your poor performance being that your features probably just do not have much to do with investing and should not be expected to be predictive of stock market movements.

Dave
  • 62,186