Model | ROC AUC | F1-score | |
---|---|---|---|
0 | Random Forest | 0.99782 | 0.99345 |
1 | Logistic Regression | 0.99851 | 0.98359 |
Machine Learning
Executive Summary
We recently made an effort to examine the popularity of Taylor Swift-related content on Reddit, and the results were impressive. Using advanced machine learning tools, our team is able to decode the elements that help certain posts go viral. The process involves carefully defining what makes a post viral—taking into account number of comments, score, etc., and using models that can predict which posts are likely to go viral. The results are remarkable. For example, random forest models achieve near-perfect scores on standard evaluation metrics, and logistic regression models are not far behind. This revolutionizes how we plan content, providing a scientific approach to identifying and promoting posts that are likely to be popular on Reddit.
We also tried to figure out what makes a post receive more upvotes. The result showed that if the topic covered in a post deserves more discussion, the number of comments under the post will increase, one may get chance to receive more upvotes. Also, reddit users prefer original articles shared by the OP (original poster), because these posts may contain unique insights. Non-original posts can be viewed from other platforms, which will reduce the click rate. What’s more, these post contents do not need to be too professional, or restricted to a particular subreddit, so that users can share to other subreddits (a.k.a. crosspost). The posting time (month) also deserves considering, bacause the user activity changes through month. People visit reddit more bacause of multiple reasons, maybe they have spare time in holidays in winter seasons, or some big news happened recently. Selecting a time period with more users on the reddit will return more upvotes. Posting hours is not as vital as posting months, based on the analysis, selecting a specific time (like after work or bafore bed) to receive more viewing and more upvotes might be unrealistic. To sum up, reddit posts should focus more on the text quality if one would like to get more upvotes.
Predicting Taylor Swift Submissions Popularity Using Machine Learning Models
To predict the popularity (score
) of a submission post under Taylor Swift subreddit, variables including post_month
,post_hour
,number_of_comments
,number_of_crossposts
,self_post
,stickied
,text_sentiment
,text_length
were considered. Four machine learning models are applied for modeling to find the best description of data, they are Random Forest Regressor, Decision Tree Regressor, Gradient Boosting Regressor, and Generalized Linear Regressor. In the data preparation part, our team applied string-indexer to convert string-type variables into numeric indicies, and applied one-hot encoding to convert numeric indicies to multiple dimensions. Data were rescaled using MinMaxScaler to reduce computing for some algorithms (the linear model). All the transformation methods (including indexers, one-hot encoding, vector assembler, and scalar) and the model are passed to a pipeline, for faster model building and computing.
Models are evaluated under two categories: R-squared and RMSE. The summary tables represent the different model performances. Decision Tree model is the best for modeling based on given data, with the smallest RMSE valued at 303
and the largest R^2 valued at 0.49
. Random Forest and Gradient Boosting models received similiar results as Decision Tree model, but with a slightly bad outcome. The Generalized Linear Model did not fit the data well.
Models | R-squared | |
---|---|---|
0 | Random Forest | 0.482579 |
1 | Decision Tree | 0.497129 |
2 | Gradient Boosting | 0.455375 |
3 | Generalized Linear Regressor | 0.0577222 |
Models | RMSE | |
---|---|---|
0 | Random Forest | 308.156 |
1 | Decision Tree | 303.793 |
2 | Gradient Boosting | 316.154 |
3 | Generalized Linear Regressor | 415.852 |
Feature importance are examined inside the Decision Tree Regressor model. A horizontal bar plot Figure 4 below displays the importance for the variable that this project considered. The top three variables affect the score
most are number of comments
, is self post
, and number of crossposts
. Variables like month
and text length
have some affect on the model but ont that significant. Variables including hour
, stickied
, and sentiment
are not helpful for predicting final outcomes.
Conclusions can be made by analyzing the variables in the best model obtained. To gain a higher percentage of upvote in a post, one may consider to bring up an original topic that others are interested in to gain more number of comments. One may also consider the post time (month), and the how informative the messages are. It is surprised to figure out that posting hour has almost no effect on the final scores and users does not care about the sentiment inside the messages
.
The model focused on predicting the score, that is, the number of upvotes minus the number of downvotes for a submission. However, using variables like number of comments
are not that helpful if considering questions like “how to post a popular submission and gain more upvotes”, because number of comments
is a lagging variable, it appeared after the submission was posted, and it depends on the post itself. A better model should consider topics in the text as well for further analysis, so that users are able to think about “what topic to be discussed” to gain more discussion, so as to receive upvotes.