Machine Learning

Executive Summary

We recently made an effort to examine the popularity of Taylor Swift-related content on Reddit, and the results were impressive. Using advanced machine learning tools, our team is able to decode the elements that help certain posts go viral. The process involves carefully defining what makes a post viral—taking into account number of comments, score, etc., and using models that can predict which posts are likely to go viral. The results are remarkable. For example, random forest models achieve near-perfect scores on standard evaluation metrics, and logistic regression models are not far behind. This revolutionizes how we plan content, providing a scientific approach to identifying and promoting posts that are likely to be popular on Reddit.

We also tried to figure out what makes a post receive more upvotes. The result showed that if the topic covered in a post deserves more discussion, the number of comments under the post will increase, one may get chance to receive more upvotes. Also, reddit users prefer original articles shared by the OP (original poster), because these posts may contain unique insights. Non-original posts can be viewed from other platforms, which will reduce the click rate. What’s more, these post contents do not need to be too professional, or restricted to a particular subreddit, so that users can share to other subreddits (a.k.a. crosspost). The posting time (month) also deserves considering, bacause the user activity changes through month. People visit reddit more bacause of multiple reasons, maybe they have spare time in holidays in winter seasons, or some big news happened recently. Selecting a time period with more users on the reddit will return more upvotes. Posting hours is not as vital as posting months, based on the analysis, selecting a specific time (like after work or bafore bed) to receive more viewing and more upvotes might be unrealistic. To sum up, reddit posts should focus more on the text quality if one would like to get more upvotes.

Viral Posts Detection Under Taylor Swift Subreddit

See the code

The project started by setting up a PySpark environment to efficiently manage large data sets. We selected Taylor Swift-related posts on Reddit and organized the data for better analysis. Important metrics such as number of comments, number of shares, total score, and post length were evaluated. We also take into account factors such as post sentiment. These variables are prepared for the machine learning process through specific techniques such as StringIndexing and OneHotEncoding.

We chose two models: random forest and logistic regression because of their ability to handle different types of data. After dividing the data into training and testing parts, we train the model and then evaluate its performance. The evaluation focuses on ROC AUC and F1 score, which are common measures for this type of task, see Table 1 and Figure 1. The near-perfect score of the random forest model demonstrates its superior ability to differentiate between viral and non-viral posts. The logistic regression model also showed impressive results, proving its value in predicting viral content.

Table 1: Model Summary Table
	Model	ROC AUC	F1-score
0	Random Forest	0.99782	0.99345
1	Logistic Regression	0.99851	0.98359

Figure 1: Model Evaluation Plot

Confusion Matrix for Random Forest Model
Confusion Matrix for Logistic Regression Model

Figure 2: Confusion Matrix for Random Forest Classifier

Figure 3: Confusion Matrix for Logistic Classifier

The success of our model demonstrates a strong connection between certain characteristics, such as sentiment and engagement metrics, and the viral potential of a post. This insight is invaluable for posts/authors aiming to go viral on platforms like Reddit.

Visual tools are key to sharing our findings. Graphs showing the most influential features in the random forest model along with the performance curves of the two models help make our complex analysis easy to understand, especially for those unfamiliar with the technical details.

The project laid the foundation for ongoing analysis. The flexibility of our models means they can be updated with new data, keeping our approach to identifying viral content at the forefront of digital strategy.

Predicting Taylor Swift Submissions Popularity Using Machine Learning Models

See the code

To predict the popularity (score) of a submission post under Taylor Swift subreddit, variables including post_month,post_hour,number_of_comments,number_of_crossposts,self_post,stickied,text_sentiment,text_length were considered. Four machine learning models are applied for modeling to find the best description of data, they are Random Forest Regressor, Decision Tree Regressor, Gradient Boosting Regressor, and Generalized Linear Regressor. In the data preparation part, our team applied string-indexer to convert string-type variables into numeric indicies, and applied one-hot encoding to convert numeric indicies to multiple dimensions. Data were rescaled using MinMaxScaler to reduce computing for some algorithms (the linear model). All the transformation methods (including indexers, one-hot encoding, vector assembler, and scalar) and the model are passed to a pipeline, for faster model building and computing.

Models are evaluated under two categories: R-squared and RMSE. The summary tables represent the different model performances. Decision Tree model is the best for modeling based on given data, with the smallest RMSE valued at 303 and the largest R^2 valued at 0.49. Random Forest and Gradient Boosting models received similiar results as Decision Tree model, but with a slightly bad outcome. The Generalized Linear Model did not fit the data well.

Table 2: Table for Models and R-squared Values
	Models	R-squared
0	Random Forest	0.482579
1	Decision Tree	0.497129
2	Gradient Boosting	0.455375
3	Generalized Linear Regressor	0.0577222

Table 3: Table for Models and R-squared Values
	Models	RMSE
0	Random Forest	308.156
1	Decision Tree	303.793
2	Gradient Boosting	316.154
3	Generalized Linear Regressor	415.852

Feature importance are examined inside the Decision Tree Regressor model. A horizontal bar plot Figure 4 below displays the importance for the variable that this project considered. The top three variables affect the score most are number of comments, is self post, and number of crossposts. Variables like month and text length have some affect on the model but ont that significant. Variables including hour, stickied, and sentiment are not helpful for predicting final outcomes.

Figure 4: Feature Importance for Decision Tree Regressor

Conclusions can be made by analyzing the variables in the best model obtained. To gain a higher percentage of upvote in a post, one may consider to bring up an original topic that others are interested in to gain more number of comments. One may also consider the post time (month), and the how informative the messages are. It is surprised to figure out that posting hour has almost no effect on the final scores and users does not care about the sentiment inside the messages.

The model focused on predicting the score, that is, the number of upvotes minus the number of downvotes for a submission. However, using variables like number of comments are not that helpful if considering questions like “how to post a popular submission and gain more upvotes”, because number of comments is a lagging variable, it appeared after the submission was posted, and it depends on the post itself. A better model should consider topics in the text as well for further analysis, so that users are able to think about “what topic to be discussed” to gain more discussion, so as to receive upvotes.

--- title: "Machine Learning" format: html: smooth-scroll: true toc: true code-fold: true code-tools: true embed-resources: true mermaid: theme: neutral bibliography: citation.bib execute: echo: false warning: false --- ## Executive Summary We recently made an effort to examine the popularity of Taylor Swift-related content on Reddit, and the results were impressive. Using advanced machine learning tools, our team is able to decode the elements that help certain posts go viral. The process involves carefully defining what makes a post viral—taking into account number of comments, score, etc., and using models that can predict which posts are likely to go viral. The results are remarkable. For example, random forest models achieve near-perfect scores on standard evaluation metrics, and logistic regression models are not far behind. This revolutionizes how we plan content, providing a scientific approach to identifying and promoting posts that are likely to be popular on Reddit. We also tried to figure out what makes a post receive more upvotes. The result showed that if the topic covered in a post deserves more discussion, the number of comments under the post will increase, one may get chance to receive more upvotes. Also, reddit users prefer original articles shared by the OP (original poster), because these posts may contain unique insights. Non-original posts can be viewed from other platforms, which will reduce the click rate. What's more, these post contents do not need to be too professional, or restricted to a particular subreddit, so that users can share to other subreddits (a.k.a. crosspost). The posting time (month) also deserves considering, bacause the user activity changes through month. People visit reddit more bacause of multiple reasons, maybe they have spare time in holidays in winter seasons, or some big news happened recently. Selecting a time period with more users on the reddit will return more upvotes. Posting hours is not as vital as posting months, based on the analysis, selecting a specific time (like after work or bafore bed) to receive more viewing and more upvotes might be unrealistic. To sum up, reddit posts should focus more on the text quality if one would like to get more upvotes. ## Viral Posts Detection Under Taylor Swift Subreddit [See the code](https://joycetaoyuuu.github.io/dsan-6000-project/mx109_ml.html) The project started by setting up a PySpark environment to efficiently manage large data sets. We selected Taylor Swift-related posts on Reddit and organized the data for better analysis. Important metrics such as `number of comments`, `number of shares`, `total score`, and `post length` were evaluated. We also take into account factors such as `post sentiment`. These variables are prepared for the machine learning process through specific techniques such as StringIndexing and OneHotEncoding. We chose two models: random forest and logistic regression because of their ability to handle different types of data. After dividing the data into training and testing parts, we train the model and then evaluate its performance. The evaluation focuses on ROC AUC and F1 score, which are common measures for this type of task, see @tbl-model_summary and @fig-summary_table. The near-perfect score of the random forest model demonstrates its superior ability to differentiate between viral and non-viral posts. The logistic regression model also showed impressive results, proving its value in predicting viral content. ```{python} #| label : tbl-model_summary #| tbl-cap : Model Summary Table import pandas as pd from IPython.display import Markdown from tabulate import tabulate df = pd.read_csv('../data/csv/ml/mx109_summary_table.csv',index_col=0) columns = df.columns.drop("Model") for c in columns: df[c] = df[c].round(5) Markdown(tabulate( df, disable_numparse=True, headers=["Model","ROC AUC","F1-score"] )) ``` ```{python} #| label: fig-summary_table #| fig-cap: Model Evaluation Plot import plotly.graph_objects as go import pandas as pd df = pd.read_csv('../data/csv/ml/mx109_summary_table.csv',index_col=0) fig = go.Figure( data = [ go.Bar(name = 'ROC AUC',x = df['Model'],y = df['ROC AUC']), go.Bar(name = 'F1 Score',x = df['Model'], y = df['F1-score']) ] ) fig.update_layout(barmode = 'group', bargap = 0.5, title = 'Model Evaluation Bar Plot', autosize = True, plot_bgcolor='white') fig.update_traces(opacity = 0.9) fig.update_yaxes( gridcolor='lightgrey' ) fig.show() ``` :::{.panel-tabset} #### Confusion Matrix for Random Forest Model ```{python} #| label: fig-confusion-matrix_1 #| fig-cap: Confusion Matrix for Random Forest Classifier import plotly.graph_objects as go import pandas as pd import seaborn as sns import matplotlib.pyplot as plt df = pd.read_csv('../data/csv/ml/confusion_matrix_rf.csv',header=None) sns.heatmap(df, annot=True) plt.title("Confusion Matrix for Random Forest Classifier") plt.show() ``` #### Confusion Matrix for Logistic Regression Model ```{python} #| label: fig-confusion-matrix_2 #| fig-cap: Confusion Matrix for Logistic Classifier import matplotlib.pyplot as plt import plotly.graph_objects as go import pandas as pd df = pd.read_csv('../data/csv/ml/confusion_matrix_lr.csv',header=None) sns.heatmap(df, annot=True) plt.title("Confusion Matrix for Logictic Classifier") plt.show() ``` ::: `The success of our model demonstrates a strong connection between certain characteristics, such as sentiment and engagement metrics, and the viral potential of a post. This insight is invaluable for posts/authors aiming to go viral on platforms like Reddit`. Visual tools are key to sharing our findings. Graphs showing the most influential features in the random forest model along with the performance curves of the two models help make our complex analysis easy to understand, especially for those unfamiliar with the technical details. The project laid the foundation for ongoing analysis. The flexibility of our models means they can be updated with new data, keeping our approach to identifying viral content at the forefront of digital strategy. ## Predicting Taylor Swift Submissions Popularity Using Machine Learning Models [See the code](https://joycetaoyuuu.github.io/dsan-6000-project/yt560_ml.html) To predict the popularity (`score`) of a submission post under Taylor Swift subreddit, variables including `post_month`,`post_hour`,`number_of_comments`,`number_of_crossposts`,`self_post`,`stickied`,`text_sentiment`,`text_length` were considered. Four machine learning models are applied for modeling to find the best description of data, they are Random Forest Regressor, Decision Tree Regressor, Gradient Boosting Regressor, and Generalized Linear Regressor. In the data preparation part, our team applied string-indexer to convert string-type variables into numeric indicies, and applied one-hot encoding to convert numeric indicies to multiple dimensions. Data were rescaled using MinMaxScaler to reduce computing for some algorithms (the linear model). All the transformation methods (including indexers, one-hot encoding, vector assembler, and scalar) and the model are passed to a pipeline, for faster model building and computing. Models are evaluated under two categories: R-squared and RMSE. The summary tables represent the different model performances. Decision Tree model is the best for modeling based on given data, with the smallest RMSE valued at `303` and the largest R^2 valued at `0.49`. Random Forest and Gradient Boosting models received similiar results as Decision Tree model, but with a slightly bad outcome. The Generalized Linear Model did not fit the data well. :::{layout-ncol=2} ```{python} #| label: tbl-summary_table_r2 #| tbl-cap: Table for Models and R-squared Values r2 = pd.read_csv('../data/csv/ml/yt560_R2_summary_table.csv',index_col = 0) Markdown(tabulate( r2, #disable_numparse=True, headers=["Models","R-squared"] )) ``` ```{python} #| label: tbl-summary_table_rmse #| tbl-cap: Table for Models and R-squared Values #| rmse = pd.read_csv('../data/csv/ml/yt560_RMSE_summary_table.csv',index_col = 0) Markdown(tabulate( rmse, #disable_numparse=True, headers=["Models","RMSE"] )) ``` ::: Feature importance are examined inside the Decision Tree Regressor model. A horizontal bar plot @fig-feature_importance below displays the importance for the variable that this project considered. The top three variables affect the `score` most are `number of comments`, `is self post`, and `number of crossposts`. Variables like `month` and `text length` have some affect on the model but ont that significant. Variables including `hour`, `stickied`, and `sentiment` are not helpful for predicting final outcomes. ```{python} #| label: fig-feature_importance #| fig-cap: Feature Importance for Decision Tree Regressor import plotly.express as px colors = px.colors.qualitative.Plotly feature_importance = pd.read_csv('../data/csv/ml/yt560_feature_importance_table.csv') feature_importance = feature_importance.sort_values(by = 'importance') fig = go.Figure(go.Bar(x = feature_importance['importance'],y = feature_importance['feature_names'],orientation='h',marker_color = colors)) fig.update_layout( title = "Feature Importance Bar Plot for Decision Tree Regressor", plot_bgcolor='white', autosize=True) fig.show() ``` Conclusions can be made by analyzing the variables in the best model obtained. `To gain a higher percentage of upvote in a post, one may consider to bring up an original topic that others are interested in to gain more number of comments. One may also consider the post time (month), and the how informative the messages are. It is surprised to figure out that posting hour has almost no effect on the final scores and users does not care about the sentiment inside the messages`. The model focused on predicting the score, that is, the number of upvotes minus the number of downvotes for a submission. However, using variables like `number of comments` are not that helpful if considering questions like "how to post a popular submission and gain more upvotes", because `number of comments` is a lagging variable, it appeared after the submission was posted, and it depends on the post itself. A better model should consider topics in the text as well for further analysis, so that users are able to think about "what topic to be discussed" to gain more discussion, so as to receive upvotes.