Presentation updates.

This commit is contained in:
Carl Colglazier 2024-05-03 18:07:01 -05:00
parent f1db23ca75
commit 02227e6bc2
28 changed files with 8085 additions and 459 deletions

1
.gitignore vendored
View File

@ -10,6 +10,7 @@ index_files/
*.revealjs.md
site_libs/
.quarto/
images/server_images/
# R stuff
.Rproj.user

View File

@ -13,9 +13,11 @@ envs <- Sys.getenv()
# Introduction
Following Twitter's 2022 acquisition, Mastodon---an open-source, decentralized social network and microblogging community---saw an increase in activity and attention as a potential Twitter alternative [@heFlockingMastodonTracking2023; @cavaDriversSocialInfluence2023]. While millions of people set up new accounts and significantly increased the size of the network, many of these newcomers and potential newcomers found the process confusing and many accounts did not remain active. Unlike centralized social media platforms, Mastodon is a network of independent servers with their own rules and norms [@nicholsonMastodonRulesCharacterizing2023]. Each server can communicate with each other using the shared ActivityPub protocols and accounts can move between Mastodon servers, but the local experience can vary widely from server to server.
Following Twitter's 2022 acquisition, Mastodon---an open-source, decentralized social network and microblogging community---saw an increase in activity and attention as a potential Twitter alternative [@heFlockingMastodonTracking2023; @cavaDriversSocialInfluence2023]. While millions of people set up new accounts and significantly increased the size of the network, many newcomers found the process confusing and many accounts did not remain active. Unlike centralized social media platforms, Mastodon is a network of independent servers with their own rules and norms [@nicholsonMastodonRulesCharacterizing2023]. Each server can communicate with each other using the shared ActivityPub protocols and accounts can move between Mastodon servers, but the local experience can vary widely from server to server.
Although attracting and retaining newcomers is a key challenge for online communities [@krautBuildingSuccessfulOnline2011 p. 182], Mastodon's onboarding process has not always been straightforward. Variation among servers can also present a challenge for newcomers who may not even be aware of the specific rules, norms, or general topics of interest on the server they are joining [@diazUsingMastodonWay2022]. Further, many Mastodon servers have specific norms which people coming from Twitter may find confusing, such as local norms around content warnings [@nicholsonMastodonRulesCharacterizing2023]. Various guides and resources for people trying to join Mastodon offered mixed advice on choosing a server. Some suggest that the most important thing is to simply join any server and work from there [@krasnoffMastodon101How2022; @silberlingBeginnerGuideMastodon2023], while others have created tools and guides to help people find potential servers of interest by size and location[@thekinrarMastodonInstances; @kingMastodonMe2024].
<!-- Further, many Mastodon servers have specific norms which people coming from Twitter may find confusing, such as local norms around content warnings [@nicholsonMastodonRulesCharacterizing2023]. -->
Although attracting and retaining newcomers is a key challenge for online communities [@krautBuildingSuccessfulOnline2011 p. 182], Mastodon's onboarding process has not always been straightforward. Variation among servers can also present a challenge for newcomers who may not even be aware of the specific rules, norms, or general topics of interest on the server they are joining [@diazUsingMastodonWay2022]. Various guides and resources for people trying to join Mastodon offered mixed advice on choosing a server. Some suggest that the most important thing is to simply join any server and work from there [@krasnoffMastodon101How2022; @silberlingBeginnerGuideMastodon2023], while others have created tools and guides to help people find potential servers of interest by size and location[@thekinrarMastodonInstances; @kingMastodonMe2024].
Mastodon's decentralized design has long been in tension with the disproportionate popularity of a small set of large, general-topic servers within the system [@ramanChallengesDecentralisedWeb2019a]. Analysing the activity of new accounts that join the network, we find that users who sign up on such servers are less likely to remain active after 91 days. We also find that many users who move accounts tend to gravitate toward smaller, more niche servers over time, suggesting that established users may also find additional utility from such servers.
@ -396,4 +398,44 @@ Based on analysis of trace data from millions of new Fediverse accounts, we find
*Federated timeline*: A timeline which includes all posts from users followed by other users on their server.
*Local timeline*: A timeline with all public posts from the local server.
:::
::: {.content-visible when-format="html"}
```{r}
library(tidyverse)
library(arrow)
library(ggrepel)
"data/scratch/server_svd.feather" %>% arrow::read_ipc_file() %>%
as_tibble %>%
ggplot(aes(x = x, y = y, label = server)) +
geom_text_repel(size = 2, max.overlaps = 10) +
#geom_point() +
theme_minimal()
```
```{r}
library(tidyverse)
library(arrow)
library(ggrepel)
library(here)
library(jsonlite)
top_tags <- "data/scratch/tag_svd.feather" %>% arrow::read_ipc_file() %>%
as_tibble %>%
mutate(s = variance * log(count)) %>% arrange(desc(s))
top_tags %>%
select(tag, index) %>%
jsonlite::write_json(here("recommender/data/top_tags.json"))
top_tags %>%
head(100) %>%
ggplot(aes(x = x, y = y, label = tag)) +
geom_text_repel(size = 3, max.overlaps = 10) +
#geom_point() +
theme_minimal()
```
:::

View File

@ -1,31 +0,0 @@
# gensim dict
from gensim.corpora.dictionary import Dictionary
from gensim.models import Nmf
host_bow_clusters = all_tag_posts_filtered.explode("tags").rename({"tags":"tag"}).join(
clusters, on="tag", how="inner"
).drop("tag").join(
clusters, on="cluster", how="inner"
).drop("cluster").unique(["host", "id", "tag"]).group_by("host").agg([
pl.col("tag")
])
bow_str = host_bow_clusters["tag"].to_list()
dict = Dictionary(bow_str)
bow = [dict.doc2bow(x) for x in bow_str]
nmf = Nmf(bow, num_topics=10)
##
#tf_idf
host_names = tf_idf["host"].unique().sort().to_list()
n_servers = len(host_names)
host_name_lookup = {host_names[i]: i for i in range(n_servers)}
n_clusters = tf_idf["cluster"].max() + 1#len(tf_idf.unique("cluster"))
id_names = {i: clusters.unique("cluster")["tag"].to_list()[i] for i in range(n_clusters)}
m = lil_matrix((n_clusters, n_servers), dtype=int)
for row in tf_idf.iter_rows(named=True):
m[row["cluster"], host_name_lookup[row["host"]]] = row["count"]
dict = Dictionary([host_names])
nmf = Nmf(corpus=m.tocsc(), num_topics=128, id2word=id_names)

View File

@ -1,62 +1,79 @@
from federated_design import *
from sklearn.cluster import AffinityPropagation
from sklearn.decomposition import TruncatedSVD
from scipy.sparse.linalg import svds
from sklearn.preprocessing import normalize
import polars as pl
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances
import json
if __name__ == '__main__':
class ReccModel:
def __init__(self):
jm = pl.read_json("data/joinmastodon-2023-08-25.json")
jm_servers = set(jm["domain"].unique().to_list())
jm_td = TagData(jm_servers, 256, min_server_accounts=2)
jm_tfidf = jm_td.bm(n_server_accounts=0, n_servers=2, n_accounts=10)
mat = built_tfidf_matrix(jm_tfidf, jm_td.tag_to_index, jm_td.host_to_index)
m = (mat.T / (scipy.sparse.linalg.norm(mat.T, ord=2, axis=0) + 0.0001))
server_similarlity = cosine_similarity(m.tocsr())
l = []
for i in range(np.shape(server_similarlity)[0] - 1):
#s_index = min(i, np.shape(baseline_similarlity)[0] - 1)
l.append(
pl.DataFrame({
"Source": list(jm_td.host_to_index.keys())[i],
"Target": list(jm_td.host_to_index.keys())[i+1:],
"Similarity": server_similarlity[i][i+1:]
self.td = TagData(jm_servers, 256, min_server_accounts=2)
# Build the tfidf matrix
self.tfidf = self.td.bm(n_server_accounts=0, n_servers=2, n_accounts=5)
self.mat = built_tfidf_matrix(self.tfidf, self.td.tag_to_index, self.td.host_to_index)
#self.tag_use_counts = np.sum(self.mat > 0, axis=1).T
self.tag_use_counts = np.array([self.mat.getrow(i).getnnz() for i in range(self.mat.shape[0])])
self.has_info = (self.tag_use_counts >= 2).tolist()
self.tag_names = np.array(list(self.td.tag_to_index.keys()))[self.has_info]
self.server_has_info = (np.sum(self.mat[self.has_info], axis=0) > 0).tolist()[0]
self.server_names = np.array(list(self.td.host_to_index.keys()))[self.server_has_info]
self.m_selected = normalize(self.mat.T.tocsr()[:, self.has_info][self.server_has_info], norm="l2", axis=1)
#self.svd = TruncatedSVD(n_components=50, n_iter=25, random_state=42).fit(self.m_selected)
def svd(self, k=50, norm_axis=None):
m = self.m_selected
if norm_axis is not None:
m = normalize(m, norm="l2", axis = norm_axis)
u, s, v = svds(self.m_selected, k=k, which="LM")
return u, s, v
def top_tags(self):
u, s, v = self.svd(k=25)
tag_stuff = np.diag(s) @ v
return pl.DataFrame({
"tag": self.tag_names,
"x": tag_stuff[-1],
"y": tag_stuff[-2],
"variance": np.var(tag_stuff, axis=0),
"count": self.tag_use_counts[self.has_info].tolist(),
"index": np.arange(len(self.tag_names))
})
def top_servers(self):
u, s, v = self.svd(k=25)
server_stuff = normalize((u @ np.diag(s)).T, norm="l2")
return pl.DataFrame({
"server": self.server_names,
"x": server_stuff[-1],
"y": server_stuff[-2],
"index": np.arange(len(self.server_names))
})
)
similarity_df = pl.concat(l).filter(pl.col("Similarity") > 0.0)
# This one seem pretty good!
def sim_from_tag_index(index=1000):
u, s, v = rm.svd(k=50, norm_axis=0)
m = (np.diag(s) @ v).T
pos = m[index]
server_matrix = u @ np.diag(s)
server_sim = cosine_similarity(pos.reshape(1, -1), server_matrix)
return server_sim
if __name__ == "__main__":
rm = ReccModel()
rm.top_tags().write_ipc("data/scratch/tag_svd.feather")
rm.top_servers().write_ipc("data/scratch/server_svd.feather")
u, s, v = rm.svd(k=100, norm_axis=None)
pos_m = v.T#(v.T @ np.diag(s))#v.T#
server_matrix = u#u @ np.diag(s)#u#
with open("recommender/data/positions.json", "w") as f:
f.write(json.dumps(pos_m.tolist()))
with open("recommender/data/server_matrix.json", "w") as f:
f.write(json.dumps(server_matrix.tolist()))
with open("recommender/data/server_names.json", "w") as f:
f.write(json.dumps(rm.server_names.tolist()))
with open("recommender/data/tag_names.json", "w") as f:
f.write(json.dumps(rm.tag_names.tolist()))
jm = pl.read_json("data/joinmastodon-2023-08-25.json")
server_samples = set(pl.scan_ipc("data/scratch/all_tag_posts.feather").select("host").unique().collect().sample(fraction = 1.0)["host"].to_list())
jm_servers = set(jm["domain"].unique().to_list())
jm_td = TagData(server_samples, 256, min_server_accounts=2)
jm_tfidf = jm_td.bm(n_server_accounts=0, n_servers=2, n_accounts=10)
mat = built_tfidf_matrix(jm_tfidf, jm_td.tag_to_index, jm_td.host_to_index)
m = (mat.T / (scipy.sparse.linalg.norm(mat.T, ord=2, axis=0) + 0.0001))
server_similarlity = cosine_similarity(m.tocsr())
#has_info = np.array((np.sum(mat, axis=1).T > 0).tolist()[0])
tag_use_counts = np.sum(mat > 0, axis=1).T
has_info = (tag_use_counts >= 3).tolist()[0]
tag_names = np.array(list(jm_td.tag_to_index.keys()))[has_info]
m_selected = m.tocsr()[:, has_info]
tag_sm = cosine_similarity(m_selected.T)
from sklearn.cluster import AffinityPropagation
ap = AffinityPropagation(affinity="precomputed", random_state=0).fit(tag_sm)
clusters = pl.DataFrame({"tag": tag_names, "cluster": ap.labels_, "servers": tag_use_counts[[has_info]].tolist()[0]})
clusters.sort("servers", descending=True).unique("cluster")
tag_index_included = (np.sum(tag_sm, axis=0) > 0)
included_tag_strings = np.array(list(jm_td.tag_to_index.keys()))[tag_index_included]
tag_sm_matrix = tag_sm[np.ix_(tag_index_included, tag_index_included)]
# import Affinity Prop
from sklearn.cluster import AffinityPropagation
ap = AffinityPropagation(affinity="precomputed", random_state=0).fit(tag_sm_matrix)
clusters = pl.DataFrame({"tag": included_tag_strings, "cluster": ap.labels_})
# select a random element from each cluster
clusters.group_by("cluster").agg([pl.col("tag").shuffle().first().alias("tag")]).sort("cluster")["tag"].to_list()
clusters.group_by("cluster").agg([pl.col("tag").len().alias("count")]).sort("count", descending=True)
clusters.filter(pl.col("servers") >= 10)
#rm.server_names[np.argsort(-cosine_similarity(pos_m[779].reshape(1, -1), server_matrix))].tolist()[0][0:10]

View File

@ -30,7 +30,7 @@ class TagData:
)
all_tag_posts_topn = all_tag_posts.explode("tags").unique(["host", "acct", "tags"]).group_by(["host", "tags"]).agg([
pl.col("id").len().alias("accounts"), # How many accounts on the server are using this tag?
]).sort("accounts", descending=True).with_columns(pl.lit(1).alias("counter")).with_columns(
]).sort(["accounts", "tags"], descending=True).with_columns(pl.lit(1).alias("counter")).with_columns(
pl.col("counter").cumsum().over("host").alias("running_count")
).filter(pl.col("running_count") <= n_tags).drop("counter", "running_count").filter(pl.col("accounts") >= min_server_accounts)
self._all_tag_posts_topn = all_tag_posts_topn

View File

@ -0,0 +1,149 @@
if __name__ == '__main__':
jm = pl.read_json("data/joinmastodon-2023-08-25.json")
jm_servers = set(jm["domain"].unique().to_list())
jm_td = TagData(jm_servers, 256, min_server_accounts=2)
jm_tfidf = jm_td.bm(n_server_accounts=0, n_servers=2, n_accounts=10)
mat = built_tfidf_matrix(jm_tfidf, jm_td.tag_to_index, jm_td.host_to_index)
m = (mat.T / (scipy.sparse.linalg.norm(mat.T, ord=2, axis=0) + 0.0001))
server_similarlity = cosine_similarity(m.tocsr())
tag_use_counts = np.sum(mat > 0, axis=1).T
has_info = (tag_use_counts >= 0).tolist()[0]
tag_names = np.array(list(jm_td.tag_to_index.keys()))[has_info]
m_selected = m.tocsr()[:, has_info]
tag_sm = cosine_similarity(m_selected.T)
ap = AffinityPropagation(affinity="precomputed", random_state=0).fit(tag_sm)
clusters = pl.DataFrame({"tag": tag_names, "cluster": ap.labels_, "servers": tag_use_counts[[has_info]].tolist()[0]})
clusters.sort("servers", descending=True).unique("cluster").filter(pl.col("servers") >= 10)
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
lsa = make_pipeline(TruncatedSVD(n_components=10), Normalizer(copy=False))
X_lsa = lsa.fit_transform(m_selected)
pl.DataFrame({
"server": server_names,
#"x": svd.components_[0],
#"y": svd.components_[1],
#"z": svd.components_[2]
"x": X_lsa[:, 0],
"y": X_lsa[:, 1],
"z": X_lsa[:, 2],
}).write_ipc("data/scratch/server_svd.feather")
# Apply SVM to find the tags that provide the most information
X_lsa = lsa.fit_transform(m_selected.T)
pl.DataFrame({
"tag": tag_names,
"x": X_lsa[:, 0],
"y": X_lsa[:, 1],
"z": X_lsa[:, 2],
"variance": np.var(X_lsa, axis=1),
"count": tag_use_counts.tolist()[0]
}).write_ipc("data/scratch/tag_svd.feather")
#impot AgglomerativeClustering
from sklearn.cluster import AgglomerativeClustering
ac = AgglomerativeClustering(n_clusters=None, distance_threshold=0.7, metric="l2", linkage="average").fit(m.tocsr()[:, has_info].T.toarray())
ac = AgglomerativeClustering(n_clusters=None, distance_threshold=0.01, metric="precomputed", linkage="average").fit(tag_sm)
clusters = pl.DataFrame({"tag": tag_names, "cluster": ac.labels_, "servers": tag_use_counts[[has_info]].tolist()[0]})
clusters.sort("servers", descending=True)[0:10]#.unique("cluster").filter(pl.col("servers") >= 10).sort("cluster")
clusters.sort("servers", descending=True).unique("cluster").filter(pl.col("servers") >= 10).sort("servers")
# Apply SVM to find the tags that provide the most information
has_info = (tag_use_counts >= 10).tolist()[0]
tag_names = np.array(list(jm_td.tag_to_index.keys()))[has_info]
m_selected = m.tocsr()[:, has_info]
U, S, VT = np.linalg.svd(m_selected.toarray(), full_matrices=False)
tag_names[np.argsort(-np.abs(np.sum(VT, axis=1)))]
np.linalg.norm(m_selected.toarray(), compute_uv=True)
tag_names[np.argsort(-np.abs(np.var(VT, axis=0)))]
mytags = ["eurovision2023", "lgbtq", "disney", "marvel"]
rank = 5
U_sub = U[:, :rank]
VT_sub = VT[:rank, :]
S_sub = np.diag(S[:rank])
A_low_rank = np.dot(np.dot(U_sub, S_sub), VT_sub)
Vk = VT[:rank, :]
norm(m_selected - A_low_rank)
tag_names[np.argsort(np.abs(Vk).sum(axis=0))[::-1]][0:25]
tag_names[np.argsort(Vk[0])[::-1]][0:5]
"""
l = []
for i in range(np.shape(tag_sm)[0] - 1):
l.append(
pl.DataFrame({
"Source": list(tag_names)[i],
"Target": list(tag_names)[i+1:],
"Similarity": tag_sm[i][i+1:]
})
)
similarity_df = pl.concat(l).filter(pl.col("Similarity") > 0.0)
l = []
for i in range(np.shape(server_similarlity)[0] - 1):
#s_index = min(i, np.shape(baseline_similarlity)[0] - 1)
l.append(
pl.DataFrame({
"Source": list(jm_td.host_to_index.keys())[i],
"Target": list(jm_td.host_to_index.keys())[i+1:],
"Similarity": server_similarlity[i][i+1:]
})
)
similarity_df = pl.concat(l).filter(pl.col("Similarity") > 0.0)
jm = pl.read_json("data/joinmastodon-2023-08-25.json")
server_samples = set(pl.scan_ipc("data/scratch/all_tag_posts.feather").select("host").unique().collect().sample(fraction = 1.0)["host"].to_list())
jm_servers = set(jm["domain"].unique().to_list())
jm_td = TagData(server_samples, 256, min_server_accounts=2)
jm_tfidf = jm_td.bm(n_server_accounts=0, n_servers=2, n_accounts=10)
mat = built_tfidf_matrix(jm_tfidf, jm_td.tag_to_index, jm_td.host_to_index)
m = (mat.T / (scipy.sparse.linalg.norm(mat.T, ord=2, axis=0) + 0.0001))
server_similarlity = cosine_similarity(m.tocsr())
#has_info = np.array((np.sum(mat, axis=1).T > 0).tolist()[0])
tag_use_counts = np.sum(mat > 0, axis=1).T
has_info = (tag_use_counts >= 3).tolist()[0]
tag_names = np.array(list(jm_td.tag_to_index.keys()))[has_info]
m_selected = m.tocsr()[:, has_info]
tag_sm = cosine_similarity(m_selected.T)
ap = AffinityPropagation(affinity="precomputed", random_state=0).fit(tag_sm)
clusters = pl.DataFrame({"tag": tag_names, "cluster": ap.labels_, "servers": tag_use_counts[[has_info]].tolist()[0]})
tag_index_included = (np.sum(tag_sm, axis=0) > 0)
included_tag_strings = np.array(list(jm_td.tag_to_index.keys()))[tag_index_included]
tag_sm_matrix = tag_sm[np.ix_(tag_index_included, tag_index_included)]
# import Affinity Prop
from sklearn.cluster import AffinityPropagation
ap = AffinityPropagation(affinity="precomputed", random_state=0).fit(tag_sm_matrix)
clusters = pl.DataFrame({"tag": included_tag_strings, "cluster": ap.labels_})
# select a random element from each cluster
clusters.group_by("cluster").agg([pl.col("tag").shuffle().first().alias("tag")]).sort("cluster")["tag"].to_list()
clusters.group_by("cluster").agg([pl.col("tag").len().alias("count")]).sort("count", descending=True)
clusters.filter(pl.col("servers") >= 10)
"""

438
icwsm.qmd
View File

@ -20,7 +20,7 @@ format:
keep-md: true
link-citations: false
abstract: |
When trying to join Mastodon, a decentralized collection of interoperable social networking servers, new users face the dilemma of choosing a home server. Using trace data from millions of new Mastodon accounts, we show that new accounts are less likely to remain active on the network's largest general instances compared to others. Additionally, we observe a trend of users migrating from larger to smaller servers. Addressing the challenge of onboarding and server selection, the paper proposes a decentralized recommendation system for server using hashtags and the Okapi BM25 algorithm. This system leverages servers' top hashtags and their frequency to create a recommendation mechanism that respects Mastodon's decentralized ethos. Simulations demonstrate that such a tool can be effective even with limited data on each local server.
When trying to join Mastodon, a decentralized collection of interoperable social networking servers, new users face the dilemma of choosing a home server. Using trace data from millions of new Mastodon accounts, we show that new accounts are less likely to remain active on the network's largest general instances compared to others. Additionally, we observe a trend of users migrating from larger to smaller servers. Addressing the challenge of onboarding and server selection, the paper proposes a decentralized recommendation system for server using hashtags and the Okapi BM25 algorithm. This system leverages servers' top hashtags and their frequency to create a recommendation mechanism that respects Mastodon's decentralized ethos.
execute:
echo: false
error: false
@ -34,4 +34,438 @@ knitr:
verbose: true
---
{{< include _article.qmd >}}
```{r}
#| label: setup
profile <- Sys.getenv("QUARTO_PROFILE", unset="acm")
if (profile == "acm") {
class_wide <- ".column-body"
} else {
class_wide <- ".column-page"
}
envs <- Sys.getenv()
```
# Introduction
Following Twitter's 2022 acquisition, Mastodon---an open-source, decentralized social network and microblogging community---saw an increase in activity and attention as a potential Twitter alternative [@heFlockingMastodonTracking2023; @cavaDriversSocialInfluence2023]. While millions of new accounts significantly increased the size of the network, many newcomers found the process confusing and did not remain active. Unlike centralized social media platforms, Mastodon is a network of independent servers, each with their own rules and norms [@nicholsonMastodonRulesCharacterizing2023], which can communicate with each other using the shared ActivityPub protocols. Athough accounts can move between Mastodon servers, the local experience can vary widely from server to server.
Attracting and retaining newcomers is a key challenge for online communities [@krautBuildingSuccessfulOnline2011 p. 182]. On Mastodon, the onboarding process has not always been straightforward: variation among servers mean newcomers who may not even be aware of the specific rules, norms, or general topics of interest on the server they are joining [@diazUsingMastodonWay2022]. Various guides and resources for people trying to join Mastodon offered mixed advice on choosing a server. Some suggest that the most important thing is to simply join any server and work from there [@krasnoffMastodon101How2022; @silberlingBeginnerGuideMastodon2023]; others have created tools and guides to help people find potential servers of interest by size and location[@thekinrarMastodonInstances; @kingMastodonMe2024].
Mastodon's decentralized design has long been in tension with the disproportionate popularity of a small set of large, general-topic servers within the system [@ramanChallengesDecentralisedWeb2019a]. Analysing the activity of new accounts that join the network, we find that users who sign up on such servers are less likely to remain active after 91 days. We also find that users who move accounts tend to gravitate toward smaller, more niche servers over time, suggesting that established users may also find additional utility from such servers.
In response to these findings, we propose a potential extension to Mastodon to facilitate server and tag recommendations by having each server report their most popular local hashtags. This recommendation system could both help newcomers find servers that match their interests and help established accounts discover "neighborhoods" of related servers.
# Background
## Empirical Setting
The Fediverse is a set of decentralized online social networks which interoperate using shared protocols like ActivityPub. Mastodon is a software program used by many Fediverse servers and offers a user experience similar to the Tweetdeck client for Twitter. It was first created in late 2016 and saw a surge in interest in 2022 during and after Elon Musk's Twitter acquisition.
Mastodon features three kinds of timelines: a "home" timeline which shows all posts from accounts followed by the user; a "local" timeline which shows all public posts from the local server; and a "federated" timeline which includes all posts from users followed by other users on their server. The local timeline is unique to each server. On larger servers, this timeline can be unwieldy; however, on smaller servers, it presents the opportunity to discover new posts and users of potential interest.
Discovery has been challenging on Masotodon. Text search, for instance, was impossible on most servers until support for this feature was added on an optional, opt-in basis using Elasticsearch in late 2023 [@rochkoMastodon2023]. Recommendation systems are currently a somewhat novel problem in the context of decentralized online social networks. @trienesRecommendingUsersWhom2018 developed a recommendation system for finding new accounts to follow on the Fediverse which used collaborative filtering based on BM25 in an early example of a content discovery system on Mastodon.
Individual Mastodon servers can have an effect on the end experience of users. For example, some servers may choose to federate with some servers but not others, altering the topology of the Fediverse network for their users. At the same time, accounts need to be locked into one specific server. Because of Mastodon's data portability, users can move their accounts freely between servers while retaining their followers, though their post history remains with their original account.
## The Mastodon Migrations
Mastodon saw a surge in interest in 2022 and 2023, particularly after Elon Musk's Twitter acquisition. In particular, four events of interests drove measurable increases in new users to the network: the announcement of the acquisition (April 14, 2022), the closing of the acquisition (October 27, 2022), a day when Twitter suspended a number of prominent journalists (December 15, 2022), and a day when Twitter experienced an outage and started rate limiting accounts (July 1, 2023). Many Twitter accounts announced they were setting up Mastodon accounts and linked their new accounts to their followers, often using tags like #TwitterMigration [@heFlockingMastodonTracking2023] and driving interest in Mastodon in a process @cavaDriversSocialInfluence2023 found consistent with social influence theory.
Some media outlets have framed reports on Mastodon [@hooverMastodonBumpNow2023] through what @zulliRethinkingSocialSocial2020 calls the "Killer Hype Cycle", whereby the media finds a new alternative social media platform, declares it a potential killer of some established platform, and later calls it a failure if it does not displace the existing platform. Such framing fails to take systems like the Fediverse seriously for their own merits: completely replacing existing commercial systems is not the only way to measure success, nor does it account for the real value the Fediverse provides for its millions of active users.
Mastodon's approach to onboarding has also changed over time. In much of 2020 and early 2021, the Mastodon developers closed sign-ups to their flagship server and linked to an alternative server, which saw increased sign-ups during this period. They also linked to a list of servers on the "Join Mastodon" webpage [@mastodonggmbhServers], where all servers are pre-approved and follow the Mastodon Server Covenant which guarantees certain content moderation standards and data protections. Starting in 2023, the Mastodon developers shifted toward making the flagship server the default when people sign up on the official Mastodon Android and iOS apps [@rochkoNewOnboardingExperience2023; @rothItGettingEasier2023].
## Newcomers in Online Communities
Onboarding newcomers is an important part of the life cycle of online communities. Any community can expect a certain amount of turnover, and so it is important for the long-term health and longevity of the community to be able to bring in new members [@krautBuildingSuccessfulOnline2011 p. 182]. However, the process of onboarding newcomers is not always straightforward.
The series of migrations of new users into Mastodon in many ways reflect folk stories of "Eternal Septembers" on previous communication networks, where a large influx of newcomers challenged the existing norms [@driscollWeMisrememberEternal2023; @kieneSurvivingEternalSeptember2016]. Many Mastodon servers do have specific norms which people coming from Twitter may find confusing, such as local norms around content warnings [@nicholsonMastodonRulesCharacterizing2023]. Variation among servers can also present a challenge for newcomers who may not even be aware of the specific rules, norms, or general topics of interest on the server they are joining [@diazUsingMastodonWay2022]. Mastodon servers open to new accounts must thus be both accommodating to newcomers while at the same ensuring the propagation of their norms and culture, either through social norms or through technical means.
# Data
```{r}
#| label: fig-account-timeline
#| fig-cap: "Accounts in the dataset created between January 2022 and March 2023. The top panels shows the proportion of accounts still active 45 days after creation, the proportion of accounts that have moved, and the proportion of accounts that have been suspended. The bottom panel shows the count of accounts created each week. The dashed vertical lines in the bottom panel represent the annoucement day of the Elon Musk Twitter acquisition, the acquisition closing day, a day where Twitter suspended a number of prominent journalist, and a day when Twitter experienced an outage and started rate limiting accounts."
#| fig-height: 2.75
#| fig-width: 6.75
#| fig-env: figure*
#| fig-pos: tb!
library(here)
source(here("code/helpers.R"))
account_timeline_plot()
```
Mastodon has an extensive API which allows for the collection of public posts and account information. We collected data from the public timelines of Mastodon servers using the Mastodon API with a crawler which runs once per day. We also collected account information from the opt-in public profile directories on these servers.
```{r}
#| label: data-counts
#| cache: true
library(arrow)
library(tidyverse)
library(here)
source(here("code/helpers.R"))
accounts <- load_accounts(filt = FALSE) %>%
filter(created_at >= "2020-08-14") %>%
filter(created_at < "2024-01-01")
tag_posts <- "data/scratch/all_tag_posts.feather" %>%
arrow::read_ipc_file(. , col_select = c("host", "acct", "created_at")) %>%
filter(created_at >= as.Date("2023-05-01")) %>%
filter(created_at < as.Date("2023-08-01"))
text_format <- function(df) {
return (format(nrow(df), big.mark=","))
}
num_tag_posts <- tag_posts %>% text_format()
num_tag_accounts <- tag_posts %>% distinct(host, acct) %>% text_format()
num_tag_servers <- tag_posts %>% distinct(host) %>% text_format()
num_accounts_unfilt <- accounts %>% text_format()
num_account_bots <- accounts %>% filter(bot) %>% text_format()
num_account_nostatuses <- accounts %>% filter(is.na(last_status_at)) %>% text_format()
num_account_suspended <- accounts %>% mutate(suspended = replace_na(suspended, FALSE)) %>% filter(suspended) %>% text_format()
num_accounts_moved <- accounts %>% filter(has_moved) %>% text_format()
num_account_limited <- accounts %>% filter(limited) %>% text_format()
num_account_samedaystatus <- accounts %>% filter(last_status_at <= created_at) %>% text_format()
num_account_filt <- load_accounts(filt = TRUE) %>% text_format()
```
**Mastodon Profiles**: We collected accounts using data previously collected from posts on public Mastodon timelines from October 2020 to August 2023. We then queried for up-to-date information on those accounts including their most recent status and if the account had moved as of February 2024. Through this process, we discovered a total of `r num_accounts_unfilt` account created between August 14, 2020 and January 1, 2024. We then filtered out accounts which were bots (`r num_account_bots` accounts), had been suspended (`r num_account_suspended` accounts), had been marked as moved to another account (`r num_accounts_moved` accounts), had been limited by their local server (`r num_account_limited` accounts), had no statuses (`r num_account_nostatuses` accounts), or had posted their last status on the same day as their account creation (`r num_account_samedaystatus` accounts). This gave us a total of `r num_account_filt` accounts which met all the filtering criteria. Note that because we got updated information on each account, we include only accounts on servers which still existed at the time of our profile queries and which returned records for the account.
**Tags**: Mastodon supports hashtags, which are user-generated metadata tags that can be added to posts. Clicking the link for a tag shows a stream of posts which also have that tag from the federated timeline, which includes accounts on the same server and posts from accounts followed by the accounts on the local server. We collected `r num_tag_posts` statuses posted by `r num_tag_accounts` accounts on `r num_tag_servers` unique servers from between May to July 2023 which contained at least one hashtag.
# Analysis and Results
## Survival Model
*Are accounts on suggested general servers less likely to remain active than accounts on other servers?*
```{r, cache.extra = tools::md5sum("code/survival.R")}
#| cache: true
#| label: fig-survival
#| fig-env: figure
#| fig-cap: "Survival probabilities for accounts created during May 2023."
#| fig-width: 3.375
#| fig-height: 2.5
#| fig-pos: h!
library(here)
source(here("code/survival.R"))
plot_km
```
```{r}
#| label: table-coxme
library(ehahelper)
library(broom)
cxme_table <- tidy(cxme) %>%
mutate(conf.low = exp(conf.low), conf.high=exp(conf.high)) %>%
mutate(term = case_when(
term == "factor(group)1" ~ "Join Mastodon",
term == "factor(group)2" ~ "General Servers",
term == "small_serverTRUE" ~ "Small Server",
TRUE ~ term
)) %>%
mutate(exp.coef = paste("(", round(conf.low, 2), ", ", round(conf.high, 2), ")", sep="")) %>%
select(term, estimate, exp.coef , p.value)
```
Using `r text_format(sel_a)` accounts created from May 1 to June 30, 2023, we create a KaplanMeier estimator for the probability that an account will remain active based on whether the account is on one of the largest general instances [^1] featured at the top of the Join Mastodon webpage or otherwise if it is on a server in the Join Mastodon list. Accounts are considered active if they have made at least one post after the censorship period `r active_period` days after account creation.
[^1]: `r paste(general_servers, collapse=", ")`
::: {.content-visible unless-profile="icwsm"}
::: {#tbl-cxme .column-body}
```{r}
if (knitr::is_latex_output()) {
cxme_table %>% knitr::kable(format="latex", booktabs=TRUE, digits=3)
} else {
cxme_table %>% knitr::kable(digits = 3)
}
```
Coefficients for the Cox Proportional Hazard Model with Mixed Effects. The model includes a random effect for the server.
:::
We also construct a Mixed Effects Cox Proportional Hazard Model
$$
h(t_{ij}) = h_0(t) \exp\left(\begin{aligned}
&\beta_1 \text{Join Mastodon} \\
&+ \beta_2 \text{General Servers} \\
&+ \beta_3 \text{Small Server} \\
&+ b_{j}
\end{aligned}\right)
$$
where $h(t_{ij})$ is the hazard for account $i$ on server $j$ at time $t$, $h_0(t)$ is the baseline hazard, $\beta_1$ is the coefficient for whether the account is on a server featured on Join Mastodon, $\beta_2$ is the coefficient for whether the account is on one of the largest general instances, $\beta_3$ is the coefficient for whether the account is on a small server with less than 100 accounts, and $b_{j}$ is the random effect for server $j$.
<!-- with coefficients for whether the account is on a small server (less than a hundred accounts), and whether the account in featured on JoinMastodon or is featured as one of the largest general instances. -->
We again find that accounts on the largest general instances are less likely to remain active than accounts on other servers, while accounts created on smaller servers are more likely to remain active.
:::
## Moved Accounts
*Do accounts tend to move to larger or smaller servers?*
Mastodon users can move their accounts to another server while retaining their connections (but not their posts) to other Mastodon accounts. This feature, built into the Mastodon software, offers data portability and helps avoid lock-in.
```{r}
#| label: table-ergm-table
#| echo: false
#| warning: false
#| message: false
#| error: false
library(here)
library(modelsummary)
library(kableExtra)
library(purrr)
library(stringr)
load(file = here("data/scratch/ergm-model-early.rda"))
load(file = here("data/scratch/ergm-model-late.rda"))
if (knitr::is_latex_output()) {
format <- "latex_tabular"
} else {
format <- "html"
}
x <- modelsummary(
list("Coef." = model.early, "Std.Error" = model.early),#, "Coef." = model.late, "Std.Error" = model.late),
estimate = c("{estimate}", "{stars}{std.error}"),# "{estimate}", "{stars}{std.error}"),
statistic = NULL,
gof_omit = ".*",
coef_rename = c(
"sum" = "Sum",
"nonzero" = "Nonzero",
"diff.sum0.h-t.accounts" = "Smaller server",
"nodeocov.sum.accounts" = "Server size\n(outgoing)",
"nodeifactor.sum.registrations.TRUE" = "Open registrations\n(incoming)",
"nodematch.sum.language" = "Languages match"
),
align="lrr",
stars = c('*' = .05, '**' = 0.01, '***' = .001),
output = format
)# %>% add_header_above(c(" " = 1, "Model A" = 2, "Model B" = 2))
```
:::: {#tbl-ergm-table }
```{r}
x # `r class_wide`
```
Exponential family random graph models for account movement between Mastodon servers. Accounts were created in May 2022 and moved to another account at some later point.
::::
To corroborate our findings, we also use data from thousands of accounts which moved between Mastodon servers, taking advantage of the data portability of the platform. Conceiving of these moved accounts as edges within a weighted directional network where nodes represent servers, edges represent accounts, and weights represent the number of accounts that moved between servers, we construct an exponential family random graph model (ERGM) with terms for server size, open registrations, and language match between servers. We find that accounts are more likely to move from larger servers to smaller servers.
# Proposed Recommendation System
*How can we build an opt-in, low-resource recommendation system for finding Fediverse servers?*
Based on these findings, we suggest a need for better ways for newcomers to find servers and propose a viable way to create server and tag recommendations on Mastodon. This system could both help newcomers find servers that match their interests and help established accounts discover "neighborhoods" of related servers.
One challenge in building such a system is the decentralized nature of the system. A single, central actor which collects data from servers and then distributes recommendations would be antithetical to the decentralized nature of Mastodon. Instead, we propose a system where servers can report the top hashtags by the number of unique accounts on the server using them during the last three months. Such a system would be opt-in and require few additional server resources since tags already have their own database table.
## Recommendation System Design
We use Okapi BM25 to construct a term frequency-inverse document frequency (TF-IDF) model to associate the top tags with each server using counts of tag-account pairs from each server for the term frequency and the number of servers that use each tag for the inverse document frequency. We then L2 normalize the vectors for each tag and calculate the cosine similarity between the tag vectors for each server.
$$
tf = \frac{f_{t,s} \cdot (k_1 + 1)}{f_{t,s} + k_1 \cdot (1 - b + b \cdot \frac{|s|}{avgstl})}
$$
where $f_{t,s}$ is the number of accounts using the tag $t$ on server $d$, $k_1$ and $b$ are tuning parameters, and $avgstl$ is the average sum of account-tag pairs. For the inverse document frequency, we use the following formula:
$$
idf = \log \frac{N - n + 0.5}{n + 0.5}
$$
where $N$ is the total number of servers and $n$ is the number of servers where the tag appears as one of the top tags. We then apply L2 normalization:
$$
tfidf = \frac{tf \cdot idf}{\| tf \cdot idf \|_2}
$$
## Applications
```{r}
#| eval: false
library(tidyverse)
library(igraph)
library(arrow)
sim_servers <- "data/scratch/server_similarity.feather" %>% arrow::read_ipc_file() %>% rename("weight" = "Similarity")
#sim_net <- as.network(sim_servers)
g <- graph_from_data_frame(sim_servers, directed = FALSE)
g_strength <- log(sort(strength(g)))
normalized_strength <- (g_strength - min(g_strength)) / (max(g_strength) - min(g_strength))
server_centrality <- enframe(normalized_strength, name="server", value="strength")
server_centrality %>% arrow::write_ipc_file("data/scratch/server_centrality.feather")
```
### Server Similarity Neighborhoods
Mastodon provides two feeds in addition to a user's home timeline populated by accounts they follow: a local timeline with all public posts from their local server and a federated timeline which includes all posts from users followed by other users on their server. We suggest a third kind of timeline, a *neighborhood timeline*, which filters the federated timeline by topic. We calculate the pairwise similarity between two servers using cosine similarity.
::: {#tbl-sim-servers .content-visible unless-profile="icwsm"}
```{r}
#| label: table-sim-servers
library(tidyverse)
library(arrow)
sim_servers <- "data/scratch/server_similarity.feather" %>% arrow::read_ipc_file()
server_of_interest <- "hci.social"
server_table <- sim_servers %>%
arrange(desc(Similarity)) %>%
filter(Source == server_of_interest | Target == server_of_interest) %>%
head(5) %>%
pivot_longer(cols=c(Source, Target)) %>%
filter(value != server_of_interest) %>%
select(value, Similarity) %>%
rename("Server" = "value")
if (knitr::is_latex_output()) {
server_table %>% knitr::kable(format="latex", booktabs=TRUE, digits=3)
} else {
server_table %>% knitr::kable(digits = 3)
}
```
Top five servers most similar to hci.social
:::
### Tag Similarity
We also calculate the similarity between tags using the same method. This can be used to suggest related tags to users based on their interests or tags related to already selected tags in the recommendation system.
### Server Discovery
Given a set of popular tags and a list of servers, we build a recommendation system[^rec] where users select tags from a list of popular tags and receive server suggestions. The system first creates a subset of vectors based on the TF-IDF matrix which represents the top clusters of topics. After a user selects the top tags of interest to them, it suggests servers which match their preferences using the singular value decomposition (SVD) of the TF-IDF matrix.
[^rec]: A live demo for the system is availible at https://cheesecake.live/files/jsdemo/witch.html
::: {.content-visible unless-profile="icwsm"}
## Rubustness to Limited Data
```{r}
#| label: fig-simulations-rbo
#| fig-env: figure*
#| cache: true
#| fig-width: 6.75
#| fig-height: 3
#| fig-pos: tb
library(tidyverse)
library(arrow)
simulations <- arrow::read_ipc_file("data/scratch/simulation_rbo.feather")
simulations %>%
group_by(servers, tags, run) %>% summarize(rbo=mean(rbo), .groups="drop") %>%
mutate(ltags = as.integer(log2(tags))) %>%
ggplot(aes(x = factor(ltags), y = rbo, fill = factor(ltags))) +
geom_boxplot() +
facet_wrap(~servers, nrow=1) +
#scale_y_continuous(limits = c(0, 1)) +
labs(x = "Tags (log2)", y = "RBO", title = "Rank Biased Overlap with Baseline Rankings by Number of Servers") +
theme_minimal() + theme(legend.position = "none")
```
A challenge for a federated recommendation system like we propose is that it needs buy in from a sufficient number of servers to provide value. There is also a tradeoff between the amount of tags to expose for each server and potential concerns about exposing too much data.
We simulated various scenarios that limit both servers that report data and the number of tags they report. We used rank biased overlap (RBO) to then compare the outputs from these simulations to the baseline with more complete information from all tags on all servers [@webberSimilarityMeasureIndefinite2010]. In particular, we gave a higher weight to suggestions with a higher rank, with weights decaying by a factor of $k^{0.80}$. @fig-simulations-rbo shows how the average agreement with the baseline scales, which take the top 256 tags from each server.
:::
# Discussion
The analysis can also be improved by additionally focusing on factors lead to accounts remaining active or dropping out, which a particular focus on the actual activity of accounts over time. For instance, do accounts that interact with other users more remain active longer? Are there particular markers of activity that are more predictive of account retention? Future work could use these to provide suggests for ways to helps newcomers during the onboarding process.
The observational nature of the data limit some of the causal claims we can make. It is unclear, for instance, if accounts on general servers are less likely to remain active because of the server itself or because of the type of users who join such servers. For example, it is conceivable that the kind of person who spends more time researching which server to join is more invested in their Mastodon experience than one who simply joins the first server they find.
Future work is necessary to determine the how well the recommendation system is at helping users find servers that match their interests. This may involve user studies and interviews to determine how well the system works in practice.
While the work presented here is based on observed posts on the public timelines, simulations may be helpful in determining the robustness of the system to targeted attacks. Due to the decentralized nature of the system, it is feasible that a bad actor could set up zombie accounts on servers to manipulate the recommendation system. Simulations could help determine how well the system can resist such attacks and ways to mitigate this risk.
# Conclusion
Based on analysis of trace data from millions of new Fediverse accounts, we find evidence that suggests that servers matter and that users tend to move from larger servers to smaller servers. We then propose a recommendation system that can help new Fediverse users find servers with a high probability of being a good match based on their interests. Based on simulations, we demonstrate that such a tool can be effectively deployed in a federated manner, even with limited data on each local server.
# References {#references}
::: {.content-visible unless-profile="icwsm"}
# Glossary {.appendix}
*ActivityPub*: A decentralized social networking protocol based on the ActivityStreams 2.0 data format.
*Fediverse*: A set of decentralized online social networks which interoperate using shared protocols like ActivityPub.
*Mastodon*: An open-source, decentralized social network and microblogging community.
*Hashtag*: A user-generated metadata tag that can be added to posts.
*Federated timeline*: A timeline which includes all posts from users followed by other users on their server.
*Local timeline*: A timeline with all public posts from the local server.
:::
::: {.content-visible when-format="html"}
```{r}
library(tidyverse)
library(arrow)
library(ggrepel)
"data/scratch/server_svd.feather" %>% arrow::read_ipc_file() %>%
as_tibble %>%
ggplot(aes(x = x, y = y, label = server)) +
geom_text_repel(size = 2, max.overlaps = 10) +
#geom_point() +
theme_minimal()
```
```{r}
library(tidyverse)
library(arrow)
library(ggrepel)
library(here)
library(jsonlite)
top_tags <- "data/scratch/tag_svd.feather" %>% arrow::read_ipc_file() %>%
as_tibble %>%
mutate(s = variance * log(count)) %>% arrange(desc(s))
top_tags %>%
select(tag, index) %>%
jsonlite::write_json(here("recommender/data/top_tags.json"))
top_tags %>%
head(100) %>%
ggplot(aes(x = x, y = y, label = tag)) +
geom_text_repel(size = 3, max.overlaps = 10) +
#geom_point() +
theme_minimal()
```
:::

Binary file not shown.

After

Width:  |  Height:  |  Size: 509 KiB

BIN
images/network_types.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 241 KiB

View File

@ -1,442 +1,559 @@
---
title: "Onboarding the Fediverse"
subtitle: "Building community discovery in decentralized online social networks"
title: "Do Servers Matter on Mastodon?"
subtitle: "Data-driven Design for Decentralized Social Media"
author: "Carl Colglazier"
institute:
- "Community Data Science Collective"
- "Northwestern University"
date: "2024-03-14"
bibliography: ../references.bib
title-slide-attributes:
data-background: "#4c3854"
format:
revealjs:
theme: presentation.scss
#embed-resources: true
width: 1600
height: 900
date-format: long
margin: 0.2
center-title-slide: false
#disable-layout: true
theme: [default, presentation.scss]
slide-number: true
keep-md: true
pdf-max-pages-per-slide: 1
reference-location: document
template-partials:
- title-slide.html
knitr:
opts_chunk:
dev: "ragg_png"
dev: "svg" #"ragg_png"
retina: 1
dpi: 200
dpi: 300
execute:
freeze: auto
cache: true
echo: false
fig-width: 5
fig-height: 6
# fig-width: 5
# fig-height: 6
prefer-html: true
---
## Growth on the Fediverse
## Goals for Today
::: {.big}
```{r}
#| label: fig-account-timeline
#| fig-height: 3
#| fig-width: 6.75
- Contextualize work on decentralized online social networks like Mastodon
library(arrow)
library(tidyverse)
library(lubridate)
library(scales)
library(here)
source(here("code/helpers.R"))
- Present a data-driven analysis of server choice on Mastodon
jm <- arrow::read_feather(here("data/scratch/joinmastodon.feather"))
moved_to <- arrow::read_feather(here("data/scratch/individual_moved_accounts.feather"))
accounts_unfilt <- arrow::read_feather(
here("data/scratch/all_accounts.feather"),
col_select=c(
"server", "username", "created_at", "last_status_at",
"statuses_count", "has_moved", "bot", "suspended",
"following_count", "followers_count", "locked",
"noindex", "group", "discoverable"
))
accounts <- accounts_unfilt %>%
filter(!bot) %>%
# TODO: what's going on here?
filter(!is.na(last_status_at)) %>%
mutate(suspended = replace_na(suspended, FALSE)) %>%
# sanity check
filter(created_at >= "2020-10-01") %>%
filter(created_at < "2024-01-01") %>%
# We don't want accounts that were created and then immediately stopped being active
filter(statuses_count >= 1) %>%
filter(last_status_at >= created_at) %>%
mutate(active = last_status_at >= "2024-01-01") %>%
mutate(last_status_at = ifelse(active, lubridate::ymd_hms("2024-01-01 00:00:00", tz = "UTC"), last_status_at)) %>%
mutate(active_time = difftime(last_status_at, created_at, units="days")) #%>%
#filter(!has_moved)
acc_data <- accounts %>%
#filter(!has_moved) %>%
mutate(created_month = format(created_at, "%Y-%m")) %>%
mutate(created_week = floor_date(created_at, unit = "week")) %>%
mutate(active_now = active) %>%
mutate(active = active_time >= 45) %>%
mutate("Is mastodon.social" = server == "mastodon.social") %>%
mutate(jm = server %in% jm$domain) %>%
group_by(created_week) %>%
summarize(
`JoinMastodon Server` = sum(jm) / n(),
`Is mastodon.social` = sum(`Is mastodon.social`)/n(),
Suspended = sum(suspended)/n(),
Active = (sum(active)-sum(has_moved)-sum(suspended))/(n()-sum(has_moved)-sum(suspended)),
active_now = (sum(active_now)-sum(has_moved)-sum(suspended))/(n()-sum(has_moved)-sum(suspended)),
Moved=sum(has_moved)/n(),
count=n()) %>%
pivot_longer(cols=c("JoinMastodon Server", "active_now", "Active", "Moved", "Is mastodon.social"), names_to="Measure", values_to="value") # "Suspended"
- Introduce a recommendation system for server choice
p1 <- acc_data %>%
ggplot(aes(x=as.Date(created_week), group=1)) +
geom_line(aes(y=value, group=Measure, color=Measure)) +
geom_point(aes(y=value, color=Measure), size=0.7) +
scale_y_continuous(limits = c(0, 1.0)) +
labs(y="Proportion") + scale_x_date(labels=date_format("%Y-%U"), breaks = "8 week") +
theme_bw_small_labels() +
theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank())
p2 <- acc_data %>%
distinct(created_week, count) %>%
ggplot(aes(x=as.Date(created_week), y=count)) +
geom_bar(stat="identity", fill="black") +
geom_vline(
aes(xintercept = as.numeric(as.Date("2022-10-27"))),
linetype="dashed", color = "black") +
geom_vline(
aes(xintercept = as.numeric(as.Date("2022-04-14"))),
linetype="dashed", color = "black") +
# https://twitter.com/elonmusk/status/1675187969420828672
geom_vline(
aes(xintercept = as.numeric(as.Date("2022-12-15"))),
linetype="dashed", color = "black") +
geom_vline(
aes(xintercept = as.numeric(as.Date("2023-07-01"))),
linetype="dashed", color = "black") +
#scale_y_continuous(limits = c(0, max(acc_data$count) + 100000)) +
scale_y_continuous(labels = scales::comma) +
labs(y="Count", x="Created Week") +
theme_bw_small_labels() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_x_date(labels=date_format("%Y-%U"), breaks = "8 week")
library(patchwork)
p1 + p2 + plot_layout(ncol = 1)
```
## The Million Account Elephant in the Room
::::: {.columns}
::: {.column width="40%"}
![](images/mastodon-social-signups-2020-11-01.png)
:::
:::: {.column width="60%"}
::: {.smaller}
Mastodon.social (MS), the flagship server from the Mastodon developers, has always been the largest Mastodon server.
The server has been closed to new open registrations many times throughout the years.
- Discuss directions for future work
:::
# The Big Picture {.center}
What is decentralized social media and why does it matter?
![Figure from @baranDistributedCommunicationsNetworks1964](images/network_types.png)
## Emergance of the Social Web
:::::: {.spread}
Internet technologies are _sociotechnical_ systems.
The social internet as we know it today emerged both from the develeopment of **protocols** and systems [@abbateInventingInternet2000] and thousands of largely non-commercial **social communities** [@driscollModemWorldPrehistory2022].
:::: {.columns}
::: {.column width=33%}
:::: {.fragment fragment-index=1}
#### Era
::::
:::: {.fragment fragment-index=2}
ARPANET
::::
:::: {.fragment fragment-index=3}
Early Internet
::::
:::: {.fragment fragment-index=4}
Commercial Web
::::
:::
::: {.column width=33%}
:::: {.fragment fragment-index=1}
#### Spaces
::::
:::: {.fragment fragment-index=2}
Email, Usenet
::::
:::: {.fragment fragment-index=3}
BBS, IRC
::::
:::: {.fragment fragment-index=4}
Social media
::::
:::
::: {.column width=33%}
:::: {.fragment fragment-index=1}
#### Technologies
::::
:::: {.fragment fragment-index=2}
TCP/IP
::::
:::: {.fragment fragment-index=3}
HTML
::::
:::: {.fragment fragment-index=4}
APIs, AJAX
::::
:::
:::
::::::
## Current Trends
::: {}
+ High **distrust** of social media companies [@AmericansWidelyDistrust2021]
+ Challenges in performing content moderation and maintaining social communities at **scale** [@gillespieContentModerationAI2020]
+ Post-API Era: **closure** of APIs on major platforms to researchers and tinkerers [@freelonComputationalResearchPostAPI2018]
:::
## Protocol-based Social Media
::::: {.spread}
The commercial internet has trended toward centralization, but this may be neither desirable nor sustainable [@masnickProtocolsNotPlatforms].
::: {.columns}
::: {.column}
#### Platforms
We have accounts on the same website
:::
::: {.column}
#### Protocols
We use the same protocol
:::
:::
::: {.columns}
::: {.column}
The (single) website controls:
- My data
- Content moderation
- Monetization
:::
::: {.column}
I can choose who controls:
- My data
- Content moderation (local)
- Monetization (if any)
:::
:::
:::::
## Closure and Opening of MS (2022) {.tiny}
## Empirical Context
```{r}
#| fig-width: 9
library(jsonlite)
library(here)
library(tidyverse)
library(tsibble)
library(fable)
::: {.columns}
server_list <- c(
"mastodon.social", "mastodon.online"
)
:::: {.column}
early.jm_servers <- as_tibble(fromJSON(here("data/joinmastodon-2020-09-18.json")))$domain
- **The Fediverse**: A set of decentralized online social networks which interoperate using shared protocols like ActivityPub.
early.day_counts <- accounts %>%
filter(created_at < "2021-09-01") %>%
mutate(created_day = as.Date(floor_date(created_at, unit = "day"))) %>%
mutate(server_code = ifelse(server %in% early.jm_servers, "joinmastodon", "other")) %>%
mutate(server_code = ifelse(server == "mastodon.social", "mastodon.social", server_code)) %>%
mutate(server = ifelse(server == "mastodon.online", "mastodon.online", server_code)) %>%
group_by(created_day, server) %>%
summarize(count = n(), .groups = "drop") %>%
as_tsibble(., key=server, index=created_day) %>%
fill_gaps(count=0) %>%
mutate(first_open = ((created_day >= "2020-09-18") & (created_day < "2020-11-01"))) %>%
#mutate(second_open = ((created_day > "2020-11-02") & (created_day < "2020-11-05"))) %>%
mutate(third_open = (created_day >= "2021-04-17")) %>%
mutate(open = (first_open | third_open))
- **Mastodon**: An open-source, decentralized social network and microblogging community.
early.data_plot <- early.day_counts %>%
mutate(created_week = as.Date(floor_date(created_day, unit = "week"))) %>%
ggplot(aes(x = created_day, y=count)) +
geom_rect(data = (early.day_counts %>% filter(open)),
aes(xmin = created_day - 0.5, xmax = created_day + 0.5, ymin = 0, ymax = Inf),
fill = "lightblue", alpha = 0.3) + # Adjust color and transparency as needed
geom_bar(stat="identity") +
facet_wrap(~ server, ncol=1, strip.position = "left") + #, scales="free_y") +
scale_x_date(expand = c(0, 0), date_labels = "%B %Y") +
scale_y_log10() +
labs(
title = "Open registration periods on mastodon.social (August 2020 - August 2021)",
x = "Account Created Date",
y = "Count"
) +
theme_bw_small_labels()
::::
model_data <- early.day_counts %>%
mutate(count = log1p(count)) %>%
ungroup %>%
arrange(created_day) %>%
mutate(day = row_number())
:::: {.column}
fit <- model_data %>%
model(arima = ARIMA(count ~ open + day + open:day + fourier(period=7, K=2) + pdq(2,0,0) + PDQ(0,0,0,period=7)))
![A screenshot of Mastodon 2.9 (2019), from the Mastodon Blog.](images/Mastodon_Single-column-layout.png)
early.table <- fit %>% tidy %>%
mutate(p.value = scales::pvalue(p.value)) %>%
pivot_wider(names_from=server, values_from = c(estimate, std.error, statistic, p.value)) %>%
select(-c(.model)) %>%
select(term,
estimate_mastodon.online, p.value_mastodon.online,
estimate_mastodon.social, p.value_mastodon.social,
estimate_joinmastodon, p.value_joinmastodon,
estimate_other, p.value_other
) %>%
#select(term, starts_with("estimate"), starts_with("p.value")) #%>%
knitr::kable(
.,
col.names = c("Term", "mastodon.online", "", "mastodon.social", "", "joinmastodon", "", "other", ""),
digits = 4,
align = c("l", "r", "r", "r", "r", "r", "r", "r", "r")
)
::::
early.data_plot
```
## Closure and Opening of MS (2022) {.tiny}
:::
```{r}
early.table
# The Fediverse is a network of _thousands_ of interconnected servers {background-color="black" data-background-image="images/mastodon_map.png" background-repeat="repeat" background-size="200px" background-opacity="0.5" .center auto-animate=true .fade-out}
::: {.footer}
Background image: Jaz-Michael King
:::
## A Timeline of Mastodon
```{mermaid}
timeline
title Mastodon and Fediverse Timeline
2008: OStatus Protocol
2016: Mastodon releases v0.1
2018: ActivityPub standard published
2019: Mastodon drops OStatus
2022: Elon Musk Twitter acquisition
: Truth Social launches using Mastodon code
2023: Mastodon reaches 2M active users
: Threads (Meta) begins experimental support for ActivityPub
```
## Closure and Opening of MS (Early 2023) {.tiny}
## Avoiding the "Twitter Killer" Hype
@zulliRethinkingSocialSocial2020 describe this pattern:
1. A writer discovers an alternative technology system
2. Media hypes it as a "killer" of a major platform
3. The system does not in fact "kill" the major platform
4. The system is declared a failure
This has happened mutliple times already.
## How do we define sucesss for sytems like Mastodon?
:::: {.columns}
::: {.column}
We should instead take social communities on their own terms.
**Do people find value in the system?**
In my view, the most interesting thing about Mastodon is the "local timeline", which shows posts from your server.
:::
::: {.column}
> "One of the things the Internet was good for was gathering together people in different places who shared a common interest"
--Michael Lewis, _Moneyball_ (2003)
:::
:::
## Which server should I join?
### Conflicting advice
::: {.columns}
::: {.column}
Just join any server!
:::
::: {.column}
Join the _right_ server!
:::
:::
::: {.fragment}
### Which is right? {.center-xy}
:::
## There are a lot of options {autoslide=2500 .fade-in}
```{r}
#| fig-width: 9
library(jsonlite)
library(here)
library(tidyverse)
library(tsibble)
library(fable)
email.jm_servers <- as_tibble(fromJSON(here("data/joinmastodon-2023-08-25.json")))$domain
email.day_counts <- accounts %>%
filter(created_at > "2022-07-01") %>%
filter(created_at < "2022-10-26") %>%
mutate(created_day = as.Date(floor_date(created_at, unit = "day"))) %>%
mutate(server_code = ifelse(server %in% email.jm_servers, "joinmastodon", "other")) %>%
mutate(server = ifelse(server == "mastodon.social", "mastodon.social", server_code)) %>%
#mutate(server = server_code) %>%
#filter(server != "other") %>%
group_by(created_day, server) %>%
summarize(count = n(), .groups = "drop") %>%
as_tsibble(., key = server, index = created_day) %>%
fill_gaps(count = 0) %>%
mutate(open = ((created_day < "2022-08-13") |
(created_day > "2022-10-03")))
email.data_plot <- email.day_counts %>%
#filter(server != "other") %>%
mutate(created_week = as.Date(floor_date(created_day, unit = "week"))) %>%
ggplot(aes(x = created_day, y = count)) +
geom_rect(
data = (email.day_counts %>% filter(open)),
aes(
xmin = created_day - 0.5,
xmax = created_day + 0.5,
ymin = 0,
ymax = Inf
),
fill = "lightblue",
alpha = 0.3
) + # Adjust color and transparency as needed
geom_bar(stat = "identity") +
facet_wrap( ~ server, ncol = 1, strip.position = "left") + #, scales="free_y") +
scale_x_date(expand = c(0, 0), date_labels = "%B %Y") +
labs(
title = "Closure of mastodon.social (2022)",
x = "Account Created Date",
y = "Count"
) +
theme_bw_small_labels()
email.data_plot
```
## Closure and Opening of MS (Early 2023) {.tiny}
```{r}
model_data <- email.day_counts %>%
mutate(count = log1p(count)) %>%
ungroup %>%
arrange(created_day) %>%
mutate(day = row_number())
fit <- model_data %>%
model(arima = ARIMA(count ~ open + day + open:day + fourier(period=7, K=2) + pdq(2,0,0) + PDQ(0,0,0,period=7)))
email.table <- fit %>% tidy %>%
mutate(p.value = scales::pvalue(p.value)) %>%
pivot_wider(names_from=server, values_from = c(estimate, std.error, statistic, p.value)) %>%
select(-c(.model)) %>%
select(term,
estimate_mastodon.social, p.value_mastodon.social,
estimate_joinmastodon, p.value_joinmastodon,
estimate_other, p.value_other
) %>%
knitr::kable(
.,
col.names = c("Term", "mastodon.social", "", "joinmastodon", "", "other", ""),
digits = 4,
align = c("l", "r", "r", "r", "r", "r", "r")
)
email.table
```
## A Change in Strategy
Mastodon has shifted away from _discouraging_ newcomers from using mastodon.social to using the flagship server as the default.
. . .
Today, almost half of new Mastodon accounts join mastodon.social
<!--- ## Do some servers retain newcomers better than others? --->
## A Change in Strategy
![](images/joinmastodon-screenshot.png)
## Moving Accounts on Mastodon
+ Accounts can move freely between Mastodon servers
+ Moved accounts retain their followers (but not their posts)
## Are people moving to larger or smaller servers? {.tiny}
```{r}
#| label: tbl-ergm
#| tbl-cap: ERGM model output
#| results: asis
#| cache: true
library(here)
library(tidyverse)
library(jsonlite)
jm <- here("data/joinmastodon.json") %>% jsonlite::fromJSON() %>% as_tibble
dir_name <- "images/server_images/"
if (!dir.exists(dir_name)) {
dir.create(dir_name, recursive = TRUE)
}
# save all the server images locally if they are not already saved
# location "images/server_images/{domain}.png"
save_image <- function(domain, proxied_thumbnail) {
file_path <- paste0(dir_name, domain, ".png") # Corrected file path
tryCatch({
if (!file.exists(file_path)) { # Check if file doesn't exist
download.file(proxied_thumbnail, file_path, mode = "wb")
}
return(file_path)
}, error = function(e) {
return(NA)
})
}
server_images <- jm %>%
filter(!is.na(blurhash)) %>%
select(domain, proxied_thumbnail) %>%
rowwise() %>%
mutate(image = save_image(domain, proxied_thumbnail)) %>%
ungroup()
```
```{r}
#| results: asis
web_image <- function(url) {
random_number <- as.integer(5*runif(1, 0, 1))
paste0('<img src="', url, '" data-fragment-index="', random_number, '" class="fragment fade-in" data-autoslide="1000" style="max-width: 100px;"/>')
}
server_images %>%
select(image) %>%
mutate(thumb = map(image, web_image)) %>%
head(125) %>%
pull(thumb) %>%
paste0(collapse = "\n") %>%
cat()
```
# But does server choice matter? {.center}
## Mastodon grew significantly in 2022 and 2023
```{r}
#| label: fig-account-timeline
#| fig-width: 6
#| fig-height: 2.5
#| fig-caption: "Number of accounts created on Mastodon. each week from late 2020-2023. The top of the graph shows the proportion of these accounts which moved or remained active after 91 days."
library(here)
source(here("code/helpers.R"))
account_timeline_plot()
```
# The Mastodon Onboarding Process Has Changed Over Time
![The "Join Mastodon" website as it currently appears.](images/joinmastodon-screenshot.png)
## The Flagship Instance
:::: {.columns}
::: {.column width=60%}
+ **Mastodon.social** was the first Mastodon instance and is the largest.
+ There have been some historical concerns that its size was an issue.
+ At certain times, it has **closed** registrations.
1. An extended period of through the end of October 2020.
2. A temporary issue when the email host limited the server in mid-2022.
3. Two periods in late 2022 and early 2023.
:::
::: {.column width=40%}
![A screenshot of mastodon.social as it appeared in 2020 with a message redirecting signups to mastodon.online or to "Join Mastodon"](images/mastodon-social-signups-2020-11-01.png)
:::
:::
## The Pull-Pull Effect: Did Closing Mastodon.social Affect Other Servers?
We can use an interrupted time series analysis to test this.
$$
\begin{aligned}
y_t &= \beta_0 + \beta_1 \text{open}_t + \beta_2 \text{day}_t + \beta_3 (\text{open} \times \text{day})_t \\
&\quad + \beta_4 \sin\left(\frac{2\pi t}{7}\right) + \beta_5 \cos\left(\frac{2\pi t}{7}\right) \\
&\quad + \beta_6 \sin\left(\frac{4\pi t}{7}\right) + \beta_7 \cos\left(\frac{4\pi t}{7}\right) \\
&\quad + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \epsilon_t
\end{aligned}
$$
where $y_t$ is the number of new accounts on a server at time $t$, $\text{open}_t$ is a binary variable indicating if the server is open to new sign-ups, $\text{day}_t$ is an increasing integer represnting the date, and $\epsilon_t$ is a white noise error term. We use the sine and cosine terms to account for weekly seasonality.
## Mastodon.online used to be more influential
| Period | Setting | $p < 0.05$ |
|------------|:----------------|:----|
| 2020-2021 | mastodon.online | Yes |
| | JoinMastodon | No |
| | Other | No |
| Mid 2022 | JoinMastodon | No |
| | Other | No |
| Early 2023 | JoinMastodon | No |
| | Other | No |
Results from ARIMA models for the number of new accounts mastodon.online, servers linked in joinmastodon.org, and all other servers.
## The current Mastodon onboarding process
:::: {.columns}
::: {.column width=60%}
+ While Mastodon once pushed newcomers _away_ from mastodon.social, it now treats it like the **default server**
+ Secondarily, newcomers are directed to "Join Mastodon"
:::
::: {.column width=40%}
![](images/mastodon_blog_onboarding.png)
:::
:::
## Accounts on the largest general servers are less likely to remain active after 91 days
::: {.columns}
::: {.column}
```{r, cache.extra = tools::md5sum("code/survival.R")}
#| cache: true
#| label: fig-survival
#| fig-env: figure
#| fig-cap: "Survival probabilities for accounts created during May 2023."
#| fig-width: 3.375
#| fig-height: 2.25
#| fig-pos: h!
library(here)
source(here("code/survival.R"))
plot_km
```
:::
::: {.column .small}
```{r}
#| label: tbl-coxme
library(ehahelper)
library(broom)
cxme_table <- tidy(cxme) %>%
mutate(conf.low = exp(conf.low), conf.high=exp(conf.high)) %>%
mutate(term = case_when(
term == "factor(group)1" ~ "Join Mastodon",
term == "factor(group)2" ~ "General Servers",
term == "small_serverTRUE" ~ "Small Server",
TRUE ~ term
)) %>%
mutate(exp.coef = paste("(", round(conf.low, 2), ", ", round(conf.high, 2), ")", sep="")) %>%
select(term, estimate, exp.coef , p.value)
cxme_table %>% knitr::kable(digits = 3)
```
:::
:::
## Accounts that move between servers are more likely to move to smaller servers
::: {.small}
```{r}
#| label: tbl-ergm-table
#| echo: false
#| warning: false
#| message: false
#| error: false
library(here)
library(modelsummary)
library(kableExtra)
library(purrr)
library(stringr)
load(file = here("data/scratch/ergm-model-early.rda"))
load(file = here("data/scratch/ergm-model-late.rda"))
#library(gt)
library(kableExtra)
library(modelsummary)
modelsummary(
if (knitr::is_latex_output()) {
format <- "latex_tabular"
} else {
format <- "html"
}
x <- modelsummary(
list("Coef." = model.early, "Std.Error" = model.early, "Coef." = model.late, "Std.Error" = model.late),
estimate = c("{estimate}", "{stars}{std.error}", "{estimate}", "{stars}{std.error}"),
statistic = NULL,
gof_omit = ".*",
coef_rename = c(
"sum" = "(Sum)",
"sum" = "Sum",
"nonzero" = "Nonzero",
"diff.sum0.h-t.accounts" = "Smaller server",
"nodeocov.sum.accounts" = "Server size (outgoing)",
"nodeifactor.sum.registrations.TRUE" = "Open registrations (incoming)",
"nodeocov.sum.accounts" = "Server size\n(outgoing)",
"nodeifactor.sum.registrations.TRUE" = "Open registrations\n(incoming)",
"nodematch.sum.language" = "Languages match"
),
align="lrrrr",
stars = c('*' = .05, '**' = 0.01, '***' = .001),
output = "kableExtra") %>%
add_header_above(c(" " = 1, "Model A" = 2, "Model B" = 2))
output = format
) %>% add_header_above(c(" " = 1, "Model A" = 2, "Model B" = 2))
x
```
## The Local Timeline: Mastodon's Secret Killer Feature
:::
While discovery is challenging in decentralized online social networks, joining the right server can make it easier.
# Our analysis suggests server choice _does_ matter {.center}
If you join an server focused on a particular topic or community of interest, you get a timeline about that topic without having to follow anyone.
Can we build a system that helps people find servers?
## Challenges in Buildling Recommendation Systems on DOSNs {.small}
# Recommendation System Concept
1. **Tensions around centralization**: a single service providing recommendations for all servers probably won't work.
1. **Local control**: system should be opt-in, server admins should be able to filter servers they accept data from.
1. **Computing power**: needs to be able to run on servers with limited resources.
- Report top **hashtags** used by the most accounts on each server
- Build an $M \times N$ server-tag matrix
- Normalize with Okai BM25 TF-IDF and L2 normalization
## Concept: Use Hashtags
Advantages:
::: {.fragment}
Using this matrix, we can
1. Already have their own table in the database.
2. Clear opt-in toward public visibility
- Calculate similarity between servers using tags
- Calculate similarity between tags using servers
- Reccommend servers based on affinity toward certain tags
:::
## Design
## Example: Server Similarity
For the most popular tags by their local users, each server reports:
1. A list of top tags
2. The number of accounts using each tag in the last 6 months
3. The number of accounts using any tag on the server.
. . .
Weigh the model using term frequency-inverse document frequency (TF-IDF)
## Challenge
How many servers do we need?
How many tags do they need to report?
## Baseline comparison
+ Data from all servers with over 100 accounts using hashtags.
+ Use cosine similarity to find pairwise similarity between all servers.
+ Compare to simulations with limits on the number of servers and number of tags reported.
Comparison metric: rank biased overlap (RBO).
## Overlap with Baseline in Various Simulations
::: {#tbl-sim-servers}
```{r}
#| label: fig-simulations-rbo
#| fig-width: 10
simulations <- arrow::read_ipc_file(here("data/scratch/simulation_rbo.feather"))
#| label: table-sim-servers
library(tidyverse)
library(arrow)
library(here)
simulations %>%
group_by(servers, tags, run) %>% summarize(rbo=mean(rbo), .groups="drop") %>%
mutate(ltags = as.integer(log2(tags))) %>%
ggplot(aes(x = factor(ltags), y = rbo, fill = factor(ltags))) +
geom_boxplot() +
facet_wrap(~servers, nrow=1) +
scale_y_continuous(limits = c(0, 1)) +
labs(x = "Tags (log2)", y = "RBO", title = "Rank Biased Overlap with Baseline Rankings by Number of Servers") +
theme_minimal() + theme(legend.position = "none")
sim_servers <- here("data/scratch/server_similarity.feather") %>% arrow::read_ipc_file()
server_of_interest <- "hci.social"
server_table <- sim_servers %>%
arrange(desc(Similarity)) %>%
filter(Source == server_of_interest | Target == server_of_interest) %>%
head(7) %>%
pivot_longer(cols=c(Source, Target)) %>%
filter(value != server_of_interest) %>%
select(value, Similarity) %>%
rename("Server" = "value")
if (knitr::is_latex_output()) {
server_table %>% knitr::kable(format="latex", booktabs=TRUE, digits=3)
} else {
server_table %>% knitr::kable(digits = 3)
}
```
## Example Recommendation System
Top five servers most similar to hci.social
+ Use just servers from joinmastodon.org
+ Ask for preferences from a bag of common tags.
+ Suggest top servers according to similarity.
:::
## User 1: education, science, academia
## Server Recs
Top suggestions:
<iframe width="100%" height="100%" src="https://carlcolglazier.com/files/jsdemo/witch.html" title="Quarto Documentation"></iframe>
+ mathstodon.xyz
+ sciences.social
+ mastodon.education
+ hcommons.social
+ mas.to
## User 2: tech, linux, hacking
# Future Work
Top suggestions:
- Evaluation of the recommendation system
- More specific analysis of account attributes
- Simulations for robustness
+ snabelen.no
+ social.anoxinon.de
+ peoplemaking.games
+ mastodon.gamedev.place
+ discuss.systems
# References {#refs .scrollable}

File diff suppressed because it is too large Load Diff

View File

@ -1,6 +1,6 @@
/*-- scss:defaults --*/
@import url(https://fonts.googleapis.com/css?family=Montserrat:300,300i&display=swap);
@import url(https://fonts.googleapis.com/css?family=Montserrat:400,400i,600&display=swap);
@import url(https://fonts.googleapis.com/css?family=Josefin+Sans&display=swap);
@import url(https://fonts.googleapis.com/css?family=Fira+Mono&display=swap);
@ -104,6 +104,10 @@ $black: #000 !default;
font-size: 0.4em
}
.big {
font-size: 1.4em
}
.Large {
font-size: 1.6em
}
@ -120,7 +124,7 @@ section.has-dark-background a:hover {
}
.reveal h2 {
padding-top: 20rem;
padding-top: 50rem;
padding-bottom: 2rem;
padding-left: 20rem;
padding-right: 20rem;
@ -128,7 +132,7 @@ section.has-dark-background a:hover {
color: $white;
position: relative;
//top: -38rem;
margin-top: -22rem;
margin-top: -52rem;
margin-bottom: 2rem;
left: -20rem;
width: 100%;
@ -137,4 +141,25 @@ section.has-dark-background a:hover {
/*
.slide {
padding: 2rem;
}*/
}*/
@keyframes tilt-n-move-shaking {
0% { transform: translate(0, 0) rotate(0deg); }
25% { transform: translate(1px, 1px) rotate(1deg); }
50% { transform: translate(0, 0) rotate(0eg); }
75% { transform: translate(-1px, 1px) rotate(-1deg); }
100% { transform: translate(0, 0) rotate(0deg); }
}
.server-image {
// animate jiggle
animation: tilt-n-move-shaking 0.5s infinite;
margin: 0!important;
}
.spread {
text-align: left;
display: flex !important;
flex-direction: column;
height: 80%;
justify-content: space-around;
}

View File

@ -0,0 +1,3 @@
{
"extends": "next/core-web-vitals"
}

36
recommender/.gitignore vendored Normal file
View File

@ -0,0 +1,36 @@
# See https://help.github.com/articles/ignoring-files/ for more about ignoring files.
# dependencies
/node_modules
/.pnp
.pnp.js
.yarn/install-state.gz
# testing
/coverage
# next.js
/.next/
/out/
# production
/build
# misc
.DS_Store
*.pem
# debug
npm-debug.log*
yarn-debug.log*
yarn-error.log*
# local env files
.env*.local
# vercel
.vercel
# typescript
*.tsbuildinfo
next-env.d.ts

36
recommender/README.md Normal file
View File

@ -0,0 +1,36 @@
This is a [Next.js](https://nextjs.org/) project bootstrapped with [`create-next-app`](https://github.com/vercel/next.js/tree/canary/packages/create-next-app).
## Getting Started
First, run the development server:
```bash
npm run dev
# or
yarn dev
# or
pnpm dev
# or
bun dev
```
Open [http://localhost:3000](http://localhost:3000) with your browser to see the result.
You can start editing the page by modifying `app/page.tsx`. The page auto-updates as you edit the file.
This project uses [`next/font`](https://nextjs.org/docs/basic-features/font-optimization) to automatically optimize and load Inter, a custom Google Font.
## Learn More
To learn more about Next.js, take a look at the following resources:
- [Next.js Documentation](https://nextjs.org/docs) - learn about Next.js features and API.
- [Learn Next.js](https://nextjs.org/learn) - an interactive Next.js tutorial.
You can check out [the Next.js GitHub repository](https://github.com/vercel/next.js/) - your feedback and contributions are welcome!
## Deploy on Vercel
The easiest way to deploy your Next.js app is to use the [Vercel Platform](https://vercel.com/new?utm_medium=default-template&filter=next.js&utm_source=create-next-app&utm_campaign=create-next-app-readme) from the creators of Next.js.
Check out our [Next.js deployment documentation](https://nextjs.org/docs/deployment) for more details.

View File

@ -0,0 +1,4 @@
@tailwind base;
@tailwind components;
@tailwind utilities;

View File

@ -0,0 +1,22 @@
import type { Metadata } from "next";
import { Inter } from "next/font/google";
import "./globals.css";
const inter = Inter({ subsets: ["latin"] });
export const metadata: Metadata = {
title: "Mastodon Server Recommender",
description: "by Carl Colglazier",
};
export default function RootLayout({
children,
}: Readonly<{
children: React.ReactNode;
}>) {
return (
<html lang="en">
<body className={inter.className}>{children}</body>
</html>
);
}

View File

@ -0,0 +1,193 @@
"use client";
import React, { useState, useEffect } from 'react';
//import top_tags from "@/data/top_tags.json";
import positions from "@/data/positions.json";
import server_matrix from "@/data/server_matrix.json";
import server_names from "@/data/server_names.json";
import tag_names from "@/data/tag_names.json";
interface Tag {
index: number;
tag: string;
}
// Read from data/top_tags.json FILE
//const tags: Tag[] =
const selected_tags = [
"politics", "gardening", "art", "nature", "pride", "cycling", "climate",
"programming", "dogs", "privacy", "cats", "gaming", "education",
"science", "music", "movies", "food", "books", "lgbtq", "python",
"emacs", "gay", "trans", "furry", "photography", "cooking", "literature",
"television"
];
function dotProduct(a: number[], b: number[]): number {
return a.map((x, i) => x * b[i]).reduce((sum, current) => sum + current, 0);
}
function magnitude(arr: number[]): number {
return Math.sqrt(arr.map(x => x * x).reduce((sum, current) => sum + current, 0));
}
function cosineSimilarity(arr1: number[], arr2: number[]): number {
if (arr1.length !== arr2.length) {
throw new Error("Arrays must have the same length");
}
const dotProd = dotProduct(arr1, arr2);
const magnitudeProd = magnitude(arr1) * magnitude(arr2);
if (magnitudeProd === 0) {
return 0;
}
return dotProd / magnitudeProd;
}
function averageOfArrays(arr: number[][]): number[] {
// Get the length of the first sub-array
const length = arr[0].length;
// Initialize an array to store the sums
const sums = Array(length).fill(0);
// Loop over each sub-array
for (let i = 0; i < arr.length; i++) {
// Loop over each element in the sub-array
for (let j = 0; j < arr[i].length; j++) {
// Add the element to the corresponding sum
sums[j] += arr[i][j];
}
}
// Divide each sum by the number of arrays to get the average
const averages = sums.map(sum => sum / arr.length);
return averages;
}
const TagSelector: React.FC = () => {
//const top_tags: Tag[] = [];
const all_tags: Tag[] = tag_names.map((tag: string, index: number) => ({index, tag}))
const tags: Tag[] = all_tags.filter((tag: Tag) => selected_tags.includes(tag.tag));
//top_tags.filter((tag) => selected_tags.includes(tag.tag));
// State to keep track of selected tag IDs.
const [selectedTagIds, setSelectedTagIds] = useState<number[]>([]);
const [suggestedTagIds, setSuggestedTagIds] = useState<number[]>([]);
const [topServerIds, setTopServerIds] = useState<number[]>([]);
// Function to handle tag selection toggling.
const toggleTag = (tagId: number) => {
setSelectedTagIds((currentTagIds) =>
currentTagIds.includes(tagId)
? currentTagIds.filter((id) => id !== tagId)
: [...currentTagIds, tagId],
)
};
function find_most_similar_tags(all_tags: Tag[], selectedTagIds: number[], positions: number[][]) {
let most_similar: Tag[] = [];
// get the average position of all selected tags
if (selectedTagIds.length > 0) {
// loop through all selected tags and get their positions
const selected_positions = selectedTagIds.map((tagId) => positions[tagId]);
for (let i = 0; i < selected_positions.length; i++) {
let tag_similarity = positions.map((row) => cosineSimilarity(selected_positions[i], row));
let tag_rank = tag_similarity.map((similarity, index) => ({index, similarity})).sort((a, b) => b.similarity - a.similarity).map((item, index) => ({index: item.index, similarity: item.similarity, name: tag_names[item.index]}));
console.log(tag_rank.slice(0, 10));
let topServerIds = tag_rank.slice(0, 10).map((item) => item.index);
for (let tagId of all_tags.filter((tag) => topServerIds.includes(tag.index))) {
if (!selectedTagIds.includes(tagId.index) && !most_similar.includes(tagId)) {
most_similar.push(tagId);
}
}
}
}
return most_similar;
}
useEffect(() => {
// get the average position of all selected tags
if (selectedTagIds.length > 0) {
let selected_positions = selectedTagIds.map((tagId) => positions[tagId]);
// get the average of selected positions
let average_position = averageOfArrays(selected_positions);
// loop through each row of the server_matrix and calculate the cosine similarity
const server_similarity = server_matrix.map((row) => cosineSimilarity(average_position, row));
const server_rank = server_similarity.map((similarity, index) => ({index, similarity})).sort((a, b) => b.similarity - a.similarity).map((item, index) => ({index: item.index, similarity: item.similarity, name: server_names[item.index]}));
setTopServerIds(server_rank.slice(0, 10).map((item) => item.index));
// Find the most similar tags among all tags
const tag_similarity = all_tags.map((tag) => ({t: tag, sim: cosineSimilarity(average_position, positions[tag.index])})).sort((a, b) => b.sim - a.sim);
const most_similar = find_most_similar_tags(all_tags, selectedTagIds, positions);
//tag_similarity.slice(0, 50).map((item) => item.t);
console.log(most_similar);
setSuggestedTagIds(most_similar.map((tag) => tag.index));
console.log(suggestedTagIds);
}
}, [selectedTagIds]);
return (
<div>
<div className="flex flex-wrap gap-2">
<div>
<h3>Selected</h3>
{all_tags.filter((tag) => selectedTagIds.includes(tag.index)).map((tag) => (
<button
key={tag.index}
onClick={() => toggleTag(tag.index)}
className={`px-3 py-1 rounded-full text-sm ${
selectedTagIds.includes(tag.index) ? 'bg-blue-500 text-white' : 'bg-gray-200 text-gray-800'
} transition-colors duration-300 ease-in-out focus:outline-none focus:ring-2 focus:ring-blue-400 focus:ring-opacity-50`}
>
{tag.tag}
</button>
))}
</div>
<div>
<h3>Categories</h3>
{tags.filter((tag) => !selectedTagIds.includes(tag.index)).map((tag) => (
<button
key={tag.index}
onClick={() => toggleTag(tag.index)}
className={`px-3 py-1 rounded-full text-sm bg-gray-200 text-gray-800 transition-colors duration-300 ease-in-out focus:outline-none focus:ring-2 focus:ring-blue-400 focus:ring-opacity-50`}
>
{tag.tag}
</button>
))}
</div>
<div>
<h3>Suggested Tags ({suggestedTagIds.filter((tag) => !selectedTagIds.includes(tag)).length})</h3>
{all_tags.filter((tag) => suggestedTagIds.includes(tag.index)).filter((tag) => !selectedTagIds.includes(tag.index)).map((tag) => (
<button
key={tag.index}
onClick={() => toggleTag(tag.index)}
className={`px-3 py-1 rounded-full text-sm bg-gray-200 text-gray-800 transition-colors duration-300 ease-in-out focus:outline-none focus:ring-2 focus:ring-blue-400 focus:ring-opacity-50`}
>
{tag.tag}
</button>
))}
</div>
</div>
<div>
<ul>
{topServerIds.map((id) => (
<li key={id}>{server_names[id]}</li>
))}
</ul>
</div>
</div>
);
};
export default function Home() {
return (
<main className="flex min-h-screen flex-col items-center justify-between p-24">
<TagSelector />
</main>
);
}

View File

@ -0,0 +1,7 @@
/** @type {import('next').NextConfig} */
const nextConfig = {
output: 'export',
basePath: '/files/jsdemo'
};
export default nextConfig;

4897
recommender/package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

27
recommender/package.json Normal file
View File

@ -0,0 +1,27 @@
{
"name": "recommender",
"version": "0.1.0",
"private": true,
"scripts": {
"dev": "next dev",
"build": "next build",
"start": "next start",
"lint": "next lint"
},
"dependencies": {
"next": "14.1.3",
"react": "^18",
"react-dom": "^18"
},
"devDependencies": {
"@types/node": "^20",
"@types/react": "^18",
"@types/react-dom": "^18",
"autoprefixer": "^10.4.18",
"eslint": "^8",
"eslint-config-next": "14.1.3",
"postcss": "^8.4.35",
"tailwindcss": "^3.4.1",
"typescript": "^5"
}
}

View File

@ -0,0 +1,6 @@
module.exports = {
plugins: {
tailwindcss: {},
autoprefixer: {},
},
};

View File

@ -0,0 +1,13 @@
/** @type {import('tailwindcss').Config} */
module.exports = {
content: [
"./app/**/*.{js,ts,jsx,tsx,mdx}",
"./pages/**/*.{js,ts,jsx,tsx,mdx}",
"./components/**/*.{js,ts,jsx,tsx,mdx}"
],
theme: {
extend: {},
},
plugins: [],
}

View File

@ -0,0 +1,20 @@
import type { Config } from "tailwindcss";
const config: Config = {
content: [
"./pages/**/*.{js,ts,jsx,tsx,mdx}",
"./components/**/*.{js,ts,jsx,tsx,mdx}",
"./app/**/*.{js,ts,jsx,tsx,mdx}",
],
theme: {
extend: {
backgroundImage: {
"gradient-radial": "radial-gradient(var(--tw-gradient-stops))",
"gradient-conic":
"conic-gradient(from 180deg at 50% 50%, var(--tw-gradient-stops))",
},
},
},
plugins: [],
};
export default config;

26
recommender/tsconfig.json Normal file
View File

@ -0,0 +1,26 @@
{
"compilerOptions": {
"lib": ["dom", "dom.iterable", "esnext"],
"allowJs": true,
"skipLibCheck": true,
"strict": true,
"noEmit": true,
"esModuleInterop": true,
"module": "esnext",
"moduleResolution": "bundler",
"resolveJsonModule": true,
"isolatedModules": true,
"jsx": "preserve",
"incremental": true,
"plugins": [
{
"name": "next"
}
],
"paths": {
"@/*": ["./*"]
}
},
"include": ["next-env.d.ts", "**/*.ts", "**/*.tsx", ".next/types/**/*.ts"],
"exclude": ["node_modules"]
}

View File

@ -1,3 +1,42 @@
@book{abbateInventingInternet2000,
title = {Inventing the {{Internet}}},
author = {Abbate, Janet},
year = {2000},
series = {Inside Technology},
edition = {3rd printing},
publisher = {MIT Press},
address = {Cambridge, Mass.},
isbn = {978-0-262-51115-5},
langid = {english}
}
@misc{AmericansWidelyDistrust2021,
title = {Americans Widely Distrust {{Facebook}}, {{TikTok}} and {{Instagram}} with Their Data, Poll Finds},
year = {2021},
month = dec,
journal = {Washington Post},
urldate = {2024-03-09},
abstract = {Pulled between not trusting some tech companies and still wanting to use their products, people look to government regulation.},
chapter = {Technology},
howpublished = {https://www.washingtonpost.com/technology/2021/12/22/tech-trust-survey/},
langid = {english}
}
@article{baranDistributedCommunicationsNetworks1964,
title = {On {{Distributed Communications Networks}}},
author = {Baran, P.},
year = {1964},
month = mar,
journal = {IEEE Transactions on Communications Systems},
volume = {12},
number = {1},
pages = {1--9},
issn = {1558-2647},
doi = {10.1109/TCOM.1964.1088883},
abstract = {This paper briefly reviews the distributed communication network concept in which each station is connected to all adjacent stations rather than to a few switching points, as in a centralized system. The payoff for a distributed configuration in terms of survivability in the cases of enemy attack directed against nodes, links or combinations of nodes and links is demonstrated. A comparison is made between diversity of assignment and perfect switching in distributed networks, and the feasibility of using low-cost unreliable communication links, even links so unreliable as to be unusable in present type networks, to form highly reliable networks is discussed. The requirements for a future all-digital data distributed network which provides common user service for a wide range of users having different requirements is considered. The use of a standard format message block permits building relatively simple switching mechanisms using an adaptive store-and-forward routing policy to handle all forms of digital data including digital voice. This network rapidly responds to changes in the network status. Recent history of measured network traffic is used to modify path selection. Simulation results are shown to indicate that highly efficient routing can be performed by local control without the necessity for any central, and therefore vulnerable, control point.},
keywords = {Buildings,Centralized control,Communication networks,Communication switching,Communication system control,History,Information systems,Network synthesis,Routing,Telecommunication network reliability}
}
@inproceedings{burkeFeedMeMotivating2009,
title = {Feed {{Me}}: {{Motivating Newcomer Contribution}} in {{Social Network Sites}}},
shorttitle = {Feed {{Me}}},
@ -42,6 +81,19 @@
langid = {english}
}
@book{driscollModemWorldPrehistory2022,
title = {The Modem World: {{A}} Prehistory of Social Media},
shorttitle = {The Modem World},
author = {Driscoll, Kevin},
year = {2022},
month = apr,
publisher = {Yale University Press},
abstract = {The untold story about how the internet became social, and why this matters for its future``Whether you're reading this for a nostalgic romp or to understand the dawn of the internet, The Modem World will delight you with tales of BBS culture and shed light on how the decisions of the past shape our current networked world.''---danah boyd, author of It's Complicated: The Social Lives of Networked TeensFifteen years before the commercialization of the internet, millions of amateurs across North America created more than 100,000 small-scale computer networks. The people who built and maintained these dial-up bulletin board systems (BBSs) in the 1980s laid the groundwork for millions of others who would bring their lives online in the 1990s and beyond. From ham radio operators to HIV/AIDS activists, these modem enthusiasts developed novel forms of community moderation, governance, and commercialization. The Modem World tells an alternative origin story for social media, centered not in the office parks of Silicon Valley or the meeting rooms of military contractors, but rather on the online communities of hobbyists, activists, and entrepreneurs. Over time, countless social media platforms have appropriated the social and technical innovations of the BBS community. How can these untold stories from the internet's past inspire more inclusive visions of its future?},
isbn = {978-0-300-26512-5},
langid = {english},
keywords = {Computers / History,Computers / Internet / General,History / Modern / 20th Century / General}
}
@misc{driscollWeMisrememberEternal2023,
title = {Do We Misremember {{Eternal September}}?},
shorttitle = {Do We Misremember {{Eternal September}}?},
@ -68,6 +120,40 @@
abstract = {When online platforms rise and fall, sometimes communities fade away, and sometimes they pack their bags and relocate to a new home. To explore the causes and effects of online community migration, we examine transformative fandom, a longstanding, technology-agnostic community surrounding the creation, sharing, and discussion of creative works based on existing media. For over three decades, community members have left and joined many different online spaces, from Usenet to Tumblr to platforms of their own design. Through analysis of 28 in-depth interviews and 1,886 survey responses from fandom participants, we traced these migrations, the reasons behind them, and their impact on the community. Our findings highlight catalysts for migration that provide insights into factors that contribute to success and failure of platforms, including issues surrounding policy, design, and community. Further insights into the disruptive consequences of migrations (such as social fragmentation and lost content) suggest ways that platforms might both support commitment and better support migration when it occurs.}
}
@article{freelonComputationalResearchPostAPI2018,
title = {Computational {{Research}} in the {{Post-API Age}}},
author = {Freelon, Deen},
year = {2018},
month = oct,
journal = {Political Communication},
volume = {35},
number = {4},
pages = {665--668},
publisher = {Routledge},
issn = {1058-4609},
doi = {10.1080/10584609.2018.1477506},
urldate = {2022-04-21},
keywords = {API,computational,Facebook,social media,Twitter}
}
@article{gillespieContentModerationAI2020,
title = {Content Moderation, {{AI}}, and the Question of Scale},
author = {Gillespie, Tarleton},
year = {2020},
month = jul,
journal = {Big Data \& Society},
volume = {7},
number = {2},
pages = {2053951720943234},
publisher = {SAGE Publications Ltd},
issn = {2053-9517},
doi = {10.1177/2053951720943234},
urldate = {2021-09-28},
abstract = {AI seems like the perfect response to the growing challenges of content moderation on social media platforms: the immense scale of the data, the relentlessness of the violations, and the need for human judgments without wanting humans to have to make them. The push toward automated content moderation is often justified as a necessary response to the scale: the enormity of social media platforms like Facebook and YouTube stands as the reason why AI approaches are desirable, even inevitable. But even if we could effectively automate content moderation, it is not clear that we should.},
langid = {english},
keywords = {Artificial intelligence,bias,content moderation,platforms,scale,social media}
}
@inproceedings{heFlockingMastodonTracking2023,
title = {Flocking to {{Mastodon}}: {{Tracking}} the {{Great Twitter Migration}}},
shorttitle = {Flocking to {{Mastodon}}},
@ -153,6 +239,15 @@
keywords = {Computer networks,internet,Online social networks,Planning,Social aspects,Social aspects Planning,Social psychology}
}
@misc{masnickProtocolsNotPlatforms,
title = {Protocols, {{Not Platforms}}: {{A Technological Approach}} to {{Free Speech}}},
shorttitle = {Protocols, {{Not Platforms}}},
author = {Masnick, Mike},
urldate = {2022-04-21},
howpublished = {https://knightcolumbia.org/content/protocols-not-platforms-a-technological-approach-to-free-speech},
langid = {english}
}
@misc{mastodonggmbhServers,
title = {Servers},
author = {{Mastodon gGmbH}},