junior-sheer/index.qmd
2024-02-07 11:41:14 -05:00

683 lines
25 KiB
Plaintext

---
title: Best Practices for Onboarding on the Fediverse
short-title: Onboarding Fediverse
authors:
- name: Carl Colglazier
affiliation:
name: Northwestern University
city: Evanston
state: Illinois
country: United States
#roles: writing
corresponding: true
bibliography: references.bib
acm-metadata:
final: false
copyright-year: 2024
acm-year: 2024
copyright: rightsretained
doi: XXXXXXX.XXXXXXX
conference-acronym: "PACMHCI"
#conference-name: |
# Make sure to enter the correct
# conference title from your rights confirmation email
#conference-date: June 03--05, 2018
#conference-location: Woodstock, NY
#price: "15.00"
#isbn: 978-1-4503-XXXX-X/18/06
format:
acm-pdf:
keep-tex: true
documentclass: acmart
classoption: [acmsmall,manuscript,screen,authorversion,nonacm,timestamp]
abstract: |
When trying to join the Fediverse, a decentralized collection of interoperable social networking websites, new users face the dillema of choosing a home server. Using trace data from thousands of new Fediverse accounts, we show that this choice matters and significantly effects the probably that the account remains active in the future. We then use these insights from this relationship to build a tool that can help new Fediverse users find a server with a high probability of being a good match based on their interests.
execute:
echo: false
error: false
freeze: auto
fig-width: 6.75
---
```{r}
#| label: r-setup
#| output: false
#| error: false
#| warning: false
library(reticulate)
library(tidyverse)
library(arrow)
library(statnet)
library(network)
library(survival)
library(ggsurvfit)
library(modelsummary)
library(randomForestSRC)
library(grid)
library(scales)
options(arrow.skip_nul = TRUE)
```
We first explore the extent to which server choice matters. We find that accounts that join smaller, more interest-based servers are more likely to continue posting six months after their creation.
Using these findings, we then propose a tool that can help users find servers that match their interests.
# Background
## Newcomers in Online Communities
Onboarding newcomers is an important thing for online communities. Any community can expect a certain amount of turnover, and so it is important for the long-term health and longevity of the community to be able to bring in new members.
RQ: What server attributes correspond with better newcomer retention?
## Migrations in Online Communities
All online communities and accounts trend toward death.
Online fandom communities, for instance...
On Reddit, @newellUserMigrationOnline found that the news aggregator had an advantage of potential competitors because of their catalogue of niche communities: people who migrated to alternative platforms tended to post most often proportionally in popular communities.
+ Fiesler on online fandom communities [@fieslerMovingLandsOnline2020]
+ TeBlunthuis on competition and mutalism [@teblunthuisIdentifyingCompetitionMutualism2022]
+ Work on "alt-tech" communities.
# Empirical Setting
The Fediverse is a set of decentralized online social networks which interoperate using shared protocols like ActivityPub.
Mastodon is a software program used by many Fediverse servers and offers a user experience similar to the Tweetdeck client for Twitter. It was first created in late 2016.
Discovery has been challenging on Masotodon. The developers and user base tend to be skeptical of algorithmic intrusions, instead opting for timelines which only show posts in reverse chronological order. Search is also difficult. Public hashtags are searchable, but most servers have traditionally not supported searching keywords or simple strings. Accounts can only be searched using their full `username@server` form.
Mastodon features a "home" timeline which shows all public posts from accounts that share the same home server. On larger servers, this timeline can be unwieldy; however, on smaller servers, this presents the opportunity to discover new posts and users of potential interest.
Mastodon offers its users high levels of data portability. Users can move their accounts accross instances while retaining their follows (their post data; however, does not move with the new account). The choice of an initial instance consequentially is not irreversible.
# Data
```{python}
#| label: py-preprocess-data
#| cache: true
#| output: false
from code.load_accounts import *
from urllib.parse import urlparse
#accounts = pl.concat(
# read_accounts_file("data/accounts.feather"),
# read_accounts_file("data/account_lookup_2023.feather")
#)
accounts = read_accounts_file(
"data/account_lookup_compressed.feather"
).unique(["account", "server"])
# Write a parsed accounts file for R to use
a = accounts.with_columns(
pl.col("url").map_elements(
lambda x: urlparse(x).netloc.encode().decode('idna')
).alias("host"),
pl.col("data_string").str.contains("""\"moved\": \{""").alias("has_moved"),
pl.col("data").struct.field("suspended"),
)
a_save = a.drop(["data", "data_string"])
a_save.select(
sorted(a_save.columns)
).write_ipc("data/scratch/accounts.feather")
moved_accounts = a.filter(pl.col("has_moved")).with_columns(# Do this again now we know the rows are all moved accounts
pl.col("data_string").str.json_decode().alias("data")
).with_columns(
pl.col("data").struct.field("moved")
).drop_nulls("moved").with_columns(
pl.col("moved").struct.field("acct").alias("moved_acct"),
).with_columns(
pl.when(
pl.col("moved_acct").str.contains('@')
).then(
pl.col("moved_acct").str.split('@').list.get(1)
).otherwise(
pl.col("server")
).alias("moved_server")
)
number_of_accounts = len(a)
popular_servers = a.group_by("server").count().sort("count", descending=True)
common_moves = moved_accounts.group_by(
["server", "moved_server"]
).count().sort("count", descending=True)
common_moves.write_ipc("data/scratch/moved_accounts.feather")
common_moves.rename({
"server": "Source",
"moved_server": "Target",
}).write_csv("data/scratch/moved_accounts.csv")
maccounts = moved_accounts.select(["account", "server", "moved_server"])
popular_servers.write_ipc("data/scratch/popular_servers.feather")
jm = pl.read_json("data/joinmastodon.json")
jm.write_ipc("data/scratch/joinmastodon.feather")
read_metadata_file("data/metadata_2023-10-01.feather").drop(
["data", "data_string"]
).write_ipc("data/scratch/metadata.feather")
```
```{python}
#| label: py-preprocess-data2
#| cache: true
#| output: false
from code.load_accounts import read_accounts_file
from urllib.parse import urlparse
import polars as pl
profile_accounts = read_accounts_file("data/profiles_local.feather")
p = profile_accounts.with_columns(
pl.col("url").map_elements(lambda x: urlparse(x).netloc.encode().decode('idna')).alias("host"),
pl.col("username").alias("account"),
pl.lit(False).alias("has_moved"),
pl.lit(False).alias("suspended")
).drop(
["data", "data_string"]
)
p.select(sorted(p.columns)).write_ipc("data/scratch/accounts_processed_profiles.feather")
all_accounts = pl.scan_ipc(
[
"data/scratch/accounts.feather",
#"data/scratch/accounts_processed_recent.feather",
"data/scratch/accounts_processed_profiles.feather"
]).collect()
all_accounts.filter(pl.col("host").eq(pl.col("server"))).unique(["account", "server"]).write_ipc("data/scratch/all_accounts.feather")
```
```{r}
#| eval: false
arrow::read_feather(
"data/scratch/accounts_processed_profiles.feather",
col_select = c(
"server", "username", "created_at",
"last_status_at", "statuses_count",
"has_moved", "bot", "suspended"
)) %>%
mutate(suspended = replace_na(suspended, FALSE)) %>%
filter(!bot) %>%
# TODO: what's going on here?
filter(!is.na(last_status_at)) %>%
# sanity check
filter(created_at >= "2022-01-01") %>%
filter(created_at < "2024-03-01") %>%
# We don't want accounts that were created
# and then immediately stopped being active
filter(statuses_count > 1) %>%
filter(!suspended) %>%
filter(!has_moved) %>%
filter(server == "mastodon.social") %>%
#filter(last_status_at >= created_at) %>%
mutate(created_month = format(created_at, "%Y-%m")) %>%
group_by(created_month) %>%
summarize(count=n()) %>%
distinct(created_month, count) %>%
ggplot(aes(x=created_month, y=count)) +
geom_bar(stat="identity", fill="black") +
labs(y="Count", x="Created Month") +
theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
```{r}
#| label: fig-account-timeline
#| fig-cap: "Accounts in the dataset created between January 2022 and March 2023. The top panels shows the proportion of accounts still active 45 days after creation, the proportion of accounts that have moved, and the proportion of accounts that have been suspended. The bottom panel shows the count of accounts created each week. The dashed vertical lines in the bottom panel represent the annoucement day of the Elon Musk Twitter acquisition, the acquisition closing day, a day where Twitter suspended a number of prominent journalist, and a day when Twitter experienced an outage and started rate limiting accounts."
#| fig-height: 3
#| fig-width: 6.75
jm <- arrow::read_feather("data/scratch/joinmastodon.feather")
accounts_unfilt <- arrow::read_feather(
"data/scratch/all_accounts.feather",
col_select=c(
"server", "username", "created_at", "last_status_at",
"statuses_count", "has_moved", "bot", "suspended",
"following_count", "followers_count"
))
accounts <- accounts_unfilt %>%
filter(!bot) %>%
# TODO: what's going on here?
filter(!is.na(last_status_at)) %>%
mutate(suspended = replace_na(suspended, FALSE)) %>%
# sanity check
filter(created_at >= "2020-10-01") %>%
filter(created_at < "2024-01-01") %>%
# We don't want accounts that were created and then immediately stopped being active
filter(statuses_count >= 1) %>%
filter(last_status_at >= created_at) %>%
mutate(active = last_status_at >= "2024-01-01") %>%
mutate(last_status_at = ifelse(active, lubridate::ymd_hms("2024-01-01 00:00:00", tz = "UTC"), last_status_at)) %>%
mutate(active_time = difftime(last_status_at, created_at, units="days")) #%>%
#filter(!has_moved)
acc_data <- accounts %>%
#filter(!has_moved) %>%
mutate(created_month = format(created_at, "%Y-%m")) %>%
mutate(created_week = floor_date(created_at, unit = "week")) %>%
mutate(active_now = active) %>%
mutate(active = active_time >= 45) %>%
mutate("Is mastodon.social" = server == "mastodon.social") %>%
mutate(jm = server %in% jm$domain) %>%
group_by(created_week) %>%
summarize(
`JoinMastodon Server` = sum(jm) / n(),
`Is mastodon.social` = sum(`Is mastodon.social`)/n(),
Suspended = sum(suspended)/n(),
Active = (sum(active)-sum(has_moved)-sum(suspended))/(n()-sum(has_moved)-sum(suspended)),
active_now = (sum(active_now)-sum(has_moved)-sum(suspended))/(n()-sum(has_moved)-sum(suspended)),
Moved=sum(has_moved)/n(),
count=n()) %>%
pivot_longer(cols=c("JoinMastodon Server", "active_now", "Active", "Moved", "Is mastodon.social"), names_to="Measure", values_to="value") # "Suspended"
theme_bw_small_labels <- function(base_size = 9) {
theme_bw(base_size = base_size) %+replace%
theme(
plot.title = element_text(size = base_size * 0.8),
plot.subtitle = element_text(size = base_size * 0.75),
plot.caption = element_text(size = base_size * 0.7),
axis.title = element_text(size = base_size * 0.9),
axis.text = element_text(size = base_size * 0.8),
legend.title = element_text(size = base_size * 0.9),
legend.text = element_text(size = base_size * 0.8)
)
}
p1 <- acc_data %>%
ggplot(aes(x=as.Date(created_week), group=1)) +
geom_line(aes(y=value, group=Measure, color=Measure)) +
geom_point(aes(y=value, color=Measure), size=0.7) +
scale_y_continuous(limits = c(0, 1.0)) +
labs(y="Proportion") + scale_x_date(labels=date_format("%Y-%U"), breaks = "4 week") +
theme_bw_small_labels() +
theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank())
p2 <- acc_data %>%
distinct(created_week, count) %>%
ggplot(aes(x=as.Date(created_week), y=count)) +
geom_bar(stat="identity", fill="black") +
geom_vline(
aes(xintercept = as.numeric(as.Date("2022-10-27"))),
linetype="dashed", color = "black") +
#geom_text(
# aes(x=as.Date("2022-10-27"),
# y=max(count),
# label=" Elon Musk Twitter Acquisition Completed"),
# vjust=-1, hjust=0, color="black") +
geom_vline(
aes(xintercept = as.numeric(as.Date("2022-04-14"))),
linetype="dashed", color = "black") +
# https://twitter.com/elonmusk/status/1675187969420828672
geom_vline(
aes(xintercept = as.numeric(as.Date("2022-12-15"))),
linetype="dashed", color = "black") +
geom_vline(
aes(xintercept = as.numeric(as.Date("2023-07-01"))),
linetype="dashed", color = "black") +
#scale_y_continuous(limits = c(0, max(acc_data$count) + 100000)) +
scale_y_continuous(labels = scales::comma) +
labs(y="Count", x="Created Week") +
theme_bw_small_labels() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_x_date(labels=date_format("%Y-%U"), breaks = "4 week")
#grid.draw(rbind(ggplotGrob(p1), ggplotGrob(p2), size = "last"))
library(patchwork)
p1 + p2 + plot_layout(ncol = 1)
```
**Mastodon Profiles**: We collected accounts using data previously collected from posts on public Mastodon timelines from October 2020 to January 2024. We then queried for up-to-date information on those accounts including their most recent status and if the account had moved. This gave us a total of `r nrow(accounts)` accounts.
**Moved Profiles**: We found a subset of `r accounts %>% filter(has_moved) %>% nrow` accounts which had moved from one server to another.
# Results
## Activity By Server Size
```{r}
#| label: fig-active-accounts
#| eval: false
#library(betareg)
library(lme4)
activity <- arrow::read_feather(
"data/scratch/activity.feather",
col_select = c("server", "logins")
) %>%
arrange(desc(logins)) %>%
mutate(server_count = logins)
account_data <- inner_join(accounts, activity, by="server") %>%
mutate(active = active_time >= 45)
a_data <- account_data %>%
#mutate(active = active_time >= 45) %>%
group_by(server) %>%
summarize(active_prop = sum(active)/n(), active_count = sum(active), count=n()) %>%
inner_join(., activity, by="server")
a_model <- glmer(active ~ log1p(logins) + (1|server), data=account_data, family=binomial)
#betareg(active_prop ~ log10(count), data = a_data)
logins_seq <- seq(min(log1p(account_data$logins)), max(log1p(account_data$logins)), length.out = 100)
a_pred <- predict(
a_model,
newdata = data.frame(logins = logins_seq, server = factor(1)),
type = "response",
re.form = NA)
pred_data <- data.frame(logins = logins_seq, active_prop = a_pred)
a_data %>%
mutate(logins = log1p(logins)) %>%
ggplot(aes(y=active_prop, x=logins)) +
geom_point(alpha=0.1) +
# help here
#geom_line(aes(y = a_pred)) +
geom_line(data = pred_data, aes(x = logins, y = active_prop), color = "red") + # Use pred_data for line
labs(
y = "Active after 45 Days",
x = "Accounts"
) +
scale_x_continuous(labels = scales::comma) +
#scale_y_log10() +
theme_bw_small_labels()
```
```{r}
#| eval: false
library(fable)
#library(fable.binary)
library(tsibble)
library(lubridate)
ad_time <- account_data |>
mutate(created_at = yearweek(created_at)) |>
group_by(server, created_at) |>
summarize(count = n(), active = sum(active)) |>
as_tsibble(key="server", index=created_at)
```
```{r}
#| eval: false
fit <- ad_time |>
model(
logistic = LOGISTIC(active ~ fourier(K = 5, period = "year"))
)
```
```{r}
#| eval: false
ad_time |>
filter(server == "mastodon.social") |>
sample_n(100) |>
autoplot(active)
```
```{r}
#| label: fig-account-activity-prop
#| fig-cap: "Account Activity Over Time"
#| fig-height: 4
#| eval: false
study_period <- 45
last_day <- "2024-01-15"
#formerly accounts_processed_recent
#server_counts <- arrow::read_feather(
# "data/scratch/accounts.feather",
# col_select=c("server", "username", "created_at", "bot")
# ) %>%
# filter(created_at <= "2023-03-01") %>%
# filter(!bot) %>%
# group_by(server) %>%
# summarize(server_count = n()) %>%
# arrange(desc(server_count)) %>%
# mutate(server_count_bin = floor(log10(server_count)))
metadata <- arrow::read_feather("data/scratch/metadata.feather", col_select=c("server", "user_count")) %>%
arrange(desc(user_count)) %>%
mutate(server_count = user_count) %>%
mutate(server_count_bin = floor(log10(server_count))) %>%
mutate(server_count_bin = ifelse(server_count_bin >= 4, 4, server_count_bin)) %>%
mutate(server_count_bin = ifelse(server_count_bin <= 2, 2, server_count_bin))
activity <- arrow::read_feather(
"data/scratch/activity.feather",
col_select = c("server", "logins")
) %>%
arrange(desc(logins)) %>%
mutate(server_count = logins) %>%
mutate(server_count_bin = floor(log10(server_count))) %>%
# Merge 4 and 5
#mutate(server_count_bin = ifelse(server_count_bin >= 5, 4, server_count_bin)) %>%
# Merge below 2
#mutate(server_count_bin = ifelse((server_count_bin <= 2) & (server_count_bin >= 1), 2, server_count_bin)) %>%
mutate(server_count_bin = ifelse(server_count == 0, -1, server_count_bin))
jm <- arrow::read_feather("data/scratch/joinmastodon.feather")
a <- accounts %>%
filter(!has_moved) %>%
#filter(created_at >= "2023-06-01") %>%
#filter(created_at < "2023-08-01") %>%
filter(created_at >= "2023-10-15") %>%
filter(created_at < "2023-12-01") %>%
inner_join(activity, by="server") %>%
filter(created_at < last_status_at) %>%
#mutate(large_server = server_count > 1000) %>%
mutate(active_time = as.integer(active_time)) %>%
mutate(active_time_weeks = active_time) %>%
mutate(status = ifelse(active, 0, 1)) %>%
mutate(jm = server %in% jm$domain) #%>% filter(server_count > 0)
survfit2(Surv(active_time_weeks, status) ~ strata(server_count_bin) + 1, data = a) %>% # strata(server_count_bin)
ggsurvfit() +
add_confidence_interval() +
scale_y_continuous(limits = c(0, 1)) +
labs(
y = "Overall survival probability",
x = "Time (days)",
) +
#scale_x_continuous(
# breaks = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 4),
# labels = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 4)
#) +
theme_bw_small_labels() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
```{r}
a %>% filter(jm) %>% inner_join(., jm, by=c("server"="domain")) %>%
mutate(is_general = category=="general") %>%
mutate(is_en = language == "en") %>%
mutate(is_large = last_week_users >= 585) %>% #filter(following_count < 10) %>%
survfit2(Surv(active_time_weeks, status) ~ is_general + is_large, data = .) %>% # strata(server_count_bin)
ggsurvfit(linetype_aes=TRUE, type = "survival") +
add_confidence_interval() +
scale_y_continuous(limits = c(0, 1)) +
labs(
y = "Overall survival probability",
x = "Time (days)",
) +
#facet_wrap(~strata, nrow = 3) +
#scale_x_continuous(
# breaks = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 4),
# labels = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 4)
#) +
add_censor_mark() +
theme_bw_small_labels() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
```{r}
#| eval: false
library(coxme)
sel_a <- a %>%
filter(jm) %>% inner_join(., jm, by=c("server"="domain")) %>%
#mutate(is_general = category=="general") %>%
rowwise() %>%
mutate(is_regional = "regional" %in% categories) %>%
mutate(is_general = "general" %in% categories) %>%
mutate(is_neither = !(is_regional | is_regional)) %>%
mutate(is_en = language == "en") %>%
rowwise() %>%
mutate(n_categories = length(categories) - is_regional - is_general) %>%
mutate(many_categories = n_categories > 0) %>%
mutate(is_large = last_week_users >= 585) %>%
mutate(follows_someone = followers_count > 0) %>% filter(server_count > 0) %>%
ungroup
#cx <- coxme(Surv(active_time_weeks, status) ~ is_large + is_general + approval_required + (1|server), data = sel_a, x=TRUE)
cx <- coxph(Surv(active_time_weeks, status) ~ many_categories + is_general*is_regional + is_general:log1p(server_count), data = sel_a, x=TRUE)
coxme(Surv(active_time_weeks, status) ~ is_neither + is_general:log1p(server_count) + (1|server), data = sel_a, x=TRUE)
cx <- coxph(Surv(active_time_weeks, status) ~ is_neither + many_categories + is_general:log10(server_count), data = sel_a, x=TRUE)
cz <- cox.zph(cx)
#plot(cz)
cz
```
```{r}
#| eval: false
options(rf.cores=2, mc.cores=2)
for_data <- sel_a #%>% slice_sample(n=2500)
obj <- rfsrc.fast(Surv(active_time_weeks, status) ~ is_neither + is_general*server_count, data = for_data, ntree=100, forest=TRUE)
#predictions <- predict(obj, newdata = newData)$predicted
#plot(get.tree(obj, 1))
reg.smp.o <- subsample(obj, B = 10, verbose = TRUE)#, subratio = .5)
plot.subsample(reg.smp.o)
```
## Moved Accounts
```{r}
#| label: fig-moved-accounts
#| fig-height: 4
#| eval: false
moved_accounts <- arrow::read_feather("data/scratch/moved_accounts.feather")
popular_servers <- arrow::read_feather("data/scratch/popular_servers.feather")
server_movement_data <- left_join(
(moved_accounts %>% group_by(server) %>% summarize(out_count = sum(count)) %>% select(server, out_count)),
(moved_accounts %>% group_by(moved_server) %>% summarize(in_count = sum(count)) %>% select(moved_server, in_count) %>% rename(server=moved_server)),
by="server"
) %>% replace_na(list(out_count = 0, in_count = 0)) %>%
mutate(diff = in_count - out_count) %>%
arrange(diff) %>%
left_join(., popular_servers, by="server") %>%
rename(user_count = count) %>% arrange(desc(user_count))
server_movement_data %>%
ggplot(aes(x=user_count, y=diff)) +
geom_point() + scale_x_log10() + theme_bw_small_labels()
```
If there was no relationship, we would expect these jumps to be random with respect to server size.
```{r}
popular_servers <-
arrow::read_feather("data/scratch/popular_servers.feather")
moved_accounts <-
arrow::read_feather("data/scratch/moved_accounts.feather") %>%
# Remove loops
filter(server != moved_server)
activity <-
arrow::read_feather("data/scratch/activity.feather",
col_select = c("server", "logins")) %>%
arrange(desc(logins))
popular_and_large_servers <-
popular_servers %>% filter(count >= 1) %>%
mutate(count = log10(count))
jm <- arrow::read_feather("data/scratch/joinmastodon.feather")
ma <- moved_accounts %>%
filter(server %in% popular_and_large_servers$server) %>%
filter(moved_server %in% popular_and_large_servers$server)
# Construct network
edgeNet <- network(ma, matrix.type = "edgelist")
edgeNet %v% "user_count" <-
left_join((as_tibble(edgeNet %v% 'vertex.names') %>% rename(server = value)),
popular_and_large_servers,
by = "server") %>%
select(count) %>%
unlist()
edgeNet %v% "in_jm" <-
as_tibble(edgeNet %v% 'vertex.names') %>%
mutate(in_jm = value %in% jm$domain) %>%
select(in_jm) %>% unlist()
```
We construct an exponential family random graph model (ERGM) where nodes represent servers and weighted directed edges represent the number of accounts that moved between servers.
$$
\begin{aligned}
\text{Sum}_{i,j} = & \beta_1 (log10(\text{user count}_j) - log10(\text{user count}_i)) + \\
& \beta_2 \\
& \beta_3 \\
& \beta_4 \\
\end{aligned}
$$
```{r}
#| label: ergm-model
#| cache: true
m1 <-
ergm(
edgeNet ~ sum +
diff("user_count", pow = 1, form = "sum") +
nodecov("user_count", form = "sum") +
nodematch("in_jm", diff = TRUE, form = "sum"),
response = "count",
reference = ~ Binomial(3),
control=control.ergm(parallel=4, parallel.type="PSOCK")
)
save(m1, file = "data/scratch/ergm-model.rda")
```
```{r}
#| label: tag-ergm-result
#| output: asis
ergm_model <- load("data/scratch/ergm-model.rda")
modelsummary(
m1,
escape = FALSE,
coef_rename = c(
"sum" = "$\\beta_0$ Intercept",
"diff.sum.t-h.user_count" = "$\\beta_1$ User Count Difference",
"nodecov.sum.user_count" = "$\\beta_2$ User Count (Node Covariate)",
"nodematch.sum.in_jm.TRUE" = "$\\beta_3$ In JoinMastodon (Both True)",
"nodematch.sum.in_jm.FALSE" = "$\\beta_4$ In JoinMastodon (Both False)"
),
)
```
We find a strong preference for accounts to move from large servers to smaller servers.
```{python}
#| eval: false
#| include: false
import random
def simulate_account_moves(origin: str, servers: dict, n: int):
server_list = list(set(servers.keys()) - {origin})
weights = [servers[x] for x in server_list]
return pl.DataFrame({
"simulation": list(range(n)),
"server": [origin] * n,
"moved_server": random.choices(server_list, weights=weights, k=n)
})
simulations = pl.concat([simulate_account_moves(row["server"], {x["server"]: x["count"] for x in popular_servers.iter_rows(named=True)}, 1000) for row in maccounts.iter_rows(named=True)])
m_counts = maccounts.join(popular_servers, how="inner", on="server").rename({"count": "origin_count"}).join(popular_servers.rename({"server": "moved_server"}), how="inner", on="moved_server").rename({"count": "target_count"})
sim_counts = simulations.join(popular_servers, how="inner", on="server").rename({"count": "origin_count"}).join(popular_servers.rename({"server": "moved_server"}), how="inner", on="moved_server").rename({"count": "target_count"})
```
## Tag Clusters
We found _number_ posts which contained between two and five tags.
# References {#references}