683 lines
25 KiB
Plaintext
683 lines
25 KiB
Plaintext
---
|
|
title: Best Practices for Onboarding on the Fediverse
|
|
short-title: Onboarding Fediverse
|
|
authors:
|
|
- name: Carl Colglazier
|
|
affiliation:
|
|
name: Northwestern University
|
|
city: Evanston
|
|
state: Illinois
|
|
country: United States
|
|
#roles: writing
|
|
corresponding: true
|
|
bibliography: references.bib
|
|
acm-metadata:
|
|
final: false
|
|
copyright-year: 2024
|
|
acm-year: 2024
|
|
copyright: rightsretained
|
|
doi: XXXXXXX.XXXXXXX
|
|
conference-acronym: "PACMHCI"
|
|
#conference-name: |
|
|
# Make sure to enter the correct
|
|
# conference title from your rights confirmation email
|
|
#conference-date: June 03--05, 2018
|
|
#conference-location: Woodstock, NY
|
|
#price: "15.00"
|
|
#isbn: 978-1-4503-XXXX-X/18/06
|
|
format:
|
|
acm-pdf:
|
|
keep-tex: true
|
|
documentclass: acmart
|
|
classoption: [acmsmall,manuscript,screen,authorversion,nonacm,timestamp]
|
|
abstract: |
|
|
When trying to join the Fediverse, a decentralized collection of interoperable social networking websites, new users face the dillema of choosing a home server. Using trace data from thousands of new Fediverse accounts, we show that this choice matters and significantly effects the probably that the account remains active in the future. We then use these insights from this relationship to build a tool that can help new Fediverse users find a server with a high probability of being a good match based on their interests.
|
|
execute:
|
|
echo: false
|
|
error: false
|
|
freeze: auto
|
|
fig-width: 6.75
|
|
---
|
|
|
|
```{r}
|
|
#| label: r-setup
|
|
#| output: false
|
|
#| error: false
|
|
#| warning: false
|
|
library(reticulate)
|
|
library(tidyverse)
|
|
library(arrow)
|
|
library(statnet)
|
|
library(network)
|
|
library(survival)
|
|
library(ggsurvfit)
|
|
library(modelsummary)
|
|
library(randomForestSRC)
|
|
library(grid)
|
|
library(scales)
|
|
|
|
options(arrow.skip_nul = TRUE)
|
|
```
|
|
|
|
We first explore the extent to which server choice matters. We find that accounts that join smaller, more interest-based servers are more likely to continue posting six months after their creation.
|
|
|
|
Using these findings, we then propose a tool that can help users find servers that match their interests.
|
|
|
|
# Background
|
|
|
|
## Newcomers in Online Communities
|
|
|
|
Onboarding newcomers is an important thing for online communities. Any community can expect a certain amount of turnover, and so it is important for the long-term health and longevity of the community to be able to bring in new members.
|
|
|
|
RQ: What server attributes correspond with better newcomer retention?
|
|
|
|
## Migrations in Online Communities
|
|
|
|
All online communities and accounts trend toward death.
|
|
|
|
Online fandom communities, for instance...
|
|
|
|
|
|
On Reddit, @newellUserMigrationOnline found that the news aggregator had an advantage of potential competitors because of their catalogue of niche communities: people who migrated to alternative platforms tended to post most often proportionally in popular communities.
|
|
|
|
+ Fiesler on online fandom communities [@fieslerMovingLandsOnline2020]
|
|
+ TeBlunthuis on competition and mutalism [@teblunthuisIdentifyingCompetitionMutualism2022]
|
|
+ Work on "alt-tech" communities.
|
|
|
|
# Empirical Setting
|
|
|
|
The Fediverse is a set of decentralized online social networks which interoperate using shared protocols like ActivityPub.
|
|
|
|
Mastodon is a software program used by many Fediverse servers and offers a user experience similar to the Tweetdeck client for Twitter. It was first created in late 2016.
|
|
|
|
Discovery has been challenging on Masotodon. The developers and user base tend to be skeptical of algorithmic intrusions, instead opting for timelines which only show posts in reverse chronological order. Search is also difficult. Public hashtags are searchable, but most servers have traditionally not supported searching keywords or simple strings. Accounts can only be searched using their full `username@server` form.
|
|
|
|
Mastodon features a "home" timeline which shows all public posts from accounts that share the same home server. On larger servers, this timeline can be unwieldy; however, on smaller servers, this presents the opportunity to discover new posts and users of potential interest.
|
|
|
|
Mastodon offers its users high levels of data portability. Users can move their accounts accross instances while retaining their follows (their post data; however, does not move with the new account). The choice of an initial instance consequentially is not irreversible.
|
|
|
|
# Data
|
|
|
|
```{python}
|
|
#| label: py-preprocess-data
|
|
#| cache: true
|
|
#| output: false
|
|
|
|
from code.load_accounts import *
|
|
from urllib.parse import urlparse
|
|
|
|
#accounts = pl.concat(
|
|
# read_accounts_file("data/accounts.feather"),
|
|
# read_accounts_file("data/account_lookup_2023.feather")
|
|
#)
|
|
accounts = read_accounts_file(
|
|
"data/account_lookup_compressed.feather"
|
|
).unique(["account", "server"])
|
|
# Write a parsed accounts file for R to use
|
|
a = accounts.with_columns(
|
|
pl.col("url").map_elements(
|
|
lambda x: urlparse(x).netloc.encode().decode('idna')
|
|
).alias("host"),
|
|
pl.col("data_string").str.contains("""\"moved\": \{""").alias("has_moved"),
|
|
pl.col("data").struct.field("suspended"),
|
|
)
|
|
|
|
a_save = a.drop(["data", "data_string"])
|
|
a_save.select(
|
|
sorted(a_save.columns)
|
|
).write_ipc("data/scratch/accounts.feather")
|
|
|
|
moved_accounts = a.filter(pl.col("has_moved")).with_columns(# Do this again now we know the rows are all moved accounts
|
|
pl.col("data_string").str.json_decode().alias("data")
|
|
).with_columns(
|
|
pl.col("data").struct.field("moved")
|
|
).drop_nulls("moved").with_columns(
|
|
pl.col("moved").struct.field("acct").alias("moved_acct"),
|
|
).with_columns(
|
|
pl.when(
|
|
pl.col("moved_acct").str.contains('@')
|
|
).then(
|
|
pl.col("moved_acct").str.split('@').list.get(1)
|
|
).otherwise(
|
|
pl.col("server")
|
|
).alias("moved_server")
|
|
)
|
|
|
|
number_of_accounts = len(a)
|
|
|
|
popular_servers = a.group_by("server").count().sort("count", descending=True)
|
|
|
|
common_moves = moved_accounts.group_by(
|
|
["server", "moved_server"]
|
|
).count().sort("count", descending=True)
|
|
|
|
common_moves.write_ipc("data/scratch/moved_accounts.feather")
|
|
common_moves.rename({
|
|
"server": "Source",
|
|
"moved_server": "Target",
|
|
}).write_csv("data/scratch/moved_accounts.csv")
|
|
|
|
maccounts = moved_accounts.select(["account", "server", "moved_server"])
|
|
|
|
popular_servers.write_ipc("data/scratch/popular_servers.feather")
|
|
|
|
jm = pl.read_json("data/joinmastodon.json")
|
|
jm.write_ipc("data/scratch/joinmastodon.feather")
|
|
|
|
read_metadata_file("data/metadata_2023-10-01.feather").drop(
|
|
["data", "data_string"]
|
|
).write_ipc("data/scratch/metadata.feather")
|
|
```
|
|
|
|
```{python}
|
|
#| label: py-preprocess-data2
|
|
#| cache: true
|
|
#| output: false
|
|
|
|
from code.load_accounts import read_accounts_file
|
|
from urllib.parse import urlparse
|
|
import polars as pl
|
|
|
|
profile_accounts = read_accounts_file("data/profiles_local.feather")
|
|
p = profile_accounts.with_columns(
|
|
pl.col("url").map_elements(lambda x: urlparse(x).netloc.encode().decode('idna')).alias("host"),
|
|
pl.col("username").alias("account"),
|
|
pl.lit(False).alias("has_moved"),
|
|
pl.lit(False).alias("suspended")
|
|
).drop(
|
|
["data", "data_string"]
|
|
)
|
|
p.select(sorted(p.columns)).write_ipc("data/scratch/accounts_processed_profiles.feather")
|
|
all_accounts = pl.scan_ipc(
|
|
[
|
|
"data/scratch/accounts.feather",
|
|
#"data/scratch/accounts_processed_recent.feather",
|
|
"data/scratch/accounts_processed_profiles.feather"
|
|
]).collect()
|
|
all_accounts.filter(pl.col("host").eq(pl.col("server"))).unique(["account", "server"]).write_ipc("data/scratch/all_accounts.feather")
|
|
```
|
|
|
|
|
|
```{r}
|
|
#| eval: false
|
|
arrow::read_feather(
|
|
"data/scratch/accounts_processed_profiles.feather",
|
|
col_select = c(
|
|
"server", "username", "created_at",
|
|
"last_status_at", "statuses_count",
|
|
"has_moved", "bot", "suspended"
|
|
)) %>%
|
|
mutate(suspended = replace_na(suspended, FALSE)) %>%
|
|
filter(!bot) %>%
|
|
# TODO: what's going on here?
|
|
filter(!is.na(last_status_at)) %>%
|
|
# sanity check
|
|
filter(created_at >= "2022-01-01") %>%
|
|
filter(created_at < "2024-03-01") %>%
|
|
# We don't want accounts that were created
|
|
# and then immediately stopped being active
|
|
filter(statuses_count > 1) %>%
|
|
filter(!suspended) %>%
|
|
filter(!has_moved) %>%
|
|
filter(server == "mastodon.social") %>%
|
|
#filter(last_status_at >= created_at) %>%
|
|
mutate(created_month = format(created_at, "%Y-%m")) %>%
|
|
group_by(created_month) %>%
|
|
summarize(count=n()) %>%
|
|
distinct(created_month, count) %>%
|
|
ggplot(aes(x=created_month, y=count)) +
|
|
geom_bar(stat="identity", fill="black") +
|
|
labs(y="Count", x="Created Month") +
|
|
theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
|
|
```
|
|
|
|
```{r}
|
|
#| label: fig-account-timeline
|
|
#| fig-cap: "Accounts in the dataset created between January 2022 and March 2023. The top panels shows the proportion of accounts still active 45 days after creation, the proportion of accounts that have moved, and the proportion of accounts that have been suspended. The bottom panel shows the count of accounts created each week. The dashed vertical lines in the bottom panel represent the annoucement day of the Elon Musk Twitter acquisition, the acquisition closing day, a day where Twitter suspended a number of prominent journalist, and a day when Twitter experienced an outage and started rate limiting accounts."
|
|
#| fig-height: 3
|
|
#| fig-width: 6.75
|
|
|
|
jm <- arrow::read_feather("data/scratch/joinmastodon.feather")
|
|
accounts_unfilt <- arrow::read_feather(
|
|
"data/scratch/all_accounts.feather",
|
|
col_select=c(
|
|
"server", "username", "created_at", "last_status_at",
|
|
"statuses_count", "has_moved", "bot", "suspended",
|
|
"following_count", "followers_count"
|
|
))
|
|
accounts <- accounts_unfilt %>%
|
|
filter(!bot) %>%
|
|
# TODO: what's going on here?
|
|
filter(!is.na(last_status_at)) %>%
|
|
mutate(suspended = replace_na(suspended, FALSE)) %>%
|
|
# sanity check
|
|
filter(created_at >= "2020-10-01") %>%
|
|
filter(created_at < "2024-01-01") %>%
|
|
# We don't want accounts that were created and then immediately stopped being active
|
|
filter(statuses_count >= 1) %>%
|
|
filter(last_status_at >= created_at) %>%
|
|
mutate(active = last_status_at >= "2024-01-01") %>%
|
|
mutate(last_status_at = ifelse(active, lubridate::ymd_hms("2024-01-01 00:00:00", tz = "UTC"), last_status_at)) %>%
|
|
mutate(active_time = difftime(last_status_at, created_at, units="days")) #%>%
|
|
#filter(!has_moved)
|
|
acc_data <- accounts %>%
|
|
#filter(!has_moved) %>%
|
|
mutate(created_month = format(created_at, "%Y-%m")) %>%
|
|
mutate(created_week = floor_date(created_at, unit = "week")) %>%
|
|
mutate(active_now = active) %>%
|
|
mutate(active = active_time >= 45) %>%
|
|
mutate("Is mastodon.social" = server == "mastodon.social") %>%
|
|
mutate(jm = server %in% jm$domain) %>%
|
|
group_by(created_week) %>%
|
|
summarize(
|
|
`JoinMastodon Server` = sum(jm) / n(),
|
|
`Is mastodon.social` = sum(`Is mastodon.social`)/n(),
|
|
Suspended = sum(suspended)/n(),
|
|
Active = (sum(active)-sum(has_moved)-sum(suspended))/(n()-sum(has_moved)-sum(suspended)),
|
|
active_now = (sum(active_now)-sum(has_moved)-sum(suspended))/(n()-sum(has_moved)-sum(suspended)),
|
|
Moved=sum(has_moved)/n(),
|
|
count=n()) %>%
|
|
pivot_longer(cols=c("JoinMastodon Server", "active_now", "Active", "Moved", "Is mastodon.social"), names_to="Measure", values_to="value") # "Suspended"
|
|
theme_bw_small_labels <- function(base_size = 9) {
|
|
theme_bw(base_size = base_size) %+replace%
|
|
theme(
|
|
plot.title = element_text(size = base_size * 0.8),
|
|
plot.subtitle = element_text(size = base_size * 0.75),
|
|
plot.caption = element_text(size = base_size * 0.7),
|
|
axis.title = element_text(size = base_size * 0.9),
|
|
axis.text = element_text(size = base_size * 0.8),
|
|
legend.title = element_text(size = base_size * 0.9),
|
|
legend.text = element_text(size = base_size * 0.8)
|
|
)
|
|
}
|
|
|
|
p1 <- acc_data %>%
|
|
ggplot(aes(x=as.Date(created_week), group=1)) +
|
|
geom_line(aes(y=value, group=Measure, color=Measure)) +
|
|
geom_point(aes(y=value, color=Measure), size=0.7) +
|
|
scale_y_continuous(limits = c(0, 1.0)) +
|
|
labs(y="Proportion") + scale_x_date(labels=date_format("%Y-%U"), breaks = "4 week") +
|
|
theme_bw_small_labels() +
|
|
theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank())
|
|
p2 <- acc_data %>%
|
|
distinct(created_week, count) %>%
|
|
ggplot(aes(x=as.Date(created_week), y=count)) +
|
|
geom_bar(stat="identity", fill="black") +
|
|
geom_vline(
|
|
aes(xintercept = as.numeric(as.Date("2022-10-27"))),
|
|
linetype="dashed", color = "black") +
|
|
#geom_text(
|
|
# aes(x=as.Date("2022-10-27"),
|
|
# y=max(count),
|
|
# label=" Elon Musk Twitter Acquisition Completed"),
|
|
# vjust=-1, hjust=0, color="black") +
|
|
geom_vline(
|
|
aes(xintercept = as.numeric(as.Date("2022-04-14"))),
|
|
linetype="dashed", color = "black") +
|
|
# https://twitter.com/elonmusk/status/1675187969420828672
|
|
geom_vline(
|
|
aes(xintercept = as.numeric(as.Date("2022-12-15"))),
|
|
linetype="dashed", color = "black") +
|
|
geom_vline(
|
|
aes(xintercept = as.numeric(as.Date("2023-07-01"))),
|
|
linetype="dashed", color = "black") +
|
|
#scale_y_continuous(limits = c(0, max(acc_data$count) + 100000)) +
|
|
scale_y_continuous(labels = scales::comma) +
|
|
labs(y="Count", x="Created Week") +
|
|
theme_bw_small_labels() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_x_date(labels=date_format("%Y-%U"), breaks = "4 week")
|
|
#grid.draw(rbind(ggplotGrob(p1), ggplotGrob(p2), size = "last"))
|
|
library(patchwork)
|
|
p1 + p2 + plot_layout(ncol = 1)
|
|
```
|
|
|
|
**Mastodon Profiles**: We collected accounts using data previously collected from posts on public Mastodon timelines from October 2020 to January 2024. We then queried for up-to-date information on those accounts including their most recent status and if the account had moved. This gave us a total of `r nrow(accounts)` accounts.
|
|
|
|
**Moved Profiles**: We found a subset of `r accounts %>% filter(has_moved) %>% nrow` accounts which had moved from one server to another.
|
|
|
|
# Results
|
|
|
|
## Activity By Server Size
|
|
|
|
```{r}
|
|
#| label: fig-active-accounts
|
|
#| eval: false
|
|
#library(betareg)
|
|
library(lme4)
|
|
activity <- arrow::read_feather(
|
|
"data/scratch/activity.feather",
|
|
col_select = c("server", "logins")
|
|
) %>%
|
|
arrange(desc(logins)) %>%
|
|
mutate(server_count = logins)
|
|
|
|
account_data <- inner_join(accounts, activity, by="server") %>%
|
|
mutate(active = active_time >= 45)
|
|
|
|
a_data <- account_data %>%
|
|
#mutate(active = active_time >= 45) %>%
|
|
group_by(server) %>%
|
|
summarize(active_prop = sum(active)/n(), active_count = sum(active), count=n()) %>%
|
|
inner_join(., activity, by="server")
|
|
|
|
a_model <- glmer(active ~ log1p(logins) + (1|server), data=account_data, family=binomial)
|
|
#betareg(active_prop ~ log10(count), data = a_data)
|
|
|
|
logins_seq <- seq(min(log1p(account_data$logins)), max(log1p(account_data$logins)), length.out = 100)
|
|
a_pred <- predict(
|
|
a_model,
|
|
newdata = data.frame(logins = logins_seq, server = factor(1)),
|
|
type = "response",
|
|
re.form = NA)
|
|
pred_data <- data.frame(logins = logins_seq, active_prop = a_pred)
|
|
|
|
a_data %>%
|
|
mutate(logins = log1p(logins)) %>%
|
|
ggplot(aes(y=active_prop, x=logins)) +
|
|
geom_point(alpha=0.1) +
|
|
# help here
|
|
#geom_line(aes(y = a_pred)) +
|
|
geom_line(data = pred_data, aes(x = logins, y = active_prop), color = "red") + # Use pred_data for line
|
|
labs(
|
|
y = "Active after 45 Days",
|
|
x = "Accounts"
|
|
) +
|
|
scale_x_continuous(labels = scales::comma) +
|
|
#scale_y_log10() +
|
|
theme_bw_small_labels()
|
|
```
|
|
|
|
```{r}
|
|
#| eval: false
|
|
library(fable)
|
|
#library(fable.binary)
|
|
library(tsibble)
|
|
library(lubridate)
|
|
|
|
ad_time <- account_data |>
|
|
mutate(created_at = yearweek(created_at)) |>
|
|
group_by(server, created_at) |>
|
|
summarize(count = n(), active = sum(active)) |>
|
|
as_tsibble(key="server", index=created_at)
|
|
```
|
|
|
|
```{r}
|
|
#| eval: false
|
|
fit <- ad_time |>
|
|
model(
|
|
logistic = LOGISTIC(active ~ fourier(K = 5, period = "year"))
|
|
)
|
|
```
|
|
|
|
```{r}
|
|
#| eval: false
|
|
ad_time |>
|
|
filter(server == "mastodon.social") |>
|
|
sample_n(100) |>
|
|
autoplot(active)
|
|
```
|
|
|
|
|
|
```{r}
|
|
#| label: fig-account-activity-prop
|
|
#| fig-cap: "Account Activity Over Time"
|
|
#| fig-height: 4
|
|
#| eval: false
|
|
study_period <- 45
|
|
last_day <- "2024-01-15"
|
|
#formerly accounts_processed_recent
|
|
#server_counts <- arrow::read_feather(
|
|
# "data/scratch/accounts.feather",
|
|
# col_select=c("server", "username", "created_at", "bot")
|
|
# ) %>%
|
|
# filter(created_at <= "2023-03-01") %>%
|
|
# filter(!bot) %>%
|
|
# group_by(server) %>%
|
|
# summarize(server_count = n()) %>%
|
|
# arrange(desc(server_count)) %>%
|
|
# mutate(server_count_bin = floor(log10(server_count)))
|
|
|
|
metadata <- arrow::read_feather("data/scratch/metadata.feather", col_select=c("server", "user_count")) %>%
|
|
arrange(desc(user_count)) %>%
|
|
mutate(server_count = user_count) %>%
|
|
mutate(server_count_bin = floor(log10(server_count))) %>%
|
|
mutate(server_count_bin = ifelse(server_count_bin >= 4, 4, server_count_bin)) %>%
|
|
mutate(server_count_bin = ifelse(server_count_bin <= 2, 2, server_count_bin))
|
|
|
|
activity <- arrow::read_feather(
|
|
"data/scratch/activity.feather",
|
|
col_select = c("server", "logins")
|
|
) %>%
|
|
arrange(desc(logins)) %>%
|
|
mutate(server_count = logins) %>%
|
|
mutate(server_count_bin = floor(log10(server_count))) %>%
|
|
# Merge 4 and 5
|
|
#mutate(server_count_bin = ifelse(server_count_bin >= 5, 4, server_count_bin)) %>%
|
|
# Merge below 2
|
|
#mutate(server_count_bin = ifelse((server_count_bin <= 2) & (server_count_bin >= 1), 2, server_count_bin)) %>%
|
|
mutate(server_count_bin = ifelse(server_count == 0, -1, server_count_bin))
|
|
|
|
jm <- arrow::read_feather("data/scratch/joinmastodon.feather")
|
|
|
|
a <- accounts %>%
|
|
filter(!has_moved) %>%
|
|
#filter(created_at >= "2023-06-01") %>%
|
|
#filter(created_at < "2023-08-01") %>%
|
|
filter(created_at >= "2023-10-15") %>%
|
|
filter(created_at < "2023-12-01") %>%
|
|
inner_join(activity, by="server") %>%
|
|
filter(created_at < last_status_at) %>%
|
|
#mutate(large_server = server_count > 1000) %>%
|
|
mutate(active_time = as.integer(active_time)) %>%
|
|
mutate(active_time_weeks = active_time) %>%
|
|
mutate(status = ifelse(active, 0, 1)) %>%
|
|
mutate(jm = server %in% jm$domain) #%>% filter(server_count > 0)
|
|
|
|
|
|
survfit2(Surv(active_time_weeks, status) ~ strata(server_count_bin) + 1, data = a) %>% # strata(server_count_bin)
|
|
ggsurvfit() +
|
|
add_confidence_interval() +
|
|
scale_y_continuous(limits = c(0, 1)) +
|
|
labs(
|
|
y = "Overall survival probability",
|
|
x = "Time (days)",
|
|
) +
|
|
#scale_x_continuous(
|
|
# breaks = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 4),
|
|
# labels = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 4)
|
|
#) +
|
|
theme_bw_small_labels() +
|
|
theme(axis.text.x = element_text(angle = 45, hjust = 1))
|
|
```
|
|
|
|
```{r}
|
|
a %>% filter(jm) %>% inner_join(., jm, by=c("server"="domain")) %>%
|
|
mutate(is_general = category=="general") %>%
|
|
mutate(is_en = language == "en") %>%
|
|
mutate(is_large = last_week_users >= 585) %>% #filter(following_count < 10) %>%
|
|
survfit2(Surv(active_time_weeks, status) ~ is_general + is_large, data = .) %>% # strata(server_count_bin)
|
|
ggsurvfit(linetype_aes=TRUE, type = "survival") +
|
|
add_confidence_interval() +
|
|
scale_y_continuous(limits = c(0, 1)) +
|
|
labs(
|
|
y = "Overall survival probability",
|
|
x = "Time (days)",
|
|
) +
|
|
#facet_wrap(~strata, nrow = 3) +
|
|
#scale_x_continuous(
|
|
# breaks = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 4),
|
|
# labels = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 4)
|
|
#) +
|
|
add_censor_mark() +
|
|
theme_bw_small_labels() +
|
|
theme(axis.text.x = element_text(angle = 45, hjust = 1))
|
|
```
|
|
|
|
|
|
```{r}
|
|
#| eval: false
|
|
library(coxme)
|
|
|
|
sel_a <- a %>%
|
|
filter(jm) %>% inner_join(., jm, by=c("server"="domain")) %>%
|
|
#mutate(is_general = category=="general") %>%
|
|
rowwise() %>%
|
|
mutate(is_regional = "regional" %in% categories) %>%
|
|
mutate(is_general = "general" %in% categories) %>%
|
|
mutate(is_neither = !(is_regional | is_regional)) %>%
|
|
mutate(is_en = language == "en") %>%
|
|
rowwise() %>%
|
|
mutate(n_categories = length(categories) - is_regional - is_general) %>%
|
|
mutate(many_categories = n_categories > 0) %>%
|
|
mutate(is_large = last_week_users >= 585) %>%
|
|
mutate(follows_someone = followers_count > 0) %>% filter(server_count > 0) %>%
|
|
ungroup
|
|
#cx <- coxme(Surv(active_time_weeks, status) ~ is_large + is_general + approval_required + (1|server), data = sel_a, x=TRUE)
|
|
cx <- coxph(Surv(active_time_weeks, status) ~ many_categories + is_general*is_regional + is_general:log1p(server_count), data = sel_a, x=TRUE)
|
|
coxme(Surv(active_time_weeks, status) ~ is_neither + is_general:log1p(server_count) + (1|server), data = sel_a, x=TRUE)
|
|
cx <- coxph(Surv(active_time_weeks, status) ~ is_neither + many_categories + is_general:log10(server_count), data = sel_a, x=TRUE)
|
|
cz <- cox.zph(cx)
|
|
#plot(cz)
|
|
cz
|
|
```
|
|
|
|
```{r}
|
|
#| eval: false
|
|
options(rf.cores=2, mc.cores=2)
|
|
for_data <- sel_a #%>% slice_sample(n=2500)
|
|
obj <- rfsrc.fast(Surv(active_time_weeks, status) ~ is_neither + is_general*server_count, data = for_data, ntree=100, forest=TRUE)
|
|
#predictions <- predict(obj, newdata = newData)$predicted
|
|
#plot(get.tree(obj, 1))
|
|
reg.smp.o <- subsample(obj, B = 10, verbose = TRUE)#, subratio = .5)
|
|
plot.subsample(reg.smp.o)
|
|
```
|
|
|
|
## Moved Accounts
|
|
|
|
```{r}
|
|
#| label: fig-moved-accounts
|
|
#| fig-height: 4
|
|
#| eval: false
|
|
moved_accounts <- arrow::read_feather("data/scratch/moved_accounts.feather")
|
|
popular_servers <- arrow::read_feather("data/scratch/popular_servers.feather")
|
|
server_movement_data <- left_join(
|
|
(moved_accounts %>% group_by(server) %>% summarize(out_count = sum(count)) %>% select(server, out_count)),
|
|
(moved_accounts %>% group_by(moved_server) %>% summarize(in_count = sum(count)) %>% select(moved_server, in_count) %>% rename(server=moved_server)),
|
|
by="server"
|
|
) %>% replace_na(list(out_count = 0, in_count = 0)) %>%
|
|
mutate(diff = in_count - out_count) %>%
|
|
arrange(diff) %>%
|
|
left_join(., popular_servers, by="server") %>%
|
|
rename(user_count = count) %>% arrange(desc(user_count))
|
|
server_movement_data %>%
|
|
ggplot(aes(x=user_count, y=diff)) +
|
|
geom_point() + scale_x_log10() + theme_bw_small_labels()
|
|
```
|
|
|
|
If there was no relationship, we would expect these jumps to be random with respect to server size.
|
|
|
|
```{r}
|
|
popular_servers <-
|
|
arrow::read_feather("data/scratch/popular_servers.feather")
|
|
moved_accounts <-
|
|
arrow::read_feather("data/scratch/moved_accounts.feather") %>%
|
|
# Remove loops
|
|
filter(server != moved_server)
|
|
activity <-
|
|
arrow::read_feather("data/scratch/activity.feather",
|
|
col_select = c("server", "logins")) %>%
|
|
arrange(desc(logins))
|
|
popular_and_large_servers <-
|
|
popular_servers %>% filter(count >= 1) %>%
|
|
mutate(count = log10(count))
|
|
jm <- arrow::read_feather("data/scratch/joinmastodon.feather")
|
|
ma <- moved_accounts %>%
|
|
filter(server %in% popular_and_large_servers$server) %>%
|
|
filter(moved_server %in% popular_and_large_servers$server)
|
|
# Construct network
|
|
edgeNet <- network(ma, matrix.type = "edgelist")
|
|
edgeNet %v% "user_count" <-
|
|
left_join((as_tibble(edgeNet %v% 'vertex.names') %>% rename(server = value)),
|
|
popular_and_large_servers,
|
|
by = "server") %>%
|
|
select(count) %>%
|
|
unlist()
|
|
edgeNet %v% "in_jm" <-
|
|
as_tibble(edgeNet %v% 'vertex.names') %>%
|
|
mutate(in_jm = value %in% jm$domain) %>%
|
|
select(in_jm) %>% unlist()
|
|
```
|
|
|
|
We construct an exponential family random graph model (ERGM) where nodes represent servers and weighted directed edges represent the number of accounts that moved between servers.
|
|
|
|
$$
|
|
\begin{aligned}
|
|
\text{Sum}_{i,j} = & \beta_1 (log10(\text{user count}_j) - log10(\text{user count}_i)) + \\
|
|
& \beta_2 \\
|
|
& \beta_3 \\
|
|
& \beta_4 \\
|
|
\end{aligned}
|
|
$$
|
|
|
|
```{r}
|
|
#| label: ergm-model
|
|
#| cache: true
|
|
m1 <-
|
|
ergm(
|
|
edgeNet ~ sum +
|
|
diff("user_count", pow = 1, form = "sum") +
|
|
nodecov("user_count", form = "sum") +
|
|
nodematch("in_jm", diff = TRUE, form = "sum"),
|
|
response = "count",
|
|
reference = ~ Binomial(3),
|
|
control=control.ergm(parallel=4, parallel.type="PSOCK")
|
|
)
|
|
|
|
save(m1, file = "data/scratch/ergm-model.rda")
|
|
```
|
|
|
|
|
|
```{r}
|
|
#| label: tag-ergm-result
|
|
#| output: asis
|
|
ergm_model <- load("data/scratch/ergm-model.rda")
|
|
|
|
modelsummary(
|
|
m1,
|
|
escape = FALSE,
|
|
coef_rename = c(
|
|
"sum" = "$\\beta_0$ Intercept",
|
|
"diff.sum.t-h.user_count" = "$\\beta_1$ User Count Difference",
|
|
"nodecov.sum.user_count" = "$\\beta_2$ User Count (Node Covariate)",
|
|
"nodematch.sum.in_jm.TRUE" = "$\\beta_3$ In JoinMastodon (Both True)",
|
|
"nodematch.sum.in_jm.FALSE" = "$\\beta_4$ In JoinMastodon (Both False)"
|
|
),
|
|
)
|
|
```
|
|
|
|
We find a strong preference for accounts to move from large servers to smaller servers.
|
|
|
|
```{python}
|
|
#| eval: false
|
|
#| include: false
|
|
import random
|
|
|
|
def simulate_account_moves(origin: str, servers: dict, n: int):
|
|
server_list = list(set(servers.keys()) - {origin})
|
|
weights = [servers[x] for x in server_list]
|
|
return pl.DataFrame({
|
|
"simulation": list(range(n)),
|
|
"server": [origin] * n,
|
|
"moved_server": random.choices(server_list, weights=weights, k=n)
|
|
})
|
|
|
|
simulations = pl.concat([simulate_account_moves(row["server"], {x["server"]: x["count"] for x in popular_servers.iter_rows(named=True)}, 1000) for row in maccounts.iter_rows(named=True)])
|
|
m_counts = maccounts.join(popular_servers, how="inner", on="server").rename({"count": "origin_count"}).join(popular_servers.rename({"server": "moved_server"}), how="inner", on="moved_server").rename({"count": "target_count"})
|
|
sim_counts = simulations.join(popular_servers, how="inner", on="server").rename({"count": "origin_count"}).join(popular_servers.rename({"server": "moved_server"}), how="inner", on="moved_server").rename({"count": "target_count"})
|
|
```
|
|
|
|
## Tag Clusters
|
|
|
|
We found _number_ posts which contained between two and five tags.
|
|
|
|
# References {#references}
|