--- title: Onboarding The Fediverse (working title) short-title: Onboarding Fediverse authors: - name: Carl Colglazier affiliation: name: Northwestern University city: Evanston state: Illinois country: United States #roles: writing corresponding: true bibliography: references.bib acm-metadata: final: false copyright-year: 2024 acm-year: 2024 copyright: rightsretained doi: XXXXXXX.XXXXXXX conference-acronym: "PACMHCI" #conference-name: | # Make sure to enter the correct # conference title from your rights confirmation email #conference-date: June 03--05, 2018 #conference-location: Woodstock, NY #price: "15.00" #isbn: 978-1-4503-XXXX-X/18/06 format: acm-pdf: documentclass: acmart classoption: [acmsmall,manuscript,screen,authorversion,nonacm,timestamp] abstract: | When trying to join the Fediverse, a decentralized collection of interoperable social networking websites, new users face the dillema of choosing a home server. Using trace data from thousands of new Fediverse accounts, we show that this choice matters and significantly effects the probably that the account remains active in the future. We then use these insights from this relationship to build a tool that can help new Fediverse users find a server with a high probability of being a good match based on their interests. execute: echo: false error: false freeze: auto --- ```{r} #| label: r-setup #| output: false library(reticulate) library(tidyverse) library(arrow) library(statnet) library(network) library(survival) library(ggsurvfit) library(modelsummary) options(arrow.skip_nul = TRUE) ``` We first explore the extent to which server choice matters. We find that accounts that join smaller, more interest-based servers are more likely to continue posting six months after their creation. Using these findings, we then propose a tool that can help users find servers that match their interests. # Background ## Newcomers in Online Communities Onboarding newcomers is an important thing for online communities. Any community can expect a certain amount of turnover, and so it is important for the long-term health and longevity of the community to be able to bring in new members. RQ: What server attributes correspond with better newcomer retention? ## Migrations in Online Communities All online communities and accounts trend toward death. + Fiesler on online fandom communities [@fieslerMovingLandsOnline2020] + TeBlunthuis on competition and mutalism [@teblunthuisIdentifyingCompetitionMutualism2022] + Work on "alt-tech" communities. # Empirical Setting The Fediverse is a set of decentralized online social networks which interoperate using shared protocols like ActivityPub. Mastodon is a software program used by many Fediverse servers and offers a user experience similar to the Tweetdeck client for Twitter. Discovery has been challenging on Masotodon. The developers and user base tend to be skeptical of algorithmic intrusions, instead opting for timelines which only show posts in reverse chronological order. Search is also difficult. Public hashtags are searchable, but most servers have traditionally not supported searching keywords or simple strings. Accounts can only be searched using their full `username@server` form. Mastodon features a "home" timeline which shows all public posts from accounts that share the same home server. On larger servers, this timeline can be unwieldy; however, on smaller servers, this presents the opportunity to discover new posts and users of potential interest. Mastodon offers its users high levels of data portability. Users can move their accounts accross instances while retaining their follows (their post data; however, does not move with the new account). The choice of an initial instance consequentially is not irreversible. # Data ```{python} #| label: py-preprocess-data #| cache: true #| output: false from code.load_accounts import * accounts = read_accounts_file("data/accounts.feather") # Write a parsed accounts file for R to use accounts.with_columns( pl.col("data").struct.field("moved").is_not_null().alias("has_moved") ).drop( ["data", "data_string"] ).write_ipc("data/scratch/accounts.feather") moved_accounts = accounts.with_columns( pl.col("data").struct.field("moved") ).drop_nulls("moved").with_columns( pl.col("moved").struct.field("acct").alias("moved_acct"), ).filter( pl.col("moved_acct").str.contains('@') ).with_columns( pl.col("moved_acct").str.split('@').list.get(1).alias("moved_server") ) number_of_accounts = len(accounts) popular_servers = accounts.group_by("server").count().sort("count", descending=True) common_moves = moved_accounts.group_by( ["server", "moved_server"] ).count().sort("count", descending=True) common_moves.write_ipc("data/scratch/moved_accounts.feather") common_moves.rename({ "server": "Source", "moved_server": "Target", }).write_csv("data/scratch/moved_accounts.csv") maccounts = moved_accounts.select(["account", "server", "moved_server"]) popular_servers.write_ipc("data/scratch/popular_servers.feather") jm = pl.read_json("data/joinmastodon.json") jm.write_ipc("data/scratch/joinmastodon.feather") read_metadata_file("data/metadata_2023-10-01.feather").drop( ["data", "data_string"] ).write_ipc("data/scratch/metadata.feather") ``` ```{r} #| label: r-load-accounts #| output: false accounts <- arrow::read_feather("data/scratch/accounts.feather", col_select=c("server", "username", "created_at", "last_status_at", "statuses_count", "has_moved", "bot")) %>% filter(!has_moved) %>% filter(!bot) %>% # TODO: what's going on here? filter(!is.na(last_status_at)) %>% # sanity check filter(created_at >= "2022-01-01") %>% filter(created_at < "2023-03-01") %>% # We don't want accounts that were created and then immediately stopped being active filter(statuses_count >= 5) %>% filter(last_status_at >= created_at) %>% mutate(active = last_status_at >= "2023-09-01") %>% # set max last_status_at to 2023-06-01 mutate(last_status_at = ifelse(active, lubridate::ymd_hms("2023-09-01 00:00:00", tz = "UTC"), last_status_at)) %>% mutate(active_time = difftime(last_status_at, created_at, units="secs")) ``` **Mastodon Profiles**: We collected accounts using data previously collected from posts on public Mastodon timelines from October 2020 to February 2023. We then queried for up-to-date infromation on those accounts including their most recent status and if the account had moved. This gave us a total of `r nrow(accounts)` accounts. ```{r} adv_server_counts <- arrow::read_feather("data/scratch/accounts.feather", col_select=c("server", "username", "created_at", "bot")) %>% filter(!bot) %>% filter(created_at > "2017-01-01") %>% filter(created_at <= "2023-01-01") %>% group_by(server) %>% arrange(created_at) %>% mutate(r = row_number()) %>% arrange(desc(r)) %>% distinct(server, created_at, .keep_all=TRUE) %>% select(server, created_at, r) %>% ungroup() %>% mutate(server_date = as.Date(created_at)) ``` ```{r} adv_server_counts %>% filter(server == "mastodon.social") %>% ggplot(aes(x=server_date, y=r)) + geom_line() + theme_minimal() ``` ```{r} #| label: fig-account-activity-prop #| fig-cap: "Account Activity Over Time" #| fig-height: 4 server_counts <- arrow::read_feather("data/scratch/accounts.feather", col_select=c("server", "username", "created_at", "bot")) %>% filter(created_at <= "2023-01-01") %>% group_by(server) %>% summarize(server_count = n()) %>% arrange(desc(server_count)) %>% mutate(server_count_bin = floor(log10(server_count))) metadata <- arrow::read_feather("data/scratch/metadata.feather", col_select=c("server", "user_count")) %>% arrange(desc(user_count)) %>% mutate(server_count = user_count) %>% mutate(server_count_bin = floor(log10(server_count))) jm <- arrow::read_feather("data/scratch/joinmastodon.feather") a <- accounts %>% inner_join(metadata, by="server") %>% mutate(metadata = server_count > 500) %>% mutate(active_time = as.integer(active_time)) %>% mutate(active_time_weeks = active_time / 60 / 60 / 24 / 7) %>% mutate(status = ifelse(active, 0, 1)) %>% mutate(jm = server %in% jm$domain)# %>% filter(jm) survfit2(Surv(active_time_weeks, status) ~ server_count_bin, data = a) %>% ggsurvfit() + add_confidence_interval() + scale_y_continuous(limits = c(0, 1)) + labs( y = "Overall survival probability", x = "Time (weeks)", colour = "Server Size (log10)", fill = "Server Size (log10)", ) + add_risktable() + scale_x_continuous( breaks = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 52), labels = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 52) ) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` To determine the relationship between server size and user retention, we... ## Moved Accounts ```{r} #| label: fig-moved-accounts #| fig-height: 4 moved_accounts <- arrow::read_feather("data/scratch/moved_accounts.feather") popular_servers <- arrow::read_feather("data/scratch/popular_servers.feather") server_movement_data <- left_join( (moved_accounts %>% group_by(server) %>% summarize(out_count = sum(count)) %>% select(server, out_count)), (moved_accounts %>% group_by(moved_server) %>% summarize(in_count = sum(count)) %>% select(moved_server, in_count) %>% rename(server=moved_server)), by="server" ) %>% replace_na(list(out_count = 0, in_count = 0)) %>% mutate(diff = in_count - out_count) %>% arrange(diff) %>% left_join(., popular_servers, by="server") %>% rename(user_count = count) %>% arrange(desc(user_count)) server_movement_data %>% ggplot(aes(x=user_count, y=diff)) + geom_point() + scale_x_log10() + theme_minimal() ``` If there was no relationship, we would expect these jumps to be random with respect to server size. ```{r} popular_servers <- arrow::read_feather("data/scratch/popular_servers.feather") moved_accounts <- arrow::read_feather("data/scratch/moved_accounts.feather") activity <- arrow::read_feather("data/scratch/activity.feather", col_select=c("server", "logins")) %>% arrange(desc(logins)) popular_and_large_servers <- popular_servers %>% filter(count >= 1) %>% mutate(count = log10(count)) jm <- arrow::read_feather("data/scratch/joinmastodon.feather") ma <- moved_accounts %>% filter(server %in% popular_and_large_servers$server) %>% filter(moved_server %in% popular_and_large_servers$server) edgeNet<-network(ma,matrix.type="edgelist") edgeNet%v%"user_count" <- left_join((as_tibble(edgeNet%v%'vertex.names') %>% rename(server=value)), popular_and_large_servers, by="server") %>% select(count) %>% unlist() edgeNet%v%"in_jm" <- as_tibble(edgeNet%v%'vertex.names') %>% mutate(in_jm = value %in% jm$domain) %>% select(in_jm) %>% unlist() ``` ```{r} #| label: ergm-model #| cache: true m1 <- ergm(edgeNet ~ sum + diff("user_count", pow=1, form="sum") + nodecov("user_count", form="sum") + nodematch("in_jm", diff=TRUE, form="sum"), response="count", reference=~Binomial(3)) save(m1, file="data/scratch/ergm-model.rda") ``` ```{r} #| label: tag-ergm-result #| output: asis ergm_model <- load("data/scratch/ergm-model.rda") modelsummary( m1, coef_rename = c( "sum" = "Intercept", "diff.sum.t-h.user_count " = "User Count Difference", "nodecov.sum.user_count " = "User Count (Node Covariate)", "nodematch.sum.in_jm.TRUE" = "In JoinMastodon (Both True)", "nodematch.sum.in_jm.FALSE" = "In JoinMastodon (Both False)" ), ) ``` ```{python} #| eval: false #| include: false import random def simulate_account_moves(origin: str, servers: dict, n: int): server_list = list(set(servers.keys()) - {origin}) weights = [servers[x] for x in server_list] return pl.DataFrame({ "simulation": list(range(n)), "server": [origin] * n, "moved_server": random.choices(server_list, weights=weights, k=n) }) simulations = pl.concat([simulate_account_moves(row["server"], {x["server"]: x["count"] for x in popular_servers.iter_rows(named=True)}, 1000) for row in maccounts.iter_rows(named=True)]) m_counts = maccounts.join(popular_servers, how="inner", on="server").rename({"count": "origin_count"}).join(popular_servers.rename({"server": "moved_server"}), how="inner", on="moved_server").rename({"count": "target_count"}) sim_counts = simulations.join(popular_servers, how="inner", on="server").rename({"count": "origin_count"}).join(popular_servers.rename({"server": "moved_server"}), how="inner", on="moved_server").rename({"count": "target_count"}) ``` ## Tag Clusters We found _number_ posts which contained between two and five tags.