--- title: "Do Servers Matter on Mastodon?" subtitle: "Data-driven Design for Decentralized Social Media" author: "Carl Colglazier" institute: - "Community Data Science Collective" - "Northwestern University" date: "2024-03-14" bibliography: ../references.bib title-slide-attributes: data-background: "#4c3854" format: revealjs: #embed-resources: true width: 1600 height: 900 date-format: long margin: 0.2 center-title-slide: false #disable-layout: true theme: [default, presentation.scss] slide-number: true keep-md: true pdf-max-pages-per-slide: 1 reference-location: document template-partials: - title-slide.html knitr: opts_chunk: dev: "svg" #"ragg_png" retina: 1 dpi: 300 execute: freeze: auto cache: true echo: false # fig-width: 5 # fig-height: 6 prefer-html: true --- ## Goals for Today ::: {.big} - Contextualize work on decentralized online social networks like Mastodon - Present a data-driven analysis of server choice on Mastodon - Introduce a recommendation system for server choice - Discuss directions for future work ::: # The Big Picture {.center} What is decentralized social media and why does it matter? ![Figure from @baranDistributedCommunicationsNetworks1964](images/network_types.png) ## Emergance of the Social Web :::::: {.spread} Internet technologies are _sociotechnical_ systems. The social internet as we know it today emerged both from the develeopment of **protocols** and systems [@abbateInventingInternet2000] and thousands of largely non-commercial **social communities** [@driscollModemWorldPrehistory2022]. :::: {.columns} ::: {.column width=33%} :::: {.fragment fragment-index=1} #### Era :::: :::: {.fragment fragment-index=2} ARPANET :::: :::: {.fragment fragment-index=3} Early Internet :::: :::: {.fragment fragment-index=4} Commercial Web :::: ::: ::: {.column width=33%} :::: {.fragment fragment-index=1} #### Spaces :::: :::: {.fragment fragment-index=2} Email, Usenet :::: :::: {.fragment fragment-index=3} BBS, IRC :::: :::: {.fragment fragment-index=4} Social media :::: ::: ::: {.column width=33%} :::: {.fragment fragment-index=1} #### Technologies :::: :::: {.fragment fragment-index=2} TCP/IP :::: :::: {.fragment fragment-index=3} HTML :::: :::: {.fragment fragment-index=4} APIs, AJAX :::: ::: ::: :::::: ## Current Trends ::: {} + High **distrust** of social media companies [@AmericansWidelyDistrust2021] + Challenges in performing content moderation and maintaining social communities at **scale** [@gillespieContentModerationAI2020] + Post-API Era: **closure** of APIs on major platforms to researchers and tinkerers [@freelonComputationalResearchPostAPI2018] ::: ## Protocol-based Social Media ::::: {.spread} The commercial internet has trended toward centralization, but this may be neither desirable nor sustainable [@masnickProtocolsNotPlatforms]. ::: {.columns} ::: {.column} #### Platforms We have accounts on the same website ::: ::: {.column} #### Protocols We use the same protocol ::: ::: ::: {.columns} ::: {.column} The (single) website controls: - My data - Content moderation - Monetization ::: ::: {.column} I can choose who controls: - My data - Content moderation (local) - Monetization (if any) ::: ::: ::::: ## Empirical Context ::: {.columns} :::: {.column} - **The Fediverse**: A set of decentralized online social networks which interoperate using shared protocols like ActivityPub. - **Mastodon**: An open-source, decentralized social network and microblogging community. :::: :::: {.column} ![A screenshot of Mastodon 2.9 (2019), from the Mastodon Blog.](images/Mastodon_Single-column-layout.png) :::: ::: # The Fediverse is a network of _thousands_ of interconnected servers {background-color="black" data-background-image="images/mastodon_map.png" background-repeat="repeat" background-size="200px" background-opacity="0.5" .center auto-animate=true .fade-out} ::: {.footer} Background image: Jaz-Michael King ::: ## A Timeline of Mastodon ```{mermaid} timeline title Mastodon and Fediverse Timeline 2008: OStatus Protocol 2016: Mastodon releases v0.1 2018: ActivityPub standard published 2019: Mastodon drops OStatus 2022: Elon Musk Twitter acquisition : Truth Social launches using Mastodon code 2023: Mastodon reaches 2M active users : Threads (Meta) begins experimental support for ActivityPub ``` ## Avoiding the "Twitter Killer" Hype @zulliRethinkingSocialSocial2020 describe this pattern: 1. A writer discovers an alternative technology system 2. Media hypes it as a "killer" of a major platform 3. The system does not in fact "kill" the major platform 4. The system is declared a failure This has happened mutliple times already. ## How do we define sucesss for sytems like Mastodon? :::: {.columns} ::: {.column} We should instead take social communities on their own terms. **Do people find value in the system?** In my view, the most interesting thing about Mastodon is the "local timeline", which shows posts from your server. ::: ::: {.column} > "One of the things the Internet was good for was gathering together people in different places who shared a common interest" --Michael Lewis, _Moneyball_ (2003) ::: ::: ## Which server should I join? ### Conflicting advice ::: {.columns} ::: {.column} Just join any server! ::: ::: {.column} Join the _right_ server! ::: ::: ::: {.fragment} ### Which is right? {.center-xy} ::: ## There are a lot of options {autoslide=2500 .fade-in} ```{r} #| results: asis #| cache: true library(here) library(tidyverse) library(jsonlite) jm <- here("data/joinmastodon.json") %>% jsonlite::fromJSON() %>% as_tibble dir_name <- "images/server_images/" if (!dir.exists(dir_name)) { dir.create(dir_name, recursive = TRUE) } # save all the server images locally if they are not already saved # location "images/server_images/{domain}.png" save_image <- function(domain, proxied_thumbnail) { file_path <- paste0(dir_name, domain, ".png") # Corrected file path tryCatch({ if (!file.exists(file_path)) { # Check if file doesn't exist download.file(proxied_thumbnail, file_path, mode = "wb") } return(file_path) }, error = function(e) { return(NA) }) } server_images <- jm %>% filter(!is.na(blurhash)) %>% select(domain, proxied_thumbnail) %>% rowwise() %>% mutate(image = save_image(domain, proxied_thumbnail)) %>% ungroup() ``` ```{r} #| results: asis web_image <- function(url) { random_number <- as.integer(5*runif(1, 0, 1)) paste0('') } server_images %>% select(image) %>% mutate(thumb = map(image, web_image)) %>% head(125) %>% pull(thumb) %>% paste0(collapse = "\n") %>% cat() ``` # But does server choice matter? {.center} ## Mastodon grew significantly in 2022 and 2023 ```{r} #| label: fig-account-timeline #| fig-width: 6 #| fig-height: 2.5 #| fig-caption: "Number of accounts created on Mastodon. each week from late 2020-2023. The top of the graph shows the proportion of these accounts which moved or remained active after 91 days." library(here) source(here("code/helpers.R")) account_timeline_plot() ``` # The Mastodon Onboarding Process Has Changed Over Time ![The "Join Mastodon" website as it currently appears.](images/joinmastodon-screenshot.png) ## The Flagship Instance :::: {.columns} ::: {.column width=60%} + **Mastodon.social** was the first Mastodon instance and is the largest. + There have been some historical concerns that its size was an issue. + At certain times, it has **closed** registrations. 1. An extended period of through the end of October 2020. 2. A temporary issue when the email host limited the server in mid-2022. 3. Two periods in late 2022 and early 2023. ::: ::: {.column width=40%} ![A screenshot of mastodon.social as it appeared in 2020 with a message redirecting signups to mastodon.online or to "Join Mastodon"](images/mastodon-social-signups-2020-11-01.png) ::: ::: ## The Pull-Pull Effect: Did Closing Mastodon.social Affect Other Servers? We can use an interrupted time series analysis to test this. $$ \begin{aligned} y_t &= \beta_0 + \beta_1 \text{open}_t + \beta_2 \text{day}_t + \beta_3 (\text{open} \times \text{day})_t \\ &\quad + \beta_4 \sin\left(\frac{2\pi t}{7}\right) + \beta_5 \cos\left(\frac{2\pi t}{7}\right) \\ &\quad + \beta_6 \sin\left(\frac{4\pi t}{7}\right) + \beta_7 \cos\left(\frac{4\pi t}{7}\right) \\ &\quad + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \epsilon_t \end{aligned} $$ where $y_t$ is the number of new accounts on a server at time $t$, $\text{open}_t$ is a binary variable indicating if the server is open to new sign-ups, $\text{day}_t$ is an increasing integer represnting the date, and $\epsilon_t$ is a white noise error term. We use the sine and cosine terms to account for weekly seasonality. ## Mastodon.online used to be more influential | Period | Setting | $p < 0.05$ | |------------|:----------------|:----| | 2020-2021 | mastodon.online | Yes | | | JoinMastodon | No | | | Other | No | | Mid 2022 | JoinMastodon | No | | | Other | No | | Early 2023 | JoinMastodon | No | | | Other | No | Results from ARIMA models for the number of new accounts mastodon.online, servers linked in joinmastodon.org, and all other servers. ## The current Mastodon onboarding process :::: {.columns} ::: {.column width=60%} + While Mastodon once pushed newcomers _away_ from mastodon.social, it now treats it like the **default server** + Secondarily, newcomers are directed to "Join Mastodon" ::: ::: {.column width=40%} ![](images/mastodon_blog_onboarding.png) ::: ::: ## Accounts on the largest general servers are less likely to remain active after 91 days ::: {.columns} ::: {.column} ```{r, cache.extra = tools::md5sum("code/survival.R")} #| cache: true #| label: fig-survival #| fig-env: figure #| fig-cap: "Survival probabilities for accounts created during May 2023." #| fig-width: 3.375 #| fig-height: 2.25 #| fig-pos: h! library(here) source(here("code/survival.R")) plot_km ``` ::: ::: {.column .small} ```{r} #| label: tbl-coxme library(ehahelper) library(broom) cxme_table <- tidy(cxme) %>% mutate(conf.low = exp(conf.low), conf.high=exp(conf.high)) %>% mutate(term = case_when( term == "factor(group)1" ~ "Join Mastodon", term == "factor(group)2" ~ "General Servers", term == "small_serverTRUE" ~ "Small Server", TRUE ~ term )) %>% mutate(exp.coef = paste("(", round(conf.low, 2), ", ", round(conf.high, 2), ")", sep="")) %>% select(term, estimate, exp.coef , p.value) cxme_table %>% knitr::kable(digits = 3) ``` ::: ::: ## Accounts that move between servers are more likely to move to smaller servers ::: {.small} ```{r} #| label: tbl-ergm-table #| echo: false #| warning: false #| message: false #| error: false library(here) library(modelsummary) library(kableExtra) library(purrr) library(stringr) load(file = here("data/scratch/ergm-model-early.rda")) load(file = here("data/scratch/ergm-model-late.rda")) if (knitr::is_latex_output()) { format <- "latex_tabular" } else { format <- "html" } x <- modelsummary( list("Coef." = model.early, "Std.Error" = model.early, "Coef." = model.late, "Std.Error" = model.late), estimate = c("{estimate}", "{stars}{std.error}", "{estimate}", "{stars}{std.error}"), statistic = NULL, gof_omit = ".*", coef_rename = c( "sum" = "Sum", "nonzero" = "Nonzero", "diff.sum0.h-t.accounts" = "Smaller server", "nodeocov.sum.accounts" = "Server size\n(outgoing)", "nodeifactor.sum.registrations.TRUE" = "Open registrations\n(incoming)", "nodematch.sum.language" = "Languages match" ), align="lrrrr", stars = c('*' = .05, '**' = 0.01, '***' = .001), output = format ) %>% add_header_above(c(" " = 1, "Model A" = 2, "Model B" = 2)) x ``` ::: # Our analysis suggests server choice _does_ matter {.center} Can we build a system that helps people find servers? # Recommendation System Concept - Report top **hashtags** used by the most accounts on each server - Build an $M \times N$ server-tag matrix - Normalize with Okai BM25 TF-IDF and L2 normalization ::: {.fragment} Using this matrix, we can - Calculate similarity between servers using tags - Calculate similarity between tags using servers - Reccommend servers based on affinity toward certain tags ::: ## Example: Server Similarity ::: {#tbl-sim-servers} ```{r} #| label: table-sim-servers library(tidyverse) library(arrow) library(here) sim_servers <- here("data/scratch/server_similarity.feather") %>% arrow::read_ipc_file() server_of_interest <- "hci.social" server_table <- sim_servers %>% arrange(desc(Similarity)) %>% filter(Source == server_of_interest | Target == server_of_interest) %>% head(7) %>% pivot_longer(cols=c(Source, Target)) %>% filter(value != server_of_interest) %>% select(value, Similarity) %>% rename("Server" = "value") if (knitr::is_latex_output()) { server_table %>% knitr::kable(format="latex", booktabs=TRUE, digits=3) } else { server_table %>% knitr::kable(digits = 3) } ``` Top five servers most similar to hci.social ::: ## Server Recs # Future Work - Evaluation of the recommendation system - More specific analysis of account attributes - Simulations for robustness # References {#refs .scrollable}