Why Reddit’s decision to cut off researchers is bad for its business

By Frances Hunter On Jan 22, 2024

Last April, Reddit CEO Steve Huffman made a strategic error that, at the time, made perfect Silicon Valley sense. For years, large firms had been using freely available public data from Reddit to train their large language models. With the explosion of generative AI tools entering the market and a long-awaited IPO in the works, Huffman saw an opportunity to finally cash in on this untapped potential resource by introducing new paywalls for accessing Reddit data. But he made this decision seemingly without consideration for how some of Reddit’s most valuable community members—including volunteer moderators and independent researchers—used these data tools every day (and without any backup option in place for them). After attempts at polite negotiation on the new policy reached a standstill, these communities organized widespread boycotts and public campaigns, angered by the company’s apparent disregard for their work—work that had enabled Reddit to scale more quickly than its competitors and cemented its reputation as an innovator in the digital-media space.

Reddit promised that it would maintain a free-tier API, but researchers and moderators repeatedly stressed that the new access was too limited and would cut off many tools, projects, and archives that their work depended on. Finally, Reddit’s olive branch came in the form of an online application for moderators and researchers to request increased access to the API and (for moderators only) access to Reddit archives that the updates took offline. The mistake seemed to have taught Huffman that ignoring the needs of communities that the platform had long relied on posed a greater risk than benefit to the profitability of the company in the long run. Yet six months later, many of us working in public interest research fields have heard nothing back from Reddit in response to our applications, and key archives of historical data remain inaccessible to researchers.

Reddit data has long powered public interest research across a variety of fields, including computing, medicine, and the social sciences. In the field of mental health, Reddit data has enabled researchers to develop innovative methods for detecting people who may need help, informed by an evidence-based understanding of why people may not seek help when they need it. Reddit data has supported groundbreaking research on substance use, which led to the development of tools to help quickly detect adverse drug reactions and added weight to the growing body of research highlighting the importance of social support in recovering from addiction. Within social and computer science, researchers have used Reddit data to develop tools for detecting fake news, understand pathways to extremism, and the adoption of conspiracy theories.

Independent research has also benefited Reddit itself, making the platform safer and more sustainable. For example, after academic research identified a way to reduce harassment and increase newcomer participation, Reddit moderators quickly adopted this intervention. Research has also played a key role in helping Reddit evaluate its existing policies: When a study found that Reddit’s ban on discriminatory communities significantly reduced hate speech, Reddit ramped up its site-wide enforcement of policies prohibiting harassment and hateful speech on the platform. Researchers have also measured the value of Reddit’s volunteer moderation system, finding that, at a bare minimum, volunteer moderation saves Reddit millions of dollars for services that cost other major platforms hundreds of millions of dollars annually.

In 2024, more than 2 billion people will vote in elections around the world, and a much smaller subset of those people will decide whether to buy stock in Reddit when the 19-year-old company finally goes live as a publicly traded company. Independent research provides clear value to both the public and potential investors, which is why policymakers and shareholders alike have pressured leaders of larger technology companies like Google and Facebook to embrace transparency and, in particular, to share their data with researchers. Reddit users, who highly value privacy should have a say in this, too. The company has not yet made it clear under what conditions and consent models user data will be shared, opening up the company to financial and reputational risk.

Though Reddit’s leadership claims to be “leaning into its humanity,” from our vantage point, the company seems more committed to leaving humanity in the dark. At Cornell’s Citizens and Technology Lab, our request went unanswered for months until we were able to leverage a personal connection at Reddit. Members of the Coalition for Independent Tech Research found themselves in an infinite application loop when they tried to make a request, and still more researchers have posted directly to Reddit to confront the company about its data-access policies.

Beyond Reddit’s own API, advanced research tools that rely on access to Reddit data have also been impacted. For example, Node XL, a powerful data analysis tool cited in over 2,200 academic studies, now only has very limited access to Reddit data. And Pushshift, the largest archive of Reddit providing data dating back to the platform’s founding, is no longer available to researchers. At present, researchers requiring archival data (which amounts to a significant portion of Reddit research) are relying on torrents of Reddit data; much like torrented movies, there are no assurances as to the quality or the legality of what is contained within them.

By controlling access to its once-open data, Reddit has put itself in a powerful role as the gatekeeper of information about its platform. How it leverages this role will prove critical to its success. By partnering with the research community to develop a data-use policy that guarantees the ethical use of Reddit data, Reddit could give back to the public by enabling research that supports our physical and mental health, and during a global election year, helps election officials become aware of harmful rumors and detect foreign influence.

Internally, as the company inches towards an IPO, these partnerships could help them make decisions that would improve the platform and make it an appealing venture for investors. Ensuring researchers can access Reddit data is good for business. Reddit takes pride in being a company that does things differently than the other tech giants. In a moment when public interest access to data is becoming collateral damage in the battles over generative AI, Reddit should be the company leading on an ethical way forward—not the one clamping down.

Sarah Gilbert is the research director at Cornell University’s Citizens and Technology Lab, which works with online communities to study the effects of technology on the public interest. Brandi Geurkink is the executive director of the Coalition for Independent Tech Research, a nonprofit that seeks to advance, defend, and sustain the right to ethically study the impacts of technology on society.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.