Introduction
Computational social science sits at the intersection between the social sciences and data science. It seeks to develop quantitative and computational approaches to understand human behaviour, social interactions and societal change at scale. Drawing on disciplines such as demography, geography, sociology, economics and political science, computational social science studies how individual actions aggregate into collective patterns, and how social, economic, political and environmental contexts shape social outcomes.
What makes computational social science different from social science, however, is its explicit engagement with the ongoing digital and computational revolution characterised by advances in computing power, data storage, digital connectivity and algorithmic methods (Hilbert and López 2011). While the study of social behaviour and population processes has traditionally relied mostly on structured data sources such as censuses, surveys and administrative records, computational social science increasingly leverages novel data sources.
Particularly, the digital revolution that started in the 1990s has resulted in a data revolution. Technological advances in computational power, storage and digital network platforms have enabled the production, processing, analysis and storage of large volumes of heterogeneous data. Analysing 1986-2007 data, (Hilbert and López 2011) estimated that the world had already passed the point at which more data were being collected than could be physically stored. They estimated that the global general-purpose computing capacity grew at an annual rate of 58% between 1986 and 2007, exceeding that of global storage capacity (23%). We can now digitally capture and generate forms of data that previously could not easily be recorded, stored or analysed.
The unprecedented amount of information that we can now capture through digital technology offers unique opportunities to advance our understanding of micro social behaviour (e.g. individual-level decision making, preferences, interactions) and macro population processes (e.g. structural population processes and trends). New and traditional data sources can be combined to study human behaviour at unprecedented spatial and temporal resolution, often in near real time and at large scale.
We can capture and study micro individual behaviours such as time use, purchasing behaviour, communication patterns and mobility trajectories from a variaty of data sources, including mobile phones, social media platforma, smart infrastructure and administrative systems. These behaviours can be aggregated to shed light into macro-level structural processes and trends, such as urban dynamics, labour markets, consumer demand, transport usage, population ageing and political participation. Fundamentally, computational approaches thus have the potential to become a key pillar informing and supporting decision making. They can inform business, public services and governments in addressing major societal issues, such as pandemics, climate change, inequality and migration, influencing influencing policy, practice and governance structures.
Yet, the growing use of large-scale and computational data also poses major conceptual, methodological and ethical challenges (Rowe 2021). These challenges motivate this module. Many contemporary data sources and algorithms are not collected or designed for research purposes, and turning these into rigorous social science knowledge requires a unique combination of computational expertise and domain-specific understanding. Traditionally, university programmes have tended to separate technical training from substantive social science education. This module aims to fill this gap by offering an integrated training in computational methods and social science applications, encouraging students to critically analyse data, design computational studies and interpret results in their social context.
The name of this module, Computational Social Science, reflects the inclusive and interdisciplinary perspective we hope to capture. The data and computational revolutions have led to the emergence of a range of sub-disciplines, such as digital demography (Kashyap et al. 2022), computational sociology, network science (Lazer et al. 2009) and geographic data science (Singleton and Arribas-Bel 2019). Computational social science seeks to integrate these perspectives and provide a fertile framework for critique, collaboration and co-creation across these emerging areas of scholarship in the study of human behaviour and social systems.
Specifically, this chapter aims to discuss key opportunities and challenges with the use of large-scale digital data and algorithms to analyse social and population dynamics. We place a particular focus on the challenges relating to privacy, bias and privacy issues.
Data and algorithms in Computational Social Science
Contemporary computational social science draws on a wide range of data sources, including traditional structured data such as censuses, surveys and administrative records, as well as newer forms of large-scale digital data often referred to as `digital trace data’. These newer froms are distinctive in their volume, velocity, variety, exhaustiveness, resolution, relational nature and flexibility (Kitchin 2014). Traditionally, social science data was mostly numeric and highly structured. The expansion of digital technologies has facilitated the collection, storage and analysis of unstructured and semi-structured data, including text (e.g. social media posts), images (e.g. photographs and satellite imagery) and video (e.g. CCTV footage). As a result, computational social science increasingly integrates heterogeneous data types within a single analytical framework.
Multiple digital and administrative systems contribute to the generation and storage of contemporary social data. Kitchin (2014) identified three broad systems: directed, automated and volunteered systems. Directed systems comprise digital administrative systems operated by humans recording data on places or people, e.g. immigration control, biometric scanning and health records. Automated systems involve digital infrastructures that automatically and autonomously record and process data with little human intervention, e.g. mobile phone networks, electronic smartcard ticketing, energy smart meters and traffic sensors. Volunteered systems involve digital spaces in which individuals actively contribute data through interactions on online platforms (e.g. Twitter and Facebook) or through crowdsourcing initiatives (e.g. OpenStreetMap and Wikipedia).
While data constitute the empirical foundation of Computational Social Science, its analysis and interpretation depends on the use of statistical, computational and algorithmic methods. Large-scale and high-dimensional data cannot be analysed using only traditional tools. Instead, computational social science draws on a broad range of methods, including statistical modelling, machine learning, network analysis and spatial analysis, to identify patterns, make predictions and test hypotheses. Algorithms play a central role in transforming raw data into knowledge in social science. They are used to clean and preprocess data, detect structure in complex datasets, model relationships between individuals and groups, and simulate social processes. At the same time, algorithms are not “neutral”, as they are often based on assumptions, design choices and training data that may influence the patterns they reveal. Understanding how algorithms operate and how they interact with data, is therefore a core component ability in Computational Social Science.
Opportunities of data and algorithms
Large-scale digital data and computational algorithms offer unique opportunities for the analysis of human behaviour and population dynamics. As Rowe (2021) argues, contemporary digital trace data offers three key promises in relation to traditional data sources, such as surveys and censuses. They generally provide greater spatio-temporal granularity, wider coverage and timeliness. When combined with advances in statistical and machine learning methods, these data enable new forms of measurement, modelling and inference in the social sciences.
Large-scale digital data offer high geographic and temporal granularity. Most digital footprint data are time-stamped and geographically referenced with high precision. Digital technologies such as mobile phone networks and Global Positioning Systems (GPS) enables the generation of a continuous streams of time-stamped location data. Such information thus provides an opportunity to trace and enhance our understanding human populations over highly granular spatial scales and time intervals, going beyond the static representation afforded by most traditional data sources. Spatial interactions, mobility patterns and they ways in which people use and are influenced by their environment can be analysed in a temporally dynamic way.
Large-scale digital data also provide extensive coverage. In contrast to traditional random sampling, many contemporary data sources capture universal or near-universal populations or geographical systems. Social media platforms, such as Instagram, generate data to capture the entire universe of Instagram users. Satellite technologies produce imagery snapshots that can be composited to represent the Earth. Electronic smartcard ticketing systems produce information to capture the population of users within transport networks. Because these data sources are typically collected consistently and at scale, they offer the potential to study human behaviour and social systems at regional, national and even global scales based on harmonised definitions, which is rarely possible using traditional data sources.
A further opportunity lies in the timeliness of contemporary data and algorithms. Unlike traditional systems of data collection and release, many digital data sources can be streamed continuously in real- or near real-time. Commercial transactions are recorded as bank card payments occur at retail outlets. Individual mobile phone locations are captured as applications interact with cellular antennas. When combined with automated algorithms for data processing and analysis, such information offers an opportunity to monitor and respond to rapidly evolving situations, such as the COVID-19 pandemic Wang et al. (2022), natural disasters (Rowe 2022) and conflicts (Rowe, Neville, and González-Leonardo 2022).
The opportunities offered by data are inseparable from advances in algorithms and computational methods. Machine learning, artificial intelligence (AI), network analysis and spatial models enable researchers to extract structure from high-dimensional data, detect patterns that are not visible to traditional methods, and build predictive and explanatory models of social processes. Algorithms make it possible to integrate heterogeneous data sources, automate large-scale analyses and simulate complex social systems, extending the scope of questions that can be addressed in the social sciences.
We also argue that while large-scale digital data and computational methods should be seen as key assets to support government and business decision-making processes, they should not be considered at the expense of traditional data sources. Contemporary digital trace data and traditional data sources should be used to complement one another. As indicated earlier, many digital data sources by-products of administrative processes or services and were not designed for research purposes. They require considerable work of data re-engineering to be re-purposed into an analysis-ready products (Arribas-Bel et al. 2021). As the saying goes “all data are dirty, but some data are useful”. Our message is that traditional data, contemporary data and computational algorithms should be triangulated to leverage their respectivestrengths and mitigate their weaknesses.
Current challenges of Computational Social Science
Large-scale digital data and computational algorithms also impose key conceptual, methodological and ethical challenges. In this section, we provide a brief discussion of challenges in these areas, focusing particularly on issues relating to bias, privacy, ethics and methods. We focus on these issues because they are of practical importance and probably to be of greatest interest to the readers of this book. Excellent discussions of these challenges can be found in Kitchin (2014), Cesare et al. (2018), Lazer et al. (2020), Rowe (2021) and (cabrera2025?).
Conceptual challenges
Conceptually, the emergence of large-scale digital data and computational methods has led to a rethinking and questioning of existing theoretical approaches in the social sciences (Franklin 2022). On the one hand, contemporary data and algorithms provide new opportunities to explore existing theories and hypotheses through different lenses and to test the consistency of long-standing beliefs. For example, economics theories discuss the existence of temporal and spatial equilibrium. Resulting hypotheses are generally tested through mathematical models or empirical analyses relying on temporally static data. The existence of equilibrium has therefore remained diffigult to assess empirically. The availability of temporally dynamic data, combined with computational modelling, provides new opportunities to test temporal and spatial equilibrium using fine-grainged longitudinal information, enabling the analysis of causal processes, rather than only focusing on static associations.
On the other hand, contemporary data and algorithms also raise fundamentally new conceptual questions. New data sources capture activities that were previously difficult to measure, such as personal communications, social networks, information seeking and fine-grained mobility patterns (Cabrera-Arnau et al. 2022). These data offer opportunities to expand existing theories by opening the “black box” of households, organisations and markets. They may also motivate entirely new research questions, such as the role of digital technologies in shaping social behaviour, the influence of algorithmic systems on decision making, and the impact of AI on productivity, labour markets and financial systems (cabrera2025a?).
At the same time, the increasing role of algorithms raises questions about explanation and interpretation. Many machine learning models prioritise predictive accuracy over interpretability. This hallenges traditional notions of causal inference and theory testing in the social sciences. Understanding how to reconcile predictive modelling with explanatory aims is a central conceptual challenge for Computational Social Science.
Methodological challenges
Methodologically, the need for a wide and new set of computational skills to handle, store and analyse large volumes of data represents a major challenge. Many contemporary data sources are not created for research purposes and must be re-engineered for scientific analysis. Large data streams cannot be stored on local machines, cannot easily be processed as a single unit, and often require repeated processing over time. This requires storage capacity, computational infrastructure and computer science expertise.
The manipulation and storage of large-scale data often require technical expertise in data management systems such as SQL, Google Cloud Storage and Amazon S3, as well as in efficient computing frameworks for distributed and parallel processing. The analysis and modelling of contemporary data increasingly entail competencies in machine learning, AI, network analysis and simulation. While these competencies are commonly taught within computer science programmes, they are rarely integrated with substantive training in social science theory and applied problem solving.
An additional methodological challenge concerns biases in both data and algorithms. Many contemporary data sources represent specific segments of the population, but little is often known about which segments and are over or under-represented and how their representation varies across data sets and contexts. Biases reflect differences in the use and access to digital technologies, patterns of use, or differences in the frequency and intensity of interaction with digital systems (cabrera2025?). These biases may be further amplified by algorithmic decisions embedded in platforms, such as content recommendation systems designed to maximise engagement.
Moreover, machine learning models trained on biased data may reproduce or amplify existing social inequalities. Considerable work has been devoted to measuring, understanding and mitigating such biases Ribeiro, Benevenuto, and Zagheni (2020), but developing robust methods for bias correction and uncertainty quantification remains an open methodological challenge.
Ethical challenges
Ethical issues represent a central challenge for Computational Social Science. Privacy is perhaps the most prominent concern. Many contemporary data sources contain highly sensitive personal information, and require anonymisaiton and disclosure disclosure control. Individual records must be protected to prevent re-identification. Yet, the high degree of spatial and temporal granularity of these data increases the risk that individuals can be identified, even after anonymisation.
High-provile cases illustrate the ethical risks asociated with the misuse of data and algorithms. For example, Cambridge Analytica used information from Facebook users to segment the population and target politically motivated content (Cadwalladr and Graham-Harrison 2018). And more generally, algorithmic systems are increasingly used to influence information exposure, political participation and consumer behaviour. This raises important concerns about social manipulation, transparency and accountability.
Anonymising information, however, introduces a trade-off between accuracy and privacy (Petti and Flaxman 2020). The greater the degree of privacy protection, the lower the potential degree of accuracy and utility of the resulting data and vice versa. Identifying an appropriate balance between these objectives is a key ethical and methodological challenge in Computational Social Science. If handled incorrectly, privacy-preserving techniques may distort population-level patterns and introduce artificial structure to the data. The application of data differential privacy to the US census provides a recent good example of this challenge. An emblematic case is New York’s Liberty Island which has no resident population, but official US census reported 48 residents which was the result of adding statistical noise to the data, in order to enhance privacy.
Beyond privacy, ethical challenges also concern fairness, accountability and governance. Algorithms may systematically disadvantage certain social groups, operate obscurely or be deployed without adequate oversight. Ensuring that data and algorithms are used responsibly, transparently and in the public interest remains a central challenge for Computational Social Science.
Conclusion
Large-scale digital trace data and computational algorithms present unique oppotunites to enhance our understanding of human behaviours, social interactions and population processes, and to support individual, business and government decision-making. Businesses have used data-driven methods to segment their consumer populations and improve the targeting of marketing content, products and services, ultimately increasing sales and revenue (Dolega, Rowe, and Branagan 2021). Governments and health care institutions, particularly during the COVID-19 pandemic, have leverage digital traces to monitor the spread of disease and develop appropriate mitigation responses (Green, Pollock, and Rowe 2021).
However, the use of contemporary data and algorithms poses major conceptual, methodological and ethical challenges that must be addressed to unleash their full potential. These challenges include biases in data and models, limits to inference and interpretability, risks to privacy and the responsible governance of algorithmic systems. The aim of this book is to address of the key methodological challenges in Computational Social Sciences. In particular, the book provides applied training in the practical use of statistical, machine learning and AI approaches to analyse large-scale data and to advance our understanding of human behaviour and social and population processes.
Arribas-Bel, Dani, Mark Green, Francisco Rowe, and Alex Singleton. 2021.
“Open Data Products-A Framework for Creating Valuable Analysis Ready Data.” Journal of Geographical Systems 23 (4): 497–514.
https://doi.org/10.1007/s10109-021-00363-5.
Cabrera-Arnau, Carmen, Chen Zhong, Michael Batty, Ricardo Silva, and Soong Moon Kang. 2022.
“Inferring Urban Polycentricity from the Variability in Human Mobility Patterns.” arXiv.
https://doi.org/10.48550/ARXIV.2212.03973.
Cadwalladr, Carole, and Emma Graham-Harrison. 2018. “Revealed: 50 Million Facebook Profiles Harvested for Cambridge Analytica in Major Data Breach.” The Guardian 17 (1): 22.
Cesare, Nina, Hedwig Lee, Tyler McCormick, Emma Spiro, and Emilio Zagheni. 2018.
“Promises and Pitfalls of Using Digital Traces for Demographic Research.” Demography 55 (5): 1979–99.
https://doi.org/10.1007/s13524-018-0715-2.
Dolega, Les, Francisco Rowe, and Emma Branagan. 2021.
“Going Digital? The Impact of Social Media Marketing on Retail Website Traffic, Orders and Sales.” Journal of Retailing and Consumer Services 60 (May): 102501.
https://doi.org/10.1016/j.jretconser.2021.102501.
Franklin, Rachel. 2022.
“Quantitative Methods II: Big Theory.” Progress in Human Geography 47 (1): 178–86.
https://doi.org/10.1177/03091325221137334.
Green, Mark, Frances Darlington Pollock, and Francisco Rowe. 2021.
“New Forms of Data and New Forms of Opportunities to Monitor and Tackle a Pandemic.” In, 423–29. Springer International Publishing.
https://doi.org/10.1007/978-3-030-70179-6_56.
Hilbert, Martin, and Priscila López. 2011.
“The World’s Technological Capacity to Store, Communicate, and Compute Information.” Science 332 (6025): 60–65.
https://doi.org/10.1126/science.1200970.
Kashyap, Ridhi, R. Gordon Rinderknecht, Aliakbar Akbaritabar, Diego Alburez-Gutierrez, Sofia Gil-Clavel, André Grow, Jisu Kim, et al. 2022.
“Digital and Computational Demography.” http://dx.doi.org/10.31235/osf.io/7bvpt.
Kitchin, Rob. 2014.
“Big Data, New Epistemologies and Paradigm Shifts.” Big Data & Society 1 (1): 205395171452848.
https://doi.org/10.1177/2053951714528481.
Lazer, David, Alex Pentland, Lada Adamic, Sinan Aral, Albert-László Barabási, Devon Brewer, Nicholas Christakis, et al. 2009.
“Computational Social Science.” Science 323 (5915): 721–23.
https://doi.org/10.1126/science.1167742.
Lazer, David, Alex Pentland, Duncan J. Watts, Sinan Aral, Susan Athey, Noshir Contractor, Deen Freelon, et al. 2020.
“Computational Social Science: Obstacles and Opportunities.” Science 369 (6507): 1060–62.
https://doi.org/10.1126/science.aaz8170.
Liang, Hai, and King-wa Fu. 2015.
“Testing Propositions Derived from Twitter Studies: Generalization and Replication in Computational Social Science.” Edited by Zi-Ke Zhang.
PLOS ONE 10 (8): e0134270.
https://doi.org/10.1371/journal.pone.0134270.
Petti, Samantha, and Abraham Flaxman. 2020.
“Differential Privacy in the 2020 US Census: What Will It Do? Quantifying the Accuracy/Privacy Tradeoff.” Gates Open Research 3 (April): 1722.
https://doi.org/10.12688/gatesopenres.13089.2.
Ribeiro, Filipe N., Fabrício Benevenuto, and Emilio Zagheni. 2020.
“How Biased Is the Population of Facebook Users? Comparing the Demographics of Facebook Users with Census Data to Generate Correction Factors.” 12th ACM Conference on Web Science, July.
https://doi.org/10.1145/3394231.3397923.
Rowe, Francisco. 2021.
“Big Data and Human Geography.” http://dx.doi.org/10.31235/osf.io/phz3e.
———. 2022.
“Using Digital Footprint Data to Monitor Human Mobility and Support Rapid Humanitarian Responses.” Regional Studies, Regional Science 9 (1): 665–68.
https://doi.org/10.1080/21681376.2022.2135458.
Rowe, Francisco, Ruth Neville, and Miguel González-Leonardo. 2022.
“Sensing Population Displacement from Ukraine Using Facebook Data: Potential Impacts and Settlement Areas.” http://dx.doi.org/10.31219/osf.io/7n6wm.
Schlosser, Frank, Vedran Sekara, Dirk Brockmann, and Manuel Garcia-Herranz. 2021.
“Biases in Human Mobility Data Impact Epidemic Modeling.” https://doi.org/10.48550/ARXIV.2112.12521.
Singleton, Alex, and Daniel Arribas-Bel. 2019.
“Geographic Data Science.” Geographical Analysis 53 (1): 61–75.
https://doi.org/10.1111/gean.12194.
Wang, Yikang, Chen Zhong, Qili Gao, and Carmen Cabrera-Arnau. 2022. “Understanding Internal Migration in the UK Before and During the COVID-19 Pandemic Using Twitter Data.” Urban Informatics 1 (1): 15.
Zagheni, Emilio, and Ingmar Weber. 2015.
“Demographic Research with Non-Representative Internet Data.” Edited by Nikolaos Askitas and Professor Professor Klaus F. Zimmermann.
International Journal of Manpower 36 (1): 13–25.
https://doi.org/10.1108/ijm-12-2014-0261.