The Datafied Society: Challenges and Strategies in Big Data Research for Social Sciences and Humanities

Document Type : Original article

Author

PhD in Communication and New Media studies, University of Vienna; Member of the Executive Committee of the UNESCO Chair in Cyberspace and Culture

10.22059/jcss.2024.378294.1106

Abstract

The advent of big data marks a profound shift in our epistemological framework, introducing a new knowledge paradigm where the social landscape is shaped by data processing, perceived as both comprehensive and natural. This transformative shift challenges traditional notions of human agency in societal understanding, positioning empirical quantification at the forefront of inquiry. Beyond philosophical implications, pragmatic challenges abound in big data research—from issues of commensuration and the influence of action grammars to the dominance of correlational over causal relationships, the prevalence of everyday data over historical archives, and the pervasive impact of algorithms on data ecosystems. This manuscript undertakes a comprehensive exploration of these challenges, proposing strategies for navigating them within emerging disciplines such as Digital Humanities, Social Computing, and Cultural Analysis. Methodologically anchored in constructivist principles and critical discourse analysis (CDA), the study investigates how socio-cultural contexts shape data and knowledge production. Drawing on extensive literature and meta-analyses, it synthesizes diverse perspectives to underscore the necessity for methodological innovation and reflexivity in addressing the complexities of big data research, ensuring the integrity and depth of social inquiry amidst evolving data-driven methodologies.

Keywords

Main Subjects


Akaike, H. (1974). “A new look at the statistical model identification”. IEEE Transactions on Automatic Control. 19(6): 716-723. doi: 10.1109/TAC.1974.1100705.  
Anderson, C. (2008). “The end of theory, will the data deluge makes the scientific method obsolete?:. Edge. retrieved at 12 September 2023 [Online] from: http://www.edge.org/3rd_culture/anderson08/anderson08_index.html.
Andrejevic, M. (2013). Infoglut: How too Much Information Is Changing the Way We Think and Know. Routledge.
Badiou, A. (2008). Number and Numbers. Cambridge: Polity.
Bellman, R. (1961). Adoptive Control Processes: A Guided Tour. University Press.
Benjamini, Y. & Hochberg, Y. (1995). “Controlling the false discovery rate: a practical and powerful approach to multiple testing”. Journal of the Royal Statistical Society: Series B (Methodological). 57(1): 289-300.  https://www.jstor.org/stable/2346101.  
Berman, J.J. (2013). Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. Newness.
Berry, D. (2011). “The computational turn: Thinking about the digital humanities”. Culture Machine. 12. Retrieved at 08 September 2023 [Online] from: http://www.culturemachine.net/index.php/cm/article/view/440/470.  
Beyer, K.; Goldstein, J.; Ramakrishnan, R. & Shaft, U. (1999). “When is ‘nearest neighbor’ meaningful?”. Database Theory—ICDT’99: 7th International Conference Jerusalem, Israel, January 10–12, 1999 Proceedings 7: 217-235. Springer Berlin Heidelberg.
Bollier, D. (2010). “The promise and peril of big data”. retrieved at 10 September 2023 [Online] from: http://www.aspeninstitute.org/sites/default/files/content/docs/pubs/The_Promise_and_Peril_of_Big_Data.pdf.  
Borgatti, S.P. & Halgin, D.S. (2011). “Network theorizing”. Organization Science. 22(5): 1168-1181. https://doi.org/10.1287/orsc.1100.0641.
Bowker, G.C. & Star, S.L. (2000). Sorting Things out: Classification and its Consequences. MIT press.
Boyd, D. & Crawford, K. (2012). “Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon”. Information, Communication & Society. 15(5): 662-679. https://doi.org/10.1080/1369118X.2012.678878.
Breiman, L. (2001). “Random forests”. Machine Learning. 45: 5-32. http://dx.doi.org/10.1023/A:1010933404324.  
Chawla, N.V.; Bowyer, K.W.; Hall, L.O. & Kegelmeyer, W.P. (2002). “SMOTE: Synthetic minority over-sampling technique”. Journal of Artificial Intelligence Research. 16(2002): 321-357.  https://doi.org/10.1613/jair.953.   
Crotty, M.J. (1998). The foundations of social research: Meaning and perspective in the research process. London: Routledge.
Couldry, N. (2014). “Inaugural: A necessary disenchantment: Myth, agency and injustice in a digital world.” The Sociological Review, 62(4), 880-897. https://doi.org/10.1111/1467-954X.12158.  
---------------. (2020). Recovering critique in an age of datafication. New Media & Society, 22(7): 1125-1336. https://doi.org/10.1177/1461444820912536.  
Derrida, J. (1996). Archive Fever: A Freudian Impression. Translated by Prenowitz E. University of Chicago Press, Chicago.
Durbin, J. & Koopman, S.J. (2012). Time Series Analysis by State Space Methods. Vol. 38. OUP Oxford.
Durkheim, E. (1982[1895]). Rules of Sociological Method. New York: The Free Press.
Espeland, W.N. & Stevens, L.M. (1998). “Commensuration as a social process”. Annual Review of Sociology. 24(1): 313–343.
Eubanks, V. (2018). Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. St. Martin’s Press. https://www.jstor.org/stable/223484.  
Fairclough, N. (2013). Critical Discourse Analysis: The Critical Study of Language. Routledge.
Fan, J. (2008). “Sure indepedence screening for ultra-high dimensional feature space”. JR Stat Soc B. 70(5): 849-911. https://doi.org/10.1111/j.1467-9868.2008.00674.x.   
Fan, J. & Li, R. (2006). “Statistical challenges with high dimensionality: Feature selection in knowledge discovery”. arXiv preprint math/0602133. 595-622.
Freeman, L.C. (1977). “A set of measures of centrality based on betweenness”. Sociometry. 40(1): 35-41. https://doi.org/10.2307/3033543.  
Gayo-Avello, D.; Metaxas, P.T. & Mustafaraj, E. (2011). Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment. Proceedings of the International Conference on Weblogs and Social Media (ICWSM).
Geertz, C. (1973). The Interpretation of Cultures: Selected Essays. Basic Books.
Gerlitz, C. & Lury, C. (2014). “Social media and self-evaluating assemblages: On numbers, orderings and values”. Distinktion: Scandinavian Journal of Social Theory. 15(2): 174-188. https://doi.org/10.1080/1600910X.2014.920267.
Gillespie, T. (2014). “The relevance of algorithms”. Edited by Gillespie T.; Boczkowski P.J. & Foot, K.A. Media Technologies: Essays on Communication, Materiality, and Society: 167-194. MIT Press.
---------------. (2010). “The politics of ‘platforms’”. New Media & Society. 12(3): 347-364.  https://doi.org/10.1177/1461444809342738.
Gitelman, L. (2011). Notes for the Upcoming Collection ‘Raw Data’ is an Oxymoron. retrieved at 10 October 2023 [Online] from: https://files.nyu.edu/lg91/public/.  
Granovetter, M.S. (1973). “The strength of weak ties”. American Journal of Sociology. 78(6): 1360-1380. https://www.jstor.org/stable/2776392.  
Greene, W.H. (2003). Econometric Analysis. 8th ed. Pearson Education India.
Hansen, B. (2022). Econometrics. Princeton University Press.
Haraway, D. (2011). “A cyborg manifesto (1985)”. Cultural Theory: An Anthology. Edited by Szeman I.; Kaposy, T.: 454-471. WILEY Blackwell.
Harman, G. (2018). Object-Oriented Ontology: A New Theory Of Everything. Penguin UK.
Hastie, T.; Tibshirani, R.; Friedman, J.H. & Friedman, J.H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Vol. 2: 1-758. New York: Springer.
Hayles, N.K. (2000). “How we became posthuman: Virtual bodies in cybernetics, literature, and informatics”. Chicago: Chicago University Press.
He, H. & Garcia, E.A. (2009). “Learning from imbalanced data”. IEEE Transactions on Knowledge and Data Engineering. 21(9): 1263-1284. doi: 10.1109/TKDE.2008.239.
Heckman, J. (2013). “Sample selection bias as a specification error”. Applied Econometrics. 31(3): 129-137. https://doi.org/10.1007/s11747-021-00816-9.  
Hellberg, L. (2024). Reduce the Gender Gap in Computer Science Education Using Creative Programming. Master’s Programme, Interactive Media Technology. KTH/Skolan för elektroteknik och datavetenskap (EECS).
Ioffe, S. & Szegedy, C. (2015). “Batch normalization: Accelerating deep network training by reducing internal covariate shift”. International Conference on Machine Learning. Pmlr: 448-456.
Jablonka, E. & Bergsten, C. (2021). “Numbers don’t speak for themselves: Strategies of using numbers in public policy discourse”. Educational Studies in Mathematics. 108(3): 579-596. https://doi.org/10.1007/s10649-021-10059-8.
Jain, A.K.; Murty, M.N. & Flynn, P.J. (1999). “Data clustering: A review”. ACM Computing Surveys (CSUR). 31(3): 264-323. https://doi.org/10.1145/331499.331504.  
Johnstone, I.M. (2001). “On the distribution of the largest eigenvalue in principal components analysis”. The Annals of Statistics. 29(2): 295-327. doi: 10.1214/aos/1009210544.
Johnstone, I.M. & Lu, A.Y. (2009). “On consistency and sparsity for principal components analysis in high dimensions”. Journal of the American Statistical Association. 104(486): 682-693. https://doi.org/10.1198/jasa.2009.0121.  
Kandel, E.R.; Schwartz, J.H. & Jessell, T.M. (2013). Principles of Neural Science. 5th ed. McGraw-Hill Education.
Kant, I. (1781[1908]). Critique of Pure Reason. Modern Classical Philosophers. Cambridge, MA: Houghton Mifflin.
Kitchin, R. (2014). The Data Revolution: Big Data, Open Data, Data Infrastructures and their Consequences. Sage.
Koller, D. & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT press.
Kuhn, T.S. (1962). The Structure of Scientific Revolutions. University of Chicago Press.
Latour, B. (2007). Reassembling the Social: An Introduction to Actor-Network-Theory. Oup Oxford.
Leinweber, D. J. (2007). Stupid data miner tricks: overfitting the S&P 500. Journal of Investing16(1), 15-22. https://doi.org/10.3905/joi.2007.681820.  
Little, R.J. & Rubin, D.B. (2019). Statistical Analysis with Missing Data. Vol. 793. John Wiley & Sons.
Lohr, S.L. (2021). Sampling: Design and Analysis. Chapman and Hall/CRC.
Manovich, L. (2011). “Trending: The promises and the challenges of big social data”. Debates in the Digital Humanities. Edited by Gold M.K. The University of Minnesota Press, Minneapolis, MN. Retrieved at 18 September 2023 [Online] from: http://www.manovich.net/DOCS/Manovich_trending_paper.pdf.  
Marchenko, V.A. & Pastur, L.A. (1967). “Distribution of eigenvalues for some sets of random matrices”. Matematicheskii Sbornik. 114(4): 507-536. doi: 10.1070/SM1967v001n04ABEH001994.  
Marres, N. & Weltevrede, E. (2013). “Scraping the social? Issues in live social research”. Journal of Cultural Economy. 6(3): 313-335. https://doi.org/10.1080/17530350.2013.772070.
Marx, K. (1932[1845]). “Theses on Feuerbach”. The German Ideology. Edited by Pascal, R. New York: International Publishers. Viktor and Cukier
Mayer-Schönberger, V. & Cukier K. (2013). Big Data: A Revolution that Will Transform How We Live, Work, and Think. New York: Houghton Mifflin Harcourt.
McPherson, M.; Smith-Lovin, L. & Cook, J.M. (2001). “Birds of a feather: Homophily in social networks”. Annual Review of Sociology. 27(1): 415-444. https://doi.org/10.1146/annurev.soc.27.1.415.  
Meinshausen, N. & Bühlmann, P. (2010). “Stability selection”. Journal of the Royal Statistical Society Series B: Statistical Methodology. 72(4): 417-473. https://doi.org/10.1111/j.1467-9868.2010.00740.x.
Meng, X.L. (2018). “Statistical paradises and paradoxes in big data (i) law of large populations, big data paradox, and the 2016 us presidential election”. The Annals of Applied Statistics. 12(2):  1-14. https://doi.org/10.1016/j.ijforecast.2024.04.008.  
Metcalf, J. & Crawford, K. (2016). “Where are human subjects in big data research? The emerging ethics divide”. Big Data & Society. 3(1): 2053951716650211. https://doi.org/10.1177/2053951716650211.  
Mohseni Ahooei, E. (2023). “The end of information age society 5.0 and the L [e] ast man”. Journal of Cyberspace Studies. 7(1), 45-66. doi: 10.22059/JCSS.2022.346205.1078.
---------------. (2022). “Shifting from individualism to genericism: Personalization as a conspiracy theory. Žurnalistikos Tyrimai. 16: 14-38. https://doi.org/10.15388/ZT/JR.2022.1.  
Moran, D. (2002). Introduction to Phenomenology. Routledge.
Narayanan, A. & Shmatikov, V. (2008). “Robust de-anonymization of large sparse datasets”. 2008 IEEE Symposium on Security and Privacy: 111-125. IEEE.
National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. (1979). “The belmont report: Ethical principles and guidelines for the protection of human subjects of research”. Retrieved at 21 September 2023 [Online] from: https://www.hhs.gov/ohrp/sites/default/files/the-belmont-report-508c_FINAL.pdf.  
Nissenbaum, H. (2011). “Privacy in context: Technology, policy, and the integrity of social life”. Journal of Information Policy. 1: 149-151. https://doi.org/10.1145/3547299.  
O’Neil, C. (2016). Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Publishing Group.
Pearl, J. (2009). Causality. Cambridge University Press.
Pessach, D. & Shmueli, E. (2022). “A review on fairness in machine learning”. ACM Computing Surveys (CSUR). 55(3): 1-44. https://doi.org/10.1145/3494672.  
Pietsch, W. (2021). Big Data. Cambridge University Press.
Pond, P. (2020). Complexity, digital media and post truth politics: a theory of interactive systems. Springer Nature.
Porter, T.M. (2020). The Rise of Statistical Thinking, 1820–1900. Princeton University Press.
Raudenbush, S.W. & Bryk, A.S. (2002). Hierarchical Linear Models: Applications and Data Analysis Methods. Vol. 1. Sage.
Resnyansky, L. (2019). “Conceptual frameworks for social and cultural Big Data analytics: Answering the epistemological challenge”. Big Data & Society. 6(1): 1-12. https://doi.org/10.1177/2053951718823815.  
Schwarz, G. (1978). “Estimating the dimension of a model”. The Annals of Statistics. 6(2): 461-464. https://doi.org/10.1214/aos/1176344136.
Shor, P.W. (1994). “Algorithms for quantum computation: Discrete logarithms and factoring”. Proceedings of the 35th Annual Symposium on Foundations of Computer Science: 124-134. IEEE. https://doi.org/10.1109/SFCS.1994.365700.  
Smith, K.E. (2010). Meaning, Subjectivity, Society: Making Sense of Modernity. Leiden and Boston: Brill.
Sporns, O.; Bullmore, E. & Kaiser, M. (2008). “The human connectome: A structural description of the human brain”. PLoS Biology. 6(7): 0245-0251. doi: 10.1371/journal.pcbi.0010042.  
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I. & Salakhutdinov, R. (2014). “Dropout: A simple way to prevent neural networks from overfitting”. The Journal of Machine Learning Research. 15(1): 1929-1958. doi: 10.5555/2627435.2670313.
Taylor, C. (1986). Self-Interpreting Animals. In Martin Heidegger. Edited by Mulhall S. London: Routledge.
Tenenbaum, J.B.; Silva, V.D. & Langford, J.C. (2000). “A global geometric framework for nonlinear dimensionality reduction”. Science. 290(5500): 2319-2323. doi: 10.1126/science.290.5500.2319.
Tibshirani, R. (1996). “Regression shrinkage and selection via the lasso”. Journal of the Royal Statistical Society Series B: Statistical Methodology. 58(1): 267-288. https://www.jstor.org/stable/2346178.  
Turkle, S. (2011). Alone Together: Why We Expect More from Technology and Less from Each Other. Basic Books.
Tversky, A. & Kahneman, D. (1974). “Judgment under uncertainty: Heuristics and biases”. Science. 185(4157): 1124-1131. doi: 10.1126/science.185.4157.1124.
Van Dijck, J. (2013). The Clture of Connectivity: A Critical History of Social Media. Oxford University Press.
---------------. (2012). “Tracing Twitter: The rise of a microblogging platform”. International Journal of Media and Cultural Politics. 7: 333–348. https://doi.org/10.1386/macp.7.3.333_1.  
Van Es, K. & Schäfer, M.T. (2017). The Datafied Society. Studying Culture through Data. Amsterdam University Press.
Vapnik, V. (2013). The Nature of Statistical Learning Theory. Springer Science & Business Media.
Wasserman, S. & Faust, K. (1994). Social network analysis: Methods and applications. Cambridge University Press.
Wellman, B. & Wortley, S. (1990). “Different strokes from different folks: Community ties and social support”. American Journal of Sociology. 96(3): 558-588. https://doi.org/10.1086/229572.  
Yeo, G. (2021). Record-Making and Record-Keeping in Early Societies. Routledge.
Ziegel, E.R. (2002). “Statistical inference”. Technometrics. 44(4): 407–408. https://doi.org/10.1198/tech.2002.s94.