Tiziano Rotesi
Text as Data for Social Science Research
SOC 2070 - Brown University
SOC 2070 - Brown University
Course Outline:
The explosion of accessible digital text is rapidly changing the work of researchers interested in studying culture, decision making and human interaction. For example, the narratives, debates, laws, and opinions that form the core of political discourse are predominantly text-based, emphasizing the need to understand what is being communicated and written. These data provide a complementary dimension to the more traditional, structured datasets typically used in social science research. From analyzing news articles and social media posts to understand public sentiment and opinion, to examining online forums and comment sections to gain insights into community dynamics and social issues, the applications of text analysis are broad and impactful.
This graduate-level course provides an overview and hands-on experience of the methods that comprise the essential toolkit for text analysis. Aimed at equipping students with practical skills, the course covers a wide range of topics, including data collection strategies and ethical considerations related to text analysis. From the perspective of social science researchers, the course explores various methods to discover patterns, measure variables of interest, and assess causal relationships using textual data. Through theoretical discussions, engagement with recent literature, and practical exercises, students will gain the necessary knowledge and expertise to effectively analyze text data in their own research.
Textbook:
Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press. (GRS)
This is the main reference for most of the course. It does not cover some of the topics in the second part of the course, for which we will need to use other material.
Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing, 3rd Edition.
Available online HERE. (JM)
Bird, S., Klein, E., & Loper, E. (2019). Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit.
Available online HERE. (BKL)
This book is a good reference point for the basic NLP tasks in Python.
Calendar:
Week 1
Date: 01/26/2024
Topic: Introduction. Discovery, Measurement, and Inference
Required Readings:GRS, Chapter 2
Gentzkow, M., Kelly, B., & Taddy, M. (2019). Text as Data. Journal of Economic Literature, 57(3), 535-74.
Optional Readings:
Edelmann, A., Wolff, T., Montagne, D., & Bail, C. A. (2020). Computational Social Science and Sociology. Annual Review of Sociology, 46, 61-81.
Ash, E., & Hansen, S. (2023). Text algorithms in economics. Annual Review of Economics.
Python: Intro to Python.
Week 2
Date: 02/02/2024
Topic: Acquiring and Selecting Data. Methods and Sources
Required Readings:
GRS, Chapters 3,4.
Python: Web Scraping, APIs, loading and cleaning data.
Week 3
Date: 02/09/2024
Topic: Bag of Words and Dictionary Methods
Required Readings:GRS, Chapters 5, 10, 11, 15, 16.
Optional Readings:
Advani, A., Ash, E., Cai, D., & Rasul, I. (2021). Race-related research in economics and other social sciences.
Cheng, M., Smith, D. S., Ren, X., Cao, H., Smith, S., & McFarland, D. A. (2023). How New Ideas Diffuse in Science. American Sociological Review, 88(3), 522-561.
Dunivin, Z. O., Yan, H. Y., Ince, J., & Rojas, F. (2022). Black Lives Matter protests shift public discourse. Proceedings of the National Academy of Sciences, 119(10), e2117320119.
Enke, B. (2020). Moral values and voting. Journal of Political Economy, 128(10), 3679-3729.
Esposito, E., Rotesi, T., Saia, A., & Thoenig, M. (2023). Reconciliation Narratives: "The Birth of a Nation" after the US Civil War. American Economic Review.
Michalopoulos, S., & Xue, M. M. (2021). Folklore. The Quarterly Journal of Economics, 136(4), 1993-2046.
Rho, E. H. R., & Mazmanian, M. (2020, April). Political hashtags & the lost art of democratic discourse. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (pp. 1-13).
Triplett, J. (2022). Articulating the Pueblo Cubano: Women's Politicization and Productivity in Revolutionary Cuba, 1959 to 1969. American Sociological Review, 87(1), 80-104.
Python: Tokenization, dictionary methods, sentiment analysis, mutual and information.
Week 4
Date: 02/16/2024
Topic: Multinomial Models and Vector Models
Required Readings:
GRS, Chapters 6, 7, 11.2-11.4.
Optional Readings:
Gentzkow, M., & Shapiro, J. M. (2010). What drives media slant? Evidence from US daily newspapers. Econometrica, 78(1), 35-71.
Cagé, J., Hervé, N., & Viaud, M. L. (2020). The Production of Information in an Online World. The Review of Economic Studies, 87(5), 2126-2164.
Python: Multinomial Models and Vector Models.
Week 5
Date: 02/23/2024
Topic: Clustering, Topic Models, Principal Component
Required Readings:
GRS, Chapters 10, 11, 13, 14.
Optional Readings:
Barron, A. T., Huang, J., Spang, R. L., & DeDeo, S. (2018). Individuals, institutions, and innovation in the debates of the French Revolution. Proceedings of the National Academy of Sciences, 115(18), 4607-4612.
Greve, H. R., Rao, H., Vicinanza, P., & Zhou, E. Y. (2022). Online Conspiracy Groups: Micro-Bloggers, Bots, and Coronavirus Conspiracy Talk on Twitter. American Sociological Review, 87(6), 919-949.
Heiberger, R. H., Munoz-Najar Galvez, S., & McFarland, D. A. (2021). Facets of specialization and its relation to career success: An analysis of US Sociology, 1980 to 2015. American Sociological Review, 86(6), 1164-1192.
Na, R. W., & DeDeo, S. (2022). The Diversity of Argument-Making in the Wild: from Assumptions and Definitions to Causation and Anecdote in Reddit's "Change My View". arXiv preprint arXiv:2205.07938.
Python: PCA, Topic Models.
Week 6
Date: 03/01/2024
Topic: Supervised Learning
Required Readings:
GRS, Chapters 17, 18, 19, 20.
Optional Readings:
Gentzkow, M., Shapiro, J. M., & Taddy, M. (2019). Measuring Group Differences in High-Dimensional Choices: Method and Application to Congressional Speech. Econometrica, 87(4), 1307-1340.
Widmer, P., Galletta, S., & Ash, E. (2022). Media slant is contagious. arXiv preprint arXiv:2202.07269.
Python: Machine learning methods applied to text.
Week 7
Date: 03/08/2024
Topic: Word Embeddings
Required Readings:
GRS, Chapter 8.
Rodriguez, P. L., & Spirling, A. (2022). Word Embeddings: What works, what doesn't, and how to tell the difference for applied research. The Journal of Politics, 84(1), 101-115.
Optional Readings:
Gennaro, G., & Ash, E. (2022). Emotion and Reason in Political Language. The Economic Journal, 132(643), 1037-1059.
Kozlowski, A. C., Taddy, M., & Evans, J. A. (2019). The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings. American Sociological Review, 84(5), 905-949.
Stampi-Bombelli, A., Gennaro, G., Ash, E., & Hangartner, D. (2022). Immigration and Social Distance: Evidence from Newspapers during the Age of Mass Migration. Mimeo.
Python: Word embeddings.
Week 8
Date: 03/15/2024
Topic: Neural Networks and Sequence Models
Required Readings:
JM, Chapter 7.
GRS, Chapter 9.
Optional Readings:
JM, Chapter 8.
Adukia, A., Eble, A., Harrison, E., Runesha, H. B., & Szasz, T. (2023). What we teach about race and gender: Representation in images and text of children's books. The Quarterly Journal of Economics.
Fetzer, T. (2020). Can workfare programs moderate conflict? Evidence from India. Journal of the European Economic Association, 18(6), 3337-3375.
Papasavva, A., Blackburn, J., Stringhini, G., Zannettou, S., & Cristofaro, E. D. (2021). "Is it a qoincidence?": An exploratory study of QAnon on Voat. In Proceedings of the Web Conference 2021 (pp. 460-471).
Python: Parsing, Named Entities, Semantic Role Labeling, application to gendered language.
Week 9
Date: 03/22/2024
Topic: Transformers and Pretrained Language Models
Required Readings:
JM, Chapters 10, 11.
Optional Readings:
Bingler, J. A., Kraus, M., Leippold, M., & Webersinke, N. (2022). Cheap talk and cherry-picking: What ClimateBert has to say on corporate climate risk disclosures. Finance Research Letters, 47, 102776.
Hansen, S., Lambert, P. J., Bloom, N., Davis, S. J., Sadun, R., & Taska, B. (2023). Remote Work across Jobs, Companies, and Space (No. w31007). National Bureau of Economic Research.
Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056.
Stammbach, D., Antoniak, M., & Ash, E. (2022). Heroes, Villains, and Victims, and GPT-3--Automated Extraction of Character Roles Without Training Data. arXiv preprint arXiv:2205.07557.
Python: Transformers and LLMs.
Week 10
Date: 04/05/2024
Topic: Causal Inference
Required Readings:GRS, Chapters 24, 25, 26, 27.
Optional Readings:
Ash, E., Chen, D. L., & Naidu, S. (2022). Ideas have consequences: The impact of law and economics on American justice (No. w29788). National Bureau of Economic Research.
Bail CA, Volfovsky A, Argyle LP, Brown TW, Bumpus JP, et al. (2018). Exposure to opposing views on social media can increase political polarization. Proceedings of the National Academy of Sciences, 115:9216–21.
Djourelova, M., Durante, R., & Martin, G. (2021). The impact of online competition on local newspapers: Evidence from the introduction of Craigslist.
Other Applications HERE.
Week 11
Date: 04/12
Topic: Privacy, Ethics, and Interpretability
Readings:
Solon Barocas, S., Hardt, M., Narayanan, A. Fairness and Machine Learning, Limitations and Opportunities. Available HERE Chapters 1 and 7.
Barocas, S., & Selbst, A. D. (2016). Big data's disparate impact. California Law Review, 671-732.
Hovy, D., & Spruit, S. L. (2016, August). The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 591-598).
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206-215.