Saturday, July 19, 2025

Arrtificial intelligence tools for historians


Historical research may seem to be a field in which AI tools will be especially useful. Historians are often confronted with very large unstructured digital collections of documents, letters, images, treaties, legal settlements, contracts, and diplomatic exchanges that far exceed the ability of a single human researcher to sift and analyze for valuable historical insights. Can emerging tools in the AI revolution help to make systematic use of such historical collections?

Earlier applications of new quantitative methods of analysis of historical data

Several earlier periods of innovation in twentieth-century historical research suggest that historians can often borrow fruitfully from new methods and analytical tools developed in other sciences. The cliometric revolution of the 1980s (Fogel and Elton 1984; Rawski 1996; Wright 2015) brought tools of econometrics, demography, and statistics more fully into play in historical inquiry. Historians have made extensive and productive use of quantitative methods borrowed from the social sciences to investigate questions concerning the health status of various historical populations and the standard of living in different cities and regions (Crafts 1980; Lee and Feng 1999; Allen 2000; Allen, Bengtsson, and Dribe 2005). These tools usually depend upon the availability of structured databases of comparable data over time—for example, census data, birth, marriage, and death records, military records of recruits, and price data for representative goods (wheat, rice, salt). There are issues of comparability, reliability, and validity that arise in these applications of large historical datasets, but these issues are no more difficult for historians than for sociologists or political scientists. Another major area of innovation was the geospatial revolution of the 1990s (Presner and Shepard 2016; Skinner, Henderson, and Yuan 2000; Thill 2020). Efforts to place historical data and events into spatial order have been very productive in suggesting new historical patterns and causal influences not visible in purely narrative accounts. G. William Skinner’s pathbreaking work on the economic regionalization of China is an outstanding example (Skinner 1977), and Peter Bol and colleagues have collaborated in the establishment of a major historical GIS database for China (Bol 2006; Bol 2007). So it is quite understandable that some contemporary historians are interested in the potential value of emerging tools of digital humanities, semantic search, and big-data analytics in their efforts to make sense of very large archives of digitized text and image materials.

However, archival collections of interest to historians present special obstacles to digital research. They are usually unstructured, consisting of collections of heterogeneous text documents, contracts, local regulations, trial documents, imperial decrees, personal letters, and artifacts and images. Moreover, the meaning of legal, political, and religious vocabulary is sometimes unclear from a modern perspective, so translation and interpretation are problematic. The written language of the documents itself is problematic. Often handwritten, interspersed with references and asides in other languages, and often using vocabulary that has no exact modern equivalent, the task of interpreting each historical document itself is challenging for the historian and for the software system. Are there tools that allow the historian to sift, summarize, categorize, and highlight the texts, sentences, and paragraphs that are included in a large archival collection? Major new capabilities have emerged in each of these areas that have substantially enhanced the ability of historians to classify and analyze very large unstructured text databases and archives. These capabilities involve advances in machine learning, large language models, semantic search tools, and big-data analytics. Like any innovation in methods of inquiry and inference, it is crucial for researchers to carefully evaluate the epistemic reliability of the tools they utilize.

Digital humanities

In the past several decades scholars in the humanities, including comparative literature, art history, and various national literatures, have explored applications of computational tools for the analysis of digital texts that permit a breadth and depth of analysis not previously available. These research efforts are now described as digital humanities. Several handbooks and overviews on digital humanities have appeared (Schreibman, Siemens, Unsworth 2004; Schreibman, Siemens, Unsworth 2016; Eve 2022). The goals of research within the field are varied, but in almost all cases the research involves computational analysis of large databases of text, image, and video documents, with the general goal of discovering large patterns that may be undetectable through traditional tools of literary or art-history analysis. Franco Moretti’s Graphs, Maps, Trees: Abstract Models for a Literary History (2005) and Distant Reading (2013) offer excellent examples. Moretti wishes to explore “world literature”; and the field of documents included in this rubric is too large for any single critic or research team to read closely all the available works in the field. Moretti writes, “A larger literary history requires other skills: sampling; statistics; work with series, titles, concordances, incipits—and perhaps also the ‘trees’ that I discuss in this essay” (2013: 67). In place of the insights of close reading, Moretti emphasizes the value of “distant reading” and the effort to discover broad and long patterns across national literatures and centuries. This requires using analytical tools of computational social science to classify texts, identify word patterns, create spatial networks, and (perhaps) to algorithmically assign markers to topics and styles in the texts subject to analysis. Martin Paul Eve writes, “Under such a model, the idea is that computational detection of style, theme, content, named entities, geographic place names, etc. could be discerned at scale and aggregated into a broader and continuous literary history that would not suffer from the same defects as a model that required one to read everything” (Eve 2022: 130).

Efforts in the digital humanities have evident relevance to the problems presented by large text and image datasets available in many areas of historical research. One promising area of application involves using big data tools of text analysis—for example, machine learning, content extraction, and semantic search—to systematically survey and classify all the documents in a collection. The impetus and initiatives of the field of “digital or computational history” are described in Siebold and Valleriani 2022 and Graham, Milligan, Weingart, and Martin 2022. The methods currently creating a great deal of interest among historians are based on joining machine learning methods, big-data analytics, and large language models (LLMs) in order to permit analysis and codification of the semantic content of documents. To what extent can emerging computational tools designed for management and analysis of large unstructured text and image databases be adapted to assist the historian in the task of assimilating, interpreting, and analyzing very large databases of historical documents and artifacts?

Pre-processing and information extraction

An avenue of research in computer science that supports analysis of large unstructured datasets containing texts and images is the field of information extraction (Adnan and Akbar 2019). Information extraction technology consists of algorithms developed to analyze patterns in text (and images or videos) to apply labels or tags to segments of the data. These are generally “big data” tools using machine learning to identify patterns in target documents or images. Adnan and Akbar put the goal of information extraction tools in these terms: “It takes collection of documents as input and generates different representations of relevant information satisfying different criteria. IE techniques efficiently analyze the text in free form by extracting most valuable and relevant information in a structured format” (Adnan and Akbar 2019: 6). In general terms, information extraction tools are expected to provide a structured basis for answers to questions like these: What is the document about? What persons or things are mentioned? What relationships are specified within the document? What events are named? The tools are often based on natural-language models that require training on large text datasets and sometimes make use of machine learning based on neural networks (Rithani et al. 2023). “The concept is to automatically extract characteristics from massive artificial neural networks and then use these features to inform choices” (Rithani et al. 2023: 14766).

A useful tool developed within the field of information extraction that has clear relevance for historians attempting to analyze large unstructured databases is named entity recognition and classification (Goyal, Gupta, and Kumar 2018). This is a group of text-analysis algorithms designed to identify meaningful information contained in a given document —for example, “person, organization, location, date/time, quantities, numbers” (Goyal et al. 2018: 22). The named entities may be specialized to a particular content area; for example, public health historians may wish to include disease and symptom names. These tools are used as a basis for pre-processing of a set of documents. The tool creates a meta-file for each document including the named entities and classes that it contains, along with other contextual information. For example, historians interested in the role that agriculture played in large periods of time may be interested in quickly identifying a selection of documents that refer to hunger, famine, or starvation. Goyal, Gupta, and Kumar carefully review the methods currently in use to identify named entities in a body of texts, including rule-based identification of named entities and machine-learning identification, with or without supervision. They emphasize that none of these methods is error-free, and false positives and false negatives continue to arise after training. This means that some lexical items in a document are either missed as referring to a named entity, or are incorrectly associated with a given named entity. Nonetheless, a historian can certainly use named-entity recognition and classification to provide a basis for important exploration and discovery in a large unstructured text database.

Keller, Shiu, and Yan (2024) provide a recent example of a machine-learning approach to automatic text analysis based on the most common large language model technique (“bidirectional encoder representations from transformers” (BERT)). They use GUWEN-BERT, a BERT model pre-trained on classical Chinese characters. They evaluate the power and accuracy of this tool in analyzing the Veritable Records of the Qing Dynasty to identify events of social unrest. The document archive is vast, encompassing more than 1,200 volumes of records from the sixth century to the end of the Qing Dynasty. Their research task is to identify episodes of social unrest, and then to classify these episodes into three categories—peasant unrest, militia unrest, and secret-society unrest (Keller et al. 2024: 4). This process of event identification and classification then permitted the researchers to seek out correlates of unrest, including fluctuations in grain prices. A useful example applying the same technology is provided by Liu, Wang, and Bol (2023), demonstrating largescale extraction of biographical information from a large collection of local gazetteers. Machine recognition of hand-written Chinese literary characters and translation of sentences and phrases in classical Chinese have made great progress in the past twenty years; Liu, Jaeger, and Nakagawa 2004, Leung and Leung 2010. This capability represents a major step forward in the ability of Chinese and Asian-language historians to make extensive use of large databases of historical documents such as the Veritable Records archives.

RAG, GraphRAG, and vector-similarity search

An important tool that has been of interest to historians exploring digital tools is retrieval-augmented generation (RAG) as a complement to LLM text generation systems. This area of research attempts to provide a basis for joining LLM query engines to specialized databases so that responses to queries will be based on data contained in the associated database. RAG tools are sometimes celebrated as solving two persistent problems arising in the application of natural-language generative chat functions based on large language models: the lack of auditability and the generation of fictitious responses (hallucinations) by the generative chat program. Kim Martineau describes a RAG tool in these terms: “Retrieval-augmented generation (RAG) is an AI framework for improving the quality of LLM-generated responses by grounding the model on external sources of knowledge to supplement the LLM’s internal representation of information. RAG implementation in an LLM-based question-answer system has two main benefits: It ensures that the model has access to the most current, reliable facts, and that users have access to the model’s sources, ensuring that its claims can be checked for accuracy and ultimately trusted” (Martineau 2024). A RAG framework is intended to allow the introduction of real, documented data into a natural language query-and-response system, and it is designed to be auditable. RAG picks up where pre-processing tools discussed previously leave off. RAG tools permit the retriever tool to parse a given query into component questions, and then to retrieve relevant data from pre-existing databases of documents (Lewis et al. 2021; Zhao et al. 2024).

RAG tools have in turn been extended with two related innovations. Vector similarity search is a semantic search tool that represents a document as a vector of abstract terms (like those identified in the discussion above of named entity identification and classification) (Mohoney et al. 2023). This further simplifies the task of querying the database for documents that are “about” one or more entities or events. A second valuable analytical tool is GraphRAG, which permits the construction of a network graph of the links among the elements in a document collection. Introduced by research scientists at Microsoft in 2024, GraphRAG was designed to permit analysis of global features of a large unstructured data collection. (See Larson and Truitt 2024, Edge et al. 2024a, and Edge et al. 2024b for technical descriptions of GraphRAG capabilities.) GraphRAG combines the data provided by RAG tools and connects these to LLM generative response systems. GraphRAG thus integrates indexing, retrieval, and generation. The key output of GraphRAG analysis of a database of text documents is a knowledge graph showing relationships among the various documents based on the content vectors associated with each document. (Experienced historians who make use of RAG and GraphRAG tools note that scaling up from moderate to large databases is challenging and computationally demanding.)

Limitations of the tools for historians

These tools suggest research strategies for historians confronting very large digital collections of documents and images. They permit computational procedures that classify and index the materials in the data archive that permit the historian to quickly identify items that are relevant to particular research questions -- the occurrence of famine, civil strife, dynastic unrest, or the transmission of ideas. And they permit natural-language query of the target database that provides suggestive avenues of further investigation for the historian. Crucially, these tools provide the ability to "audit" the results of a query by returning to the specific documents on which a response is based. The problem of "hallucination" that is endemic to large-language models by themselves is substantially reduced by tying responses to specific items in the database. And the algorithms of vector search allow the AI agent to quickly pull together the documents and "chunks" of text that are most relevant to the query.

These applications present powerful new opportunities for historians to make extensive use of very large databases of texts, but they also pose novel questions for the philosophy of history. In particular, they require that historians and philosophers develop new standards and methods for validating the computational methods that are chosen for various research tasks presented by the availability of large text collections. This means that we need to examine the strengths and limitations of each of these methods of analysis. Crucially, the designers and researchers of these tools are quite explicit in acknowledging that the tools are subject to error: the problem of hallucination is not fully removed, the content database itself may be error-prone, there may be flaws and limitations inherent in the training database in use, and any errors created during the information-extraction stage will be carried forward into the results. It is therefore incumbent upon the historian who uses such tools to validate and evaluate the information provided by searches and natural language queries. Nothing in the design of these tools suggests that they are highly reliable; rather, they are best viewed as exploratory tools permitting the historian to look more deeply into the collection of documents than traditional methods would permit. It will be necessary for historians to think critically about the quality and limitations of the information they extract from these forms of big-data analysis of historical databases.

References

Adnan, Kiran, and Rehan Akbar, 2019. “An analytical study of information extraction from unstructured and multidimensional big data,” Journal of Big Data, 6(1): 91. doi10.1186/s40537-019-0254-8

Allen, Robert C., 2000. “Economic Structure and Agricultural Productivity in Europe, 1300–1800,” European Review of Economic History, 3: 1–25.

Allen, Robert C., Tommy Bengtsson, and Martin Dribe (eds.), 2005. Living standards in the past: New perspectives on well-being in Asia and Europe, Oxford; New York: Oxford University.

Bol, Peter, 2006. “Creating the China Historical Geographic Information System,” in History in the Digital Age Symposium, University of Nebraska-Lincoln, video lecture. Bol 2006 available online.

Bol, Peter. 2007. Creating the China Historical Geographic Information System (text and slides). (Digital History Project, University of Nebraska-Lincoln). http://digitalhistory.unl.edu/essays/bolessay.php.

Crafts, N.F.R., 1980. “National income estimates and the British standard of living debate: A reappraisal of 1801–1831,” Explorations in Economic History, 17: 176–88.

Edge, Darren, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson, 2024a, “From Local to Global: A Graph RAG Approach to Query-Focused Summarization”, manuscript at arXiv.org.

Edge, Darren, Ha Trinh, Steven Truitt, and Jonathan Larson, 2024b, “GraphRAG: New Tool for Complex Data Discovery Now on GitHub”, blog post at Microsoft Research, 2 July 2024.

Eve, Martin Paul, 2022. The digital humanities and literary studies, first edition, Oxford: Oxford University Press.

Fogel, Robert William, and G. R. Elton, 1983. Which road to the past? Two views of history, New Haven: Yale University Press.

Goyal, Archana, Vishal Gupta, and Manish Kumar, 2018. “Recent named entity recognition and classification techniques: A systematic review,” Computer Science Review, 29: 21–43. Goyal, Gupta, & Kumar 2018 available online.

Graham, Shawn, Ian Milligan, Scott Weingart, and Kimberley Martin, 2022. Exploring big historical data: The historian’s macroscope, second edition, New Jersey: World Scientific.

Kamath, Uday, Kevin Keenan, Garrett Somers, and Sarah Sorenson. 2024. Large Language Models: A Deep Dive: Bridging Theory and Practice. 1st 2024. ed.: Springer Nature Switzerland: Imprint: Springer.

Keller, Wolfgang, Carol H. Shiue, and Sen Yan, 2024. “Mining Chinese historical sources at scale: A machine learning approach to Qing state capacity,” Working Paper 32982, National Bureau of Economic Research, Cambridge, MA. Keller, Shiue, & Yan 2024 available online.

Larson, Jonathan, and Steven Truitt, 2024, “GraphRAG: Unlocking LLM Discovery on Narrative Private Data”, blog post at Microsoft Research, 13 February 2024.

Lee, James Z., and Wang Feng, 1999. One quarter of humanity : Malthusian mythology and Chinese realities, 1700–2000. Cambridge, Mass.: Harvard University Press.

Leung, K. C., and C. H. Leung, 2010. “Recognition of handwritten Chinese characters by critical region analysis,” Pattern Recognition, 43(3): 949–961.

Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela, 2021. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” manuscript available at arXiv.org.

Liu, Zhou, Hongsu Wang, and Peter K. Bol, 2023. “Automatic biographical information extraction from local gazetteers with Bi-LSTM-CRF model and BERT,” International Journal of Digital Humanities, 4: 195–212.

Liu, Cheng-Lin, Stefan Jaeger, and Masaki Nakagawa, 2004. “Online recognition of Chinese characters: The state-of-the-art,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 26: 198–213.

Martineau, Kim. 2024. "What is retrieval-augmented generation?". IBM Research. Accessed 11/23/2024. https://research.ibm.com/blog/retrieval-augmented-generation-RAG.

Mitchell, Melanie, Alessandro B. Palmarini, and Arseny Moskvichev, 2023. “Comparing humans, GPT-4, and GPT-4V on abstraction and Reasoning tasks,” manuscript at arXiv.org.

Mohoney, Jason, Anil Pacaci, Shihabur Rahman Chowdhury, Ali Mousavi, Ihab F. Ilyas, Umar Farooq Minhas, Jeffrey Pound, and Theodoros Rekatsinas. 2023. "High-Throughput Vector Similarity Search in Knowledge Graphs." arxiv>cs>arXiv:2304.01926.

Moretti, Franco, 2005. Graphs, maps, trees: Abstract models for a literary history, London; New York: Verso.

–––, 2013. Distant reading, London; New York: Verso.

Presner, Todd, and David Shepard, 2016. “Mapping the geospatial turn,” in A new companion to digital humanities, edited by Susan Schreibman, Raymond George Siemens and John Unsworth. Wiley/Blackwell.

Rawski, Thomas G. (ed.), 1996. Economics and the historian, Berkeley: University of California Press.

Rithani, M., R Prasanna Kumar, and Srinath Doss, 2023. “A review on big data based on deep neural network approaches,” Artificial Intelligence Review, 56(12): 14765–14801

Schreibman, Susan, Raymond George Siemens, and John Unsworth (eds.), 2004. A companion to digital humanities (Blackwell Companions to Literature and Culture), Malden, MA: Blackwell Pubications.

Schreibman, Susan, Raymond George Siemens, and John Unsworth (eds.), 2016. A new companion to digital humanities, Chichester, West Sussex, UK: Wiley/Blackwell.

Siebold, Anna, and Matteo Valleriani, 2022. “Digital perspectives in history,” Histories, 2(2): 170–177.

Skinner, G. William. 1977. "Regional Urbanization in Nineteenth-Century China." In The City in Late Imperial China, edited by G. William Skinner and Hugh D. R. Baker. Stanford, CA: Stanford University Press.

Skinner, G. William, Mark Henderson, and Yuan Jianhua. 2000. "China's Fertility Transition through Regional Space: Using GIS and Census Data for a Spatial Analysis of Historical Demography." Social Science History 24 (3): 613-652.

Thill, Jean-Claude (ed.), 2020. Innovations in urban and regional systems: Contributions from GIS&T, spatial analysis and location modeling, 1st edition, Cham: Springer International Publishing.

Wang, Dongbo, Chang Liu, Zhixiao Zhao, Si Shen, Liu Liu, Bin Li, Haotian Hu, Mengcheng Wu, Litao Lin, Xue Zhao, and Xiyu Wang, 2023. “GujiBERT and GujiGPT: Construction of Intelligent Information Processing Foundation Language Models for Ancient Texts,” manuscript at arXiv.org.

Zhao, Penghao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui, 2024. "Retrieval-Augmented Generation for AI-Generated Content: A Survey," manuscript at arXiv.org.

No comments: