- This event has passed.
Toronto Data Workshop (TDW) – Accessible Investigative Journalism: Navigating Canada’s Largest Corpus of Government Documents – Sept 20, 2024
Session Description
September 20 2024 @ 12:00 pm - 1:00 pm
“Open By Default” (OBD) is a dataset from the Investigative Journalism Foundation which is Canada’s largest collection of government documents, comprising over 4.5 million pages of Access To Information and Privacy (ATIP) requests and corresponding government documentation. This project enhanced data capture using optical character recognition (OCR), improved search performance through Large Language Model (LLM) vectorization, and topic modelling to reveal the high-level subject matter represented in the OBD dataset. The final development of the project was a Retrieval Augmented Generation (RAG) LLM pipeline, which enables a chatbot to provide tailored, context-rich responses to user queries, paired with follow-up research directions.