Extraction of knowledge models from textbooks

Alpizar-Chacón, Isaac

View/Open

D29_BIB310277_Extraction_of_Knowledge_Models....pdf (13.38Mb)

Date

2023-03-08

Author

Alpizar-Chacón, Isaac

Metadata

Show full item record

Abstract

Many adaptive educational systems and other artificial intelligence applications rely on high-quality knowledge representations. Still, knowledge acquisition remains the primary bottleneck hindering large-scale deployment and adoption of knowledgebased systems. The manual creation of knowledge models requires a lot of time and effort. One path to scalable knowledge extraction is using digital textbooks, given their domain-oriented content, structure, and availability. The authors’ knowledge encoded in the textbooks’ elements that facilitate navigation and understanding of the material (table of contents, index, formatting styles) can be leveraged to generate knowledge models. Nevertheless, extracting this hidden knowledge from digital (PDF) textbooks is challenging. This dissertation presents a unified approach for automatically extracting high-quality and domain-specific knowledge models from digital textbooks. This dissertation’s approach extracts an initial set of information (structure, content, and terminology) from the textbooks, then gradually adds new information (links and semantic content), and finally analyses and refines the knowledge about the domain (concepts). The proposed approach consists of seven phases, each focusing on different aspects of knowledge that can be extracted from textbooks. Additionally, multiple evaluations verify the quality of the extracted models. The first phase of the approach, extraction (¬ Chapter 3), describes multiple steps to automatically recognize the structure, content, and domain terms in textbooks. Structural information refers to the list of chapters and subchapters of the textbook. The textbook’s content is represented hierarchically (words, lines, text fragments, pages, and sections). Lastly, the domain terms are extracted from the back-of-the-book index. The extracted elements are recognized with a very high accuracy: almost absolute precision and recall for the structure, higher recognition for the content than state-of-the-are tools (e.g., 97% vs. 85%), and very high precision and recall for the domain terms (between 96% and 99%). Linking/enrichment is the second phase of the approach (¬Chapter 4), where the domain terms are used as a bridge to connect the textbooks to an external knowledge graph. Specifically, domain terms are matched to corresponding entities in DBpedia—a publicly available knowledge graph based on Wikipedia. The linking mechanism achieves high and balanced precision and recall values (e.g., 97% and 92%, respectively), which allows for the enrichment of the domain terms with the semantic information (abstracts, Wikipedia links, categories, synonyms, and relations to other terms) that matches the terms’ actual meanings ("sense"). Domain analysis is performed in the third, fourth, and fifth phases. The third phase of the approach, integration (¬ Chapter 4), describes integrating the domain terms from multiple textbooks into a single model by merging semantically equal terms. When more textbooks from the target domain are combined, the coverage of the domain grows significantly. For example, one-third of the target domain while using ten textbooks vs. less than one-tenth using only one textbook. Categorization (¬ Chapter 5) is the fourth phase of the approach. It details the steps to identify the relevance of concepts to the target domain. The information from DBpedia (categories, links, and abstracts) and textbooks (index terms and domain information) is used to identify how individual concepts are related to the target domain: most essential concepts in the target domain, other concepts in the target domain, concepts in neighboring domains, and concepts unrelated to the target domain. Distinguishing the relevance of the concepts to the target domain, i.e., their specificity, is achieved with high accuracy (e.g., 92%). The fifth phase, validation (¬ Chapter 6), describes an experiment that analyses the cognitive validity of textbook concepts for knowledge modeling using learning curve analysis. The results show that, generally, textbook concepts are cognitively valid knowledge units (learning takes place during student practicing for 44 out of 46 studied concepts). Additionally, the experiment demonstrates that in terms of granularity, fine-grained concepts model knowledge better. The sixth phase of the automatic approach is formalization (¬ Chapter 3), where all the extracted knowledge is serialized as a descriptive XML file using the Text Encoding Initiative. Finally, after the approach has produced knowledge models, they can be used in various applications. This dissertation introduces three educational systems that use the extracted models at their core (¬ Chapter 7). (1) Interlingua uses the knowledge models as a semantic bridge between textbooks written in different languages. (2) Intextbooks supports the adaptation and interactivity of all possible elements in HTML textbooks thanks to the fine-grained identification of content elements in the extracted knowledge models. (3) The integration of Intextbooks with P4—a personalized Python programming practice system—shows how the knowledge models can facilitate linking between textbook content and smart learning activities. Multiple domains are used for the evaluation of the proposed approach. Extraction of the textbook content is tested in the domains of statistics, computer science, history, and literature. Terms in the domains of statistics and information retrieval are linked correctly to their respective entities in DBpedia. The specificity of terms in statistics and ancient philosophy is identified. Finally, concepts extracted from Python programming textbooks are shown to be valid knowledge components. In conclusion, this dissertation explores and presents an approach for automatically extracting knowledge models from digital textbooks taking into account multiple quality aspects. Future research involves extending and further evaluating the approach, evaluating the extracted knowledge models from the prospect of pedagogical usefulness, and extending the use of the approach toward creating intelligent textbooks.

Description

Proyecto de graduación (Doctorado en Ciencias de la Información y Computación) Utrecht University. Department of Information and Computing Sciences, 2023

URI

https://hdl.handle.net/2238/14378

Collections

Desarrollo Académico [65]