Extraction of knowledge models from textbooks
Abstract
Many adaptive educational systems and other artificial intelligence applications rely
on high-quality knowledge representations. Still, knowledge acquisition remains the
primary bottleneck hindering large-scale deployment and adoption of knowledgebased
systems. The manual creation of knowledge models requires a lot of time and
effort. One path to scalable knowledge extraction is using digital textbooks, given
their domain-oriented content, structure, and availability. The authors’ knowledge
encoded in the textbooks’ elements that facilitate navigation and understanding of
the material (table of contents, index, formatting styles) can be leveraged to generate
knowledge models. Nevertheless, extracting this hidden knowledge from digital
(PDF) textbooks is challenging. This dissertation presents a unified approach for
automatically extracting high-quality and domain-specific knowledge models from
digital textbooks.
This dissertation’s approach extracts an initial set of information (structure, content,
and terminology) from the textbooks, then gradually adds new information
(links and semantic content), and finally analyses and refines the knowledge about
the domain (concepts). The proposed approach consists of seven phases, each focusing
on different aspects of knowledge that can be extracted from textbooks. Additionally,
multiple evaluations verify the quality of the extracted models.
The first phase of the approach, extraction (¬ Chapter 3), describes multiple
steps to automatically recognize the structure, content, and domain terms in textbooks.
Structural information refers to the list of chapters and subchapters of the
textbook. The textbook’s content is represented hierarchically (words, lines, text
fragments, pages, and sections). Lastly, the domain terms are extracted from the
back-of-the-book index. The extracted elements are recognized with a very high accuracy:
almost absolute precision and recall for the structure, higher recognition for
the content than state-of-the-are tools (e.g., 97% vs. 85%), and very high precision
and recall for the domain terms (between 96% and 99%).
Linking/enrichment is the second phase of the approach (¬Chapter 4), where the
domain terms are used as a bridge to connect the textbooks to an external knowledge
graph. Specifically, domain terms are matched to corresponding entities in
DBpedia—a publicly available knowledge graph based on Wikipedia. The linking
mechanism achieves high and balanced precision and recall values (e.g., 97% and
92%, respectively), which allows for the enrichment of the domain terms with the semantic
information (abstracts, Wikipedia links, categories, synonyms, and relations to other terms) that matches the terms’ actual meanings ("sense").
Domain analysis is performed in the third, fourth, and fifth phases. The third
phase of the approach, integration (¬ Chapter 4), describes integrating the domain
terms from multiple textbooks into a single model by merging semantically equal
terms. When more textbooks from the target domain are combined, the coverage of
the domain grows significantly. For example, one-third of the target domain while
using ten textbooks vs. less than one-tenth using only one textbook.
Categorization (¬ Chapter 5) is the fourth phase of the approach. It details the
steps to identify the relevance of concepts to the target domain. The information
from DBpedia (categories, links, and abstracts) and textbooks (index terms and domain
information) is used to identify how individual concepts are related to the
target domain: most essential concepts in the target domain, other concepts in the
target domain, concepts in neighboring domains, and concepts unrelated to the target
domain. Distinguishing the relevance of the concepts to the target domain, i.e.,
their specificity, is achieved with high accuracy (e.g., 92%).
The fifth phase, validation (¬ Chapter 6), describes an experiment that analyses
the cognitive validity of textbook concepts for knowledge modeling using learning
curve analysis. The results show that, generally, textbook concepts are cognitively
valid knowledge units (learning takes place during student practicing for 44 out of
46 studied concepts). Additionally, the experiment demonstrates that in terms of
granularity, fine-grained concepts model knowledge better.
The sixth phase of the automatic approach is formalization (¬ Chapter 3), where
all the extracted knowledge is serialized as a descriptive XML file using the Text
Encoding Initiative.
Finally, after the approach has produced knowledge models, they can be used in
various applications. This dissertation introduces three educational systems that use
the extracted models at their core (¬ Chapter 7). (1) Interlingua uses the knowledge
models as a semantic bridge between textbooks written in different languages.
(2) Intextbooks supports the adaptation and interactivity of all possible elements in
HTML textbooks thanks to the fine-grained identification of content elements in the
extracted knowledge models. (3) The integration of Intextbooks with P4—a personalized
Python programming practice system—shows how the knowledge models
can facilitate linking between textbook content and smart learning activities.
Multiple domains are used for the evaluation of the proposed approach. Extraction
of the textbook content is tested in the domains of statistics, computer science,
history, and literature. Terms in the domains of statistics and information retrieval
are linked correctly to their respective entities in DBpedia. The specificity of terms in
statistics and ancient philosophy is identified. Finally, concepts extracted from Python
programming textbooks are shown to be valid knowledge components.
In conclusion, this dissertation explores and presents an approach for automatically
extracting knowledge models from digital textbooks taking into account multiple
quality aspects. Future research involves extending and further evaluating the
approach, evaluating the extracted knowledge models from the prospect of pedagogical
usefulness, and extending the use of the approach toward creating intelligent
textbooks.
Description
Proyecto de graduación (Doctorado en Ciencias de la Información y Computación) Utrecht University. Department of Information and Computing Sciences, 2023
Share
Metrics
Collections
- Desarrollo Académico [55]