The French Lexicon Ontology
8 décembre 2020 4 min
The second rundown for my past project. I still have one in mind before reaching the Present.
The French Wiktionary is an immense source of lexical knowledge about the French language. Its community model allowed it to reach a broad coverage of the language, making it one of the best sources available online. Moreover, it's open-sourced. Yet, its content is hardly machine readable, preventing its usage in general application. We tackle this by parsing the Wiktionary data and populating an handcrafted ontology. Check it out at https://chalier.fr/flont!
1. Building the ontology schema
First on the list came the ontology crafting. I took a notepad, and started figuring out how lexical data should be represented. I wrote down three target features:
- a part-of-speech taxonomy,
- given a word, have outgoing links towards its synonyms, antonyms, etc.,
- given a word, have outgoing links towards its inflections.
I tried several structures but finally chose to mimic the way Wiktionary was already organizing its data. Each literal is described in one article, each article is composed of entries (usually one entry per part-of-speech or per global meaning) and each entry may have several definitions that slightly differ, along with examples. That lead me to the following basic schema (under the classes are the associated data properties):
After a quick study of the existing articles within the French Wiktionary, lexical entries are broken down using the
rdfs:subClassOf property into the following taxonomy:
Information of semantic links or inflections (including conjugations) are represented as object properties, between lexical entries and literals. Domains and ranges of those properties are not trivially defined: I chose the simplest representation for WikiText parsing, but this representation may not be the best. I wrote about this here.
The resulting schema can be found at https://ontology.chalier.fr/flont-schema.owl.
2. Parsing the data
The actual parsing needed data first: it came from an XML dump. I first parsed the XML and put it in a SQLITE database so it would not take too long executing small queries while doing a preliminary data study.
The data is first parsed into a Python representation. This parsing involves reading WikiText strings. Fortunately, a module
wikitextparser already exists for that. What it does not do is handling the different template specific to Wiktionary (specifying all the relevant data such as POS or gender). The task is therefore very grindy, and involves a lot of very specific bits of code.
One main issue is the correctness of the input data: as it is written by humans, it comes with lots of errors, mispellings, and other inconsistencies. I tried to add some fixes for the most common ones, but (detecting) and covering everything would be an excessive task for such a project.
Once that first parsing is done, that data is casted into a OWL ontology, using the
owlready2 interface. The ontology is then exported to a file. The whole process takes about half an hour, but requires a lot of free RAM.
3. Visualizing the output
The final step was to implement a way to visualize the output data. Mainly, I recreated a minimalist French Wiktionary, but with an highly structured template. All referenced entities are linked and browsable. All relationships are also linked and browsable. Meta-information relative to the entities can be displayed. A SPARQL endpoint is also included, yet not available to the public (I wish to preserve my poor Raspberry Pi).
As I usually do, I built a Django application for that. My main concern was RAM consumption, as the server is hosted on a Rapsberry Pi with its 1Gb of RAM. Fortunately, using the SQLITE backend of Owlread2 is relatively RAM friendly. Yet, features such as indexing (for literal searching and ranking) are not possible. Anyway, you better see it for yourself at https://chalier.fr/flont.
A second version of the ontology comes with global enhancements and some more features, such as a way to parse and represent etyomolgy data and a more generalized inflexion representation system, allowing for a complete conjugation representation.
The code is available on GitHub at ychalier/flont.