Blog

The French Lexicon Ontology

Dec. 8, 2020

*The second rundown for my past project. I still have one in mind before reaching the Present.* The [French Wiktionary](https://fr.wiktionary.org/wiki/Wiktionnaire:Page_d%E2%80%99accueil) is an immense source of lexical knowledge about the French language. Its community model allowed it to reach a broad coverage of the language, making it one of the best sources available online. Moreover, it's open-sourced. Yet, its content is hardly machine readable, preventing its usage in general application. We tackle this by parsing the Wiktionary data and populating an handcrafted ontology. Check it out at !

FLOnt Illustration

# Building the ontology schema First on the list came the ontology crafting. I took a notepad, and started figuring out how lexical data should be represented. I wrote down three target features: 1. a part-of-speech taxonomy, 2. given a word, have outgoing links towards its synonyms, antonyms, etc., 3. given a word, have outgoing links towards its inflections. I tried several structures but finally chose to mimic the way Wiktionary was already organizing its data. Each **literal** is described in one article, each article is composed of **entries** (usually one entry per part-of-speech or per global meaning) and each entry may have several **definitions** that slightly differ, along with examples. That lead me to the following basic schema (under the classes are the associated data properties):
Top-level Ontology Classes: Literal, Lexical Entry and Lexical Sense
Top-level Ontology Classes
After a quick study of the existing articles within the French Wiktionary, lexical entries are broken down using the `rdfs:subClassOf` property into the following taxonomy:
Part-of-speech Taxonomy
Part-of-speech Taxonomy
Information of semantic links or inflections (including conjugations) are represented as object properties, between lexical entries and literals. Domains and ranges of those properties are not trivially defined: I chose the simplest representation for WikiText parsing, but this representation may not be the best. [I wrote about this here](https://github.com/ychalier/flont/blob/master/TODO.md#re-targeting-flonthaslink-properties).
Object Relations Example for the common word 'épée'
Object Relations Example
The resulting schema can be found at . # Parsing the data The actual parsing needed data first: it came from an XML [dump](https://wikimedia.mirror.us.dev/backup-index.html). I first parsed the XML and put it in a SQLITE database so it would not take too long executing small queries while doing a preliminary data study. The data is first parsed into a Python representation. This parsing involves reading WikiText strings. Fortunately, a module [`wikitextparser`](https://github.com/5j9/wikitextparser) already exists for that. What it does not do is handling the different template specific to Wiktionary (specifying all the relevant data such as POS or gender). The task is therefore very grindy, and involves a lot of very specific bits of code. One main issue is the correctness of the input data: as it is written by humans, it comes with lots of errors, mispellings, and other inconsistencies. I tried to add some fixes for the most common ones, but (detecting) and covering everything would be an excessive task for such a project. Once that first parsing is done, that data is casted into a OWL ontology, using the [`owlready2`](https://pythonhosted.org/Owlready2/) interface. The ontology is then exported to a file. The whole process takes about half an hour, but requires a lot of free RAM. # Visualizing the output The final step was to implement a way to visualize the output data. Mainly, I recreated a minimalist French Wiktionary, but with an highly structured template. All referenced entities are linked and browsable. All relationships are also linked and browsable. Meta-information relative to the entities can be displayed. A SPARQL endpoint is also included, yet not available to the public (I wish to preserve my poor Raspberry Pi). As I usually do, I built a [Django](https://www.djangoproject.com/) application for that. My main concern was RAM consumption, as the server is hosted on a Rapsberry Pi with its 1Gb of RAM. Fortunately, using the SQLITE backend of Owlread2 is relatively RAM friendly. Yet, features such as indexing (for literal searching and ranking) are not possible. Anyway, you better see it for yourself at . # Update A second version of the ontology comes with global enhancements and some more features, such as a way to parse and represent etyomolgy data and a more generalized inflexion representation system, allowing for a complete conjugation representation. The code is available on GitHub at [ychalier/flont](https://github.com/ychalier/flont).