Researches on improvement of RimWorld text-generating system

Neitronik · February 09, 2020, 12:40:52 AM

Hello, great people!

I'm a fan of RimWorld, just like all of you.
And I want RimWorld to generate beautiful text on languages in which one word could be written in many variations, depending on a context.

I found some posts on this forum, which are very similar to mine. People has different problems which could be solved by just one proper conception.
The French guy: https://ludeon.com/forums/index.php?topic=43979.msg456809#msg456809
The Russian guy: https://ludeon.com/forums/index.php?topic=49993.msg468309#msg468309

English isn't my native language, so my explanations could seem rather strange for ya...
So let me first specify some simple terms for the sake of clarity. I'll try to explain them in my own words and as simple as I can. Please use your imagination at full power!

a Word - is a sequence of letters (of "chars" if you programmer) that we usually separate by "space" symbols, and which have a meaning. (Let us just pretend for 10 minutes that articles and prepositions are not Words).
a Lexeme - is a set of similar Words, which we can unite under the same meaning and say: "These words mean exactly the same thing". For English example: the set "eat", "eats", "ate", "eaten" is the Lexeme for the meaning "EAT", and this is an English verb in different forms. For French example: the set "cheval", "chevaux" is the Lexeme for the meaning "CHEVAL", and this is a French noun in singular and plural forms.

(Let me spell these terms starting with Capital Letters, to distinguish them from other words).

Now let's imagine how else sentences could be generated. Earlier I had experience in modding Civilization V and I learned very simple and effective way for creating perfect sentences. This game use the alternative conception, which is very easy understood by translators. But it requires more complex text-parsing.

Here is an example of German translation files:

Here you can see the text-keys, which are conceptually similar to Keyed part of RimWorld localization system. Please note, that entities on this picture actually are Lexemes. Within the "Text" tag you can see full sets of Words, of which those Lexemes consist are. Words are separated by "vertical line" symbol.
Thus, meaning "adjective for France" could be presented in German as set of Words "französisch", "französischen", "französische", "französischer", "französisches".

But in that xml-file we can also find this:

And it's something more than Lexemes. For programmer this is just arrays of strings, of course, but linguists could say that it's a word-combinations. And you know, Mr. and Mrs. Linguists would right, but the term they use wouldn't work for us fine. I suggest to create new terms for this:

a Strings Enumeration - is just an array of strings. And nothing more
a Word Chain - is two or more Words which combined into one string, and which could be used as one element of Strings Enumeration array.

So, now we can say: "Civilization V uses Strings Enumerations for the translation purposes". And also we can say: "Strings Enumerations could represent Lexemes as well as differnet forms of Word Chains".
Here is an example of using Strings Enumerations:

The game just doesn't care what each string in Enumerations means. But translators do! Translators define content of Strings Enumerations by yourselves, and use this content to define other text-strings. They simply specify the index along with the string placeholder (in square brackets). This index is one-based, and the game use it to extract proper string from an Enumeration.

Moreover, if you could notice, each Strings Enumeration has attributes, such as "Gender" and "Plurality" (if they omitted, they just default). For the game these attributes still mean nothing. But for translators these attributes mean a lot. Translators specify these attribites by themselves, and then use them in other text-strings, like this:

And even moreover, as you can see now, this conception allows you to resolve articles

All this works for translations because of the fact, that words in sentences could be depended on other words. And different languages differ by power of these dependencies. For example, English has very low power of dependencies, it doesn't even divide all Words by Gender (how French and Spanish do). German is more complex, because of varied endings of Words: for example "gut", "gute", "guter", "gutes", "gutem" is the Lexeme "GUT". Endings of words, as well as articles around them, indicate the form of dependency which was used. In German almost each singular noun could be written at four forms, in Russian - at six forms, and in some other languages - even more.

The more powerful dependencies uses language, the more words are involved in mutual transformations, as a result. Nouns can make junctions with adjectives and with verbs, and even with other nouns. So if you replace one main word in the phrase by another word, you could cause the domino effect, and all words in this phrase could be changed, because of powerful dependencies. And in the upshot you may get a completely different string (which still could have the same or similar meaning).
It's important, that each Word Chain submits the laws of word-inflection which are used by the language, and each language defines its own rules for this.

Unfortunately, some languages doesn't allow you to use simple mathematic formulas to just "calculate" the proper forms of words for Word Chains. But Word Chains nicely suit the Strings Enumeration conception. And now we can stop pretending that articles and prepositions are not words

Of course they are, and you can use them within a Word Chain as well

But how could this conception be adapted to the RimWorld's text generating system?
I thought about all this for several days, and developed a new model of text-generating. Please, check it out! ^^

Before I begin, let's refresh our memory.
This is how the current algorithm works:

This algorithm is a sorta recursive tree crawler (or jumper?), which search nodes in a flat list. The scope of nodes to search should be formed before this algorithms runs, and also you need to specify a root node name from which you want to start. And besides, algorithm can use a list of constants containing some information about pawns or any other information.
All these input parameters you can call a "context" of the algorithm, because actually it is. So, this algorithm works within a context. And its context is constant during algorithm execution, because this algorithm doesn't allow to change any property of its context. (The only thing that changes is a randomizer, but it shouldn't depend on translator's will).
As you can see at the picture above, you allowed to retrieve any property from context, at any node. And it's the very useful paradigm. But think a minute, what if we could directly affect on algorithm's execution by changing its context? It would give translators much more power on text generating, for the sake of great translation quality!

Disclaimer: You don't need to change the scope of search, it's the worst method, I think. Instead you need a program-like features to control the algorithm's execution context. Talking about program, if you passed something in, you always able to retrieve what it will return for you. And that is what we need.

Here you can see how node processing could work within this conception:

About the first example:
"Current context" is the context of the current processing node and it should be passed to each independent child node as is. Independent child nodes are only B, C and F, because of lacking "d" modificator. Because d is for dependency, as you could guess

When B node is processed, it returns another context object, which is required by nodes A, D and E, and then they can be processed using new context object. As you could guess, it is possible to process child nodes in any order where B stands before A, D and E, because everything else doesn't matters. Actually, any node returns a context object, but algorithm just doesn't pass returned context object to anywhere, if no one other child node required it. It is important, that returned context object encapsulates a result string, which was made during [placeholder] resolving. We could use this string to perform some checks, they are described later below.

About the second example:
Here you can see the proposed way to change context on the fly. I called it "feedback", maybe it isn't very suitable word, I'm sorry if so, have a patience until I finish

Feedback clause should be specified in the end of node entry. Thanks to symbols ")^" in the tail of string you can fast and easy perform check of feedback clause presence. Now let me explain what is a "context formula"... This notation just looks like a formula, and nothing else

So I like this new term xD Nevermind. By this notation you can describe the context: c is for grammatical Case, g is for grammatical Gender and q is for Quantity. Context could carry much more information, depending on language. These three grammatical category would be enough for English, I think. So, when algorithm has processed all child nodes at the current one, (or when the current node just have no any child), it can process the feedback clause, as a last action at the current node. At this step, algorithm copies current context and applies feedback modificators to this copy. It's very important to provide the stack-like copying behaviour of the context objects, so context object type should be defined as a struct.

About the third example:
Here you can see the simplified scheme how one context could be replaced by another during processing nodes. The Quantity property of initial context is 1. During processing [Quantity_adjphrase] node, crawler makes a choice, at which node to jump next. When it choose "two", it gets a modificated context with (q=2). After that crawler returns back to the previous node and pass this context object to the node [appear]. It should perform checks to find all nodes with compatible formulas, and then decide at which to jump. Remember, that we've got a (q=2) previously, so within this example it can only jumps to (q=2..m) - this means that value of q must be in specified range inclusive. And m is for "many"! Or "maximum". Or positive infinity if you like so... A large but unknown quantity, okay. As a result, processing of [appear] returns string "appear" which is suitable for "two".

Please notice also that we should use w to specify nodes probabilty instead of p, because p is for grammatical Person now (and w is for weight, that is similar to probabilty).

Perhaps, you want to ask me, why I didn't use "!=" operator or ">" operator. Well, I think that these two make formula less clear to read. And the same with sign "-", I prefer double points ".." because of using operator "->". Of course, this notation is not ideal, so we have a lot of topics for discussion. But let me finish my presentation first, please

Maybe you think that symbol @ means a link to a child node in the current context. But it isn't so: this symbol means a link to a named context object which was returned by a child node, and which you can use anywhere within the current node. But some context objects could be global, at previous example we have two - HUMAN which you can specify as @1 and ANIMAL which you can specify as @2. Maybe better use @HUMAN and @ANIMAL instead, and in this case we could start local links names from @1... I dunno. It makes sense, but I already made these images, so we should go with what we have for this time, sorry ^^

Here you can see examples illustrating how else you can specify dependencies:

In addition to examples above, you can also select modificators from other contexts, for example:

Code Select

<li>current_node->[@3;child_A] [@4;child_B] [child_C(g=@3;d=@4)]</li>
This means that you want algorithm to pass @4 to C with applying of gender, selected from @3.

This notation also works fine within feedback clause:

Code Select

<li>current_node->[@3;child_A]=>(d=@3)^</li>
This means that you want algorithm to perform returning @3 instead of the current context.

Code Select

<li>current_node->[@3;child_A]=>(g=@3)^</li>
This means that you want algorithm to select gender modificator from @3 and apply it to the current context before returning.

Code Select

<li>current_node->[@3;child_A] [@4;child_B]=>(g=@3;d=@4)^</li>
And this means that you want algorithm to select gender modificator from @3, then apply it to @4, and then perform returning modified @4 instead of the current context.
And so on

Let's see now what's in the files:
(please, look under the tags "file_Enemy" and "file_Weapon")

Whithin this conception you should use xml format for random-entries files. It's very useful even for English because every language has its irregularities. As you can see, you can specify all proper forms of words (or Word Chains) by using Strings Enumerations. Crawler extracts a needed string from an Enumeration using the Case property of a current context: this property works here like an index within the Civilization's approach. Also, you can use formula checking for your list of forms. Please notice, that I used the wildcard formula after (q=1). That's possible because of different context situation: in this case algorithm doesn't need to make choice, it just scanning all formulas from top to down and take the first satisfying a context. So you can use a wildcard here as "else" clause.

So, when crawler goes into a file, it makes a random choice what entry to scan. Then it scans all formulas in this entry and find the first which satisfy the current context. And then crawler "jumps" into it, just like into a regular node. Do you remember the example with [Quantity_adjphrase]? Crawler already got (q=2), and c is 1 by default, so algorithm takes the first case from the second Strings Enumeration. As a result, this node returns a new context which encapsulates the string "brigands". Feedback clause shouldn't be used in files, you should specify feedback modificators within the special <feedback/> tag instead. I think that English doesn't need any feedbacks for nouns, because of lacking the grammatical gender. But it would extremely useful for languages such as French, Spanish, German, and also for many Slavic languages.

Here is example for French:

And for German:

And you no longer need to create separate files for masculine and feminine words, hahaha

Also you can notice node names lacking before -> operator. Because you just don't need to name nodes in files. But you can call Language Global Rules from here. What for? And what rules?

I also propose to use new format instead of "Grammar.xml". Check it out please:

In this file you can define special rules which you need to call from everywhere. Such as: rules for using articles, rules for replacing or adding letters (a/an), pronouns and much more. This section works within the same principle as scanning of forms in files: algorithm scans these rules from top to down and take the first satisfying a context. So wildcards look pretty nice here, I think. Though English actually uses grammatical Person, I think that we can go without it. So, at the example above you can see rules for the first personal pronoun, for the second one, and for the third separately. Besides, it looks more readable then this chain:

Though this variant is also satisfy the conception, we lose self-explanatory node names - [third_pro], [second_pro], [first_pro]. In the most situations we exactly know the grammatical Person and grammatical Case which we need, so we don't need such level of abstraction for English, I think.

The most great thing of all of this - is adaptability. Translators could specify all rules here and use them. Also we could use special sub-sections for writing these rules, for optimization purposes. For example, the feedbackDependencies section of languageGlobalRules could be checked by crawler first of all when it permorms passing context between child nodes of current one.

At the picture you can also see examples of additional checks of a current context's string (I told about that somewhere above), such as: "StartsWithConsonant", "StartsWithVowel", "StartsWith". The latter is very useful to char sequences checking. I think here also could be "EndsWith", "EndsWithConsonant" and "EndsWithVowel". And maybe much more

It's possible to you to define checks using small letters only, thanks to uppercase and lowercase strings provided (the game could built extended rules from these implicitly, when it performs loading, for optimization purposes).

So, these special Global Rules are sorta utility methods library for translators, which methods could be requested from anywhere. I presume that this conception could even replace LanguageWorker's child classes conception, thanks to its great independence from game developers code. But maybe I just miss something...

Now I want to return to the example with [Quantity_adjphrase]. After all these explanations I can show you the continue of that story

Here you can see the whole list of rules which crawler will use within my example:

As you can see, the two strings could be generated at this example:
For (q=1): a knife-wielding brigand appears in the upper part of the image
And for (q=2): two knife-wielding brigands appear in the upper part of the image

And I find it great ^^ Yaaay!

Neitronik · February 09, 2020, 12:49:33 AM

Now let's see some more useful things! Here you can see the usage of global context objects which are HUMAN and ANIMAL at my previous example:

I really believe that we need rules to be spread over all the translation system. Many languages use declension of personal names, and English also do:

A bit messy at this example, because I just didn't think yet, how it may look better. Sorry

When we describe a pawn or any other object in the RimWorld we should use grammar rules, because all translators should x) Thus, all objects include pawns should provide a proper feedback modificators for language context.

About plural forms. It is very important to pass quantity as an integer number, not as enum. Because some languages use not only a single and plural noun forms, but also a dual (for two), and even a trial (for three). Moreover, some languages, for example Slavic, use paucal number systems, where ranges can be useful, but, unfortunately, it is still not enough. I think this problem could be solved with this approach:

Notation (q#=#) means "checking separate digits to be satisfying specified mask". (q##=1#) means that second digit from the end of a number should be equal 1. Also, please, notice, that I introduced a new special attribute pf, that's why it's important to allow translators to specify context formulas by themselves. They could use their own terms and grammatical categories, which are specific for their language. For the game this formula just means "please, create a dictionary with these string keys and values". I think that all user-defined values should be also the string type. But keys c, d and q should be reserved (as well as "w" which means weight), because of their special interpretation. Maybe it also would be better to reserve additional key qq for another integer (as well as q). You know, some foreign languages could be extremely strange

I don't know any language for which qq integer would be useful, but who really knows this?
Here the alternative variant of previous example for Russian:

I really dunno what is better. Maybe we better use LanguageWorker's child classes for calculating this? u___u Russian_LanguageWorker do this fine.

And in conclusion for this looong presentation I may show you a comparison overview between current rules and proposed ones.

Thank you veeeeery much for reading all of this ^^ I'm sorry for my wall of text..
I hope my researches will be helpful for all of us.
If you have questions, please be patient, I can't visit forum everyday :/

With best regards,
Neitronik

Neitronik · February 09, 2020, 12:59:11 AM

QuoteThe game just doesn't care what each string in Enumerations means. But translators do! Translators define content of Strings Enumerations by yourselves

I mean by themselves of couse /)___u