The Tokenize function/ limitation

KaBe · 11 March 2022 09:36

Hello,
let’s take the following description inside a logical component of our system.

Using the Tokenize function :

selection.eAllContents(la::LogicalComponent)->at(14).description.tokenize(‘///’)

to obtain 3 sentences would not work well.

For the simple reason, there is a “/” inside the first sentence. And a “//” in the third sentence. Thus, obtaining 5 sentences.

My question:
How comes .tokenize(‘///’) does include .tokenize(‘//’) and .tokenize(‘/’) ?
Isn’t it possible to have a tokenize function that separates text only if it has seen “///”? (Meaning it should be doing nothing if it sees “/” or “//”)

Thanks again.

Edit: It’s especially bothersom because the HTML codes contain “/” such as ('<'b/> etc), and the tokenize function would not work on a type returned by fromHTMLBodyString() , but that’s not the only problem expressed here.

YvanLussaud · 11 March 2022 11:18

The service tokenize() use the Java StringTokenizer class. This class use every char of the given delimiters string as a delimiter. That’s why you have this behavior. It you have control over the delimiter in the description, you can change it to some UTF-8 char that you are not likely to encounter in the description itself.
We should probably add a service that use String.split() which takes a regular expression as parameter:

YvanLussaud · 11 March 2022 11:47

You could try something like this as a workaround:

description.replaceAll('///', '\u2600').tokenize('\u2600')

You might want to change the UTF-8 char to something else.