I'm using scanner with delimiter and I've came across a strange behaviour I'd like to understand.

I'm using this programm :

    Scanner sc = new Scanner("Aller à : Navigation, rechercher");
    sc.useDelimiter("\\s+|\\s*\\p{Punct}+\\s*");
    String word="";
    while(sc.hasNext()){
        word = sc.next();
        System.out.println(word);
    }

The output is :

Aller
à

Navigation
rechercher

So first I don't understand why I'm getting a blank token, the documentation says :

Depending upon the type of delimiting pattern, empty tokens may be returned. For example, the pattern "\s+" will return no empty tokens since it matches multiple instances of the delimiter. The delimiting pattern "\s" could return empty tokens since it only passes one space at a time.

I'm using \\s+ so why it returns a blank token?

Then there is an other thing I'd like to understand concerning regex. If I change the delimiter using the "reversed" regex :

    sc.useDelimiter("\\s*\\p{Punct}+\\s*|\\s+");

The output is correct and I get :

Aller
à
Navigation
rechercher

Why it works in the way?

EDIT :

With this case :

    Scanner sc = new Scanner("(23 ou 24 minutes pour les épisodes avec introduction) (approx.)1");
    sc.useDelimiter("\\s*\\p{Punct}+\\s*|\\s+"); //second regex

I still have a blank token between introduction and approx. Is it possible to avoid it?

link|improve this question

1  
I have a feeling that you are causing two delimiter captures in places where there's a blank space followed by punctuation. Why not simply use "[\\s\\p{Punct}]+"? Or am I over-simplifying the problem? – Hovercraft Full Of Eels 37 mins ago
@HovercraftFullOfEels Thanks your regex is perfect for my needs! I thought that \\s+|\\p{Punct}+ (I started with this one, didn't mention it) was doing the same as your one but it's not why? – alain.janinm 29 mins ago
And I'm still looking for an explanation of the difference between \\s*\\p{Punct}+\\s*|\\s+ and \\s+|\\s*\\p{Punct}+\\s* – alain.janinm 27 mins ago
feedback

1 Answer

up vote 1 down vote accepted

I have a feeling that you are causing two delimiter captures in places where there's a blank space followed by punctuation. Why not simply use [\\s\\p{Punct}]+?

This regex \\s+|\\p{Punct}+ will first capture the empty space and swallow it, then will capture the next delimiter as the punctuation. That will be two delimiters next to each other with nothing in between (the empty token).

link|improve this answer
Thanks a lot so in my example if the second pattern works it's because \\s*\\p{Punct}+\\s* already catch ` : ` then the \\s+ is not used and there is no blank. Am I right? – alain.janinm 19 mins ago
@Alain: that sounds about right to me. – Hovercraft Full Of Eels 17 mins ago
Ok thanks a lot for your help! I've learned something today! – alain.janinm 12 mins ago
@alain: quite welcome! – Hovercraft Full Of Eels 12 mins ago
feedback

Your Answer

 
or
required, but never shown

Not the answer you're looking for? Browse other questions tagged or ask your own question.