Re: [Midnight Commander] #266: Backward search doesn't find underlined text



#266: Backward search doesn't find underlined text
----------------------+-----------------------------------------------------
  Reporter:  egmont   |       Owner:       
      Type:  defect   |      Status:  new  
  Priority:  minor    |   Milestone:  4.7  
 Component:  mc-core  |     Version:  4.6.2
Resolution:           |    Keywords:       
  Blocking:           |   Blockedby:       
----------------------+-----------------------------------------------------

Comment(by egmont):

 I was thinking on approaches to fix this issue, as well as these other
 forthcoming issues:

 - basic utf8 support (included in the utf8 patch)

 - case insensitive search in utf8 (tricky, not included in the current
 utf8 patch)

 - combination with manpage formats (that is, to be hardcore, have a utf8
 manpage with backspace notation for bold/underlined, and do case
 insensitive forward/backward search in this). Even displaying a utf8
 manpage is not implemented in mc + the utf8 patch, but I have a patch for
 this. However, my patch breaks when it comes to searching highlighted
 text; searching simple text is case sensitive for accented letters; and
 probably horribly breaks for backward search (never tried).

 Basically there are two approaches for text search. Summarizing them:

 - The "always fast" one: Build up a state machine first, based on the
 needle string. Then just go through the whole input once.

 - The simple one: See whether it matches from the given offset. If not,
 try from the next offset.

 Let's see these in detail.

 The "always fast" approach, with the state machine:

 The advantage is: no matter what the needle is and how large it is, the
 file has to be processed only once, constant amount of memory is needed
 and runtime is always linear to the length of the file.

 Hard to implement. By each and every newly added feature it becomes even
 more complicated. Man page format, utf8, case insensitiveness, backwards
 option, and all combinations thereof make it extremely complicated.

 As long as the only "extra" option was utf8, backward could be implemented
 by reversing both haystack and needle. However, as soon as either manpage
 format, or utf8 case insensitive search is a requirement, this no longer
 works, since the state machine would have to be different anyway for the
 reversed string. And there's no point in reversing the whole haystack and
 needle in either cases: it's way easier and faster to simply walk
 downwards in memory.

 The simple solution:

 The fear of the simple solution could be that under some circumstances it
 might be slow. E.g. if both the needle and haystack are a series of letter
 A's then it will require H*N time (H and N stand for haystack and needle
 size). This is worst case in algorithmical sense, but not a typical use
 case.

 Note that with typical user-entered search strings, the run-time is still
 linear (to the filesize), does not grow as you enter longer string. Even
 if the needle is a random string, the average runtime is linear.

 So in real life usage I think that this algorithm is absolutely perfect
 for us. If we were about to write software that was potentially DoS-able
 (such as a public web-based searching library), this would be an issue.
 Within mc, where the user types the short query string, this algorithm is
 perfect.

 Okay, let's suppose we have manpage format, utf8, case insensitiveness and
 all combination of them implemented in forward only mode with this
 algorithm. And we were about to add backward searching capability.
 Reversing the haystack and needle raises endless numbers of headaches
 (different approach for manpage format, different tricks for utf8, no out-
 of-the-box utf8 library usable anymore etc.) And what would we gain with
 it? Absolutely nothing! Why?

 Because what's a "backward search" from the user's point of view, can
 simply be implemented as a loop of forward matches. The inner loop, which
 is a kind of clever strcmp (with features for manpage format, utf8, case
 insensitiveness and what not) can still go forward in memory. Only one
 minor thing has to be changed: the outer loop, which tries out the
 different matching positions, has to go downwards. Absolutely trivial
 change.

 Conclusion:

 We don't need any super clever algorithm. The simplest way of implementing
 search is absolutely perfect.

 Reversing strings is a completely wrong idea. We should get rid of it, and
 simply change the outer loop of _icase_search() to go downwards (and of
 course adjust parameters and such). (Note that _icase_search() is
 currently implemented as a single loop, which sometimes resets the loop
 variable to a previous value, but logically what it does is way easier to
 imagine as two nested loops.) This would automatically fix the original
 bug of this ticket.

 Once it's done (shouldn't be a hard change), I see two independent issues
 that would still need to be solved, both are for the utf8 patch only:
 support case insensitive utf8 search, and support search in utf8 manpage
 format. Forward mode only, luckily. I might implement these on a random
 rainy Sunday afternoon...

-- 
Ticket URL: <www.midnight-commander.org/ticket/266#comment:1>
Midnight Commander <www.midnight-commander.org>
Midnight Development Center


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]