Ticket 2980: Diff viewer has problems with control characters

2980: Diff viewer has problems with control characters

Fixed
Review Board
Component: DiffViewer

Priority: Medium

Type: Defect

Reported by:	pekka.t.*******@gmai*** (Google Code) (Is this you? Claim this profile.) 12 years, 1 month
Unassigned
Closed:	June 1, 2015

What version are you running?
1.7.6

What's the URL of the page containing the problem?
-

What steps will reproduce the problem?
1. Create change in a file that contains Form Feed control character. Make the change BELOW the FF character.
2. Submit review request.

What is the expected output? What do you see instead?
ReviewBoard diff viewer highlites wrong row.

What operating system are you using? What browser?
ReviewBoard is running on Linux. Using Windows7 + Chrome for viewing review requests. Our source control system is SVN 1.7.8 (Windows).

Please provide any additional information below.

Our codebase is ancient. It seems to contain some control characters (besides CR and LF) for example in C source files. ReviewBoard diff viewer seems to struggle with them, I have so far identified at least Form Feed (FF) control character to cause problems. Change highliting gets displaced by one line per FF. When the FFs accumulate, the highlited part can get pretty far from the actual change. See attached picture. I also attached the actual diff.

Files

diff_viewer_form_feed_problem.png

svn_diff.txt

Index: //ReviewBox/aemxmn__mx.c

===================================================================

--- //ReviewBox/aemxmn__mx.c	(revision 408)

+++ //ReviewBox/aemxmn__mx.c	(revision 409)

@@ -498,7 +498,9 @@

                         m_src << ", code:" << m_code << ") IGNORED" << endm;

+This is a change on one line above FF.

+This a change on one line below FF.

 /*======================================================================*/

 /*	mpex_handle_info_msg						*/

/*

#1 satish3

Hi Team,
We are also facing the same issue. Any input on this.

The strange thing is it used to work on ReviewBoard 1.6.3.

Thanks,
Satish

#2 david

Tags:
+ Component-DiffViewer

+	Component-DiffViewer

#3 bar****@gmai***** (Google Code) (Is this you? Claim this profile.)

The cause is that splitlines() behaves differently for unicode vs
regular strings. For unicode, it considers form-feed a line breaking
character and splits on that. But for regular strings it does not.
For example:

41 zara:~/> python -c 'print "A\n\f\nB\n".splitlines()'
['A', '\x0c', 'B'] # good
42 zara:~/> python -c 'print u"A\n\f\nB\n".splitlines()'
[u'A', u'', u'', u'B'] # ate \f and added extra empty elem

References:
http://bugs.python.org/issue12855
http://stackoverflow.com/questions/24453713/python-string-splitlines-removes-certain-unicode-control-characters

This means the lines will be off by the number of ^L chars above it in
the diff and won't match the diff.

The diff ends up getting stored in the DB w/o the ^L chars. Sometimes
it won't patch, and sometimes it does but renders incorrectly.

There are a few places where splitlines() is used on the diff. As an
experiment I modified DiffParser, DiffChunkGenerator,
TextBasedReviewUI, and filter_interdiff_opcodes to use a hacked
splitlines() routine that first swaps ^L with NUL, splits, then swaps
back and that makes things work (attached). Not sure that is the best
strategy, but it ends up being a simple change. A similar approach
would be to cons together a random number or UUID along with
"reviewboard" and use that as the swap string, or use some other
non-printable.

It would be nice to see this fixed, but I'm aware that the group of
Lisp-influenced Emacs-lovers who employ ^L in this manner might not be
a growing population. However, if the RB developers think this is
worth persuing I can spend more time on a fix.

Files:
+
0001-diffviewer-handle-L-characters-in-diffs.patch

+	0001-diffviewer-handle-L-characters-in-diffs.patch

#4 alan.law*********@gmai***** (Google Code) (Is this you? Claim this profile.)

There are a number of popular open-source repositories and projects with e.g. ^L form-feed in their sources. For example, GCC.

#5 dan.sch*******@gmai***** (Google Code) (Is this you? Claim this profile.)

I believe this issue is the root cause of bug 3609.

#6 dccarson

Any movement on this issue?  This is a real bug, even if using ^L in source code is less common than it used to be.  In our case, we have no choice, as the code in question is third-party code, not our own.

Thanks.

#7 david

We're working on it right now, but as you might expect, it's a pretty sticky problem.

#8 pekka.t.*********@gmai***** (Google Code) (Is this you? Claim this profile.)

This seems to be fixed in 2.0.13? Briefly tested - looking good now.

#9 david

Yes, this was fixed in 2.0.13

Status:
+ Fixed

Welcome to the Splat alpha!

2980: Diff viewer has problems with control characters