WTELF. I am trying to – yes, yes, I know – do regexes on dumps of the LJ Inbox, and I just discovered that something Very Weird is going on with the linefeeds.
Mostly everything is fine. Except the part that isn't: The actual message contents seem
to have line breaks between, er, lines (at least everything that renders that thinks they're line ends – View Source in Firefox, more
at the command line, BBEdit) except that perl
isn't seeing them as line breaks. Perl thinks (okay, while (my $line = <INBOXFILE>)
returns) they're all one big happy line with some sort of line break in them. Running against that string regex that matches just the HTML at the end ($line =~ /<div class=\'actions\'>/
) doesn't just return the last line of the message, it returns almost the whole damned message
To confirm this, I wrote a perl script to simply number the lines of the raw HTML. It returned something like the following (heavily redacted because privacy and not escaping all the damned angle brackets):
522: <td class="item">
523: <div class="InboxItem_Controls"><a href='http://www.livejournal.com/inbox/?page=1&bookmark_off=3320'><img src='../../l-stat.livejournal.net/img/flag_off.gif' width='16' height='18' class='InboxItem_Bookmark' border='0' /></a>
524: <a href="http://www.livejournal.com/inbox/?page=1&expand=3320"><img src="../../l-stat.livejournal.net/img/expand.gif%3Fv=8234" class="InboxItem_Expand" border="0" onclick="return false" /></a>
526: <span class="InboxItem_Title InboxItem_Unread" id="all_Title_3320"><div class='pkg'><div style='width: 60px; float: left;'><img src="../../l-userpic.livejournal.com/19583817/961489" width="50" align="top" /></div><div>Re: (no subject)<br />from HTMLHTMLHTML...
528: <div class="InboxItem_Content" style="display: block;">Yes, exactly that one! Thank you so much. blah blah blah it's an awesome post, and I'm so glad you helped me find it again.
<br />--- I wrote:
<br />> This one? URLgoeshere
<br />> <br />> --- randomuername wrote:
<br />> > Sorry for bothering you, but was it you that wrote the awesome post about the blah blah blah? I can't find it despite googling a lot, so I thought I'd ask... <div class='actions'> <a href='http://www.livejournal.com/inbox/compose.bml?mode=reply&msgid=80595787'>Reply</a> | <a href='http://www.livejournal.com/friends/add.bml?user=randomusername'>Add as friend</a> | <a href='http://www.livejournal.com/inbox/markspam.bml?msgid=80595787' class='mark-spam'>Mark as Spam</a></div></div>
What I think should
happen is that only the "line" that starts <br />> > Sorry for bothering you
etc should be returned by matching on the attribute of that tag at the end. What I'm getting is everything here marked 528 and in bold.
As always, the questions are:
1) Why is it doing this to me? (Is there something actually different about those linefeeds? My text editor seems to think they're identical.)
2) What is doing this to me? (Is this perl being weird?)
3) How do I make it stop? (Can I convince perl not to do what it's doing and do something that make more sense?)
Advice welcome. Meanwhile, I'm necessarily just going to live with it. Don't have time to wrestle this alligator on my way across the swamp.