New Lines in Unix vs. Windows

I had a moment of enlightenment  today while working with text files. Summary... Unix formats new lines with the "\n" escape sequence, but Windows looks for "\r\n"

The files were mime messages that came into a Unix-based mail server. These files then get moved to a Windows machine for processing. Opening the files in advanced text editor like Notepad++ revealed no difference, but opening the files in Notepad showed little boxes in place of the new lines.

In mime messages, new lines are used to separate the various items of a header (Content-Type, To, From, Subject). My application could not find all the headers, because it could not find all the line breaks.  

A simple fix for this is to replace "\n" with "\r\n" if the file is from a Unix system.

            FileStream fs = File.OpenRead(txtInputPath.Text);
            StreamReader sr = new StreamReader(fs);
            string msgString = sr.ReadToEnd();
            fs.Position = 0;

            // Windows needs \r\n but Unix formats docs with only \n
            msgString = msgString.Replace("\n","\r\n");

For a thorough explanation and history...
The Absolute Minimum Every Software Developer
Absolutely Positively Must Know About Unicode and Character Sets
 
by Joel Spolsky

For a simple program that will do the conversion see. ToFroDos

Tags:

 

Converting (text, html, word) to PDF in C#

Here's the problem...

We need an imaging solutions that will process e-mail messages and attachments and convert them to pdf or tif so that they can be stored and viewed through a management system.  A single message could consist of any number of d ocuments with varying formats (text, .doc, html, .xls, etc.) and I need to convert anything that can be converted into a PDF document.  This all needs to happen on the fly.

The Tools

I am finally putting together a number of tools to use for this task.  First is the MIME processing required to extract the different parts of a message.

Mime4Net is likely going to be my tool for extracting mime messages.  I plan to create text e mails as .txt files and html emails as .html files.  Attachments will be extracted as they are. (Mime4Net is a commercial product, but there is also an open source tool, called SharpMime that works, but is not as refined as Mime4Net).

The next step is to convert the messages (.txt and .html) and the files.  Right now, I am only really concerned with converting all .txt, .doc, .tif, and any images (.jpg, .gif, etc).  

This is where things get tricky.  There are posts all over the web about various tools for this.  But, I have not found any straight forward solution.  In the end, it seems that the only way to get an accurate reproduction of the original document is going to involve opening the document in it's native application and printing it to a PostScript file.  Ghostscript comes with a print driver for performing this task.  There is a good walk through of this at ASPAlliance

This Post Script file can then be converted to PDF using Ghostscript.  Ghostscript is an "interpreter for the PostScript language and for PDF".

Finally, I will need to merge all my PDF docs into one file for each message. This can be accomplished using iTextSharp

The Solution

After some prototyping I think the entire process for one message is going to look like this.

  • Extract the various message parts
  • Open and print all .txt, .doc, .jpg, .gif files to PostScript
  • Convert PostScript files to PDF using a C# Ghostscript Wrapper
  • Merge PDF files using iTextSharp

More to come in later posts.

== Edit == 

I just found a commercial tool that promises to convert just about anything to PDF.... activePDF

 

Tags: