I'm not one for using techonology for the sake of it, but I have a hurdle...
On one side, I have this datasource which has been kindly exported into Excel for me. It contains about 12 columns which, at their most simple, are Page ID's, Titles and Date/Time's. The difficult column is the Page Content column which, at its "peak" is about 26,000 characters!. This is effectively a database of articles.
Now I want to get this into a MySQL database to be used with Drupal (a CMS). I have the PHP code to take the data and thump it in as needed - that was well within my range
Initially, I exported it all as a CSV and went to parse it that way... Hurdle 1 - the articles contain comma's!! Ok, lets try something else? Tabs.. Ah ha!... Ah, no... They contain tabs too (partial HTML). I have now got it exporting into neat XML (even with a nice XML schema too!), BUT... VERY kindly, Excel has decided to tidy up the HTML tags and has converted symbols such as '<', '>' and '&' into their HTML entities (eg, '& g t ;', '& l t ;', etc.. Had to put in spaces as the forum converts them otherwise!). Its also got its knickers in a twist with some other tags too, such as line breaks. I have tried using PHP to replace all the entities and this worked partially - but not entirely.
Has anyone got any idea of how to get this data out of Excel? The "worst" bit is this is just a 100 article test file. The actual file I will need to work with is going to be abuot 1000 articles, so any processes must be automatic. I'm not going to manually look through 1000 articles!!!!
Cheers guys