live from China::Hong Kong S.A.R. (中国::香港特別行政區)

How to parse a certain column from a CSV string using Regular Expressions

2008-06-17, on 15:54 | In computer science | 3 Comments | Dieser Artikel in Deutsch

I need to parse the string below using regular expressions. Yes, it must be a regular expression. :)

"123","06/16/2008","","123456","1","1234","This is a title string","4.99","USD","","","","kozen@kozen.de","HK","Hong Kong","210000D1","Individual String","Site string","Moep","","Not required","","","","","","kozen","the bozen","da kozl street 23","","kozmode","12345","23232323232323"

I need a regular expression matching the content of a specific field number. E.g.

  • 13 » kozen@kozen.de
  • 27 » kozen
  • 28 » the bozen
  • 29 » da kozl street 23

That’s how far I got:

"([^"]*)",

which gives me the content of a field, and results in 123 for the first field. This is just the first one, I need number 13.

Another expression is:

("([^"]*)",){13}

which matches 13 times and the last matching results in "kozen@kozen.de",. The quotation marks and the comma should not be here :(

Actually, I thought the following should match the expression 13 times, but for some weird reason it does not work:

"([^"]*)",\13

If anyone has an idea I would appreciate your thoughts. I am messing around with that stuff for 6 hours now, google-ed the world out of the net but didn’t find a solution. Something ‘universal’ like the one above would be nice so I can just replace the number (’13′) with another column number to grab another column’s content.

Here are some helpers by the way:

3 Comments »

RSS feed for comments on this post. TrackBack URI

  1. Thanks a lot to x who solved this issue in a second!

    Here is the example for getting field number 13 without the quotation marks:

    (?:\"[^"]*\",){12}\"([^"]*)\",.*

    I forgot to mention that the CSV line is part of a bigger file which means there can be lines before and after this CSV line that do not contain any relevant data :)

    Comment by kozen — 2008-06-17 #

  2. Ganz spontan und simpel hätte es auch einfach ^"[^"]*","[^"]*","[^"]*","[^"]*","[^"]*","[^"]*","[^"]*","[^"]*","[^"]*","[^"]*","[^"]*","[^"]*","([^"]*)", getan, da warst Du ja schon dran, ist allerdings nicht so elegant. :)

    Beide Lösungen funktionieren allerdings nur, wenn "," nie Whitespace enthaelt und leere Felder immer ,"", sind und nicht ,,

    Comment by fok — 2008-06-17 #

  3. Leider ist das Feld, wo ich den regulären Ausdruck eingeben muß vielleicht 30 Zeichen lang oder so. Das wird dann schnell knapp, wenn ich Position 27 haben will :)

    Comment by kozen — 2008-06-18 #

Leave a comment

XHTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>



about | kozens blog