bentglasstube · June 4, 2013 23:31
diff --git a/gistfile1.txt b/gistfile1.txt
 John Holleran: How do I do "Q1 performance (70), refueling 20, pax 20" to "(70),20,20”

 It really depends on what the whole data looks like not just one line, but I'll give it a shot.  I have no idea how (un)familiar you are with regular expressions so I will assume you know nothing to err on the side of overexplanation.

 As a warning, these are perl regular expressions.  Whatever you are using has a different (and shittier) regex engine than perl does so some things might be a little bit different for you.  If you tell me what language you are using, I might know more about its particular engine.  If you tell me it is something that costs money like Oracle DB regex you are on your own, sorry.

 Assuming that the text is mostly the same you could do something like

  ^Q[1-4] performance (\(\d+\)), refueling (\d+), pax (\d+)$

 I'll try to break it down so you can figure out what you need to change for your specific problem.

  ^         Beginning of line
  Q[1-4]    Q followed by a character in 1-4
  (\(\d+\)) The outside parentheses indicate that you want to capture this bit.  The inner parentheses have a \ to escape them so they are treated as a literal parenthesis in your input text.  The \d+ means one or more digits.
  (\d+)     These both just mean capture (as above) one or more digits
  $         End of line

 This is the most specific way to solve for the exact string that you gave as an example.  The most general way, which will find just a number in parentheses followed by two more numbers somewhere in the text is as follows:

  (\(\d+\)).*(\d+).*(\d+)

 Here, the capture groups are the same as before (as you are trying to extract the same data, but the beginning and end of line anchors are gone and the shit between the groups is replaced with ".*".  The . means "any character" and the * means "repeated zero or more times" so ".*" means "anything zero or more times" and might be overly vague for your data.

 As an example of a sort of middle ground, assuming your text always says "performance (##), someword ##, otherword ##" you could use the following:

  performance (\(\d+\)), [\w ]+ (\d+), [\w ]+ (\d+)

 Again, the capture groups are the same, but the "[\w ]+" construct is introduced.  The square brackets mean that you are matching a class of characters which you will describe inside.  This was used in the first example to match any of 1-4, but this one matches \w or a space.  \w means a "word character" which is a-z A-Z 0-9 or underscore.  The + after the last bracket means one or more times as in \d+.  Altogether [\w ]+ more or less matches some words with spaces between them.

 This got really long and out of hand.  Let me know if you have queries.
	John Holleran: How do I do "Q1 performance (70), refueling 20, pax 20" to "(70),20,20”

	It really depends on what the whole data looks like not just one line, but I'll give it a shot. I have no idea how (un)familiar you are with regular expressions so I will assume you know nothing to err on the side of overexplanation.

	As a warning, these are perl regular expressions. Whatever you are using has a different (and shittier) regex engine than perl does so some things might be a little bit different for you. If you tell me what language you are using, I might know more about its particular engine. If you tell me it is something that costs money like Oracle DB regex you are on your own, sorry.

	Assuming that the text is mostly the same you could do something like

	^Q[1-4] performance (\(\d+\)), refueling (\d+), pax (\d+)$

	I'll try to break it down so you can figure out what you need to change for your specific problem.

	^ Beginning of line
	Q[1-4] Q followed by a character in 1-4
	(\(\d+\)) The outside parentheses indicate that you want to capture this bit. The inner parentheses have a \ to escape them so they are treated as a literal parenthesis in your input text. The \d+ means one or more digits.
	(\d+) These both just mean capture (as above) one or more digits
	$ End of line

	This is the most specific way to solve for the exact string that you gave as an example. The most general way, which will find just a number in parentheses followed by two more numbers somewhere in the text is as follows:

	(\(\d+\)).(\d+).(\d+)

	Here, the capture groups are the same as before (as you are trying to extract the same data, but the beginning and end of line anchors are gone and the shit between the groups is replaced with ".". The . means "any character" and the means "repeated zero or more times" so ".*" means "anything zero or more times" and might be overly vague for your data.

	As an example of a sort of middle ground, assuming your text always says "performance (##), someword ##, otherword ##" you could use the following:

	performance (\(\d+\)), [\w ]+ (\d+), [\w ]+ (\d+)

	Again, the capture groups are the same, but the "[\w ]+" construct is introduced. The square brackets mean that you are matching a class of characters which you will describe inside. This was used in the first example to match any of 1-4, but this one matches \w or a space. \w means a "word character" which is a-z A-Z 0-9 or underscore. The + after the last bracket means one or more times as in \d+. Altogether [\w ]+ more or less matches some words with spaces between them.

	This got really long and out of hand. Let me know if you have queries.