xdocs/usermanual/regular

<?xml version="1.0"?>  <!DOCTYPE document [ <!ENTITY sect-num '21'> ]> <document prev="functions.html" next="hints_and_tips.html" id="$Id$"> <properties> <title>User's Manual: Regular Expressions</title> </properties> <body> <section name="&sect-num;. Regular Expressions" anchor="regex"> <subsection name="&sect-num;.1 Overview" anchor="overview"> JMeter includes the pattern matching software <a href="http://attic.apache.org/projects/jakarta-oro.html">Apache Jakarta ORO</a> There is some documentation for this on the Jakarta web-site, for example <a href="http://archimedes.fas.harvard.edu/scrapbook/jakarta-oro-2.0.6/docs/api/org/apache/oro/text/regex/package-summary.html"> a summary of the pattern matching characters</a> There is also documentation on an older incarnation of the product at <a href="http://www.savarese.org/oro/docs/OROMatcher/index.html">OROMatcher User's guide</a>, which might prove useful. <note>With JMeter version 5.5 the Regex implementation can be switched from Oro to the JDK based one by setting the JMeter property <code>jmeter.regex.engine</code> to some value different than <code>oro</code>.</note> The pattern matching is very similar to the pattern matching in Perl. A full installation of Perl will include plenty of documentation on regular expressions - look for <code>perlrequick</code>, <code>perlretut</code>, <code>perlre</code> and <code>perlreref</code>. It is worth stressing the difference between "contains" and "matches", as used on the Response Assertion test element: <dl> <dt>"contains"</dt><dd> means that the regular expression matched at least some part of the target, so '<code>alphabet</code>' "contains" '<code>ph.b.</code>' because the regular expression matches the substring '<code>phabe</code>'. </dd> <dt> "matches"</dt><dd> means that the regular expression matched the whole target. So '<code>alphabet</code>' is "matched" by '<code>al.*t</code>'. </dd> </dl> In this case, it is equivalent to wrapping the regular expression in <code>^</code> and <code>$</code>, viz '<code>^al.*t$</code>'. However, this is not always the case. For example, the regular expression '<code>alp|.lp.*</code>' is "contained" in '<code>alphabet</code>', but does not "match" '<code>alphabet</code>'. Why? Because when the pattern matcher finds the sequence '<code>alp</code>' in '<code>alphabet</code>', it stops trying any other combinations - and '<code>alp</code>' is not the same as '<code>alphabet</code>', as it does not include '<code>habet</code>'. <note> Unlike Perl, there is no need to (i.e. do not) enclose the regular expression in <code>//</code>. </note> So how does one use the modifiers <code>ismx</code> etc. if there is no trailing <code>/</code>? The solution is to use extended regular expressions, i.e. <code>/abc/i</code> becomes <code>(?i)abc</code>. See also <a href="#placement">Placement of modifiers</a> below. </subsection> <subsection name="&sect-num;.2 Examples" anchor="examples"> <h3>Extract single string</h3> Suppose you want to match the following portion of a web-page: <code>name="file" value="readme.txt"></code> and you want to extract <code>readme.txt</code>. A suitable regular expression would be: <code>name="file" value="(.+?)"></code> The special characters above are: <dl> <dt><code>(</code> and <code>)</code></dt><dd>these enclose the portion of the match string to be returned</dd> <dt><code>.</code></dt><dd>match any character</dd> <dt><code>+</code></dt><dd>one or more times</dd> <dt><code>?</code></dt><dd>don't be greedy, i.e. stop when first match succeeds</dd> </dl> Note: without the <code>?</code>, the <code>.+</code> would continue past the first <code>"></code> until it found the last possible <code>"></code> - which is probably not what was intended. Note: although the above expression works, it's more efficient to use the following expression: <code>name="file" value="([^"]+)"></code> where <code>[^"]</code> - means match anything except <code>"</code> In this case, the matching engine can stop looking as soon as it sees the first <code>"</code>, whereas in the previous case the engine has to check that it has found <code>"></code> rather than say <code>" ></code>. <h3>Extract multiple strings</h3> Suppose you want to match the following portion of a web-page: <code>name="file.name" value="readme.txt"</code> and you want to extract both <code>file.name</code> and <code>readme.txt</code>. A suitable regular expression would be: <code>name="([^"]+)" value="([^"]+)"</code> This would create 2 groups, which could be used in the JMeter Regular Expression Extractor template as <code>$1$</code> and <code>$2$</code>. The JMeter Regex Extractor saves the values of the groups in additional variables. For example, assume: <ul> <li>Reference Name: <code>MYREF</code></li> <li>Regex: <code>name="(.+?)" value="(.+?)"</code></li> <li>Template: <code>$1$$2$</code></li> </ul> <note>Do not enclose the regular expression in <code>/ /</code></note> The following variables would be set: <dl> <dt><code>MYREF</code></dt><dd><code>file.namereadme.txt</code></dd> <dt><code>MYREF_g0</code></dt><dd><code>name="file.name" value="readme.txt"</code></dd> <dt><code>MYREF_g1</code></dt><dd><code>file.name</code></dd> <dt><code>MYREF_g2</code></dt><dd><code>readme.txt</code></dd> </dl> These variables can be referred to later on in the JMeter test plan, as <code>${MYREF}</code>, <code>${MYREF_g1}</code> etc. </subsection> <subsection name="&sect-num;.3 Line mode" anchor="line_mode"> The pattern matching behaves in various slightly different ways, depending on the setting of the multi-line and single-line modifiers. Note that the single-line and multi-line operators have nothing to do with each other; they can be specified independently. <h3>Single-line mode</h3> Single-line mode only affects how the '<code>.</code>' meta-character is interpreted. Default behaviour is that '<code>.</code>' matches any character except newline. In single-line mode, '<code>.</code>' also matches newline. <h3>Multi-line mode</h3> Multi-line mode only affects how the meta-characters '<code>^</code>' and '<code>$</code>' are interpreted. Default behaviour is that '<code>^</code>' and '<code>$</code>' only match at the very beginning and end of the string. When Multi-line mode is used, the '<code>^</code>' metacharacter matches at the beginning of every line, and the '<code>$</code>' metacharacter matches at the end of every line. </subsection> <subsection name="&sect-num;.4 Meta characters" anchor="meta_chars"> Regular expressions use certain characters as meta characters - these characters have a special meaning to the RE engine. Such characters must be escaped by preceding them with <code>\</code> (backslash) in order to treat them as ordinary characters. Here is a list of the meta characters and their meaning (please check the ORO documentation if in doubt). <dl> <dt><code>(</code> and <code>)</code></dt><dd>grouping</dd> <dt><code>[</code> and <code>]</code></dt><dd>character classes</dd> <dt><code>{</code> and <code>}</code></dt><dd>repetition</dd> <dt><code>*</code>, <code>+</code> and <code>?</code></dt><dd>repetition</dd> <dt><code>.</code></dt><dd>wild-card character</dd> <dt><code>\</code></dt><dd>escape character</dd> <dt><code>|</code></dt><dd>alternatives</dd> <dt><code>^</code> and <code>$</code></dt><dd>start and end of string or line</dd> </dl> <note> Please note that ORO does not support the <code>\Q</code> and <code>\E</code> meta-characters. [In other RE engines, these can be used to quote a portion of an RE so that the meta-characters stand for themselves.] You can use function to do the equivalent, see <a href="functions.html#__escapeOroRegexpChars">${__escapeOroRegexpChars(valueToEscape)}</a>. </note> The following Perl5 extended regular expressions are supported by ORO. <dl> <dt><code>(?#text)</code></dt> <dd>An embedded comment causing text to be ignored.</dd> <dt><code>(?:regexp)</code></dt> <dd>Groups things like "<code>()</code>" but doesn't cause the group match to be saved.</dd> <dt><code>(?=regexp)</code></dt> <dd>A zero-width positive lookahead assertion. For example, <code>\w+(?=\s)</code> matches a word followed by whitespace, without including whitespace in the MatchResult.</dd> <dt><code>(?!regexp)</code></dt> <dd>A zero-width negative lookahead assertion. For example <code>foo(?!bar)</code> matches any occurrence of "<code>foo</code>" that isn't followed by "<code>bar</code>". Remember that this is a zero-width assertion, which means that <code>a(?!b)d</code> will match <code>ad</code> because <code>a</code> is followed by a character that is not <code>b</code> (the <code>d</code>) and a <code>d</code> follows the zero-width assertion.</dd> <dt><code>(?imsx)</code></dt> <dd>One or more embedded pattern-match modifiers. <code>i</code> enables case insensitivity, <code>m</code> enables multiline treatment of the input, <code>s</code> enables single line treatment of the input, and <code>x</code> enables extended whitespace comments.</dd> </dl> Note that <code>(?<=regexp)</code> - lookbehind - is not supported. </subsection> <subsection name="&sect-num;.5 Placement of modifiers" anchor="placement"> Modifiers can be placed anywhere in the regex, and apply from that point onwards. [A bug in ORO means that they cannot be used at the very end of the regex. However they would have no effect there anyway.] The single-line <code>(?s)</code> and multi-line <code>(?m)</code> modifiers are normally placed at the start of the regex. The ignore-case modifier <code>(?i)</code> may be usefully applied to just part of a regex, for example: <source> Match ExAct case or (?i)ArBiTrARY(?-i) case </source> would match <code>Match ExAct case or arbitrary case</code> as well as <code>Match ExAct case or ARBitrary case</code>, but not <code>Match exact case or ArBiTrARY case</code>. </subsection> </section> <section name="&sect-num;.6 Testing Regular Expressions" anchor="testing_expressions"> Since JMeter 2.4, the listener <a href="component_reference.html#View_Results_Tree">View Results Tree</a> include a RegExp Tester to test regular expressions directly on sampler response data. There is a <a href="http://www.regexplanet.com/advanced/java/index.html">Website</a> to test Java Regular expressions. Another approach is to use a simple test plan to test the regular expressions. The Java Request sampler can be used to generate a sample, or the HTTP Sampler can be used to load a file. Add a Debug Sampler and a Tree View Listener and changes to the regular expression can be tested quickly, without needing to access any external servers. </section> </body> </document>

xdocs/usermanual/regular_expressions.xml (230 lines of code) (raw):