xdocs/usermanual/regular_expressions.xml (230 lines of code) (raw):
<?xml version="1.0"?>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one or more
~ contributor license agreements. See the NOTICE file distributed with
~ this work for additional information regarding copyright ownership.
~ The ASF licenses this file to you under the Apache License, Version 2.0
~ (the "License"); you may not use this file except in compliance with
~ the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
-->
<!DOCTYPE document
[
<!ENTITY sect-num '21'>
]>
<document prev="functions.html" next="hints_and_tips.html" id="$Id$">
<properties>
<title>User's Manual: Regular Expressions</title>
</properties>
<body>
<section name="§-num;. Regular Expressions" anchor="regex">
<subsection name="§-num;.1 Overview" anchor="overview">
<p>
JMeter includes the pattern matching software <a href="http://attic.apache.org/projects/jakarta-oro.html">Apache Jakarta ORO</a>
<br/>
There is some documentation for this on the Jakarta web-site, for example
<a href="http://archimedes.fas.harvard.edu/scrapbook/jakarta-oro-2.0.6/docs/api/org/apache/oro/text/regex/package-summary.html">
a summary of the pattern matching characters</a>
</p>
<p>
There is also documentation on an older incarnation of the product at
<a href="http://www.savarese.org/oro/docs/OROMatcher/index.html">OROMatcher User's guide</a>, which might prove useful.
</p>
<note>With JMeter version 5.5 the Regex implementation can be switched from Oro to the JDK based one by setting
the JMeter property <code>jmeter.regex.engine</code> to some value different than <code>oro</code>.</note>
<p>
The pattern matching is very similar to the pattern matching in Perl.
A full installation of Perl will include plenty of documentation on regular expressions - look for <code>perlrequick</code>,
<code>perlretut</code>, <code>perlre</code> and <code>perlreref</code>.
</p>
<p>
It is worth stressing the difference between "<em>contains</em>" and "<em>matches</em>", as used on the Response Assertion test element:
</p>
<dl>
<dt>"<em>contains</em>"</dt><dd> means that the regular expression matched at least some part of the target,
so '<code>alphabet</code>' "<em>contains</em>" '<code>ph.b.</code>' because the regular expression matches the substring '<code>phabe</code>'.
</dd>
<dt>
"<em>matches</em>"</dt><dd> means that the regular expression matched the whole target.
So '<code>alphabet</code>' is "<em>matched</em>" by '<code>al.*t</code>'.
</dd>
</dl>
<p>In this case, it is equivalent to wrapping the regular expression in <code>^</code> and <code>$</code>, viz '<code>^al.*t$</code>'.
</p>
<p>However, this is not always the case.
For example, the regular expression '<code>alp|.lp.*</code>' is "<em>contained</em>" in '<code>alphabet</code>',
but does not "<em>match</em>" '<code>alphabet</code>'.
</p>
<p>Why? Because when the pattern matcher finds the sequence '<code>alp</code>' in '<code>alphabet</code>', it stops trying any other
combinations - and '<code>alp</code>' is not the same as '<code>alphabet</code>', as it does not include '<code>habet</code>'.
</p>
<note>
Unlike Perl, there is no need to (i.e. do not) enclose the regular expression in <code>//</code>.
</note>
<p>
So how does one use the modifiers <code>ismx</code> etc. if there is no trailing <code>/</code>?
The solution is to use <i>extended regular expressions</i>, i.e. <code>/abc/i</code> becomes <code>(?i)abc</code>.
See also <a href="#placement">Placement of modifiers</a> below.
</p>
</subsection>
<subsection name="§-num;.2 Examples" anchor="examples">
<h3>Extract single string</h3>
<p>
Suppose you want to match the following portion of a web-page:
<br/>
<code>name="file" value="readme.txt"></code>
<br/>
and you want to extract <code>readme.txt</code>.
<br/>
A suitable regular expression would be:
<br/>
<code>name="file" value="(.+?)"></code>
<p>
The special characters above are:
</p>
<dl>
<dt><code>(</code> and <code>)</code></dt><dd>these enclose the portion of the match string to be returned</dd>
<dt><code>.</code></dt><dd>match any character</dd>
<dt><code>+</code></dt><dd>one or more times</dd>
<dt><code>?</code></dt><dd>don't be greedy, i.e. stop when first match succeeds</dd>
</dl>
<p>
Note: without the <code>?</code>, the <code>.+</code> would continue past the first <code>"></code>
until it found the last possible <code>"></code> - which is probably not what was intended.
</p>
<p>
Note: although the above expression works, it's more efficient to use the following expression:
<br/>
<code>name="file" value="([^"]+)"></code>
where<br></br>
<code>[^"]</code> - means match anything except <code>"</code><br></br>
In this case, the matching engine can stop looking as soon as it sees the first <code>"</code>,
whereas in the previous case the engine has to check that it has found <code>"></code> rather than say <code>" ></code>.
</p>
<h3>Extract multiple strings</h3>
<p>
Suppose you want to match the following portion of a web-page:<br/>
<code>name="file.name" value="readme.txt"</code>
and you want to extract both <code>file.name</code> and <code>readme.txt</code>.
<br/>
A suitable regular expression would be:
<br/>
<code>name="([^"]+)" value="([^"]+)"</code>
<br/>
This would create 2 groups, which could be used in the JMeter Regular Expression Extractor template as <code>$1$</code> and <code>$2$</code>.
</p>
<p>
The JMeter Regex Extractor saves the values of the groups in additional variables.
</p>
<p>
For example, assume:
</p>
<ul>
<li>Reference Name: <code>MYREF</code></li>
<li>Regex: <code>name="(.+?)" value="(.+?)"</code></li>
<li>Template: <code>$1$$2$</code></li>
</ul>
<note>Do not enclose the regular expression in <code>/ /</code></note>
<p>
The following variables would be set:
</p>
<dl>
<dt><code>MYREF</code></dt><dd><code>file.namereadme.txt</code></dd>
<dt><code>MYREF_g0</code></dt><dd><code>name="file.name" value="readme.txt"</code></dd>
<dt><code>MYREF_g1</code></dt><dd><code>file.name</code></dd>
<dt><code>MYREF_g2</code></dt><dd><code>readme.txt</code></dd>
</dl>
These variables can be referred to later on in the JMeter test plan, as <code>${MYREF}</code>, <code>${MYREF_g1}</code> etc.
</p>
</subsection>
<subsection name="§-num;.3 Line mode" anchor="line_mode">
<p>The pattern matching behaves in various slightly different ways,
depending on the setting of the multi-line and single-line modifiers.
Note that the single-line and multi-line operators have nothing to do with each other;
they can be specified independently.
</p>
<h3>Single-line mode</h3>
<p>
Single-line mode only affects how the '<code>.</code>' meta-character is interpreted.
</p>
<p>
Default behaviour is that '<code>.</code>' matches any character except newline.
In single-line mode, '<code>.</code>' also matches newline.
</p>
<h3>Multi-line mode</h3>
<p>
Multi-line mode only affects how the meta-characters '<code>^</code>' and '<code>$</code>' are interpreted.
</p>
<p>
Default behaviour is that '<code>^</code>' and '<code>$</code>' only match at the very beginning and end of the string.
When Multi-line mode is used, the '<code>^</code>' metacharacter matches at the beginning of every line,
and the '<code>$</code>' metacharacter matches at the end of every line.</p>
</subsection>
<subsection name="§-num;.4 Meta characters" anchor="meta_chars">
<p>
Regular expressions use certain characters as meta characters - these characters have a special meaning to the RE engine.
Such characters must be escaped by preceding them with <code>\</code> (backslash) in order to treat them as ordinary characters.
Here is a list of the meta characters and their meaning (please check the ORO documentation if in doubt).
</p>
<dl>
<dt><code>(</code> and <code>)</code></dt><dd>grouping</dd>
<dt><code>[</code> and <code>]</code></dt><dd>character classes</dd>
<dt><code>{</code> and <code>}</code></dt><dd>repetition</dd>
<dt><code>*</code>, <code>+</code> and <code>?</code></dt><dd>repetition</dd>
<dt><code>.</code></dt><dd>wild-card character</dd>
<dt><code>\</code></dt><dd>escape character</dd>
<dt><code>|</code></dt><dd>alternatives</dd>
<dt><code>^</code> and <code>$</code></dt><dd>start and end of string or line</dd>
</dl>
<note>
Please note that ORO does not support the <code>\Q</code> and <code>\E</code> meta-characters.
[In other RE engines, these can be used to quote a portion of an RE so that the meta-characters stand for themselves.]
You can use function to do the equivalent, see <a href="functions.html#__escapeOroRegexpChars">${__escapeOroRegexpChars(valueToEscape)}</a>.
</note>
<p>
The following Perl5 extended regular expressions are supported by ORO.
<dl>
<dt><code>(?#text)</code></dt>
<dd>An embedded comment causing text to be ignored.</dd>
<dt><code>(?:regexp)</code></dt>
<dd>Groups things like "<code>()</code>" but doesn't cause the group match to be saved.</dd>
<dt><code>(?=regexp)</code></dt>
<dd>A zero-width positive lookahead assertion. For example, <code>\w+(?=\s)</code> matches a word followed by whitespace, without including whitespace in the MatchResult.</dd>
<dt><code>(?!regexp)</code></dt>
<dd>A zero-width negative lookahead assertion. For example <code>foo(?!bar)</code> matches any occurrence of "<code>foo</code>" that
isn't followed by "<code>bar</code>". Remember that this is a zero-width assertion, which means that <code>a(?!b)d</code> will
match <code>ad</code> because <code>a</code> is followed by a character that is not <code>b</code> (the <code>d</code>) and a <code>d</code>
follows the zero-width assertion.</dd>
<dt><code>(?imsx)</code></dt>
<dd>One or more embedded pattern-match modifiers. <code>i</code> enables case insensitivity, <code>m</code> enables multiline treatment
of the input, <code>s</code> enables single line treatment of the input, and <code>x</code> enables extended whitespace comments.</dd>
</dl>
<b>Note that <code>(?<=regexp)</code> - lookbehind - is not supported.</b>
</p>
</subsection>
<subsection name="§-num;.5 Placement of modifiers" anchor="placement">
<p>
Modifiers can be placed anywhere in the regex, and apply from that point onwards.
[A bug in ORO means that they cannot be used at the very end of the regex.
However they would have no effect there anyway.]
</p>
<p>
The single-line <code>(?s)</code> and multi-line <code>(?m)</code> modifiers are normally placed at the start of the regex.
</p>
<p>
The ignore-case modifier <code>(?i)</code> may be usefully applied to just part of a regex,
for example:</p>
<source>
Match ExAct case or (?i)ArBiTrARY(?-i) case
</source>
would match <code>Match ExAct case or arbitrary case</code> as well as <code>Match ExAct case or ARBitrary case</code>, but not <code>Match exact case or ArBiTrARY case</code>.
</subsection>
</section>
<section name="§-num;.6 Testing Regular Expressions" anchor="testing_expressions">
<p>
Since JMeter 2.4, the listener <a href="component_reference.html#View_Results_Tree">View Results Tree</a>
include a RegExp Tester to test regular expressions directly on sampler response data.
</p>
<p>
There is a <a href="http://www.regexplanet.com/advanced/java/index.html">Website</a> to test Java Regular expressions.
</p>
<p>
Another approach is to use a simple test plan to test the regular expressions.
The Java Request sampler can be used to generate a sample, or the HTTP Sampler can be used to load a file.
Add a Debug Sampler and a Tree View Listener and changes to the regular expression can be tested quickly,
without needing to access any external servers.
</p>
</section>
</body>
</document>