Package sunlabs.brazil.util
Class LexHTML
java.lang.Object
sunlabs.brazil.util.LexML
sunlabs.brazil.util.LexHTML
This class breaks up HTML into tokens.
This class differs slightly from LexML as follows: after certain tags,
like the <script>
tag, the body that follows is
uninterpreted data and ends only at the next, in this case,
</script>
tag, not at the just the next
"<" or ">" character. This is one way that HTML is not fully
compliant with XML.
The default set of tags that have this special processing is
<script>
, <style>
, and
<xmp>
. The user can change this by retrieving
the Vector of special tags via
getClosingTags
, and modifying it as needed.
- Version:
- 2.2
- Author:
- Colin Stevens (colin.stevens@sun.com)
-
Field Summary
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionGet the set of HTML tags that have the special body-processing behavior mentioned above.getTag()
Gets the tag name at the begining of the current tag.boolean
Advances to the next token, correctly handling HTML tags that have the special body-processing behavior mentioned above.void
Changes the string that this LexHTML is parsing.Methods inherited from class sunlabs.brazil.util.LexML
findClose, getArgs, getAttributes, getBody, getLocation, getString, getToken, getType, isSingleton, rest
-
Constructor Details
-
LexHTML
Creates a new HTML parser, which can be used to iterate over the tokens in the given string.- Parameters:
str
- The HTML to parse.
-
-
Method Details
-
getClosingTags
Get the set of HTML tags that have the special body-processing behavior mentioned above. The Vector is returned; the caller may modify it after calling this method, which will affect this parser's settings.- Parameters:
tags
- The array of case-insensitive tag names that are only closed by seeing their "slashed" version.
-
nextToken
public boolean nextToken()Advances to the next token, correctly handling HTML tags that have the special body-processing behavior mentioned above. The user can then call the other methods in this class to get information about the new current token.This method returns the uninterpreted data making up the body of a special HTML tag as a token of type
LexML.STRING
, even if the body was actually a comment or another tag. -
getTag
Gets the tag name at the begining of the current tag. In HTML, tag names are defined as case-insensitive, so the name returned is converted to lower case for the convenience of the user. -
replace
Changes the string that this LexHTML is parsing.
-