summaryrefslogtreecommitdiff
path: root/vendor/regex-syntax/src/lib.rs
diff options
context:
space:
mode:
Diffstat (limited to 'vendor/regex-syntax/src/lib.rs')
-rw-r--r--vendor/regex-syntax/src/lib.rs431
1 files changed, 0 insertions, 431 deletions
diff --git a/vendor/regex-syntax/src/lib.rs b/vendor/regex-syntax/src/lib.rs
deleted file mode 100644
index 20f25db7..00000000
--- a/vendor/regex-syntax/src/lib.rs
+++ /dev/null
@@ -1,431 +0,0 @@
-/*!
-This crate provides a robust regular expression parser.
-
-This crate defines two primary types:
-
-* [`Ast`](ast::Ast) is the abstract syntax of a regular expression.
- An abstract syntax corresponds to a *structured representation* of the
- concrete syntax of a regular expression, where the concrete syntax is the
- pattern string itself (e.g., `foo(bar)+`). Given some abstract syntax, it
- can be converted back to the original concrete syntax (modulo some details,
- like whitespace). To a first approximation, the abstract syntax is complex
- and difficult to analyze.
-* [`Hir`](hir::Hir) is the high-level intermediate representation
- ("HIR" or "high-level IR" for short) of regular expression. It corresponds to
- an intermediate state of a regular expression that sits between the abstract
- syntax and the low level compiled opcodes that are eventually responsible for
- executing a regular expression search. Given some high-level IR, it is not
- possible to produce the original concrete syntax (although it is possible to
- produce an equivalent concrete syntax, but it will likely scarcely resemble
- the original pattern). To a first approximation, the high-level IR is simple
- and easy to analyze.
-
-These two types come with conversion routines:
-
-* An [`ast::parse::Parser`] converts concrete syntax (a `&str`) to an
-[`Ast`](ast::Ast).
-* A [`hir::translate::Translator`] converts an [`Ast`](ast::Ast) to a
-[`Hir`](hir::Hir).
-
-As a convenience, the above two conversion routines are combined into one via
-the top-level [`Parser`] type. This `Parser` will first convert your pattern to
-an `Ast` and then convert the `Ast` to an `Hir`. It's also exposed as top-level
-[`parse`] free function.
-
-
-# Example
-
-This example shows how to parse a pattern string into its HIR:
-
-```
-use regex_syntax::{hir::Hir, parse};
-
-let hir = parse("a|b")?;
-assert_eq!(hir, Hir::alternation(vec![
- Hir::literal("a".as_bytes()),
- Hir::literal("b".as_bytes()),
-]));
-# Ok::<(), Box<dyn std::error::Error>>(())
-```
-
-
-# Concrete syntax supported
-
-The concrete syntax is documented as part of the public API of the
-[`regex` crate](https://docs.rs/regex/%2A/regex/#syntax).
-
-
-# Input safety
-
-A key feature of this library is that it is safe to use with end user facing
-input. This plays a significant role in the internal implementation. In
-particular:
-
-1. Parsers provide a `nest_limit` option that permits callers to control how
- deeply nested a regular expression is allowed to be. This makes it possible
- to do case analysis over an `Ast` or an `Hir` using recursion without
- worrying about stack overflow.
-2. Since relying on a particular stack size is brittle, this crate goes to
- great lengths to ensure that all interactions with both the `Ast` and the
- `Hir` do not use recursion. Namely, they use constant stack space and heap
- space proportional to the size of the original pattern string (in bytes).
- This includes the type's corresponding destructors. (One exception to this
- is literal extraction, but this will eventually get fixed.)
-
-
-# Error reporting
-
-The `Display` implementations on all `Error` types exposed in this library
-provide nice human readable errors that are suitable for showing to end users
-in a monospace font.
-
-
-# Literal extraction
-
-This crate provides limited support for [literal extraction from `Hir`
-values](hir::literal). Be warned that literal extraction uses recursion, and
-therefore, stack size proportional to the size of the `Hir`.
-
-The purpose of literal extraction is to speed up searches. That is, if you
-know a regular expression must match a prefix or suffix literal, then it is
-often quicker to search for instances of that literal, and then confirm or deny
-the match using the full regular expression engine. These optimizations are
-done automatically in the `regex` crate.
-
-
-# Crate features
-
-An important feature provided by this crate is its Unicode support. This
-includes things like case folding, boolean properties, general categories,
-scripts and Unicode-aware support for the Perl classes `\w`, `\s` and `\d`.
-However, a downside of this support is that it requires bundling several
-Unicode data tables that are substantial in size.
-
-A fair number of use cases do not require full Unicode support. For this
-reason, this crate exposes a number of features to control which Unicode
-data is available.
-
-If a regular expression attempts to use a Unicode feature that is not available
-because the corresponding crate feature was disabled, then translating that
-regular expression to an `Hir` will return an error. (It is still possible
-construct an `Ast` for such a regular expression, since Unicode data is not
-used until translation to an `Hir`.) Stated differently, enabling or disabling
-any of the features below can only add or subtract from the total set of valid
-regular expressions. Enabling or disabling a feature will never modify the
-match semantics of a regular expression.
-
-The following features are available:
-
-* **std** -
- Enables support for the standard library. This feature is enabled by default.
- When disabled, only `core` and `alloc` are used. Otherwise, enabling `std`
- generally just enables `std::error::Error` trait impls for the various error
- types.
-* **unicode** -
- Enables all Unicode features. This feature is enabled by default, and will
- always cover all Unicode features, even if more are added in the future.
-* **unicode-age** -
- Provide the data for the
- [Unicode `Age` property](https://www.unicode.org/reports/tr44/tr44-24.html#Character_Age).
- This makes it possible to use classes like `\p{Age:6.0}` to refer to all
- codepoints first introduced in Unicode 6.0
-* **unicode-bool** -
- Provide the data for numerous Unicode boolean properties. The full list
- is not included here, but contains properties like `Alphabetic`, `Emoji`,
- `Lowercase`, `Math`, `Uppercase` and `White_Space`.
-* **unicode-case** -
- Provide the data for case insensitive matching using
- [Unicode's "simple loose matches" specification](https://www.unicode.org/reports/tr18/#Simple_Loose_Matches).
-* **unicode-gencat** -
- Provide the data for
- [Unicode general categories](https://www.unicode.org/reports/tr44/tr44-24.html#General_Category_Values).
- This includes, but is not limited to, `Decimal_Number`, `Letter`,
- `Math_Symbol`, `Number` and `Punctuation`.
-* **unicode-perl** -
- Provide the data for supporting the Unicode-aware Perl character classes,
- corresponding to `\w`, `\s` and `\d`. This is also necessary for using
- Unicode-aware word boundary assertions. Note that if this feature is
- disabled, the `\s` and `\d` character classes are still available if the
- `unicode-bool` and `unicode-gencat` features are enabled, respectively.
-* **unicode-script** -
- Provide the data for
- [Unicode scripts and script extensions](https://www.unicode.org/reports/tr24/).
- This includes, but is not limited to, `Arabic`, `Cyrillic`, `Hebrew`,
- `Latin` and `Thai`.
-* **unicode-segment** -
- Provide the data necessary to provide the properties used to implement the
- [Unicode text segmentation algorithms](https://www.unicode.org/reports/tr29/).
- This enables using classes like `\p{gcb=Extend}`, `\p{wb=Katakana}` and
- `\p{sb=ATerm}`.
-* **arbitrary** -
- Enabling this feature introduces a public dependency on the
- [`arbitrary`](https://crates.io/crates/arbitrary)
- crate. Namely, it implements the `Arbitrary` trait from that crate for the
- [`Ast`](crate::ast::Ast) type. This feature is disabled by default.
-*/
-
-#![no_std]
-#![forbid(unsafe_code)]
-#![deny(missing_docs, rustdoc::broken_intra_doc_links)]
-#![warn(missing_debug_implementations)]
-#![cfg_attr(docsrs, feature(doc_auto_cfg))]
-
-#[cfg(any(test, feature = "std"))]
-extern crate std;
-
-extern crate alloc;
-
-pub use crate::{
- error::Error,
- parser::{parse, Parser, ParserBuilder},
- unicode::UnicodeWordError,
-};
-
-use alloc::string::String;
-
-pub mod ast;
-mod debug;
-mod either;
-mod error;
-pub mod hir;
-mod parser;
-mod rank;
-mod unicode;
-mod unicode_tables;
-pub mod utf8;
-
-/// Escapes all regular expression meta characters in `text`.
-///
-/// The string returned may be safely used as a literal in a regular
-/// expression.
-pub fn escape(text: &str) -> String {
- let mut quoted = String::new();
- escape_into(text, &mut quoted);
- quoted
-}
-
-/// Escapes all meta characters in `text` and writes the result into `buf`.
-///
-/// This will append escape characters into the given buffer. The characters
-/// that are appended are safe to use as a literal in a regular expression.
-pub fn escape_into(text: &str, buf: &mut String) {
- buf.reserve(text.len());
- for c in text.chars() {
- if is_meta_character(c) {
- buf.push('\\');
- }
- buf.push(c);
- }
-}
-
-/// Returns true if the given character has significance in a regex.
-///
-/// Generally speaking, these are the only characters which _must_ be escaped
-/// in order to match their literal meaning. For example, to match a literal
-/// `|`, one could write `\|`. Sometimes escaping isn't always necessary. For
-/// example, `-` is treated as a meta character because of its significance
-/// for writing ranges inside of character classes, but the regex `-` will
-/// match a literal `-` because `-` has no special meaning outside of character
-/// classes.
-///
-/// In order to determine whether a character may be escaped at all, the
-/// [`is_escapeable_character`] routine should be used. The difference between
-/// `is_meta_character` and `is_escapeable_character` is that the latter will
-/// return true for some characters that are _not_ meta characters. For
-/// example, `%` and `\%` both match a literal `%` in all contexts. In other
-/// words, `is_escapeable_character` includes "superfluous" escapes.
-///
-/// Note that the set of characters for which this function returns `true` or
-/// `false` is fixed and won't change in a semver compatible release. (In this
-/// case, "semver compatible release" actually refers to the `regex` crate
-/// itself, since reducing or expanding the set of meta characters would be a
-/// breaking change for not just `regex-syntax` but also `regex` itself.)
-///
-/// # Example
-///
-/// ```
-/// use regex_syntax::is_meta_character;
-///
-/// assert!(is_meta_character('?'));
-/// assert!(is_meta_character('-'));
-/// assert!(is_meta_character('&'));
-/// assert!(is_meta_character('#'));
-///
-/// assert!(!is_meta_character('%'));
-/// assert!(!is_meta_character('/'));
-/// assert!(!is_meta_character('!'));
-/// assert!(!is_meta_character('"'));
-/// assert!(!is_meta_character('e'));
-/// ```
-pub fn is_meta_character(c: char) -> bool {
- match c {
- '\\' | '.' | '+' | '*' | '?' | '(' | ')' | '|' | '[' | ']' | '{'
- | '}' | '^' | '$' | '#' | '&' | '-' | '~' => true,
- _ => false,
- }
-}
-
-/// Returns true if the given character can be escaped in a regex.
-///
-/// This returns true in all cases that `is_meta_character` returns true, but
-/// also returns true in some cases where `is_meta_character` returns false.
-/// For example, `%` is not a meta character, but it is escapeable. That is,
-/// `%` and `\%` both match a literal `%` in all contexts.
-///
-/// The purpose of this routine is to provide knowledge about what characters
-/// may be escaped. Namely, most regex engines permit "superfluous" escapes
-/// where characters without any special significance may be escaped even
-/// though there is no actual _need_ to do so.
-///
-/// This will return false for some characters. For example, `e` is not
-/// escapeable. Therefore, `\e` will either result in a parse error (which is
-/// true today), or it could backwards compatibly evolve into a new construct
-/// with its own meaning. Indeed, that is the purpose of banning _some_
-/// superfluous escapes: it provides a way to evolve the syntax in a compatible
-/// manner.
-///
-/// # Example
-///
-/// ```
-/// use regex_syntax::is_escapeable_character;
-///
-/// assert!(is_escapeable_character('?'));
-/// assert!(is_escapeable_character('-'));
-/// assert!(is_escapeable_character('&'));
-/// assert!(is_escapeable_character('#'));
-/// assert!(is_escapeable_character('%'));
-/// assert!(is_escapeable_character('/'));
-/// assert!(is_escapeable_character('!'));
-/// assert!(is_escapeable_character('"'));
-///
-/// assert!(!is_escapeable_character('e'));
-/// ```
-pub fn is_escapeable_character(c: char) -> bool {
- // Certainly escapeable if it's a meta character.
- if is_meta_character(c) {
- return true;
- }
- // Any character that isn't ASCII is definitely not escapeable. There's
- // no real need to allow things like \☃ right?
- if !c.is_ascii() {
- return false;
- }
- // Otherwise, we basically say that everything is escapeable unless it's a
- // letter or digit. Things like \3 are either octal (when enabled) or an
- // error, and we should keep it that way. Otherwise, letters are reserved
- // for adding new syntax in a backwards compatible way.
- match c {
- '0'..='9' | 'A'..='Z' | 'a'..='z' => false,
- // While not currently supported, we keep these as not escapeable to
- // give us some flexibility with respect to supporting the \< and
- // \> word boundary assertions in the future. By rejecting them as
- // escapeable, \< and \> will result in a parse error. Thus, we can
- // turn them into something else in the future without it being a
- // backwards incompatible change.
- //
- // OK, now we support \< and \>, and we need to retain them as *not*
- // escapeable here since the escape sequence is significant.
- '<' | '>' => false,
- _ => true,
- }
-}
-
-/// Returns true if and only if the given character is a Unicode word
-/// character.
-///
-/// A Unicode word character is defined by
-/// [UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties).
-/// In particular, a character
-/// is considered a word character if it is in either of the `Alphabetic` or
-/// `Join_Control` properties, or is in one of the `Decimal_Number`, `Mark`
-/// or `Connector_Punctuation` general categories.
-///
-/// # Panics
-///
-/// If the `unicode-perl` feature is not enabled, then this function
-/// panics. For this reason, it is recommended that callers use
-/// [`try_is_word_character`] instead.
-pub fn is_word_character(c: char) -> bool {
- try_is_word_character(c).expect("unicode-perl feature must be enabled")
-}
-
-/// Returns true if and only if the given character is a Unicode word
-/// character.
-///
-/// A Unicode word character is defined by
-/// [UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties).
-/// In particular, a character
-/// is considered a word character if it is in either of the `Alphabetic` or
-/// `Join_Control` properties, or is in one of the `Decimal_Number`, `Mark`
-/// or `Connector_Punctuation` general categories.
-///
-/// # Errors
-///
-/// If the `unicode-perl` feature is not enabled, then this function always
-/// returns an error.
-pub fn try_is_word_character(
- c: char,
-) -> core::result::Result<bool, UnicodeWordError> {
- unicode::is_word_character(c)
-}
-
-/// Returns true if and only if the given character is an ASCII word character.
-///
-/// An ASCII word character is defined by the following character class:
-/// `[_0-9a-zA-Z]`.
-pub fn is_word_byte(c: u8) -> bool {
- match c {
- b'_' | b'0'..=b'9' | b'a'..=b'z' | b'A'..=b'Z' => true,
- _ => false,
- }
-}
-
-#[cfg(test)]
-mod tests {
- use alloc::string::ToString;
-
- use super::*;
-
- #[test]
- fn escape_meta() {
- assert_eq!(
- escape(r"\.+*?()|[]{}^$#&-~"),
- r"\\\.\+\*\?\(\)\|\[\]\{\}\^\$\#\&\-\~".to_string()
- );
- }
-
- #[test]
- fn word_byte() {
- assert!(is_word_byte(b'a'));
- assert!(!is_word_byte(b'-'));
- }
-
- #[test]
- #[cfg(feature = "unicode-perl")]
- fn word_char() {
- assert!(is_word_character('a'), "ASCII");
- assert!(is_word_character('à'), "Latin-1");
- assert!(is_word_character('β'), "Greek");
- assert!(is_word_character('\u{11011}'), "Brahmi (Unicode 6.0)");
- assert!(is_word_character('\u{11611}'), "Modi (Unicode 7.0)");
- assert!(is_word_character('\u{11711}'), "Ahom (Unicode 8.0)");
- assert!(is_word_character('\u{17828}'), "Tangut (Unicode 9.0)");
- assert!(is_word_character('\u{1B1B1}'), "Nushu (Unicode 10.0)");
- assert!(is_word_character('\u{16E40}'), "Medefaidrin (Unicode 11.0)");
- assert!(!is_word_character('-'));
- assert!(!is_word_character('☃'));
- }
-
- #[test]
- #[should_panic]
- #[cfg(not(feature = "unicode-perl"))]
- fn word_char_disabled_panic() {
- assert!(is_word_character('a'));
- }
-
- #[test]
- #[cfg(not(feature = "unicode-perl"))]
- fn word_char_disabled_error() {
- assert!(try_is_word_character('a').is_err());
- }
-}