pywb.rewrite package¶
Submodules¶
pywb.rewrite.content_rewriter module¶
-
class
pywb.rewrite.content_rewriter.
BaseContentRewriter
(rules_file, replay_mod='')[source]¶ Bases:
object
-
CHARSET_REGEX
= re.compile(b'<meta[^>]*?[\\s;"\']charset\\s*=[\\s"\']*([^\\s"\'/>]*)')¶
-
TITLE
= re.compile('<\\s*title\\s*>(.*)<\\s*\\/\\s*title\\s*>', re.IGNORECASE|re.MULTILINE|re.DOTALL)¶
-
html_unescape
()¶ Convert all named and numeric character references (e.g. >, >, &x3e;) in the string s to the corresponding unicode characters. This function uses the rules defined by the HTML 5 standard for both valid and invalid character references, and the list of HTML 5 named character references defined in html.entities.html5.
-
-
class
pywb.rewrite.content_rewriter.
RewriteInfo
(record, content_rewriter, url_rewriter, cookie_rewriter=None)[source]¶ Bases:
object
-
JSONP_CONTAINS
= ['callback=jQuery', 'callback=jsonp', '.json?']¶
-
JSON_REGEX
= re.compile(b'^\\s*[{[][{"]')¶
-
TAG_REGEX
= re.compile(b'^(\xef\xbb\xbf)?\\s*\\<')¶
-
TAG_REGEX2
= re.compile(b'^.*<\\w+[\\s>]')¶
-
content_stream
¶
-
pywb.rewrite.cookie_rewriter module¶
Bases:
pywb.rewrite.cookie_rewriter.WbUrlBaseCookieRewriter
Rewrite cookies only using exact path, useful for live rewrite without a timestamp and to minimize cookie pollution
If path or domain present, simply remove
Bases:
pywb.rewrite.cookie_rewriter.WbUrlBaseCookieRewriter
Attempt to rewrite cookies to current host url..
If path present, rewrite path to current host. Only makes sense in live proxy or no redirect mode, as otherwise timestamp may change.
If domain present, remove domain and set to path prefix
Bases:
pywb.rewrite.cookie_rewriter.WbUrlBaseCookieRewriter
Attempt to rewrite cookies to minimal scope possible
If path present, rewrite path to current rewritten url only If domain present, remove domain and set to path prefix
Bases:
pywb.rewrite.cookie_rewriter.WbUrlBaseCookieRewriter
Sometimes it is necessary to rewrite cookies to root scope in order to work across time boundaries and modifiers
This rewriter simply sets all cookies to be in the root
Bases:
object
Base Cookie rewriter for wburl-based requests.
If HttpOnly cookie that is set to a path ending in /, and current mod is mp_ or if_, then assume its meant to be a prefix, and likely needed for other content. Set cookie with same prefix but for all common modifiers: (mp_, js_, cs_, oe_, if_, sw_, wkrf_)
pywb.rewrite.default_rewriter module¶
-
class
pywb.rewrite.default_rewriter.
DefaultRewriter
(replay_mod='', config=None)[source]¶ Bases:
pywb.rewrite.content_rewriter.BaseContentRewriter
-
DEFAULT_REWRITERS
= {'amf': <class 'pywb.rewrite.rewrite_amf.RewriteAMF'>, 'cookie': <class 'pywb.rewrite.cookie_rewriter.HostScopeCookieRewriter'>, 'css': <class 'pywb.rewrite.regex_rewriters.CSSRewriter'>, 'dash': <class 'pywb.rewrite.rewrite_dash.RewriteDASH'>, 'header': <class 'pywb.rewrite.header_rewriter.DefaultHeaderRewriter'>, 'hls': <class 'pywb.rewrite.rewrite_hls.RewriteHLS'>, 'html': <class 'pywb.rewrite.html_rewriter.HTMLRewriter'>, 'html-banner-only': <class 'pywb.rewrite.html_insert_rewriter.HTMLInsertOnlyRewriter'>, 'js': <class 'pywb.rewrite.regex_rewriters.JSLocationOnlyRewriter'>, 'js-proxy': <class 'pywb.rewrite.regex_rewriters.JSNoneRewriter'>, 'js-worker': <class 'pywb.rewrite.rewrite_js_workers.JSWorkerRewriter'>, 'json': <class 'pywb.rewrite.jsonp_rewriter.JSONPRewriter'>, 'xml': <class 'pywb.rewrite.regex_rewriters.XMLRewriter'>}¶
-
default_content_types
= {'css': 'text/css', 'html': 'text/html', 'js': 'text/javascript'}¶
-
rewrite_types
= {'': 'guess-text', 'application/dash+xml': 'dash', 'application/javascript': 'js', 'application/json': 'json', 'application/octet-stream': 'guess-bin', 'application/vnd.apple.mpegurl': 'hls', 'application/x-amf': 'amf', 'application/x-javascript': 'js', 'application/x-mpegURL': 'hls', 'application/xhtml': 'html', 'application/xhtml+xml': 'html', 'text/css': 'css', 'text/html': 'guess-html', 'text/javascript': 'js', 'text/plain': 'guess-text'}¶
-
pywb.rewrite.header_rewriter module¶
-
class
pywb.rewrite.header_rewriter.
DefaultHeaderRewriter
(rwinfo, header_prefix='X-Archive-Orig-')[source]¶ Bases:
object
-
header_rules
= {'accept-patch': 'keep', 'accept-ranges': 'keep', 'access-control-allow-credentials': 'prefix-if-url-rewrite', 'access-control-allow-headers': 'prefix-if-url-rewrite', 'access-control-allow-methods': 'prefix-if-url-rewrite', 'access-control-allow-origin': 'prefix-if-url-rewrite', 'access-control-expose-headers': 'prefix-if-url-rewrite', 'access-control-max-age': 'prefix-if-url-rewrite', 'age': 'prefix', 'allow': 'keep', 'alt-svc': 'prefix', 'cache-control': 'prefix', 'connection': 'prefix', 'content-base': 'url-rewrite', 'content-disposition': 'keep', 'content-encoding': 'prefix-if-content-rewrite', 'content-language': 'keep', 'content-length': 'content-length', 'content-location': 'url-rewrite', 'content-md5': 'prefix', 'content-range': 'keep', 'content-security-policy': 'prefix', 'content-security-policy-report-only': 'prefix', 'content-type': 'keep', 'date': 'prefix', 'etag': 'prefix', 'expires': 'prefix', 'last-modified': 'prefix', 'link': 'keep', 'location': 'url-rewrite', 'p3p': 'prefix', 'pragma': 'prefix', 'proxy-authenticate': 'keep', 'public-key-pins': 'prefix', 'retry-after': 'prefix', 'server': 'prefix', 'set-cookie': 'cookie', 'status': 'prefix', 'strict-transport-security': 'prefix', 'tk': 'prefix', 'trailer': 'prefix', 'transfer-encoding': 'transfer-encoding', 'upgrade': 'prefix', 'upgrade-insecure-requests': 'prefix', 'vary': 'prefix', 'via': 'prefix', 'warning': 'prefix', 'www-authenticate': 'keep', 'x-frame-options': 'prefix', 'x-xss-protection': 'prefix'}¶
-
pywb.rewrite.html_insert_rewriter module¶
-
class
pywb.rewrite.html_insert_rewriter.
HTMLInsertOnlyRewriter
(url_rewriter, **kwargs)[source]¶ Bases:
pywb.rewrite.content_rewriter.StreamingRewriter
Insert custom string into HTML into the head, before any tag not <head> or <html> no other rewriting performed
-
NOT_HEAD_REGEX
= re.compile('(<\\s*\\b)(?!(html|head))', re.IGNORECASE)¶
-
XML_HEADER
= re.compile('<\\?xml.*\\?>')¶
-
pywb.rewrite.html_rewriter module¶
-
class
pywb.rewrite.html_rewriter.
HTMLRewriter
(*args, **kwargs)[source]¶ Bases:
pywb.rewrite.html_rewriter.HTMLRewriterMixin
,html.parser.HTMLParser
-
PARSETAG
= re.compile('[<]')¶
-
-
class
pywb.rewrite.html_rewriter.
HTMLRewriterMixin
(url_rewriter, head_insert=None, js_rewriter_class=None, js_rewriter=None, css_rewriter=None, css_rewriter_class=None, url='', defmod='', parse_comments=False, charset='utf-8')[source]¶ Bases:
pywb.rewrite.content_rewriter.StreamingRewriter
HTML-Parsing Rewriter for custom rewriting, also delegates to rewriters for script and css
-
ADD_WINDOW
= re.compile('(?<![.])(WB_wombat_)')¶
-
BEFORE_HEAD_TAGS
= ['html', 'head']¶
-
DATA_RW_PROTOCOLS
= ('http://', 'https://', '//')¶
-
META_REFRESH_REGEX
= re.compile('^[\\d.]+\\s*;\\s*url\\s*=\\s*(.+?)\\s*$', re.IGNORECASE|re.MULTILINE)¶
-
PRELOAD_TYPES
= {'audio': 'oe_', 'document': 'if_', 'embed': 'oe_', 'fetch': 'mp_', 'font': 'oe_', 'image': 'im_', 'object': 'oe_', 'script': 'js_', 'style': 'cs_', 'track': 'oe_', 'video': 'oe_', 'worker': 'js_'}¶
-
SRCSET_REGEX
= re.compile('\\s*(\\S*\\s+[\\d\\.]+[wx]),|(?:\\s*,(?:\\s+|(?=https?:)))')¶
-
pywb.rewrite.jsonp_rewriter module¶
-
class
pywb.rewrite.jsonp_rewriter.
JSONPRewriter
(url_rewriter, align_to_line=True, first_buff='')[source]¶ Bases:
pywb.rewrite.content_rewriter.StreamingRewriter
-
CALLBACK
= re.compile('[?].*callback=([^&]+)')¶
-
JSONP
= re.compile('(?:^[ \\t]*(?:(?:\\/\\*[^\\*]*\\*\\/)|(?:\\/\\/[^\\n]+[\\n])))*[ \\t]*(\\w+)\\(\\{', re.MULTILINE)¶
-
pywb.rewrite.regex_rewriters module¶
-
class
pywb.rewrite.regex_rewriters.
CSSRewriter
(rewriter, extra_rules=None, first_buff='')[source]¶ Bases:
pywb.rewrite.regex_rewriters.RegexRewriter
-
rules_factory
= <pywb.rewrite.regex_rewriters.CSSRules object>¶
-
-
class
pywb.rewrite.regex_rewriters.
CSSRules
[source]¶ Bases:
pywb.rewrite.regex_rewriters.RxRules
-
CSS_IMPORT_REGEX
= '@import\\s+(?:url\\s*)?\\(?\\s*[\'"]?([\\w.:/\\\\-]+)'¶
-
CSS_URL_REGEX
= 'url\\s*\\(\\s*(?:[\\\\"\']|(?:&.{1,4};))*\\s*([^)\'"]+)\\s*(?:[\\\\"\']|(?:&.{1,4};))*\\s*\\)'¶
-
-
class
pywb.rewrite.regex_rewriters.
JSLinkAndLocationRewriter
(rewriter, extra_rules=None, first_buff='')[source]¶ Bases:
pywb.rewrite.regex_rewriters.RegexRewriter
-
rules_factory
= <pywb.rewrite.regex_rewriters.JSLinkAndLocationRewriterRules object>¶
-
-
class
pywb.rewrite.regex_rewriters.
JSLinkAndLocationRewriterRules
(prefix='WB_wombat_')[source]¶ Bases:
pywb.rewrite.regex_rewriters.JSLocationRewriterRules
JS Rewriter rules which also rewrite absolute http://, https:// and // urls at the beginning of a string
-
JS_HTTPX
= '(?:(?<=["\\\';])https?:|(?<=["\\\']))\\\\{0,4}/\\\\{0,4}/[A-Za-z0-9:_@%.\\\\-]+/'¶
-
-
class
pywb.rewrite.regex_rewriters.
JSLocationOnlyRewriter
(rewriter, extra_rules=None, first_buff='')[source]¶ Bases:
pywb.rewrite.regex_rewriters.RegexRewriter
-
rules_factory
= <pywb.rewrite.regex_rewriters.JSLocationRewriterRules object>¶
-
-
class
pywb.rewrite.regex_rewriters.
JSLocationRewriterRules
(prefix='WB_wombat_')[source]¶ Bases:
pywb.rewrite.regex_rewriters.RxRules
JS Rewriter mixin which rewrites location and domain to the specified prefix (default:
WB_wombat_
)
-
class
pywb.rewrite.regex_rewriters.
JSNoneRewriter
(rewriter, extra_rules=None, first_buff='')[source]¶
-
class
pywb.rewrite.regex_rewriters.
JSReplaceFuzzy
(*args, **kwargs)[source]¶ Bases:
object
-
rx_obj
= None¶
-
-
pywb.rewrite.regex_rewriters.
JSRewriter
¶ alias of
pywb.rewrite.regex_rewriters.JSLinkAndLocationRewriter
-
class
pywb.rewrite.regex_rewriters.
JSWombatProxyRewriter
(rewriter, extra_rules=None)[source]¶ Bases:
pywb.rewrite.regex_rewriters.RegexRewriter
JS Rewriter mixin which wraps the contents of the script in an anonymous block scope and inserts Wombat js-proxy setup
-
rules_factory
= <pywb.rewrite.regex_rewriters.JSWombatProxyRules object>¶
-
-
class
pywb.rewrite.regex_rewriters.
RegexRewriter
(rewriter, extra_rules=None, first_buff='')[source]¶ Bases:
pywb.rewrite.content_rewriter.StreamingRewriter
-
rules_factory
= <pywb.rewrite.regex_rewriters.RxRules object>¶
-
-
class
pywb.rewrite.regex_rewriters.
RxRules
(rules=None)[source]¶ Bases:
object
-
HTTPX_MATCH_STR
= 'https?:\\\\?/\\\\?/[A-Za-z0-9:_@.-]+'¶
-
-
class
pywb.rewrite.regex_rewriters.
XMLRewriter
(rewriter, extra_rules=None, first_buff='')[source]¶ Bases:
pywb.rewrite.regex_rewriters.RegexRewriter
-
rules_factory
= <pywb.rewrite.regex_rewriters.XMLRules object>¶
-
pywb.rewrite.rewrite_amf module¶
pywb.rewrite.rewrite_dash module¶
pywb.rewrite.rewrite_hls module¶
pywb.rewrite.rewrite_js_workers module¶
-
class
pywb.rewrite.rewrite_js_workers.
JSWorkerRewriter
(url_rewriter, align_to_line=True, first_buff='')[source]¶ Bases:
pywb.rewrite.content_rewriter.StreamingRewriter
A simple rewriter for rewriting web or service workers. The only rewriting that occurs is the injection of the init code for wombatWorkers.js. This allows for all them to operate as expected on the live web.
pywb.rewrite.rewriteinputreq module¶
-
class
pywb.rewrite.rewriteinputreq.
RewriteInputRequest
(env, urlkey, url, rewriter)[source]¶ Bases:
pywb.warcserver.inputrequest.DirectWSGIInputRequest
-
RANGE_ARG_RX
= re.compile('.*.googlevideo.com/videoplayback.*([&?]range=(\\d+)-(\\d+))')¶
-
RANGE_HEADER
= re.compile('bytes=(\\d+)-(\\d+)?')¶
-
pywb.rewrite.templateview module¶
-
class
pywb.rewrite.templateview.
BaseInsertView
(jenv, insert_file, banner_view=None)[source]¶ Bases:
object
Base class of all template views used by Pywb
-
render_to_string
(env, **kwargs)[source]¶ Render this template.
Parameters: - env (dict) – The WSGI environment associated with the request causing this template to be rendered
- kwargs (any) – The keyword arguments to be supplied to the Jninja template render method
Returns: The rendered template
Return type:
-
-
class
pywb.rewrite.templateview.
HeadInsertView
(jenv, insert_file, banner_view=None)[source]¶ Bases:
pywb.rewrite.templateview.BaseInsertView
The template view class associated with rendering the HTML inserted into the head of the pages replayed (WB Insert).
-
create_insert_func
(wb_url, wb_prefix, host_prefix, top_url, env, is_framed, coll='', include_ts=True, **kwargs)[source]¶ Create the function used to render the header insert template for the current request.
Parameters: - wb_url (rewrite.wburl.WbUrl) – The WbUrl for the request this template is being rendered for
- wb_prefix (str) – The URL prefix pywb is serving the content using (e.g. http://localhost:8080/live/)
- host_prefix (str) – The host URL prefix pywb is running on (e.g. http://localhost:8080)
- top_url (str) – The full URL for this request (e.g. http://localhost:8080/live/http://example.com)
- env (dict) – The WSGI environment dictionary for this request
- is_framed (bool) – Is pywb or a specific collection running in framed mode
- coll (str) – The name of the collection this request is associated with
- include_ts (bool) – Should a timestamp be included in the rendered template
- kwargs – Additional keyword arguments to be supplied to the Jninja template render method
Returns: A function to be used to render the header insert for the request this template is being rendered for
Return type: callable
-
-
class
pywb.rewrite.templateview.
JinjaEnv
(paths=None, packages=None, assets_path=None, globals=None, overlay=None, extensions=None, env_template_params_key='pywb.template_params', env_template_dir_key='pywb.templates_dir')[source]¶ Bases:
object
Pywb JinjaEnv class that provides utility functions used by the templates, configured template loaders and template paths, and contains the actual Jinja env used by each template.
-
template_filter
(param=None)[source]¶ Returns a decorator that adds the wrapped function to dictionary of template filters.
The wrapped function is keyed by either the supplied param (if supplied) or by the wrapped functions name.
Parameters: param – Optional name to use instead of the name of the function to be wrapped Returns: A decorator to wrap a template filter function Return type: callable
-
-
class
pywb.rewrite.templateview.
PkgResResolver
[source]¶ Bases:
webassets.env.Resolver
Class for resolving pywb package resources when install via pypi or setup.py
-
get_pkg_path
(item)[source]¶ Get the package path for the
Parameters: item (str) – A resources full package path Returns: The netloc and path from the items package path Return type: tuple[str, str]
-
resolve_source
(ctx, item)[source]¶ Given
item
from a Bundle’s contents, this has to return the final value to use, usually an absolute filesystem path.Note
It is also allowed to return urls and bundle instances (or generally anything else the calling
Bundle
instance may be able to handle). Indeed this is the reason why the name of this method does not imply a return type.The incoming item is usually a relative path, but may also be an absolute path, or a url. These you will commonly want to return unmodified.
This method is also allowed to resolve
item
to multiple values, in which case a list should be returned. This is commonly used ifitem
includes glob instructions (wildcards).Note
Instead of this, subclasses should consider implementing
search_for_source()
instead.
-
-
class
pywb.rewrite.templateview.
RelEnvironment
(block_start_string='{%', block_end_string='%}', variable_start_string='{{', variable_end_string='}}', comment_start_string='{#', comment_end_string='#}', line_statement_prefix=None, line_comment_prefix=None, trim_blocks=False, lstrip_blocks=False, newline_sequence='n', keep_trailing_newline=False, extensions=(), optimized=True, undefined=<class 'jinja2.runtime.Undefined'>, finalize=None, autoescape=False, loader=None, cache_size=400, auto_reload=True, bytecode_cache=None, enable_async=False)[source]¶ Bases:
jinja2.environment.Environment
Override join_path() to enable relative template paths.
-
join_path
(template, parent)[source]¶ Join a template with the parent. By default all the lookups are relative to the loader root so this method returns the template parameter unchanged, but if the paths should be relative to the parent template, this function can be used to calculate the real template name.
Subclasses may override this method and implement template path joining here.
-
-
class
pywb.rewrite.templateview.
TopFrameView
(jenv, insert_file, banner_view=None)[source]¶ Bases:
pywb.rewrite.templateview.BaseInsertView
The template view class associated with rendering the replay iframe
-
get_top_frame
(wb_url, wb_prefix, host_prefix, env, frame_mod, replay_mod, coll='', extra_params=None)[source]¶ Parameters: - wb_url (rewrite.wburl.WbUrl) – The WbUrl for the request this template is being rendered for
- wb_prefix (str) – The URL prefix pywb is serving the content using (e.g. http://localhost:8080/live/)
- host_prefix (str) – The host URL prefix pywb is running on (e.g. http://localhost:8080)
- env (dict) – The WSGI environment dictionary for the request this template is being rendered for
- frame_mod (str) – The modifier to be used for framing (e.g. if_)
- replay_mod (str) – The modifier to be used in the URL of the page being replayed (e.g. mp_)
- coll (str) – The name of the collection this template is being rendered for
- extra_params (dict) – Additional parameters to be supplied to the Jninja template render method
Returns: The frame insert string
Return type:
-
pywb.rewrite.url_rewriter module¶
-
class
pywb.rewrite.url_rewriter.
IdentityUrlRewriter
(wburl, prefix='', full_prefix=None, rel_prefix=None, root_path=None, cookie_scope=None, rewrite_opts=None, pywb_static_prefix=None)[source]¶ Bases:
pywb.rewrite.url_rewriter.UrlRewriter
No rewriting performed, return original url
-
class
pywb.rewrite.url_rewriter.
SchemeOnlyUrlRewriter
(*args, **kwargs)[source]¶ Bases:
pywb.rewrite.url_rewriter.IdentityUrlRewriter
A url rewriter which ensures that any urls have the same scheme (http or https) as the base url. Other urls/input is unchanged.
-
class
pywb.rewrite.url_rewriter.
UrlRewriter
(wburl, prefix='', full_prefix=None, rel_prefix=None, root_path=None, cookie_scope=None, rewrite_opts=None, pywb_static_prefix=None)[source]¶ Bases:
object
Main pywb UrlRewriter which rewrites absolute and relative urls to be relative to the current page, as specified via a WbUrl instance and an optional full path prefix
-
NO_REWRITE_URI_PREFIX
= ('#', 'javascript:', 'data:', 'mailto:', 'about:', 'file:', '{')¶
-
PARENT_PATH
= '../'¶
-
PROTOCOLS
= ('http:', 'https:', 'ftp:', 'mms:', 'rtsp:', 'wais:')¶
-
REL_PATH
= '/'¶
-
REL_SCHEME
= ('//', '\\/\\/', '\\\\/\\\\/')¶
-
pywb_static_prefix
¶ Returns the static path URL :rtype: str
-
pywb.rewrite.wburl module¶
WbUrl represents the standard wayback archival url format. A regular url is a subset of the WbUrl (latest replay).
The WbUrl expresses the common interface for interacting with the wayback machine.
There WbUrl may represent one of the following forms:
query form: [/modifier]/[timestamp][-end_timestamp]*/<url>
modifier, timestamp and end_timestamp are optional:
*/example.com
20101112030201*/http://example.com
2009-2015*/http://example.com
/cdx/*/http://example.com
url query form: used to indicate query across urls
same as query form but with a final *
:
*/example.com*
20101112030201*/http://example.com*
replay form:
20101112030201/http://example.com
20101112030201im_/http://example.com
latest_replay: (no timestamp):
http://example.com
Additionally, the BaseWbUrl provides the base components (url, timestamp, end_timestamp, modifier, type) which can be used to provide a custom representation of the wayback url format.
-
class
pywb.rewrite.wburl.
BaseWbUrl
(url='', mod='', timestamp='', end_timestamp='', type=None)[source]¶ Bases:
object
-
LATEST_REPLAY
= 'latest_replay'¶
-
QUERY
= 'query'¶
-
REPLAY
= 'replay'¶
-
URL_QUERY
= 'url_query'¶
-
-
class
pywb.rewrite.wburl.
WbUrl
(orig_url)[source]¶ Bases:
pywb.rewrite.wburl.BaseWbUrl
-
DEFAULT_SCHEME
= 'http://'¶
-
FIRST_PATH
= re.compile('(?<![:/])[/?](?![/])')¶
-
QUERY_REGEX
= re.compile('^(?:([\\w\\-:]+)/)?(\\d*)[*-](\\d*)/?(.+)$')¶
-
REPLAY_REGEX
= re.compile('^(\\d*)([a-z]+_|[$][a-z0-9:.-]+)?/{1,3}(.+)$')¶
-
SCHEME_RX
= re.compile('[a-zA-Z0-9+-.]+(:/)')¶
-
is_embed
¶
-
is_identity
¶
-
is_url_rewrite_only
¶
-
static
percent_encode_host
(url)[source]¶ Convert the host of uri formatted with to_uri() to have a %-encoded host instead of punycode host The rest of url should be unchanged
-